Machine Learning Techniques for Time Series Classification 9783736978133

Classification of time series is an important task in various fields, e.g., medicine, finance, and industrial applicatio

348 51 11MB

English Pages 216 [217] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Contents
1. Introduction
2. Machine Learning
3. Interpretable Generalized Radial Basis Function Classifiers Based on the Random Forest Kernel
4. Classification of Temporal Data
5. Classification in Car Safety Systems
6. Scenario-Based Random Forest for On-Line Time Series Classification
7. Segmentation and Labeling for On-Line Time Series Classification
8. Conclusions
Appendices
Bibliography
Recommend Papers

Machine Learning Techniques for Time Series Classification
 9783736978133

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Künstliche Intelligenz (KI) & Digitalisierung Editorial Board: Prof. Dr.-Ing. Kai Peter Birke Dipl.-Kfm. Annette Jentzsch-Cuvillier Band 2

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

Lehrstuhl f¨ ur Netzwerktheorie und Signalverarbeitung Technische Universit¨at M¨ unchen

Machine Learning Techniques for Time Series Classification Dipl.-Ing. Univ. Michael-Felix Botsch

Vollst¨andiger Abdruck der von der Fakult¨at f¨ ur Elektrotechnik und Informationstechnik der Technischen Universit¨at M¨ unchen zur Erlangung des akademischen Grades eines Doktor-Ingenieurs genehmigten Dissertation.

Vorsitzender: Univ.-Prof. Dr.-Ing. habil. Gerhard Rigoll Pr¨ ufer der Dissertation: 1. Univ.-Prof. Dr. techn. Josef A. Nossek 2. Univ.-Prof. Dr. rer. nat., Dr. rer. nat. habil. Rupert Lasser

Die Dissertation wurde am 29.09.2008 bei der Technischen Universit¨at M¨ unchen eingereicht und durch die Fakult¨at f¨ ur Elektrotechnik und Informationstechnik am 06.02.2009 angenommen.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

Bibliografische Information der Deutschen Nationalbibliothek Die Deutsche Nationalbibliothek verzeichnet diese Publikation in der Deutschen Nationalbibliografie; detaillierte bibliografische Daten sind im Internet über http://dnb.ddb.de abrufbar. 2. Aufl. - Göttingen: Cuvillier, 2023 Zugl.: (TU) München, Univ., Diss., 2009 978-3-7369-7813-3

” CUVILLIER VERLAG, Göttingen 2023 Nonnenstieg 8, 37075 Göttingen Telefon: 0551-54724-0 Telefax: 0551-54724-21 www.cuvillier.de Alle Rechte vorbehalten. Ohne ausdrückliche Genehmigung des Verlages ist es nicht gestattet, das Buch oder Teile daraus auf fotomechanischem Weg (Fotokopie, Mikrokopie) zu vervielfältigen. 2. Auflage, 2023 Gedruckt auf säurefreiem Papier 978-3-7369-7813-3

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

Acknowledgements

This thesis emerged during my work at the Ingolstadt Institute der Technischen Universit¨ at M¨ unchen (INITUM) as research assistant of the Institute for Circuit Theory and Signal Processing. I wish to express my deep respect and sincere gratitude to my supervisor Professor Josef A. Nossek for his continuous support, sound advice, and the invaluable suggestions arising from our numerous discussions. I would also like to include my gratitude to Dr.rer.nat. Frank Keck and Dipl.-Ing. Christian Weiß who enabled this research work at Audi AG and who provided support and encouragement along the way. I thank Professor Rupert Lasser for taking the time to be in my dissertation committee and Professor Gerhard Rigoll for serving as a chairman of my dissertation committee. I would also like to thank Dr. Uwe Koser and Dr. Peter-Felix Tropschuh for supporting the work of all PhD students at INITUM. I am deeply grateful to Professor Wolfgang Utschick from the Associate Institute for Signal Processing of the Munich University of Technology whose profound insight has always been a great source of inspiration. Our common way from Munich to Ingolstadt offered the chance to lead many constructive discussions who influenced not only my work but also me as a person. I would like to thank Dr.-Ing. Michael Joham and Dr.-Ing. Guido Dietl for guiding me as a student and teaching me how to conduct research. My colleagues at the Institute for Circuit Theory and Signal Processing as well as at INITUM have been very helpful and supportive. I would like to thank in particular Johannes Speth, Markus Br¨andle, Meinhard Braedel, Peter Breun, Peter Knauer, Raphael Hunger, and Thomas Bock for the constructive atmosphere and the good time we had. I also want to thank the students Katrin Oberdorfer and Peter Bergmiller. Working with them has been an enrichment. I am very grateful to my friends Katrin B¨ottger and Michael Hohendorf who always had confidence in me and who supported me whenever it was necessary. Finnally, and most importantly I want to thank my parents, Elena and Hans Botsch, and my brother Eduard for their love and for their constant support and encouragement during my whole educational life. I dedicate this work to them. The following quotation in German by Dietrich Bonh¨ofer should emphasize my gratefulness to all who contributed in their unique ways to this dissertation: ,,Im normalen Leben wird es einem oft gar nicht bewußt, dass der Mensch u ¨berhaupt unendlich viel mehr empf¨ angt, als er gibt, und dass Dankbarkeit das Leben erst reich macht. Man u ¨bersch¨ atzt recht leicht das eigene Wirken und Tun in seiner Wichtigkeit gegen¨ uber dem, was man nur durch andere geworden ist.”

Ingolstadt, December 2008

i Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

Contents

1. Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Outline and Major Contributions of the Thesis . . . . . . . . . . . . . . . . . 1.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2. Machine Learning 2.1 Basics of Statistical Learning . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Parametric and Nonparametric Techniques for Classification . . . 2.1.2 Learning and Generalization in Classification Tasks . . . . . . . . 2.1.2.1 Bias-Variance Decomposition . . . . . . . . . . . . . . . 2.1.2.2 Evaluation of Machine Learning Algorithms in Practice . 2.2 Machine Learning Algorithms for Classification . . . . . . . . . . . . . . 2.2.1 Linear Basis Expansion Models for Classification . . . . . . . . . . 2.2.2 Multi-Layer Perceptron Neural Networks for Classification . . . . 2.2.3 Kernel Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Classification and Regression Trees . . . . . . . . . . . . . . . . . 2.2.5 Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.5.1 Bias-Variance Framework in Averaging Ensemble Models 2.2.5.2 Random Forest Algorithm . . . . . . . . . . . . . . . . . 2.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Feature Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2.1 Feature Selection with CART . . . . . . . . . . . . . . . 2.3.2.2 Feature Selection with RF . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

1 1 2 4 5 5 7 9 10 13 20 20 22 24 25 30 30 33 39 40 43 43 45

3. Interpretable Generalized Radial Basis Function Classifiers Based on the Random Forest Kernel 3.1 Kernels Defined by Ensembles of CART Classifiers . . . . . . . . . . . . . . . 3.2 GRBF Based on the RF Kernel . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49 50 53 56 59

4. Classification of Temporal Data 4.1 Time Series Representation and Dimensionality Reduction 4.2 Difficulties in Time Series Classification . . . . . . . . . . . 4.3 Classical Approaches for Time Series Classification . . . . 4.4 Similarity Measures for Time Series . . . . . . . . . . . . . 4.5 Feature Generation for Time Series Classification . . . . . 4.5.1 Global Features . . . . . . . . . . . . . . . . . . . . 4.5.2 Local Features . . . . . . . . . . . . . . . . . . . . .

65 66 75 77 82 94 95 97

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

iii Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

iv

Contents 4.5.3

Event-Based Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5. Classification in Car Safety Systems 105 5.1 The Car Crash Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.2 Evaluation of Classification Performance . . . . . . . . . . . . . . . . . . . . 109 6. Scenario-Based Random Forest for On-Line Time Series Classification 6.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 The Scenario-Based Random Forest Algorithm . . . . . . . . . . . . . 6.3 Feature Selection with SBRF . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Wrapper Method . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Embedded Methods . . . . . . . . . . . . . . . . . . . . . . . . 6.4 SBRF for Car Crash Classification . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

113 113 115 117 118 119 122

7. Segmentation and Labeling for On-Line Time Series Classification 7.1 On-Line Segmentation of Time Series . . . . . . . . . . . . . . . . 7.1.1 Reduction of the Number of Change Detectors . . . . . . . 7.1.2 Linear Change Detectors . . . . . . . . . . . . . . . . . . . 7.1.3 Segmentation for Car Crash Classification . . . . . . . . . 7.2 Labeling Classifiers for On-Line Time Series Classification . . . . 7.2.1 GRBF for Car Crash Classification . . . . . . . . . . . . . 7.2.2 Temporal Prototypes for Car Crash Classification . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

139 140 145 146 149 152 153 162

. . . . . . .

. . . . . . .

8. Conclusions 175 8.1 Comparison of the Presented Methods for Car Crash Classification . . . . . . 175 8.2 Contributions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 178 Appendices

181

A. Appendix on Machine Learning Procedures A.1 Bias-Variance Decomposition for Regression . . . . . . . . . . . . A.2 Example for the Bias-Variance Decomposition for Classification . A.3 Bias-Variance Decomposition for Classification . . . . . . . . . . . A.4 Computation of W and θ with p(y |x) Versus ¯ y as Target Vector A.5 Confidence Mapping . . . . . . . . . . . . . . . . . . . . . . . . . A.6 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . A.7 Minimum Local Risk in CART Classification . . . . . . . . . . . . A.8 Random Forests Do Not Overfit . . . . . . . . . . . . . . . . . . . A.9 Computation of the Gradient for the GRBF Algorithm . . . . . . A.10 The AdaTron Algorithm . . . . . . . . . . . . . . . . . . . . . . .

183 183 183 185 185 186 188 190 191 192 192

. . . . . . . . . .

B. List of Frequently Used Symbols Bibliography

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

195 199

1. Introduction

The usage of machine learning techniques for on-line time series classification is the main topic of this work. The motivation to deal with this task and its central challenges are described in Section 1.1. Major contributions of this thesis are summarized in Section 1.2.

1.1 Motivation The classification of temporal data plays a central role in many fields, e. g., medicine, finance, speech recognition or monitoring of industrial processes. Approaches that are frequently applied in temporal classification tasks typically use statistical models like Auto-Regressive Moving Average (ARMA) or Hidden Markov Models (HMM). In cases where it is not possible to build accurate models that describe the considered time series the classification task is normally solved by using empirically designed expert systems. In recent years the interest in applying machine learning techniques to time series has grown, mainly driven by enhanced computational resources which are required for the large amount of data that has to be processed when dealing with time series. Thus, the difficult task of finding appropriate statistical models or the hand-crafted design of expert systems can be replaced by well-known machine learning methods or by new statistical learning algorithms that are emerging in order to deal with temporal data. Machine learning procedures automatically “learn” the mapping implementing the classifier from observations in such a way that the main task is the acquisition of representative observations rather than the design of analytical models. Hereby, the “learning” is accomplished by solving optimization problems in order to obtain the desired classification result for the available observations. Taking into account that most learning systems can be improved by integrating domain knowledge about the problem at hand, machine learning classifiers normally achieve a better performance than expert systems, although they can be designed with less effort. This makes machine learning an attractive approach to tackle classification problems. Since time series are complex data structures the common framework used in machine learning for classification can not be applied straightforwardly and modifications must be included in order to take the temporal aspect into account. The common machine learning framework assumes that an object which has to be classified is described by features. Each feature represents a characteristic of the object. When dealing with temporal data, the object to be classified is described by trajectories which are represented in this work as time series. It is inappropriate to treat each sample in the series as a feature of the object to be classified for two reasons: firstly, this would lead to a huge number of features and secondly, it is rather the interdependence between samples that contains the characteristic information which is required for classification and should be integrated into features. In this work possibilities to represent time series in such a way that the common machine learning framework can be applied will be discussed. The main difference of this thesis compared to other works that deal with the application of machine learning techniques for the classification of temporal data is the fact that the

1 Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

2

1. Introduction

time series considered here can change their class label over time so that the classification problem covers two main tasks: the detection of those time instances when the class label changes and the correct assignment of the class label. Works in the technical literature that deal with the application of machine learning to time series classification only consider the latter task and assume that the series are already divided into segments in such a way that each segment belongs to just one class. For practical applications including dynamic systems that generate time series which must be classified the former task is also highly important and therefore will be covered in this thesis. Whenever a new sample of a time series becomes available it has to be decided in real-time whether a class change has occurred and if yes, which one. Thus, the task will be called on-line time series classification. Although many machine learning algorithms are able to solve a large number of classification problems that appear in practical applications with high accuracy they have the drawback of not being interpretable. For most classification tasks there is a tradeoff between interpretability and low error rates which stems from the inability of humans to imagine arbitrary hypersurfaces in high-dimensional spaces. Since interpretability is often desired as in case of safety critical applications the design of interpretable classifiers is an important issue that will be treated in this work. The main application that will be discussed is the design of classifiers for detection and categorization of car crashes. The aim hereby is the deployment of safety systems, e. g., belt tensioners or airbags, at time instances where the best-possible protection of passengers is assured. This is a safety critical application where a decision must be taken based on temporal data stemming from sensors, e. g., deceleration or pressure sensors which are incorporated in modern cars. State of the art algorithms for this task are expert systems which rely on empirical experience and require lots of hand-crafted “trial and error” steps for each car type. Therefore, applying machine learning procedures represents a means to reduce the development costs and to improve the flexibility and performance of the employed algorithms. Solving the car crash classification problem with machine learning techniques requires the usage of all topics that were mentioned above: a suitable representation of time series, the detection of those time instances when a class change occurs, i. e., when to deploy safety systems, a high accuracy, i. e., the correct decision about what safety systems to deploy, and interpretability.

1.2 Outline and Major Contributions of the Thesis The thesis is divided into two main parts. The focus of the first part that is covered by Chapters 2 and 3 is the classification task based on machine learning techniques whereas the second part—Chapters 4 to 7—deals with temporal data, the possibilities to classify time series using machine learning procedures, and the car crash application. Chapter 2 starts by introducing the basic concepts of machine learning. Hereby, the generalization ability of classification systems is treated in more detail, i. e., the property of classifiers to take the best decision not only for the data that has been used in the design phase but also for new, unseen data. In this context the bias-variance framework is very useful since it gives insight into the tradeoff that must be found when choosing the complexity of the model that is used to learn the underlying process generating the data. Since a main aim in the design of classifiers is a good generalization ability also methods that can be applied to measure the performance will be discussed in this context. Chapter 2 continues

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

1.2 Outline and Major Contributions of the Thesis

3

by presenting important machine learning algorithms. The aim is to convey the basics of some learning algorithms that have proven their suitability on a large number of practical applications and to introduce the relatively new field of ensemble learning. In this part of the thesis an approach to explain the good generalization ability of ensemble techniques for classification tasks using the bias-variance framework is presented. Similar approaches exist for regression tasks but there the bias-variance framework is different. Having stressed the advantages of ensemble techniques a special representative, the Random Forest (RF) algorithm, is introduced since it is the basic algorithm that is used throughout the thesis to design classification systems and to develop new algorithms. The success of a classification system does not only depend on the classification algorithm but also on the way how the data is represented for the algorithm. Thus, the so-called feature extraction is highly important, i. e., the extraction of a small but representative set of attributes that facilitate a good classification performance. The last part of Chapter 2 covers this topic. A major contribution of this thesis is described in Chapter 3 where a possibility to design interpretable Generalized Radial Basis Functions (GRBF) based on the RF kernel is introduced. After showing how the RF kernel can be deduced from the kernel of a single decision tree, the similarity measure described by the RF kernel is used during the construction of GRBF classifiers in order to assure a good generalization performance. GRBF classifiers have some advantages that other classifiers do not have: they allow interpretability which is important in safety critical applications and they offer the option to reject decisions. Both properties will be used in the second part of the thesis for the car crash application. Chapter 3 describes in detail how GRBF can be constructed using the RF kernel and shows how interpretability can be achieved by using a constrained optimization step. Moreover, it is presented how the number of generalized radial basis functions can be reduced in the GRBF classifier while assuring a good classification performance. The chapter ends with experimental results that affirm the advantages of the proposed classification algorithm. Chapter 4 describes the temporal classification problem, elaborating on the possibilities to represent temporal data, the difficulties that arise when dealing with the topic and possible ways to handle it. Some common approaches for the classification of temporal data are reviewed. Since the usage of suitable similarity measures between time series is a possible access to the classification task the chapter presents some standard measures and shows how these measures can be applied to construct class-specific prototypes. In Chapter 4 a new similarity measure is introduced which will be called Augmented Dynamic Time Warping (ADTW) similarity. Due to its property to capture both the similarity in shape as well as the duration of time series the ADTW similarity represents an adequate measure for the car crash application. In the final part of the chapter some possibilities to generate global or local features from time series are reviewed and a new type is introduced that will be called event-based features. The essence of event-based features is the usage of classifiers in the generation step in such a way that application specific events can be detected and the time instances when these events occur are incorporated into features. The main application that is used in the second part of the thesis to evaluate the techniques that are developed in this work is presented in Chapter 5. The classification of car crashes is a challenging task for a couple of reasons. Firstly, the task involves two classification steps: the detection of a suitable time instance when to deploy safety systems and the decision about what safety systems to activate. Secondly, causality must be taken into account, i. e., at time instances when a decision must be taken only sensor signals up to this time stamp

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

4

1. Introduction

are available, which leads to time series of different length. Thirdly, it is a safety critical application which comes along with high demands on the classification performance while favoring interpretable classifiers. The chapter describes how the performance of car crash classification systems can be measured. Both the time instance when to deploy as well as the decision about what safety systems to activate depends on the crash severity. Thus, measuring the performance must evaluate both aspects. In Chapter 6 a contribution of this thesis to the problem of on-line time series classification is presented. The approach is an expansion of the RF algorithm to temporal data which is why it will be denoted Scenario-Based Random Forest (SBRF) algorithm. SBRF comes along with all advantages of the RF algorithm making it possible to compute an honest estimate of the classification performance without putting aside a subset of observations from the available data set. A highly important property of the SBRF algorithm for practical applications is its ability to perform feature selection. Reducing the number of features that are used in the classification task not only decreases the computational load but it also facilitates a good generalization ability. On-line classification using the SBRF approach is performed by computing at each time stamp a feature vector and assigning this feature vector to a class. The chapter ends by applying the SBRF algorithm to the car crash classification task. The results for two data sets that stem from real car crashes are shown. In Chapter 7 the temporal time series classification problem is explicitly divided into two subtasks: detecting time instances when class changes possibly occur and assigning class labels. The former subtask segments a time series into intervals that belong to the same class and the latter assigns class labels to the segments. In contrast to the SBRF approach, here the label is not computed at each time instance but only when the segmentation classifier signals a possible class change. The construction of linear segmentation classifiers is discussed and than adapted to the car crash classification task. In the final part of the chapter two labeling classification techniques are applied to the car crash datasets. The first is realized using GRBF classifiers that are constructed as described in Chapter 3 and the second using the ADTW similarity measure from Chapter 4. The thesis ends with Chapter 8 where the presented methods for the car crash application are compared and aspects for future work in the field of on-line time series classification using machine learning techniques are discussed.

1.3 Notation Throughout the thesis vectors and matrices are denoted by lower and upper case bold letters. Random variables are written using sans serif fonts. The matrix I n is the n × n identity matrix, ei its i-th column, and 0n the n-dimensional zero vector. The symbol “∗” denotes the element-wise multiplication, tr{·} the trace of a matrix, Ex {·} the expectation with respect to x, (·)T transpose, · floor, · ceil, O(·) the Landau symbol, and · the norm of a vector. A list of the most important symbols that are used in the thesis can be found in Appendix B. Expressions are emphasized by writing them in italic type.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

2. Machine Learning

The branch of science that deals with the automatic discovery of regularities in data through the use of computer algorithms is called machine learning. If the discovery of regularities in data is not necessarily coupled to the use of computers one talks about statistical learning. Machine learning plays an important role in the areas of data mining, artificial intelligence, statistics and in various engineering disciplines. The focus of this thesis lies on the latter, aiming to use machine learning for the design of technical systems that have to react to signals coming from the environment by tuning the parameters of an adaptive model in such a way that an application-specific behavior is realized. The first section of this chapter introduces the basics underlying statistical learning. Section 2.2 presents state of the art algorithms for machine learning with a focus on linear basis expansion models, Classification and Regression Trees (CART), and the Random Forest (RF) algorithm since these methods are the basis for techniques that are developed later in the thesis for the task of temporal classification. Section 2.3 addresses the problem of finding the most compact and informative representation of data which is then used by a machine learning algorithm to realize the desired behavior.

2.1 Basics of Statistical Learning Many relations that are found by statistical learning methods in data can be represented in the form of classification or regression functions. Classification and regression aim at estimating values of an attribute of a system based on previously measured attributes of this  system. Given a set of measured observation attributes v = [v1 , . . . , vN  ]T ∈ RN , statistical learning methods estimate the values of a different attribute y. If y takes on continuous numerical values, i. e., y ∈ R one talks about regression and if it takes on discrete values from a set of K categorical values, called classes, i. e., y ∈ {c1 , . . . , cK } one talks about classification. Often a preprocessing of the observation vector v is performed in order to simplify the mapping  f˜ : RN → R, v → y for regression and  f˜ : RN → {c1 , . . . , cK }, v → y for classification.

(2.1) (2.2)

Preprocessing plays a very important role being a possibility to introduce a priori knowledge about the considered machine learning problem. This preprocessing transforms the observation vector v into the so-called feature vector x ∈ RN . Defining feature vectors is the most common and convenient means of data representation for classification and regression problems. A pair (x, y) is called a pattern, x the “input” and y the “output” or “target”. Because the measured attribute values are subject to variations which often cannot be described deterministically, a statistical framework must be adopted [Vid03]. In this framework, x is the realization of the random variable x and y of the random variable y. One can think of the

5 Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

6

2. Machine Learning

mapping from v to y or the mapping from x to y as a black box representing the process of interest.1 In machine learning one is interested both in generating from the observation v a feature vector x that is suitable for the application at hand and in estimating the mapping from x to y using a set of M already known correspondences, the so-called training set2 D = {(x1 , y1 ), . . . , (xM , yM )}.

(2.3)

Whereas most of the literature focuses on finding an estimate of the mapping from x to y, i. e., on computing a suitable mapping f : RN → R, x → ˆy,

for regression and

f : R → {c1 , . . . , cK }, x → ˆy, N

(2.4)

for classification,

(2.5)

it should be noted that for a good performance of the learning system, which enables to predict accurately the output y for a new unseen measurement vector v , the construction of the feature vector x is extremely important. Fig. 2.1 shows that the computed output ˆy can only be a good estimate of the target y corresponding to v if both the feature extraction and the mapping f are chosen properly. The topic of generating suitable feature vectors

v

Feature Extraction

x

f (x)

ˆy

Learning system

Figure 2.1: Learning system

is discussed in Section 2.3. The current and the following section focus on computing the mapping f . In order to determine a suitable mapping f , a measure for the quality of the mapping is required. Based on statistical decision theory, not the quality but the lack of quality of f is measured. Firstly, a loss L(y, ˆy) must be defined which assigns a cost to the prediction yˆ = f (x), knowing that the true value is y. The measure for the lack of quality is the prediction risk R(f )—also called generalization error —which is defined as the expectation of L(y, f (x)) over the x, y-space. For regression one obtains   R(f ) = Ex,y {L(y, f (x))} = L(y, f (x)) p(x = x, y = y) dydx, (2.6) RN R

In regression the mapping performed by the black box can be modeled by y = ftrue (x) + e, where e is noise and ftrue the noiseless mapping from x to y. In classification the mapping can be represented by y = fdiscr (ftrue (x) + e), where fdiscr maps the real valued expression ftrue (x) + e to the classes ck . 2 This kind of learning problem is called supervised learning because in the training set to each input vector xm the corresponding target ym is known. In unsupervised learning only a set of feature vectors without their corresponding targets is available and instead of predicting the output for an unseen input the task changes to describing how the data is organized or clustered. 1

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

2.1 Basics of Statistical Learning

7

where p(x = x, y = y) is the joint probability density function of x and y. For classification the risk is defined as   K R(f ) = Ex,y {L(y, f (x))} = L(ck , f (x)) p(x = x, y = ck ) dx. (2.7) RN

k=1

In this framework, the aim in machine learning is to find the function fB (x) which minimizes the prediction risk R(f ) fB = argmin {R(f )}.

(2.8)

f

The function fB (x) is called the Bayes regression function or Bayes classifier, respectively. Commonly used loss functions in regression are the absolute error L(y, f (x)) = |y − f (x)| or the squared error L(y, f (x)) = (y − f (x))2 whereas in classification problems all possible values of the loss L(y, f (x)) can be stored in a K × K matrix. Often the loss for classification is chosen to be the 0/1-loss  0, if y = f (x) (2.9) L(y, f (x)) = 1 − δ(y, f (x)) = 1, otherwise, where δ(·, ·) is a function that generates the output 1 if its arguments are equal and 0 otherwise. The probability density functions are normally not known in practical applications and therefore fB (x) cannot be computed. Thus, the empirical risk is introduced which defines a measure for the lack of quality of f (x) based on the training set D M 1  Remp (f, D) = L(ym , f (xm )). M m=1

(2.10)

The empirical risk from (2.10) is an unbiased estimate of R(f ) obtained from the training set D. Learning procedures can now be applied in order to find an estimate of the mapping from x to y by minimizing Remp (f, D). 2.1.1 Parametric and Nonparametric Techniques for Classification The joint probability density p(x = x, y = ck ) offers a complete summary of the uncertainty associated to the random variables x and y, such that the knowledge of p(x = x, y = ck ) allows the computation of the optimal classifier fB . An arbitrary classification algorithm implementing the mapping f segments the input space RN into regions X , in such a way that all inputs x ∈ X are assigned to the class c . Therefore, the prediction risk from (2.7) can be rewritten as K  K   R(f ) = Ex,y {L(y, f (x))} = L(ck , c )p(x = x, y = ck )dx, (2.11) k=1 =1 X 

with L(ck , c ) representing the costs of assigning an input x to the output c when the true class is ck . During the design of a classification algorithm one aims at choosing the regions

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

8

2. Machine Learning

X in such a way that the prediction loss from (2.11) is minimized. As a consequence the N regions X emerge by minimizing for each x ∈ R the cost function K k=1 L(ck , f (x))p(x = x, y = ck ) leading to the Bayes classifier  K  fB (x) = argmin L(ck , f (x))p(x = x, y = ck ) (2.12) f (x)

= argmin

k=1

K 

c

 L(ck , c )p(y = ck |x = x) ,

(2.13)

k=1

since the common factor p(x = x) can be eliminated.3 Assuming the 0/1-loss from (2.9) the Bayes classifier can be simplified, resulting in the so-called Maximum-A-Posteriori (MAP) classifier4 fMAP (x) = argmax {p(y = c |x = x)}.

(2.14)

c

If additionally the priors p(y = ck ) are equal for all classes the MAP decision rule can be further simplified to the so-called Maximum Likelihood classifier5 fML (x) = argmax {p(x = x)|y = c }.

(2.15)

c

In practical applications the probabilities that are required for the ML, MAP or Bayes decision rule are normally not known. Nevertheless, in some applications parameterized forms of the underlying densities p(x = x, y = ck ) or p(x = x|y = ck ) are available, which makes it possible to use the training set in order to estimate the values of unknown parameters in the distributions. Then, based on the computed densities, one of the optimal decision rules from (2.13), (2.14) or (2.15) is applied. This approach is called a parametric classification method. On the other hand, if one uses methods for classification that do not require knowledge of the forms of the probability distributions, the approach is said to be nonparametric. A possible way to perform classification by applying nonparametric techniques is to estimate 3

According to Bayes rule the joint probability density can be decomposed into p(x = x, y = ck ) = p(y = ck |x = x)p(x = x).

4

The MAP classifier results from the simplification K   (1 − δ(ck , c ))p(y = ck |x = x) fB (x) = argmin c

k=1



= argmax c

5

K 

k=1

p(y = ck |x = x) −

K  k=1

 (1 − δ(ck , c ))p(y = ck |x = x)

= argmax {p(y = c |x = x)}. c

Due to Bayes rule the MAP classifier can be simplified for equal priors:   p(x = x|y = c )p(y = c ) = argmax {p(x = x)|y = c }. fML (x) = argmax {p(y = c |x = x)} = argmax p(x = x) c c c

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

2.1 Basics of Statistical Learning

9

p(x = x, y = ck ) or p(x = x|y = ck ) without using any model of the densities, e. g., with kernel methods, and then to take a decision based on the estimated densities and one of the Eqs. (2.13), (2.14) or (2.15). One of the fundamental problems in statistical learning is the so-called curse of dimensionality which states that with increasing dimensionality of the input space it becomes harder to construct accurate statistical models. A manifestation of the curse of dimensionality that can be easily remembered is the fact that the sampling density is proportional to M 1/N , where M is the size of the training set and N the dimensionality of the input space. If one divides a region of the input space into regular cells, then the number of cells grows exponentially with the dimensionality of the input space and as a consequence one needs an exponentially large number of examples in the training set to ensure that the cells are not empty. Therefore, the methods that are applied in statistical learning use assumptions that lie beyond the observed data to perform well in high dimensional spaces. More details on the curse of dimensionality can be found in [Bel61] or [HTF01]. Almost all machine learning procedures that are used in practice are nonparametric techniques and some of them are presented in Section 2.2. When considering a classification problem the structure in the observed data is a result of the fact that the data is produced by a source which has certain statistical properties in such a way that the generated ob servations v do not fill out the whole input space RN . Moreover, observations v which lie close to each other with respect to a certain metric that is determined by the data source also belong to the same class. Nonparametric procedures exploit these facts by specifying metrics for measuring the neighborhood in the input space. Kernel methods or CART explicitly specify the metric, whereas methods based on linear combinations of basis functions implicitly define a metric. It is very important to know that all methods that aim to escape the curse of dimensionality have explicitly or implicitly defined a metric for measuring the neighborhoods. This metric does not allow the neighborhood to be simultaneously small in all directions [HTF01]. Thus, every nonparametric technique makes intrinsic statistical assumptions about the data but in general it can not be predicted in advance if these assumptions are valid, i. e., if the algorithm is suitable for the considered classification task. A frequently adopted approach in machine learning is to check which of the available classifiers performs well, in such a way that the most adequate algorithm from the cost and performance point of view can be chosen.

2.1.2 Learning and Generalization in Classification Tasks The aim of a machine learning algorithm is to find a “good” approximation of the mapping from x to y. The approximation is “good” if it generalizes well, i. e., if it represents the underlying systematic properties of the data such that the prediction loss is small. In this context the question arises whether a classification method is superior or inferior to other methods, if no prior assumptions about the nature of the classification task are made. The answer to this question is given by the no free lunch theorem which states that there are no contextindependent or usage-independent reasons to favor one classifier over another [Sch94]. A superior classifier can only exist if the probability over the class of problems is not uniform. If an algorithm is superior to another it is due to the fact that its intrinsic assumptions fit the particular problem better. The no free lunch theorem is an essential theoretical result

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

10

2. Machine Learning

since it states that for a new classification task one should focus on the prior knowledge that one has about the process generating the data. 2.1.2.1 Bias-Variance Decomposition In the following the so-called bias-variance framework will be introduced to measure the suitability of a learning algorithm for a specific classification problem. Colloquially, the bias measures the suitability of the chosen learning model for the problem at hand and the variance the specificity of a realization of the learning model. Given a certain type of classifier, e. g., neural networks or CART, both bias and variance can be influenced when designing the algorithm, but the two terms are not independent, a decreasing bias leading to an increasing variance and vice versa. This tradeoff is discussed in detail below because it represents a key to understand the generalization ability of statistical learning algorithms. By deciding to use a certain type of machine learning algorithm, e. g., neural networks or CART, one chooses the family of functions—also called model space—to which f belongs. If the model space is too large, f can be determined based on the training set D in such a way that the empirical risk is very small but the total risk R(f ) is large. This is called overfitting and the chosen model f has a high variance.6 Intuitively, a model has a high variance if two different training sets D1 and D2 which are produced by the same source lead to significant different mappings f . On the other hand, if the mapping f differs significantly on average over the training sets from the optimal model fB , then the learning model f has a high bias. This may occur for example if the model space is too small and the function f does not have the flexibility to estimate the optimal mapping from x to y. Although regression is not the focus of this work, the bias-variance decomposition of the prediction risk for regression tasks will be introduced since it is required at various places in the thesis. In regression the most widely used loss function is L(y, f (x)) = (y − f (x))2 , leading to the minimization of the mean squared error. With this loss function and taking into account that f (x) is constructed based on a random training set D, the prediction risk can be decomposed as7





Ex,y,D (y − f (x, D))2 = Ex Ey|x y2 − fB (x)2 + Ex (fB (x) − ED|x {f (x, D)})2   ei =Ex {ile(x)} eb =Ex {(biasˆy|x {ˆ y})2 }

+ Ex ED|x (f (x, D) − ED|x {f (x, D)})2 , (2.16)  ev =Ex {varˆy|x {ˆ y }} where the optimal solution fB (x) is the conditional mean estimator Ey|x {y}. This decomposition can be found in Appendix A.1. Whereas the first term on the right hand side, the expectation over the so-called irreducible local error ile(x), cannot be influenced and is due to the irreducible noise in the data, the bias and the variance can be changed by the choice of f . The bias measures the difference between the Bayes model fB (x) and the average model ED {f (x, D)}. The variance measures the dependency of f (x, D) on the particular realization of D. From this additive decomposition in bias and variance it is clear that one should design the function f (x, D) in such a way that both bias and variance are low. 6 7

Since f (x) depends on the random training set D, it will be denoted as f (x, D) in the following. To cancel out the dependency of f (x, D) on the training set, (2.6) and (2.7) have to be extended by taking also the expectation over the D-space.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

2.1 Basics of Statistical Learning

11

The bias-variance decomposition for classification is not equivalent to the bias-variance decomposition for regression since the loss function and thus the average error rate are defined in different ways. Whereas in regression one uses L(y, f (x)) = (y − f (x))2 , in classification the commonly used loss function is the 0/1-loss L(y, f (x)) = 1−δ(y, f (x)). With the 0/1-loss the prediction risk for classification is

Ex,y,D {L(y, f (x, D))} = Ex,y ED|x,y {1 − δ(y, f (x, D))}

= Ex Ey|x ED|x,y {1 − δ(y, f (x, D))}   K  = Ex 1 − ED|x {δ(ck , f (x, D))} p(y = ck |x) . (2.17) k=1

There are a variety of bias-variance decompositions for classification in the literature which try to decompose the prediction risk when using the zero-one loss. An overview on these decompositions can be found in [Geu02] and three of them will be described in the following. An intuitive way to decompose the prediction risk in a bias and a variance term will be discussed first. The equivalent to the irreducible local error from (2.16) can be set to ile(x) = 1 − p(y = fB (x)|x),

(2.18)

with the Bayes classifier being [see (2.8) and (2.17)] fB (x) = argmax {p(y = ck |x)}.

(2.19)

ck

The equivalent to the bias error from (2.16) can be defined as biasˆy|x {ˆy} = 1 − δ(fB (x), fmaj (x)),

(2.20)



fmaj (x) = argmax ED|x {δ(ck , f (x, D))} ,

(2.21)

where ck

with other words the classifier choosing that class for which the majority among the distribution of classifiers vote which were built based on the distribution of the training set D. The expectation ED|x {δ(ck , f (x, D))} is equal to the probability that a random realization of D leads to f (x, D) = ck , i. e.,  p(D = D|x)dD = pD (f (x, D) = ck ). (2.22) ED|x {δ(ck , f (x, D))} = D: f (x,D)=ck

In order to find an expression for varˆy|x {ˆy} which measures the variation of f (x, D) with respect to the training set D the following intuitive definition can be used varˆy|x {ˆy} = 1 − ED|x {δ (fmaj (x), f (x, D))} = 1 − pD (f (x, D) = fmaj (x)).

(2.23)

Although the variance from (2.23) is a measure of the variability of the prediction, the decomposition of the local average error Ey|x {ED {1 − δ(y, f (x, D))}}, into the terms from

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

12

2. Machine Learning

(2.18), (2.20) and (2.23) has a drawback. The local average error is not equal to the sum of ile(x), biasˆy|x {ˆy}, and varˆy|x {ˆy}. It should be noted that in classification tasks the following interesting effect may occur, which is impossible in regression: an increasing variance can lead to a decreasing prediction risk while the bias remains unchanged. This may happen when the predictions are wrong on average. A simple example illustrating this fact is presented in Appendix A.2. A different decomposition of the local average error in a bias and a variance term has been introduced in [Tib97]. Here the irreducible local error ile(x) is the same as in (2.18), but the bias is defined as biasˆy|x {ˆy} = (1 − p(y = fmaj (x)|x)) − (1 − p(y = fB (x)|x)).

(2.24)

Due to the definition of fB (x) the local bias, biasˆy|x {ˆy}, is always greater or equal to zero. With (2.24) and (2.18), it follows that8 ile(x) + biasˆy|x {ˆy} = 1 − p(y = fmaj (x)|x) = Ey|x {L(y, fmaj (x))} ,

(2.25)

which is the local average error when using fmaj (x). In order to make sure that the local average error can be written as a sum of ile(x), biasˆy|x {ˆy} and varˆy|x {ˆy}, the variance term is defined by Tibshirani as

(2.26) varˆy|x {ˆy} = ED Ey|x,D {L(y, f (x, D))} − Ey|x {L(y, fmaj (x))} . Since the term from (2.26) is not necessarily positiv, it is called aggregation effect because it represents the aggregation of the variations between the decisions fmaj (x) and f (x, D). An example is shown in Appendix A.2. It can be seen that a decomposition of the prediction risk into a simple sum of irreducible error, bias and variance as in regression is not suitable for classification tasks. That is the reason for the existence of a variety of bias-variance decompositions for classification. As presented some require a negative variance and other do not sum up to the prediction risk. A deeper insight into the bias-variance decomposition for classification can be gained from [Fri97]. The observation underlying this decomposition is that many classifiers, including CART, first compute an estimate9 p ˆ(y = ck |x, D) of p(y = ck |x) for each ck , which is a regression task, and then choose p(y = ck |x, D)}. f (x, D) = argmax {ˆ

(2.27)

ck

Thus, the variance of the random variable p ˆ(y = ck |x, D) is

varpˆ|x {ˆ p(y = ck |x, D) − ED|x {ˆ p(y = ck |x, D)} = ED|x (ˆ p(y = ck |x, D)})2 .

(2.28)

In the following a two class problem is analyzed since it conveys the main ideas.10 The decision of the Bayes classifier at input x is fB (x). Then, using (2.17) one obtains

ED|x Ey|x,D {L(y, f (x, D))} = 1 − p(y = fB (x)|x) + (1 − pD (f (x, D) = fB (x))) · (2p(y = fB (x)|x) − 1). (2.29) K Ey|x {L(y, fmaj (x))} = k=1 (1 − δ(ck , fmaj (x)))p(y = ck |x) = 1 − p(y = fmaj (x)|x). 9 p ˆ(y = ck |x, D) is a random variable. 10 A multi-class problem can be decomposed into a series of two-class problems. For example the one-againstall technique that is commonly used with support vector machines is such an approach [LZ05]. Hereby, the results from multiple classifiers, each solving a two-class problem, are combined for the final decision. 8

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

2.1 Basics of Statistical Learning

13

The derivation can be found in Appendix A.3. Since pD (f (x, D) = fB (x)) denotes the probability that f (x, D)—where f (x, D) is computed using a random training set D—leads to the Bayes decision fB (x), with (2.27) it holds that p(y = fB (x)|x, D) > 0.5). pD (f (x, D) = fB (x)) = p(ˆ

(2.30)

ˆ(y = fB (x)|x, D), the factor (1 − pD (f (x, D) = Assuming a Gaussian distribution11 for p fB (x))) from (2.29) is   p(y = fB (x)|x, D)} − 0.5 ED|x {ˆ  , (2.31) 1 − pD (f (x, D) = fB (x)) = Φ varpˆ|x {ˆ p(y = fB (x)|x, D)} where Φ(·) denotes the upper tail of the standard normal distribution.12 Thus,

ED|x Ey|x,D {L(y, f (x, D))} = 1 − p(y = fB (x))   p(y = fB (x)|x, D)} − 0.5 ED|x {ˆ  +Φ varpˆ|x {ˆ p(y = fB (x)|x, D)} · (2p(y = fB (x)|x) − 1).

(2.32)

With this decomposition it is possible to interpret the observations made with the other two bias-variance decompositions, namely that an increasing variance can lead to a decreasing local average error while the bias remains unchanged. As long as ED|x {ˆ p(y = fB (x)|x, D)} < 0.5 a decreasing variance of p ˆ(y = fB (x)|x, D) leads to a smaller argument in Φ(·) and thus to a greater local average error. On the other hand, when ED|x {ˆ p(y = fB (x)|x, D)} > 0.5 a decreasing variance of p ˆ (y = f (x)|x, D) leads to a decreasing local average error B

ED|x Ey|x,D {L(y, f (x, D))} . The main reason to present the last bias-variance decomposition for classification is an observation that can be made from (2.32). As it can be seen there is no additive relation between bias and variance of the estimates p ˆ(y = ck |x, D) and the generalization error in the corresponding classification task. A low generalization error can be achieved by using a regression algorithm to estimate the posteriors that yields the largest value ED|x {ˆ p(y = ck |x, D)} for ck = fB (x) and at the same time a low variance of p ˆ(y = fB (x)|x, D). Since the variance appears in the denominator of Φ(·) it is possible to reduce the classification error to its minimum value by reducing only the variance. This is the key to understanding the good performance of ensemble techniques which are variance-reducing techniques and the basis for the algorithms that will be developed in this thesis. 2.1.2.2 Evaluation of Machine Learning Algorithms in Practice The generalization property of a machine learning model, i. e., its prediction capability on independent data, is described by (2.6) and (2.7). Having good estimates of this performance is extremely important in practice since it gives a measure of the quality of the chosen learning model. The intuitive way to estimate the prediction risk of a model f that has been trained 11 12

The final conclusions are not dependent on this assumption. If the random variable z is Gaussian distributed with mean μ and variance σ, then the probability p(z > a)   μ−a  a−μ a−μ > . Since (z − μ)/σ has a normal distribution, p(z > a) = Φ = 1 − Φ . is equal to p z−μ σ σ σ σ

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

14

2. Machine Learning

using the set D requires another very large (infinite) set of patterns (xi , yi ) which stem from the same population as the patterns in D. For each of the inputs xi of this new set, f (xi , D) is computed and compared to the true output yi . For regression problems the average over the mean squared error (yi − f (xi , D))2 is an estimate of R(f ). For classification the proportion of misclassified inputs xi is an estimate for R(f ) if the 0/1-loss is used. In practice there is often no possibility to generate another very large set of patterns in addition to D. Thus, only the patterns in D must be used to estimate R(f ). There are four commonly used types of such estimates. I. First Estimate of R(f ) The first estimate is the worst and simply estimates R(f ) from the training set D which has been previously used to construct the learning model: M 1  ˆ RD (f ) = (ym − f (xm , D))2 M m=1

(2.33)

M 1  ˆ RD (f ) = (1 − δ(ym , f (xm , D))), M m=1

(2.34)

for regression and

ˆ D (f ) is too optimistic since it is identical to Remp (f ) after for classification. The estimate R training and Remp (f ) was used by the learning algorithm as cost function. As an example ˆ D (f ) is zero, the case where Remp (f ) can be reduced to zero is considered. Then also R suggesting that the generalization error is zero. This will most probably not be the case since the learning algorithm may have overfitted the training data and for many patterns (x, y) which do not belong to D the loss L(y, f (x, D)) is not zero.13 In order to get a good estimate of the prediction risk the patterns in the set used for estimating R(f ) must be independent from the patterns in D. II. Second Estimate of R(f ) The second possibility to estimate R(f ) is to divide the set D randomly into sets D1 and D2 . The idea here is to use only D1 for training and to estimate R(f ) based on D2 . The number of patterns in D1 is often chosen to be |D1 | = 2/3M and in D2 , |D2 | = 1/3M . The estimate for R(f ) obtained here is ˆ D2 (f ) = 1 R |D2 |



(ym − f (xm , D1 ))2

(2.35)

(1 − δ(ym , f (xm , D1 )))

(2.36)

(xm ,ym )∈D2

for regression and ˆ D2 (f ) = 1 R |D2 |

 (xm ,ym )∈D2

for classification. The main drawback of this procedure is that it reduces the size of the training data considerably. This might not be a problem in some applications where a large 13

In the bias-variance framework this means that the variance term is large.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

2.1 Basics of Statistical Learning

15

number of training patterns is available. However, since in most practical applications the size of the training set D is restricted, the following two methods namely V -fold cross-validation and the bootstrap method are presented. III. Third Estimate of R(f ) In V -fold cross-validation the set D is randomly divided into V disjoint subsets of nearly equal size D1 , . . . , DV . The idea underlying the V -fold method is to construct V learning models fv by using the set D \ Dv for training the v-th model and to compute the estimate ˆ Dv (fv ) of this model by using Dv . The final model f is constructed of the prediction error R by using the learning procedure with the whole training set D. The V -fold cross-validation estimate of the prediction risk of the final model is V 1 ˆ ˆ RCV (f, V ) = RDv (fv ). V v=1

(2.37)

A special case of the V -fold cross-validation method is the so-called Leave-One-Out (LOO) ˆ CV (f, M ) is called method where the special choice V = M is made. The resulting estimate R the jackknife estimate of R(f ). Jackknifing is a frequently used sampling technique. Its basic idea is to apply the LOO method and compute the statistic of interest with one point “left out” leading to M estimates. Based on these estimates properties of the statistic, e. g., mean or variance can be estimated. If there are enough patterns in the training set so that the variance resulting from a special choice of the training set is small, the jackknife gives good estimates of the prediction risk because each of the M classifiers is very similar to the final classifier that must be tested. This will be discussed below. A drawback of the jackknife estimate is its high computational burden that is required to compute it. In Chapters 6 and 7 the jackknife estimate will be used for the car crash classification task. In order to answer the question what value should be chosen for V , again the bias-variance ˆ CV for a fixed value of framework must be considered. Here, the local bias of the estimate R 14 V and the input x is   ˆ CV (f (x, D), V )} = ED|x,V R ˆ CV (f (x, D), V ) − R(f (x)) biasRˆCV |x,V {R = R(fv (x)) − R(f (x)), (2.38)

with R(f (x)) = ED|x Ey|x,D {L(y, f (x, D))} and the variance   2    ˆ ˆ ˆ RCV (f (x, D), V ) − ED|x,V RCV (f (x, D), V ) . varRˆCV |x,V RCV (f (x, D), V ) = ED|x,V (2.39) ˆ CV (f (x, D), V )} measures the difference between the prediction The local bias biasRˆCV |x,V {R risk for the input x which results from the mapping fv that is computed using a training set of size |D \ Dv | and the prediction risk for the input x of the mapping f whose training ˆ CV (f (x, D), V ) due to set has the size |D| = M . The variance measures the variation of R a special choice of the set D. Figures 2.2 and 2.3 help in understanding the bias-variance tradeoff better. Since the construction of the function f depends on the size M of the 14

ˆ CV (f (x, D), V ) is an unbiased estimate of R(fv (x)). R

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

16

2. Machine Learning   ˆ CV (f (x, D), V ) − R(f ))2 ED|x,V (R R(f )

R(fB ) M 

M

M

Figure 2.2: Dependence of R(f ) on M

V 

V

V =M

Figure 2.3: Mean squared error of the estimate ˆ CV (f (x, D), V ) R

training set also the prediction risk depends on M . Fig. 2.2 shows that the prediction error R(f ) decreases as a function of the set size M . Assuming a training set of size |D| = M Fig. 2.2 also visualizes the bias for the two choices M  and M  . If V  -fold cross-validation is performed during the construction of each of the models fv , v = 1, . . . , V  , a training set size of M  = |D \ Dv | is used. Due to the large value of M  ˆ CV (f (x, D), V  ) will be low since the models fv and f use training sets of the bias of R almost the same size, but the variance can be high because all V  models are built on almost the same data. A different realization of D may lead to a significantly different estimate ˆ CV (f (x, D), V  ) especially when the input points in the training set do not cover the input R space sufficiently. On the other hand, if V is chosen to be V  < V  , the models fv based on training sets of size M  are constructed mainly with different data and the variance of ˆ CV (f (x, D), V  ) is low. However, the bias will be high since the mappings fv are based on R significantly smaller training sets than the mapping f . Thus, the mean squared error of the ˆ CV (f (x, D), V ) is represented qualitatively in Fig. 2.3. If the set D is rather small estimate R five- or tenfold cross-validation leads to a good bias-variance tradeoff [BFOS84, HTF01]. Many machine learning algorithms depend on so-called metaparameters α which are not adapted by the learning algorithm. If no domain knowledge which would justify a special choice of these metaparameters is known, cross-validation can be used to tune them. The ˆ CV (f (α) , V ), the superscript ˜ where α ˜ minimizes R metaparameters are then chosen to be α, ˜ ˆ CV (f (α) (α) indicating that the mapping f depends on the metaparameters α. Here, R , V ) is ˜ the V -fold cross-validation estimate of the prediction risk when using the metaparameters α. ˜ Then, the learning procedure for the final model f (α) uses the whole set D and as an estimate ˜ ˜ ˆ CV (f (α) ˆ CV (f (α) of its performance R , V ) can be used. Nevertheless, the estimate R , V ) might 15 be slightly optimistic because the subset left apart when training the v-th model has ˜ ˜ through the cross-validation estimate of R(f (α) influence on the choice of α ) so that the learning model is not constructed independently of the data set that is used to compute ˜ the estimate of R(f (α) ). Therefore, another V˜ -fold cross-validation step can be integrated ˜ on top of the presented V -fold cross-validation to determine the performance of f (α) . For each fold in this additional V˜ -fold cross-validation which divides D into Dv˜ and D\Dv˜, the ˜ v˜. initial V -fold cross-validation is performed on the set D\Dv˜ in order to compute a value α ˜ v˜ ) (α ˜ v˜ one has to compute the mapping f Setting the metaparameters to α based on D\Dv˜ 15

˜ ˜ ˆ CV (f (α) , V ) is not necessarily an optimistic estimate of R(f (α) ) since all models used to compute the R estimate are based only on a subset of the data in D, namely D\Dv for the v-th model.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

2.1 Basics of Statistical Learning

17

˜ ˆ D (f (α˜ v˜ ) ) with the set Dv˜. Compared to R ˆ CV (f (α) and then the estimate R , V ) the average v ˜ ˜ v˜ ) (α 16 ˜ ˆ of the V estimates RDv˜ (f ) is a rather pessimistic estimate of the prediction risk of the ˜ (α) model f . Due to the computational complexity of the latter approach a combination of V -fold crossvalidation and the second method to estimate R(f ) is preferred. Here, one divides the set D ˜ are randomly into a large subset D1 and a smaller subset D2 . Then, the metaparameters α found based on V -fold cross-validation using only D1 . A fair estimate of the prediction risk ˜ and all available data D is given by RD2 (f ). of the model f that is build using α

IV . Fourth Estimate of R(f ) The bootstrap is a general procedure to asses statistical accuracy based on resampling with replacement from the set D containing M realizations. A bootstrap set DBS,b is constructed by resampling with replacement M times from D. A large number B of such sets DBS,b , b = 1, . . . , B is constructed. Then, the bootstrap estimate of a statistic x is computed by ˆb generated by using the sets DBS,b taking the mean over the individual estimates x xBS =

B 1 ˆ xb . B b=1

(2.40)

With the bootstrap procedure any aspect of the distribution of x can be estimated, e. g., its mean or variance [ZB98]. One should note that each bootstrap sample DBS,b contains only about 63.2% of the data in D. The remaining 36.8% are repetitions. This is due to the fact that the probability for a realization not to be in the bootstrap set DBS,b is (1 − 1/M )M ≈ 0.368 for large M . In order to obtain a good bootstrap estimate of R(f ), a model fb is firstly constructed from each of the bootstrap sets. Denoting Bm the set containing the indices of those bootstrap samples which do not include the pattern (xm , ym ), the bootstrap estimate of R(f ) is17 M 1  1  ˆ L(ym , fb (xm )). RBS (f ) = M m=1 |Bm | b∈B

(2.41)

m

For the computation of fb only approximately 63.2% of the data in the set D is used. Thus, ˆ CV (f, V = 2). It has been ˆ BS (f ) is a similar estimate of R(f ) as the two-fold estimate R R explained above that such an estimator has a low variance but might have a high bias. ˆ BS (f ) is in general a too pessimistic estimate of R(f ). This is the reason why Therefore, R the “0.632 estimator” is introduced in [HTF01] to reduce the bias: ˆ BS,0.632 (f ) = 0.368R ˆ D (f ) + 0.632R ˆ BS (f ), R

(2.42)

ˆ D (f ) is the resubstitution error measured on the training set from (2.33) or (2.34). where R ˆ BS,0.632 (f ) may be an overoptimistic when the model f strongly overfits the The estimate R data. ˜ ˆ D (f (α˜ v˜ ) ) is a pessimistic estimate of R(f (α) The average of the V˜ values R ) since all models used to v ˜ compte the estimates are computed based on a smaller training set than D\Dv˜ . 17 For large M the cardinality of Bm is |Bm | ≈ 0.368M .

16

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

18

2. Machine Learning

Confusion Matrix and Receiver Operating Characteristic Additionally to computing estimates of the prediction risk from (2.7), two other evaluation methods are frequently used in classification tasks. The first is based on the confusion-matrix and the second, which can be applied only to binary classification tasks, on the Receiver Operating Characteristic (ROC) curve. Both techniques must be used on a test set in order to obtain a fair evaluation of a classifiers performance. The confusion-matrix helps in analyzing quickly whether a classification system is confusing two classes. Each row in the matrix represents the examples in the test set that belong to an actual class and each column the examples that belong to a predicted class. The confusion matrix for a binary classification task is given in Table 2.1. There are αcm + βcm examples in y = c1 y = c2

ˆy = c1 αcm γcm

ˆy = c2 βcm ζcm

Table 2.1: Confusion-matrix for a binary classification task

the test set which belong to class c1 but only αcm are classified correctly, the remaining βcm are allocated to class c2 . From the γcm + ζcm examples in the test set which belong to class c2 only ζcm are classified correctly and the remaining γcm are allocated erroneously to class c1 . The following figures of merit exist for the two class problem based on the entries in the confusion matrix from Table 2.1: ζcm Estimated True Positive Rate: TˆP = (2.43) γcm + ζcm βcm (2.44) Estimated False Positive Rate: FˆP = αcm + βcm αcm + ζcm ˆ = (2.45) Estimated Accuracy: AC αcm + βcm + γcm + ζcm   β γ 1 cm cm ˆ = . (2.46) + Estimated Balanced Error Rate: BER 2 αcm + βcm γcm + ζcm ˆ are self-explanatory, the estimated balanced error rate BER ˆ has Whereas TˆP , FˆP and AC been introduced to take unbalanced data sets into account. If the data set is unbalanced one ˆ but a very poor performance in classifying the might obtain a high estimated accuracy AC minority class. Other measures that are based on the confusion-matrix and their analysis can be found in [Fla03]. The ROC-curve has been introduced by the signal processing community in order to evaluate the capability of a binary classifier to distinguish informative radar signals from noise. In the meanwhile it is used in various fields, e. g., in medicine to assess the usefulness of a diagnostic test. It is a plot with the false positive rate on the abscissa and the true positive rate on the ordinate. The ROC-curve is obtained by varying a decision threshold18 that separates class c1 from class c2 . For each possible value of this threshold one obtains a pair of FˆP and TˆP performance rates and thus, a point on the ROC-curve. Fig. 2.4 shows a ROC-curve. The better the classifier, the nearer the curve is to the upper left corner of the plot. The diagonal line represents a classifier that predicts at random. Thus, a frequently 18

For example the Neyman-Pearson [Kay98] test or the Bayes test for binary classification take a decision by comparing the likelihood ratio p(x = x|y = c2 )/p(x = x|y = c1 ) with a threshold.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

2.1 Basics of Statistical Learning

19

TˆP 1

1

FˆP

Figure 2.4: ROC-curve

used performance measure is the area under the ROC-curve. When the area under the curve is 1 one has the ideal classifier since it can achieve an accuracy of 1 when the threshold is chosen correctly. An area under the curve of 0.5 corresponds to the classifier that decides randomly. The best value of the threshold leading to the ROC-curve, i. e., the value leading to the smallest prediction loss can be computed as described in the following. Firstly, one expresses the dependence between the true positive rate and the false positive rate as T P = gROC (F P ),

(2.47)

where the function gROC is estimated by the ROC-curve. The true positive rate and the false positive rate are   p(x = x|y = c2 )dx, F P = p(x = x|y = c1 )dx. (2.48) TP = {x:f (x)=c2 }

{x:f (x)=c2 }

Thus, the prediction risk from (2.7) can be written as R(f ) = L(c1 , c2 )p(y = c1 )F P + L(c2 , c1 )p(y = c2 )(1 − T P ) = L(c1 , c2 )p(y = c1 )F P + L(c2 , c1 )p(y = c2 )(1 − gROC (F P )),

(2.49)

where p(y = c1 ) and p(y = c2 ) are the prior probabilities for the classes c1 and c2 , respectively. The prediction risk is minimal if ∂R(f ) ∂gROC (F P ) ! = L(c1 , c2 )p(y = c1 ) − L(c2 , c1 )p(y = c2 ) = 0. ∂F P ∂F P

(2.50)

Therefore, one has to choose that threshold which leads to the point in the ROC-curve where the slope is equal to the ratio L(c1 , c2 )p(y = c1 )/L(c2 , c1 )p(y = c2 ), i. e., ∂gROC (FˆP ) L(c1 , c2 )p(y = c1 ) . = ˆ L(c2 , c1 )p(y = c2 ) ∂F P

(2.51)

The main benefit of the ROC-curve is that the performance of a binary classifier can be characterized and optimized without knowing the underlying probability densities or the prediction risk.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

20

2. Machine Learning

2.2 Machine Learning Algorithms for Classification This section gives an overview on prominent machine learning algorithms with a focus on the techniques that will be used in the second part of this thesis for time series classification. According to Taylor and Cristianini [STC04] machine learning algorithms have gone through three important development stages. In the 1960s efficient algorithms were designed for detecting linear relations in the data, the Perceptron being one example. The second very important development occurred in the 1980s when a “nonlinear revolution” began due to the introduction of the backpropagation technique as a learning algorithm for Multi-Layer Perceptron (MLP) networks. Such Neural Networks (NN) are able to cope with nonlinear relations in the data. Despite having huge advantages like being universal approximators,19 NN suffer from the drawback of local minima. The third significant development took place in the mid 1990s with the development of Support Vector Machines (SVM). SVM belong to the family of kernel methods which have already been known since the 1960s. The great advantage of SVM is that they come along with all the advantages of the kernel methods, e. g., they are universal approximators or their learning model is built directly from the data, but unlike previous kernel methods they do not need to examine the whole data base D to predict the value y of a new input x. SVM enabled researchers to analyze nonlinear relations without the drawbacks known from backpropagation algorithms, but they require a decision about the kernel that is used. Choosing the kernel is a difficult task and depends on the knowledge that one has about the data and often this knowledge is very limited or not available. A fourth important development in pattern recognition started about the year 2000 with techniques like bagging, boosting or arcing which are ensemble learning methods. The idea underlying these methods is the concept of weak learners and their linear combinations. Additionally to being universal approximators, ensemble learning methods offer some advantages which help in analyzing the process producing the patterns. Ensemble learning methods offer the possibility to make statements about the relevance of features as well as statements about the confidence in the computed values. Ensemble learning techniques also provide methods for machine learning that do not suffer from overfitting. Moreover, they can be used as “off-the shelf” tools for regression or classification. Ensemble learning methods, in particular the Random Forest (RF) algorithm, are the basis for the time series classification methods that are introduced in this thesis and they will be discussed in Subsection 2.2.5. 2.2.1 Linear Basis Expansion Models for Classification A large category of useful classifiers f are realized using linear combinations of a set of basis functions. In a first step the input vector x ∈ RN is transformed into the vector u ∈ RK with its k-th entry being uk =

H 

ϕh (x, ϑh )wk,h .

(2.52)

h=1

Hereby, ϕh (·, ϑh ) represents the h-th basis function that depends on the free parameters ϑh and wk,h a weight. Thus, the vector u can be expressed as u = W ϕ(x, ϑ), 19

(2.53)

An universal approximator can represent any continuous function with compact support [Bis95, Lin95].

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

2.2 Machine Learning Algorithms for Classification

21

where W ∈ RK×H is the weight matrix, with the (k, h)-th entry being wk,h , ϕ(x, ϑ) = [ϕ1 (x, ϑ1 ), . . . , ϕH (x, ϑH )]T ∈ RH the vector containing the values of the H basis functions evaluated for the input x, and ϑ = {ϑ1 , . . . , ϑH } the free parameters in all basis functions. In order to approximate the Bayes classifier the mapping from x to u, i. e., the matrix W and the free parameters ϑ, can be computed from   2  ˜  ˜ {W , ϑ} = argmin Ex W ϕ(x, ϑ) − p(y |x) , (2.54) ˜ ˜ ,ϑ W

with p(y |x) = [p(y = c1 |x = x), . . . , p(y = cK |x = x)]T . As shown in Appendix A.4 the same values for W and ϑ can be obtained from   2  ˜  ˜ y , {W , ϑ} = argmin Ex,¯y W ϕ(x, ϑ) − ¯

(2.55)

(2.56)

˜ ˜ ,ϑ W

where ¯ y ∈ {0, 1}K is the unit vector representation of the random variable y, i. e., y = ck is represented by the target unit vector ¯ y = ek which is an all-zero vector except for the k-th entry that has the value 1. Thus, in practice W and ϑ are computed based on the training set D to minimize the residual sum-of-squares M 2 1  ˜ ˜ ˜ − y¯m . ˜ Remp (W , ϑ, D) = W ϕ(xm , ϑ) M m=1

(2.57)

Having computed suitable values for W and ϑ, the vector u corresponding to an input x is determined according to (2.53). Due to (2.54) and (2.56) the vector u is not only an estimate of the unit vector representation ek of the class ck corresponding to x, but also an estimate of the vector p(y |x). A further step that can be performed but which is often skipped in practice is to transform the entries in u into better estimates of the probabilities p(y = ck |x = x). The estimates should fulfill the properties of probabilities, i. e., the values must belong to the interval [0, 1] and the sum over all classes must be 1. To accomplish this goal the so-called confidence mapping introduced in [Sch96] is adopted.20 Given an input x, each entry uk ∈ R in u = W ϕ(x, ϑ) is mapped to its confidence value cnf k (uk ) ∈ [0, 1]. These mappings are performed by taking the class-specific histograms obtained from the training set into consideration. The confidence mapping is described in Appendix A.5. The remaining requirement of computing estimates of the conditional probabilities that sum up to 1, is achieved by the normalization of cnf k (uk ): cnf (uk ) . p ˆ(y = ck |x = x) = K k cnf (u ) k k k=1 20

(2.58)

As stated in [Sch96] the confidence mapping can be treated as an additional functional approximation which uses the values in u as features, with the goal of providing an improved approximation of the conditional class probabilities p(y |x). The most important effect of confidence mapping lies in providing confidence scores that are consistent with statistical experience.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

22

2. Machine Learning

The values p ˆ(y = ck |x = x) build the vector of estimated posterior probabilities21 ˆ (y |x) = [ˆ ˆ(y = cK |x = x)]T , p p(y = c1 |x = x), . . . , p

(2.59)

which is used for classification by assigning that class ck to an query input x whose correˆ (y |x) is maximal. sponding entry in p In many machine learning algorithms, e. g., polynomial regression [Sch96] or perceptron classification [Bis95], the basis functions ϕh (x, ϑh ) from (2.52) are chosen to be fixed, i. e., there are no free parameters ϑh . Due to the limitations22 of such models many successful algorithms applied in practice use a mechanism to adapt the basis functions to the data. The MLP neural networks or Radial Basis Function (RBF) networks, which will be described in the following, belong to the latter group. 2.2.2 Multi-Layer Perceptron Neural Networks for Classification The term neural network is used in the machine learning community for the implementation of the mapping f by a combination of simple processing elements.23 Hereby, the mapping f from x to ˆy is determined by the connections between the processing units and the parameters in these units. Fig. 2.5 shows a MLP neural network for classification with one hidden layer, where the simple processing elements that build the network are realized by the basis functions ϕh (x, ϑh ) = ϕh (ϑT h x).

(2.60)

In a MLP with more hidden layers the evaluations of the basis functions at x are not directly multiplied with the weights wk,h but instead used by another layer of basis functions as inputs. With one or more hidden layers, the mapping implemented by a MLP belongs to the group of linear basis expansions that was described in the previous subsection. The ˜ D) from (2.57) so ˜ , ϑ, parameters ϑ = {ϑ1 , . . . , ϑH } are determined by minimizing Remp (W that the resulting basis functions ϕh (·, ϑh ) are data adaptive. This leads to a large model space, which is necessary for a low bias in complex applications. The MLP networks started to play an important role in machine learning due to the backpropagation technique24 which allows to apply the gradient descent method in a very ˜ D) from (2.57) when the basis functions ˜ , ϑ, efficient way to the task of minimizing Remp (W 21

If the confidence mapping step is skipped the estimated conditional probabilities p ˆ(y = ck |x = x) are computed analogously to (2.58) by replacing the values cnf k (uk ) with uk

 =

uk 0

uk ≥ 0 uk < 0.

22

The algorithms have a high bias in many applications since the model space that is spanned by the basis functions is too small to approximate the mapping from x to y appropriately. 23 A well-founded description and analysis of neural networks can be found in [Bis95]. 24 The backpropagation technique was firstly described in [Wer74] but it gained recognition only after the publication of [RHW86].

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

2.2 Machine Learning Algorithms for Classification

1

w1,1

ϕ1

1 Σ .. .

.. .

x2

K

H .. .

Σ

ϕH

wK,H

H,N

xN

u1

uK

Normalization

ϑ1,1

(Confidence Mapping)

x1

23

ˆ (y |x) p

max(·)



ϑ Hidden Layer

Figure 2.5: Multi-Layer Perceptron for classification

are chosen to be sigmoidal,25 i. e., ϕh (x, ϑ˜h ) =

1 . 1 + exp(−ϑ˜T h x)

(2.61)

The main advantage of sigmoidal basis functions is that their derivatives are simple:   ∂ϕh (x, ϑ˜h ) = ϕh (x, ϑ˜h ) 1 − ϕh (x, ϑ˜h ) x. ∂ ϑ˜h

(2.62)

Thus, the derivatives which are required in the gradient descent method can be calculated by the function values so that the implementation of the optimization task can significantly be simplified [Bis95]. Other optimization techniques that can be used to minimize ˜ D) are presented in [Bis95], e. g., the conjugate gradient algorithm, the Newton ˜ , ϑ, Remp (W method or the Levenberg-Marquardt algorithm. Although MLP have been applied successfully to numerous applications, neural networks ˜ D) that ˜ , ϑ, suffer from the drawback of local minima. The optimization criterion Remp (W must be minimized is nonconvex so that the computed solutions W and ϑ represent local minima. Moreover, the solutions are highly dependent on the initialization values when starting the optimization. Since in machine learning the aim is to minimize the prediction risk and not the empirical risk, it is not required to find the global minimum in order to obtain a machine learning model that generalizes well. On the contrary, the global minimum or local minima with very small values for the empirical risk come along with overfitting, i. e., the variance is high and therefore also the prediction risk. This effect also happens when the empirical risk is reduced dramatically by choosing a large number H of basis functions. Thus, one of the few possibilities that remain to find a good trade-off between bias and variance, i. e., to generalize well, is to measure the prediction risk with an additional data set D2 and to try out various initialization values as well as combinations for the number H of basis functions. The necessity to try out various settings is dissatisfactory, not only due to the fact that the available data must be divided into a training and a validation set, but also due to computational resources and the time that is required to compute and evaluate the models. 25

In order to obtain a larger model space the input x is often replaced by the augmented input [1, x T ]T ∈ RN +1 and the vector of free parameters in the h-th basis function, ϑ˜h , by the augmented vector [ϑ˜h,0 , ϑ˜h,1 , . . . , ϑ˜h,N ]T ∈ RN +1 .

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

24

2. Machine Learning

2.2.3 Kernel Classifiers A function κ : RN × RN → R is called a kernel if an inner dot space (F, ·, ·) and a mapping Φ˜ into this space, Φ˜ : RN → F, exist with ˜  ), Φ(x ˜ q ) ∀x , xq ∈ RN . κ(x , xq ) = Φ(x

(2.63)

A necessary and sufficient condition for a function κ to be a valid kernel is that the Gram matrix K , whose elements are given by κ(x , xq ), is positive semidefinite for all possible inputs x ∈ RN and xq ∈ RN [STC04]. The group of kernel machine learning algorithms determine and keep points xK ∈ RN in the input space using the training set D so that the output f (x) is computed by a combination of the kernels κ(x, xK ). This is different to MLP networks and many other machine learning algorithms where the training set is used to adapt a vector ϑ of parameters. The kernel κ(x, xK ) represents a weighting function which assigns a weight to xK based on its similarity to x. This way a localization is achieved since the points xK close to a query input x have a higher weight in the computation of f (x) than points that lie further away. Hereby, the neighborhood relations are defined by the kernel. Most kernels posses a parameter α that dictates the width of the neighborhood, which is why they are denoted by κα . If all inputs xm from the training set are kept for the computation of f (x) one talks about memory-based methods because the entire training set must be stored. A prominent example for memory-based methods is the Nadaraya-Watson algorithm [HTF01].26 In addition to memory-based models so-called sparse kernel machines evolved which compute the output f (x) by evaluating the kernels κ(x, xK ) only for a smaller number H < M of points xK . Important representatives of this group of machine learning algorithms are Support Vector Machines (SVM) and the Radial Basis Function (RBF) models. SVM are the most prominent kernel classification technique but they are only used in this thesis as benchmark for the evaluation of the algorithm that will be developed in Chapter 3. Therefore, only a short overview on SVM is given in Appendix A.6. On the other hand, RBF models are the basis for the algorithm from Chapter 3 and thus, they will be described in the following. Radial Basis Function Classifiers RBF classifiers belong to the group of linear expansions of basis functions, where the basis functions are realized by kernels κ(x, xK ). Thus, the k-th element of the vector u is computed as in (2.52) uk =

H 

καh (x, xK,h )wk,h ,

(2.64)

h=1

where each basis function is characterized by a scale parameter αh and a point xK,h in the input space RN , the so-called prototype. In addition to being valid kernels, the functions καh must posses the property that καh (x, xK,h ) only depends on the distance between x 26

The vector u is computed by the Nadaraya-Watson algorithm as the kernel-weighted average M ¯m m=1 κα (x, xm )y u=  . M m=1 κα (x, xm )

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

2.2 Machine Learning Algorithms for Classification

25

and the prototype, i. e., καh (x, xK,h ) = καh (x − xK,h ). If the norm x − xK,h  denotes the Euclidean distance one talks about RBF,27 otherwise about Generalized Radial Basis Functions (GRBF). The latter case will be presented in the next chapter. It should be noted that there exist other approaches than the one adopted in this thesis to introduce RBF, since the models have been developed and used independently in many different areas. For example RBF have been used for interpolation in high-dimensional spaces [Pow87] or in the regularization framework [PG89] for learning a multivariate function from sparse data. The parameters αh , xK,h and wk,h must be determined from the training set D according ˜ W ˜ } = {α, ˜ K, W ˜} ˜ X to (2.56) and (2.57) by minimizing with respect to all parameters {ϑ, the empirical risk ˜ K, W ˜ , D) = ˜ X Remp (α,

M 2 1   ˜  ˜ ¯ W κ (x , X ) − y  ˜ m K m  , α M m=1

(2.65)

˜ K = [x ˜ K) = ˜ K,1 , . . . , x ˜ K,H ] ∈ RN ×H and κα˜ (xm , X ˜ = [˜ ˜ H ] ∈ RH , X with α α1 , . . . , α T H ˜ K,1 ), . . . , κα˜ H (xm , x ˜ K,H )] ∈ R+ . Similar to the MLP, the minimization criterion is [κα˜ 1 (xm , x nonconvex with multiple local minima and the optimization algorithms that can be applied are similar to those used for MLPs. Another difficulty is the choice of a suitable number H of prototypes. Therefore, the approach that is normally made in order to find the parameters of RBF is to estimate {α, XK } separately from W . Given {α, XK } the computation of W is a least squares problem. Some of the techniques that can be applied in order to determine {α, XK } are the k-means clustering procedure [DHS01], the orthogonal least squares method [CCG91], genetic algorithms [WC96] or the expectation maximization method [Orr98]. Another successful method has been introduced in [Kub98], where a decision tree is used to perform a partitioning of the input space and for each region a prototype is created. This method is related to the results in [WD92] which state that supervised learning of center locations can be very important for RBF learning. In Chapter 3 a new method will be introduced that is also based on supervised learning and which leads to values α and XK that facilitate a low prediction risk. 2.2.4 Classification and Regression Trees The majority of the algorithms used in classification and regression tasks focus on finding an accurate approximation of the mapping from x to y without considering an aspect that is important in many practical applications: interpretability. Classification and Regression Tree (CART) algorithms are a white box approach which facilitate interpretability and help in understanding the interactions among the features in x leading to the output y. The CART algorithm was introduced in 1984 [BFOS84] driven by the need to deal with complex data28 27

The most popular choice for the kernel is the Gaussian kernel   x − xK,h  καh (x, xK,h ) = exp − . αh

28

In this thesis the input space is considered to be RN . There exist applications where the features are not necessarily real numbers, e. g., they can be categorical variables. Moreover, it may happen that not all inputs in the training set have the same dimensionality. In contrast to most machine learning algorithms CART can deal with such problems [BFOS84].

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

26

2. Machine Learning

and by the wish to uncover the structure in the data. The CART algorithm generates binary trees. In this thesis only binary trees will be presented because CART is the basis for the algorithms that will be used and developed later in this work, although other tree structured procedures like multiway-split or multivariate trees exist [DHS01]. A binary tree structured procedure represents a multi-stage decision process where a binary decision is made at each stage. An example is shown in Fig. 2.6. Thereby, beginning with the RN Split 1 X5

X1

Split 2 X6

X4

Split 3 X3

X2

Figure 2.6: Binary tree structure

input space RN repeated splits of subsets of RN into two descendant subsets are performed. The resulting machine learning structure is a tree consisting of nodes and branches. The nodes can be either internal or terminal nodes. An internal node implements a split into a right and a left node, called children, whereas terminal nodes, called leafs, do not have children but a value—a real number in regression or a class label in classification tasks— associated with them. The splits are determined using the training set D. A new input x starts at the first node, called root node and falls down the tree according to the splits and its feature values until the input reaches a terminal node. The value at that terminal node is associated to the input x, thereby implementing the mapping f . A node t in a tree represents a subspace Xj of RN for which p(x ∈ t) > 0. A partition of the whole space RN into a finite number of disjoint subspaces will be denoted with T˜ and can be interpreted as the partition realized by the terminal nodes of a tree. The function associated with T˜ which assigns each input x to a terminal node t ∈ T˜ is τ : RN → T˜,

x → t,

(2.66)

so that τ (x) = t if, and only if, x ∈ t. A mapping f corresponds to the partition T˜ if for each node t ∈ T˜ there is an assignment ν(t) ∈ R for regression or ν(t) ∈ {c1 , . . . , cK } for classification tasks so that f (x) = ν(t) for all x ∈ t. Then, f (x) = ν(τ (x)). The prediction risk of such a mapping is   R(f ) = Ey|x {L(y, f (x))} p(x = x)dx = Ey|x∈t {L(y, ν(t))} p(x ∈ t). RN

t∈T˜

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

(2.67)

(2.68)

2.2 Machine Learning Algorithms for Classification

27

By introducing the local minimum risk

rmin (t) = min Ey|x∈t {L(y, ν(τ (x)))} , ν

the minimal prediction risk that results from the partition T˜ of RN is  Rmin (T˜) = rmin (t)p(x ∈ t).

(2.69)

(2.70)



In regression problems, given a node t ∈ T˜ the mean and the variance of the output y in this node are μT˜ (t) = Ey|x∈t {y} and σT2˜ (t) = Ey|x∈t {(y − μT˜ (t))2 }. Since the conditional mean estimator is optimal for regression (see Appendix A.1), the optimal value corresponding to a partition T˜ is fB,T˜ (x) = μT˜ (τ (x)). Moreover, for regression problems using the loss L(y, f (x)) = (y − f (x))2 it holds that rmin (t) = σT2˜ (t). Classification is performed by firstly computing estimates of [p(y = c1 |x = x), . . . , p(y = cK |x = x)]T and then choosing that class with the highest estimated posterior probability. Given a node t ∈ T˜ the local minimum risk during the regression subtask of estimating the posterior probabilities in the node t is rmin (t) =

K 

p(y = ck |x ∈ t)(1 − p(y = ck |x ∈ t)).

(2.71)

k=1

The derivation of this result can be found in A.7. In the classification problem, by choosing the class with the highest posterior probability one obtains the minimum local risk K   rmin (t) = min (1 − δ(ck , c ))p(y = ck |x ∈ t) = 1 − max {p(y = ck |x ∈ t)} . (2.72) c

ck

k=1

Using these results one can determine the risk reduction obtained by the split s which divides the node t ∈ T˜ into two disjoint nodes tL and tR . This leads to a new partition T˜ of RN . The risk reduction due to the split is given by ΔR(s, t) = Rmin (T˜) − Rmin (T˜ ) = p(x ∈ t) (rmin (t) − p(x ∈ tL |x ∈ t)rmin (tL ) − p(x ∈ tR |x ∈ t)rmin (tR )) .

(2.73)

The relative risk reduction is Rmin (T˜) − Rmin (T˜ ) p(x ∈ t) = rmin (t) − p(x ∈ tL |x ∈ t)rmin (tL ) − p(x ∈ tR |x ∈ t)rmin (tR ).

Δr(s|t) =

(2.74)

If one wants to refine the partition T˜ by splitting the node t, the optimal relative risk reduction splitting rule is to choose s to maximize Δr(s|t): sopt (t) = argmax {Δr(˜ s|t)}.

(2.75)



Up to here it has been assumed that the probability functions needed to compute the optimal values for partitions are known. In the following it will be shown how tree structured procedures can be applied in practice where only a training set D is available. The

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

28

2. Machine Learning

construction of a tree from a training set D consists of three tasks: the selection of the splits, the decision when to declare a node a terminal node or to continue splitting it, and the assignment of each terminal node to a possible value of the output y. Whereas the latter task is simple, the first two can be quite difficult. The main idea when looking for the best split is to select the split in such a way that a maximal relative risk reduction is obtained. Hereby, the training set D is used to estimate the probabilities that are required for computing the risk reduction. Given a partition T˜ of RN that is realized by a tree mapping f and t ∈ T˜ a node in this tree containing M (t) patterns from the training set D an estimate for p(x ∈ t) is computed as p ˆ(x ∈ t) =

M (t) . M

(2.76)

The set M(t) = {m ∈ M : xm ∈ t} contains the indices of all those training patterns that lie in node t so that |M(t)| = M (t). Further, it will be assumed that p ˆ(x ∈ t) > 0 for ˜ all t ∈ T . The estimates of the minimum local risk rmin (t) in a node and of the total risk Rmin (f ) for tree mapping f [see (2.69) and (2.70)] are29   M(t) L(ym , ν(τ (xm ))) ˆ min (f ) = and R rˆmin (t) = min rˆmin (t)ˆ p(x ∈ t). (2.77) ν M (t) ˜ T

ˆ min (f ) is equivalent to the empirical risk Remp (f ) It should be noted hat the risk estimate R ˆ D (f ) from (2.33) and (2.34). or to the estimate R The mean μT˜ (t) and the variance σT2˜ (t) of the output y in the node t are approximated by   ˆT˜ (t))2 M(t) ym M(t) (ym − μ 2 and σ ˆT˜ (t) = . (2.78) μ ˆT˜ (t) = M (t) M (t) To estimate the optimal values in regression tasks one sets ν(t) = μ ˆT˜ (t) leading to rˆmin (t) = 2 σ ˆT˜ (t). The estimate for the optimal value given an input x in regression is f (x) = fˆB,T˜ (x) = μ ˆT˜ (τ (x)). In classification one has to compute the estimates for the probabilities p(y = ck |x ∈ t), k = 1, . . . , K in order to compute rˆmin (t). With the set of indices Mk (t) = {m ∈ M : xm ∈ t and ym = ck } and the notation Mk (t) = |Mk (t)| for the number of training patterns that belong to class ck in the node t the estimate for p(y = ck |x ∈ t) is p ˆ(y = ck |x ∈ t) =

Mk (t) . M (t)

(2.79)

Therefore, (2.71) and (2.72) turn into rˆmin (t) =

K  k=1

=

Note that in the sum T˜.

 T˜

p ˆ(y=ck |x∈t)

K K   k=1

29

p ˆ(y = ck |x ∈ t)(1 − p ˆ(y = ck |x ∈ t))  p ˆ(y = ck |x ∈ t)ˆ p(y = ck |x ∈ t)

(2.80)

k =1 k=k

only the terminal nodes t are considered since these nodes define the partition

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

2.2 Machine Learning Algorithms for Classification and rˆmin (t) = min c

 

29

 (1 − δ(ck , c ))ˆ p(y = ck |x ∈ t)

= 1 − max {ˆ p(y = ck |x ∈ t)} .

k

ck

(2.81)

What is still missing in order to compute an estimate of the best empirical split [see (2.75)] sˆopt (t) is the estimate of the relative empirical risk reduction Δr(s|t). For this reason the two estimates p ˆ(x ∈ tL |x ∈ t) =

p ˆ(x ∈ tL ) p ˆ(x ∈ t)

and p ˆ(x ∈ tR |x ∈ t) =

p ˆ(x ∈ tR ) p ˆ(x ∈ t)

(2.82)

of p(x ∈ tL |x ∈ t) and p(x ∈ tR |x ∈ t) are introduced so that the estimate of Δr(s|t) [see (2.74)] is ˆ(x ∈ tL |x ∈ t)ˆ rmin (tL ) − p ˆ(x ∈ tR |x ∈ t)ˆ rmin (tR ). Δˆ r(s|t) = rˆmin (t) − p

(2.83)

Thus, the optimal empirical split at node t is r(˜ s|t)}. sˆopt (t) = argmax {Δˆ

(2.84)



Now it is possible to give an answer to the question how to choose the best split in practice when only a training set D is available. In the CART algorithm only univariate splitting rules are considered, i. e., each split depends on the value of only a single feature xn . Then, for an input x a split s divides the node t according to the question “Is xn < τCART (s)?”, where the split s defines both the threshold τCART (s) and the index n of the feature in x. If for an input x the answer to this question is “yes” one moves down to the left child node, otherwise to the right child node, thereby implementing a partition of t whose border is a hyperplane perpendicular to the axis corresponding to xn in RN . Given the training set D there is only a finite set of possible splits. In a node t, for the n-th feature only the halfway between consecutive distinct values of xn are considered. Assuming that all M (t) values of xn from inputs of the training set that lie in the node t are distinct, there are M (t) − 1 possible splits associated to the n-th feature in t. Thus, N (M (t) − 1) splits and their relative risk reductions must be taken into consideration before deciding which split should be chosen in the node t. In classification tasks the relative risk reduction is normally computed with the risk from (2.80), the so-called Gini impurity so that in each terminal node estimates for the posterior probabilities are available. It is necessary to clarify what value should be associated to a terminal node. For regression tasks the estimated conditional mean estimator rule for the node is used, i. e., all inputs that lie in the partition of the terminal node t are assigned to f (x ∈ t) = μ ˆT˜ (t), which is computed according to (2.78). For the classification task the estimate of the optimal decision requires to assign all inputs that lie in the partition of the terminal t to that class ck with the maximal estimated posterior p ˆ(y = ck |x ∈ t). The values p ˆ(y = ck |x ∈ t) are computed using (2.79). In order to give an answer to the question how to decide when to declare a node a terminal node or to continue splitting it, one must understand this task as an estimation problem. Whereas a tree with many nodes, i. e., many partitions, is likely to overfit the training set leading to a high variance, a tree with only a few nodes will most probably not be able

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

30

2. Machine Learning

to model the structure in the data leading to a high bias. When CART was introduced many complicated rules aiming to decide when to stop growing a tree have been tried out, all leading to unsatisfactory results. A “shift in focus” [BFOS84] led to the breakthrough. Instead of stopping the splitting according to a rule in a first step one has to continue splitting until all terminal nodes have only a small number of elements and the input space RN is divided into many partitions. In a second step one has to prune this tree by selectively recombining subtrees. As the number of possible pruned subtrees of a tree is very large a selective pruning procedure based on a cost function is necessary. All algorithms that are used and developed in this thesis use full-grown trees for classification, i. e., each terminal node contains only inputs that belong to one class. Therefore, the pruning process it not discussed here. Details on pruning can be found in [BFOS84] and [BN07a].

2.2.5 Ensemble Learning Ensemble learning procedures make predictions by running a “base learning algorithm”, e. g., Classification and Regression Tree (CART), Radial Basis Function (RBF) or Multi-Layer Perceptron (MLP), B times and then combine the predictions of the individual models. Each individual learner additionally depends on the parameter vector θb , b = 1, . . . , B which assures that the models are not identical. Ensemble learning models can be divided into two categories. If each base learner is computed independently of the other base learners one talks about parallel ensemble learning models. An example for a parallel ensemble method is the Random Forest (RF) algorithm that is discussed below. If the construction of a base learner is coupled on the other base learners one talks about sequential ensemble learning models. The AdaBoost algorithm [FS97, BN07a] is an example for a sequential ensemble model. In the following the bias-variance framework for averaging parallel ensemble models will be presented, i. e., models where the output is computed as an average of the outputs of B independently constructed base learners.

2.2.5.1 Bias-Variance Framework in Averaging Ensemble Models The bias-variance framework is a suitable approach to gain insight into the good performance of ensemble learning algorithms. It is assumed that the parameter vectors θb , b = 1, . . . , B are drawn independently from the distribution p(jb = θb |D = D). Since the bias-variance decomposition for classification problems can be discussed by using the bias-variance decomposition in regression, the latter will be presented first. In regression tasks the mapping f considered here is realized by the average over base learners B 1  f (x, D, j) = f (x, D, jb ), B b=1

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

(2.85)

2.2 Machine Learning Algorithms for Classification

31

T T where j = [jT 1 , . . . , jB ] . Due to the additional random vector jb the bias-variance decomposition of the prediction risk from (2.16) for one base learner in the ensemble becomes



Ex,y,D,jb (y − fb (x, D, jb ))2 = Ex Ey|x y2 − fB (x)2  ei,b =Ex {ileb (x)}



+ Ex (fB (x) − ED,jb |x {fb (x, D, jb )})2  eb,b =Ex {(biasˆy|x {ˆ y})2 }

+ Ex ED,jb |x (fb (x, D, jb ) − ED,jb |x {fb (x, D, jb )})2 .  ev,b =Ex {varˆy|x {ˆ y }} (2.86) Using the theorem of total variance30 ev,b can be decomposed into two terms



ev,b = Ex ED|x varjb |D,x {f (x, D, jb )} + Ex varD|x {Ejb |D,x {f (x, D, jb )}} .

(2.87)

Due to the introduction of the random parameter jb the bias and the variance of the individual learners f (x, D, jb ) will be larger than those of the original learner f (x, D) with no randomization. But as it will be shown below this can be compensated in general by averaging over the B learners f (x, D, jb ). (B) When averaging over the B individual models the expectation of the squared bias eb of the ensemble learner is the same as the expectation of the squared bias eb,b of a base learner since ED,jb |x {f (x, D, jb )} is the same for all B base learners: ⎧ 2 ⎫ B ⎨ ⎬  1 (B) e b = Ex f (x) − = eb,b . ED,jb |x {f (x, D, jb )} (2.88) ⎩ B ⎭ B b=1

An averaging ensemble learner does not decrease the bias of an individual learner, which in general is larger than the bias of the original learner with no randomization. One possibility to decrease the bias is to use a weighted combination of the individual learners. This is done for example in the AdaBoost algorithm. As pointed out in [Die02] a weighted sum of learners expands the space of functions that can be represented, thereby reducing the bias. The huge advantage of averaging is that the variance can be reduced dramatically. Assuming that the B random variables jb are i. i. d. random variables according to the distribution (B) p(jb |D, x) the variance ev of the ensemble is:31



e(B) = Ex ED|x varj|D,x {f (x, D, j)} + Ex varD|x {Ej|D,x {f (x, D, j)}} v  



1 ED|x varjb |D,x {f (x, D, jb )} + Ex varD|x {Ejb |D,x {f (x, D, jb )}} . (2.89) = Ex B

2



2 2 2 vara,b = − Eb Ea|b {f (a, b)} {f (a, b)} = Ea,b f (a, b) −Ea,b {f (a, b)}  = Eb Ea|b f (a, b)

2 2 2 2 Eb Ea|b f (a, b) − Ea|b {f (a, b)} = + Eb Ea|b {f (a, b)} − Eb Ea|b {f (a, b)}

Eb vara|b {f (a, b)} + varb {Ea|b {f (a, b)}}. 31 If the B random variables jb are correlated the variance varθ|D,x {f (x, D, j)} is ⎞ ⎛ B B  B  1 ⎝ varθb |D,x {f (x, D, jb )} + 2 Covθi ,θj |D,x {f (x, D, ji ), f (x, D, jj )}⎠ varθ|D,x {f (x, D, j)} = 2 B i=1 j=i+1 30

b=1

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

32

2. Machine Learning

Therefore, the more uncorrelated the base learners in the ensemble are the lower is the variance. In order to achieve a lower generalization error than the original learner with no randomization the variance reduction of the ensemble learner must be greater than the increase in bias and variance that appears when introducing j. As empirical bias-variance analyses show in general this is the case [GEW06]. In classification tasks the majority vote among the B base classifiers is taken. This corresponds to computing estimates p ˆ(y = ck |x, D, j) of the posteriors p(y = ck |x) by averaging the decisions, in their unit vector representation, of all B base learners f (x, D, j) =

B 1  f (x, D, jb ) B b=1

∈ [0, 1]K

(2.90)

and by choosing that class with the maximal estimated posterior probability. Hereby, f (x, D, jb ) denotes the unit vector representation of f (x, D, jb ) and the k-th element of f (x, D, j) the estimate p ˆ(y = ck |x, D, j). As known from Subsection 2.1.2 and (2.32) the variance of p ˆ(y = ck |x, D, j) plays a crucial role for the generalization ability. With p ˆb (y = ck |x, D, jb ) ∈ {0, 1} denoting the k-th element of f (x, D, jb ), due to (2.90) it holds that B 1  p ˆ(y = ck |x, D, j) = p ˆ (y = ck |x, D, jb ). B b=1 b

(2.91)

Eq. (2.91) is equivalent to (2.85) so that according to (2.89) the variance of p ˆ(y = ck |x, D, j),

p(y = ck |x,D,j)−ED,j|x {ˆ p(y = ck |x,D,j)})2 , (2.92) var(ˆ p(y = ck |x,D,j)) = Ex ED,j|x (ˆ is much smaller than the variance of p ˆb (y = ck |x, D, jb ),

pb (y = ck |x,D,jb )−ED,jb |x {ˆ pb (y = ck |x,D,jb )})2 . (2.93) var(ˆ pb (y = ck |x,D,jb )) = Ex ED,jb |x (ˆ ˆ(b) (y = It should be mentioned that p ˆb (y = ck |x, D, jb ) is not the estimate of the posterior p (b) ck |x, D, jb ) that would be used by the b-th base learner. The estimate p ˆ (y = ck |x, D, jb ) of the posterior as computed by algorithms like CART, MLP or RBF takes on values in the interval [0, 1], whereas p ˆb (y = ck |x, D, jb ) ∈ {0, 1}. Therefore, in contrast to the regression bias-variance decomposition a direct comparison of the variance resulting from an ensemble with the variance from an individual base learner cannot be performed. Nevertheless, low variances for the estimates p ˆ(y = ck |x, D, j) are obtained with ensemble models by averaging. As shown for binary classification problems by (2.32) a low variance of the posterior estimate p ˆ(y = fB (x)|x, D, j) assures a low generalization error if32 ED,j|x {ˆ p(y = fB (x)|x, D, j)} is the with Covθi ,θj |D,x {f (x, D, ji ), f (x, D, jj )} = Eθi ,θj |D,x {f (x, D, ji ), f (x, D, jj )} − Eθi |D,x {f (x, D, ji )} Eθj |D,x {f (x, D, jj )} . The worst case happens when B times the same model f (x, D, jb ) is used in the ensemble. Only in this case the variance is not reduced by the ensemble. If the B random variables jb are i. i. d. according to the distribution p(jb |D, x) the covariance Covθi ,θj |D,x {f (x, D, ji ), f (x, D, jj )} is zero. 32 Due to the additional random variable j the equations from Subsection 2.1.2 are extended by taking the expectations over the (D, j)-space instead of only the D-space.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

2.2 Machine Learning Algorithms for Classification

33

largest value in ED,j|x {f (x, D, j)}. Due to (2.91) it holds that ED,j|x {ˆ p(y = ck |x, D, j)} = ED,jb |x {ˆ pb (y = ck |x, D, jb )} and therefore the largest value in ED,jb |x {f (x, D, jb )} must be ED,jb |x {ˆ pb (y = fB (x)|x, D, jb )}. With other words this means that the base classifiers must decide on average over the (D, jb )-space most frequently as the Bayes classifier. This is the only requirement for the base learners in order to achieve a low generalization error. Thus, base learners that compute bad estimates p ˆ(b) (y = ck |x, D, jb ) of the posteriors p(y = ck |x = x) can lead to the lowest possible classification error when considered in an ensemble. On the other side, if on average over the (D, jb )-space another class than fB (x) is most frequently chosen a decreasing variance leads to a larger generalization error. In [GEW06] it is pointed out that there is a tradeoff between bias and variance which is controlled by the randomization strength. With an increasing amount of randomization the dependence of the randomized models on the learning sample becomes smaller, i. e., the base learners are only correlated weakly, leading to a small variance of the averaging model. On the other hand, an increasing amount of randomization makes the dependence of the output prediction on the input features smaller and the bias becomes higher. Compared to the regression bias-variance analysis of ensemble techniques this means that for classification a higher randomization should be used in terms of the tradeoff in order to decrease the variance and thereby the prediction error, as long as on average over the (D, jb )-space the base learners decide for an input x in most instances as the Bayes classifier fB (x). Such methods using high randomization are the RF algorithm which will be presented next and the Extremly Randomized Trees algorithm [GEW06]. 2.2.5.2 Random Forest Algorithm The RF algorithm is a recent machine learning method for classification and regression problems. Inspired by [AG97], Leo Breiman developed the RF algorithm in 2001 [Bre01]. The RF algorithm is strongly connected to the bagging 33 technique that has been introduced by Leo Breiman in 1996 [Bre96a]. In bagging, using the bootstrap method, B new sets Db , b = 1, . . . , B of size M are constructed by sampling with replacement from D. On each of these bootstrap sets a model f (x, Db ) is constructed using a base learning algorithm. The final prediction is made by taking the average on the predictions of the B models for regression and the majority vote for classification. The RF is a randomized ensemble learning method strengthened by bagging using CART as base learners. RF is one of the most powerful off-the-shelf machine learning procedures available nowadays. Breiman defines the RF for classification as follows [Bre01]: a random forest is a classifier consisting of a collection of tree-structured classifiers {f (x, jb )}, b = 1, . . . , B, where jb are independent identically distributed random vectors and each tree casts an unit vote for the most popular class at input x. The majority vote among the B CART classifiers yields the output of the RF classifier. For regression each tree computes a real number as output and the RF prediction is computed by taking the average over the B individual outputs. In the following RF for classification will be discussed. The RF algorithm grows B trees, where CART learning is perturbed by two random elements which form the vector j. On the one hand, the training set Db for each tree is constructed using the bootstrap method and on the other hand, the search for the optimal 33

The word bagging is the short form for “bootstrap aggregating”.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

34

2. Machine Learning

split during tree growing is not performed on all N features but only on NRF < N

(2.94)

randomly chosen features at each node. An important result derived in [Bre01] is that RF do not overfit. As more trees are added to the RF the generalization error decreases converging to a limiting value. The derivation can also be found in Appendix A.8. As described above in the discussion about the bias-variance relation for ensemble techniques a CART classifier in the RF must decide on average over the jb -space34 as the Bayes classifier so that the majority vote among the classifiers leads to a small generalization error. In the RF algorithm B full-grown trees are constructed, i. e., each leaf in each tree contains only examples that belong to one class. With the notation introduced above this means that p ˆ(b) (y = ck |x, jb ) = p ˆb (y = ck |x, jb ) ∈ {0, 1}.

(2.95)

The b-th tree f (x, jb ) segments the input space RN into many regions so that in each region only inputs from its training set that belong to one class are left. Due to this fact the model space that is represented by full-grown CART is very large and it is assured that whenever an input x belongs to the training set Db this input—and the segment in the input space around it—is also classified to the corresponding target from the training set which most probably is fB (x). Thus, if each tree f (·, jb ) is able to model the underlying pattern source to a certain degree, i. e., to create a suitable segmentation of the input space, on average over the jb -space f (x, jb ) decides as fB (x). On the other hand, the variance of the estimate p ˆb (y = ck |x, jb ) for a full-grown tree is large because small changes in jb yield big differences in the way how the b-th tree segments the input space RN . Nevertheless, due to the aggregation described by (2.91) which is required to compute the RF decision the estimate p ˆ(y = fB (x)|x, j) has a low variance var(ˆ p(y = fB (x)|x, j)) < var(ˆ pb (y = fB (x)|x, jb )).

(2.96)

This is the key message of (2.89). The two requirements that assure a low generalization error for averaging ensemble classifiers are fulfilled by the RF classifier: the largest value in Ejb {f (x, jb )} must be Ejb {ˆ pb (y = fB (x)|x, jb )} and the variance of the estimate p ˆ(y = fB (x)|x, j) that results after averaging must be small. In order to obtain a low variance according to (2.89) the trees should be as uncorrelated as possible.35 This is achieved by growing each tree on an independent bootstrap sample from the training data and by randomly selecting NRF features at each node for which the best split is determined. Although other classifiers than CART and other randomization methods than bootstrap can be used to obtain a powerful ensemble model, by these choices one has the possibility to gain a lot of insight into the data. Additionally to being a tool applicable to both classification and regression and being able to obtain a high accuracy despite a large feature dimensionality, RF offer the possibility to determine the importance of variables,36 compute an unbiased 34

The generation of the training set for each tree in the RF contains random elements. Therefore, instead of the notation f (x, Db , jb ) the expression f (x, jb ) is used in the following, assuming that the random variable Db is included in jb . 35 The expression “as uncorrelated as possible” means that by knowing the outcome of fb1 (x, jb1 ) one should not know much about the outcome of fb2 (x, jb2 ) with b1 = b2 . 36 In the 2003 NIPS competition on feature selection in high dimensional data (thousands of features) there were over 1600 entries from some of the most prominent people in machine learning. The top entries used RF for feature selection.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

2.2 Machine Learning Algorithms for Classification

35

estimate of the generalization error during the building process, find interactions in the data, detect outliers, and deal with missing values. Whereas the feature selection using RF will be presented in the next section, in the following the other properties of RF will be discussed. Out-Of-Bag Estimates Using bagging not only introduces randomization which is a key characteristic of RF but it also allows the computation of an unbiased estimate of the generalization error that is achieved by the ensemble. This is performed by using the so-called out-of-bag 37 (oob) classifier. In order to understand this term one should note that due to the bootstrap about one-third of the training data is left out when constructing f (x, jb ). This is due to the fact that the probability for a pattern not to be in the bootstrap set Db is (1 − 1/M )M ≈ 0.37 for large M . Thus, to estimate the classification performance for the input xm one can take the majority vote only among those trees which have not seen the pattern (xm , ym ) during training. By denoting with Bm the set of indices for the trees whose training sets do not contain the pattern (xm , ym ) the oob classifier computes firstly as in (2.90) foob (xm , j) =

1  fb (xm , jb ) |Bm | b∈B

(2.97)

m

and then takes that class as output foob (xm ) whose entry in the vector foob (xm , j) containing the estimated posteriors p ˆ(y = ck |xm , j) is maximal. Then, the out-of-bag estimate for the prediction risk of the RF f is the error rate of the oob classifier M 1  ˆ L(foob (xm ), ym ). Roob (f ) = M m=1

(2.98)

Unlike the cross-validation estimate of the prediction risk, where  bias is present but un38 ˆ known, the oob estimate of R(f ) is unbiased, i. e., Ej Roob (f ) = R(f ), if the number B of trees is large enough [Bre01]. Because the error rate decreases as the number of trees B in the RF increases, the out-of-bag classifier for small B tends to offer a pessimistic estimate of the generalization error, overestimating it. One must increase B past the value where the out-of-bag estimate of the prediction error converges. In this case the decision for an input  by B trees or only by ≈ 0.37B is the same, i. e., f (x) = foob (x), and thus,  x made ˆ oob (f ) = R(f ). In [Bre96b] Breiman gives empirical evidence that the out-of-bag Ej R estimate is as accurate as using a test set of the same size M as the training set. By using the out-of-bag estimate there is no need for an additional test set, which is a huge advantage of the RF algorithm. ˆ oob (f ) is the basis for performing feature selection, finding out The out-of-bag estimate R the effects of features on the prediction and determining a suitable value for the only adjustable parameter NRF . 37 38

Out-of-bag stands for “out of bagging sample”. In V -fold cross-validation the size of the training set used to compute the final model and the size of the training sets for the models used to determine the estimate   is different. Thus, as discussed in Subsecˆ tion 2.1.2 the estimate is biased, i. e., ED RCV (f (x, D), V ) = R(f ).

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

36

2. Machine Learning

Due to the randomization introduced by selecting only NRF < N features at each node for computing the best split the RF procedure is fast to compute. The question that arises is what value should be chosen for NRF . As found out empirically by Breiman [Bre01] the RF algorithm is not sensitive to the value of NRF over a wide range. The smaller NRF the higher the randomization effect and the weaker the dependence of the structure of the grown trees on the target values of the learning sample. The correlation among the trees in the RF decreases as NRF is made smaller, which according to (2.89) leads to a smaller variance of the ensemble classifier f . Since for classification problems the decrease in variance can theoretically lead to the smallest possible prediction error a high randomization is desirable. But this only holds if on average over the jb -space each tree in the forest decides for an input x as fB (x), i. e., each tree requires some prediction accuracy which decreases with smaller values of NRF . Therefore, NRF can be seen as a regularization parameter: if NRF is too small the base learners do not decide on average over the jb -space as the Bayes classifier for large regions of the input space and despite the low variance of the RF the classification performance will be poor and if NRF is too large the trees in the forest are strongly correlated and the decrease of the variance due to averaging is only small. As found out empirically on a large number of data sets by Breiman the choice &√ ' NRF = N (2.99) is a good tradeoff between the two effects [Bre01]. In problems where &√ 'the number of features N is low the randomization introduced by the choice NRF = N can be improved by additional randomization methods like linear combinations of the inputs [Bre01] or random selection of the splits [GEW06]. Another application of the oob estimation is to find out those intervals in the range of an input feature that are important for the classification of a class ck and thus, to gain insight into the problem [Bre02]. For each input xm from D one has to determine the proportions of votes in the oob classifier for each class ck . By denoting with Bm,ck the set of indices marking those trees that have not seen the pattern (xm , ym ) during the training process and for which the decision fb (xm , j) = ck is taken, one computes the proportions with the ratios |Bm,ck |/|Bm |. These proportions must be computed for all inputs xm , m = 1, . . . , M . Then, if one wants to figure out what importance the n-th feature has for classifying class ck the information from the n-th entry in all inputs xm is removed. This can be done by randomly permuting the n-th entry in all oob patterns39 for every tree f (·, jb ). With this perturbed inputs xnm , where xnm denotes that the n-th entry in xm is random, one computes with the same RF again the proportions of votes in the out-of-bag classifier for each class, denoted n by |Bm,c |/|Bm | . Now, if one is interested in the importance of the n-th feature for class k ck , for all values of the n-th feature that appear in the training set the difference between n |Bm,c |/|Bm | and |Bm,ck |/|Bm | is computed. If the difference is big in a certain interval of the k n-th feature, i. e., the perturbation of the n-th entry deteriorates the classification of class ck significantly in this interval, then feature n plays an important role in identifying class ck . Thus, one can figure out in what interval of the values taken by feature xn this feature is important for class ck . A similar technique using the oob classifier can be used to perform feature selection. This will be explained in detail in the next section. 39

The oob patterns of a tree f (·, jb ) are those patterns in D that have not been used during the training of f (·, jb ).

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

2.2 Machine Learning Algorithms for Classification

37

The Proximity Measure Being an ensemble method using CART as base learners, RF inherit all advantages of CART except interpretability. Among this advantages are the robustness to noise and outliers, the ability to deal with categorical features and the ability to deal with missing entries in some of the inputs xm . However, the main reason for using trees as base learners is the possibility to determine proximities between inputs xm . The idea underlying the RF proximity measure is the following. Because the trees in the forest are unpruned the leaf nodes will contain only a small number of inputs xm that belong to the same class. When an input xm2 is run down each tree in the RF and it lies in the same leaf node as an input xm1 , the proximity of xm1 and xm2 is increased by one. This is done for all trees and at the end the proximities are divided by the number of trees. Moreover, the proximity between a pattern and itself is set to one. Thus, prox(xm1 , xm2 ) =

B 1  δ (tb (xm1 ), tb (xm2 )), B b=1

(2.100)

where tb (xm ) denotes the leaf in the b-th tree to which the input xm belongs. The proximity measure is not directly related to the Euclidean distance. Instead it is a measure inherent in the data and the RF algorithm. For example two inputs xm1 and xm2 can be far apart in the Euclidean space but might have a high proximity prox(xm1 , xm2 ) because they lie in the same leafs in many trees. On the other hand, two cases that are close together in Euclidean space can have a small proximity if they are close to the classification boundary. From the definition of proximities it follows that the values d2RF (xm1 , xm2 ) = 1 − prox(xm1 , xm2 ) are squared distances in an Euclidean space of high dimension [Bre02]. Therefore, one can use metric multidimensional scaling to project the data onto a lower dimensional space while preserving the distances between them as much as possible. A two or three dimensional space can be visualized showing the clusters in the data. One of the main applications of the proximity measure is visualization which is a very important tool to gain insight in the data structure. Examples for multidimensional scaling using d2RF (xm1 , xm2 ) are presented in the next chapter. Another application of the proximity measure is outlier location. Outliers are defined as inputs xm having small proximities to all other inputs. Because the data in some classes is more spread out than in others, the notion of outliers is defined classwise [Bre02]. For the input xm1 belonging to class ym1 the quantity out(xm1 ) is defined as out(xm1 ) =

1 M  m2=1

.

(2.101)

2

(prox(xm1 , xm2 ))

m2=m1,ym2 =ym1

For all inputs xm in a class one has to compute the median of out(xm ) and also the mean absolute deviation from the median. After the median is subtracted from each out(xm ) and the division by the deviation from the median is performed, one obtains a normalized measure for outlyingness. The values less than zero are set to zero. Values larger than 10 indicate that the considered input is an outlier. This procedure differs from the usual definition of outlyingness since instead of the Euclidean distance, the RF distance based on proximities is used.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

38

2. Machine Learning

The proximity measure can also be used to replace missing values in the entries of some of the inputs xm . Due to restrictions during the measurement of some features or due to high costs it might be impossible to determine all N features in xm for all M patterns. A solution to this problem is to replace missing values for a given feature by the median of the non-missing values. Using RF in a first step one can determine the missing values with the median-strategy and then compute the proximity between two inputs. In a second step, the replacement of missing values can be improved. In order to achieve this one has to replace the missing values in input xm by a weighted average of non-missing values, with weights proportional to the proximity between input xm and the inputs with non-missing values. Another highly interesting application of the proximity measure is the possibility to find prototypes for each class. There might be homogeneous classes that can be summarized by a single prototype and other classes might need more prototypes to characterize them. The algorithm suggested in [BC03] to find prototypes for class ck is shown in Algorithm 2.1. Prototypes give insight in the structure of the data and increase the interpretability. Algorithm 2.1 Prototypes for class ck using the RF algorithm Think of class ck ˆ “nearest” neighbors (using proximities) 2: For each input, find its K Select the input that has the largest weighted number of class ck nearest neighbors 4: The median of these neighbors is the prototype, and the first and third quartiles give an idea of stability Now look for another input that is not one of the previously-chosen neighbors and select the one that has the largest weighted number of class ck nearest neighbors 6: Repeat

RF as an Adaptive Nearest Neighbor Algorithm There are some very interesting connections between RF and other machine learning methods. In Chapter 3 the RF algorithm will be presented from the point of view of kernel classifiers and in the following the RF will be described as an adaptive nearest neighbor method. For a better understanding a simplified model of RF is considered [Bre04]. In this model it is assumed that no bootstrapping is performed and the randomization is introduced only by randomly choosing the splitting values. Additionally, in this model there is only one pattern from the training set per leaf in each tree, i. e., the input space is fragmented in M hyperrectangles. If a new test input x has to be classified by the tree f (·, jb ) and x lies in the same leaf as xm the class label ym is assigned to x by the b-th tree. Taking all B trees into account the proportion of trees in which x and xm are in the same leaf is the proximity prox(x, xm ) from (2.100). Since there is no bootstrapping and only one input per leaf it holds that M 

prox(x, xm ) = 1.

m=1

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

(2.102)

2.3 Feature Extraction

39

As expressed by (2.91) the RF determines the class belonging to x by computing for each class ck the estimate of the posterior p(ck |x, j), p ˆ(ck |x, j) =

M 

prox(x, xm )

(2.103)

m=1

ym =ck

and then choosing that class with largest value p ˆ(ck |x, j). In regression the output of the RF for the input x is y=

M 

prox(x, xm )ym .

(2.104)

m=1

Considering that for large B the value 1 − prox(x, xm ) is a squared Euclidean distance in a high dimensional space between x and xm [Bre04], it is clear that this simplified RF model is a classical nearest neighbor algorithm. In regression the output is computed as a weighted sum of the neighbor outputs, with the weight determined by the RF measure of closeness. Similar in classification the RF assigns much importance to patterns for which the input is close to x. An explanation for the good generalization properties of RF can be given by looking closer ˆ onto the squared distance measure 1 − prox(x, xm ). Unlike the K-nearest neighbor method ˆ of neighbors is consulted before taking a decision, RF determines where a fixed number K the number of neighbors adaptively, depending on the location in the input space of the x. This is due to the adaptive distance measure defined by prox(x, xm ). In cases where x lies in a region of the input space that clearly belongs to class ck only a few neighbors will be chosen to classify x. On the other hand, if x is close to the boundary between classes the RF uses a large number of neighbors to decide the class of x. Although a simplified model of RF has been used above the interpretation of RF as an adaptive nearest neighbor algorithm captures very much of its behavior.

2.3 Feature Extraction In the previous section it has been assumed that the input space RN is fixed and the optimization problems have focused on finding a suitable mapping f by minimizing estimates of the prediction risk. As shown in Fig. 2.1 the computed output ˆy can only be a good estimate of the target y corresponding to the measured observation v if both the feature extraction and the mapping f are chosen properly. The design of a machine learning system involves the optimization of both blocks, the feature extraction and the mapping f . Most of the literature only focuses on the mapping f since the choice for a suitable input space strongly depends on the application. It is often assumed that the available domain knowledge has been integrated in the generation of the features and the input space RN so that only f must be determined. Nevertheless, in cases where such domain knowledge is not available or when it is not clear what features to choose, methods as presented in Subsection 2.1.2 for the evaluation of machine learning algorithms in practice can be used. Hereby, one chooses a fixed machine learning algorithm and determines by honest estimates of the prediction risk which input space is best suited for the given application in combination with the chosen

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

40

2. Machine Learning

learning algorithm. Thus, the task of feature extraction is strongly related to the model assessment problem. Fig. 2.7 shows the feature extraction block from Fig. 2.1 during the design phase of the learning system. Feature extraction can be divided into the two subtasks feature generation and feature selection. Feature generation denotes the preprocessing steps that are performed  ˜ to transform the measurements v ∈ RN into the vector ˜ x ∈ RN which contains a number ˜ of features, out of which N are selected in the feature selection step for the final input N x ∈ RN which is mapped by f to the output ˆy. After identifying suitable features for use in the considered application the generation block only computes these features and the selection block is removed.

v

Feature Generation

˜ x

Feature Selection

x

Feature Extraction

Figure 2.7: Feature extraction during the design phase

2.3.1 Feature Generation As mentioned above finding a good representation of the data highly depends on the available measurements and the domain knowledge about the considered application. Since the success of all subsequent steps in the machine learning system is determined by the features it is important to use all available knowledge about the data to compute expressive features and to avoid loosing relevant information. More than any other part of the learning system the feature generation step requires human expertise. Anyhow, there are some preprocessing steps which are frequently used in the feature generation step and which will be presented briefly in the following. The observation vector v may contain entries that have different scales. In this case a standardization of the data can be performed. Hereby, one has to subtract the mean and to divide by the standard deviation for each feature, i. e., the n-th entry in v is transformed into ˜xn =

ˆn ) (vn − μ , σ ˆn

(2.105)

where μ ˆn and σ ˆn are the estimated mean and variance of the n-th entry in v based on the available measurements. This standardization avoids that certain features dominate distance calculations because they have large numerical values, which could happen for most of the frequently used distance measures. It must be stressed that the standardization is an appropriate technique if the spread of values is due to random variation but it is inappropriate if the spread is caused by the existence of subclasses. If the observation vector v allows the signal-to-noise ratio should be improved. This can be done for example if v represents an image or a time series. Popular methods for de-noising include low-pass filtering or the wavelet-transform. Although the reduction of dimensionality is an important issue in machine learning in some cases the expansion of dimensionality may improve the performance. This occurs when

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

2.3 Feature Extraction

41

one uses a function f from a model space that is too small for the given data. Then, the expansion of dimensionality leads to more degrees of freedom with the result that relatively simple models in the expanded space, e. g., linear mappings can realize complex hypersurfaces in the original space. This expansion of dimensionality can be implemented by assigning products of the features in v to features in ˜ x , e. g., xn = vn2 1 vn2 vn3 , or by the introduction of kernels. The former method is used in polynomial classifiers [Sch96] and the second in SVMs. Due to the curse of dimensionality and the fact that many data sources correspond to low dimensional manifolds that are embedded in the higher dimensional observation space RN a popular preprocessing step is Principal Component Analysis (PCA). Principal component analysis can be understood as a projection technique. It can be introduced in two different ways. The first assumes that the average projection cost is defined as the mean squared distance between the observation vectors v and their projections ˜ x and one is looking for that linear projection that minimizes this cost [Pea01]. The second possibility is the definition of the PCA as the orthogonal projection of the data onto a lower dimensional linear space so that the variance of the projected data is maximized [Hot33]. Both approaches lead to the same algorithm and require in a first step the computation of the sample covariance matrix [Bis06] ˆv = C

M 1  ˆ v )(vm − μ ˆ v )T , (vm − μ M − 1 m=1

(2.106)

ˆ v denotes the sample mean of the random variable v . The eigenvectors corresponding where μ ˆ v build the projection matrix that transforms a ˜ largest eigenvalues of the matrix C to the N ˆ v from it, i. e., (v − μ ˆ v ). Geometrically, if one thinks new vector v into ˜ x after subtracting μ of the M observation vectors vm that are used in the training set as a hyperellipsoidally ˆ v are the shaped cloud of M points in the N  -dimensional space, then the eigenvectors of C principal axes of the hyperellipsoid. Principal component analysis reduces the dimensionality by focusing only on the main principal axes, i. e., the axes with the largest spread in the data. Dimensionality reduction using principal component analysis can be used as an alternative to feature selection so that the selection block from Fig. 2.7 can be removed. A suitable value for N is hereby obtained by removing the small eigenvalues. Another possibility is the usage of the low-dimensional representation in combination with other features that are computed from v , e. g., by nonlinear combinations of features, for the construction of the feature vector ˜ x . It should be noted that principal component analysis linearly combines observation features in order to determine new features so that the interpretability of the resulting input space suffers from this transformation. Thus, principal component analysis is avoided in some applications. Visualization Visualization is a powerful tool to better understand the data, which is necessary to generate suitable features. The main idea hereby is to transform the observation vectors v into a two- or three-dimensional space that can be plotted. A representative of such a technique is the principal component analysis described above. Recently other methods, so-called spectral dimensionality reduction techniques, have been developed which compute based on the

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

42

2. Machine Learning

principal eigenvectors of a symmetric matrix,40 an embedding of the observation vectors that are used in the training set. Each observation vector vm is associated to a two- or ˜ m which corresponds to its estimated coordinates on the three-dimensional representation x manifold underlying the data source. In the following two nonlinear dimensionality reduction techniques will be presented briefly: multidimensional scaling and isomaps. Multidimensional scaling is a way to represent the observation vectors v in a lowerdimensional space in such a way that the distances in this lower-dimensional space correspond to the dissimilarities between points in the original space. Here, metric multidimensional scaling [CC94] is considered where the starting point is the distance in the original  space RN between two points, d(vi , vj ). Considering the M observation vectors that are used ˆ ∈ RM ×M , with the (m1 , m2 )-th entry for the training set, one can determine the matrix κ being κ ˆ (m1 , m2 ) = d2 (vm1 , vm2 ). According to Torgerson‘s double-centering formula [Tor52] κ ˆ (m1 , m2 ) is converted into  M M 1 1  1   κ(m1 , m2 ) = − κ ˆ (m1 , m2 ) − κ ˆ (m1 , m ) − κ ˆ (m , m2 ) 2 M m =1 M m =1  M M 1     + 2 κ ˆ (m , m ) , M m =1 m =1

(2.107)

so that κ(m1 , m2 ) represents an equivalent centered dot product between vm1 and vm2 given by the distance d2 (vm1 , vm2 ) [GGNZ06]. Similarly to principal component analysis an eigenvalue decomposition of the matrix κ ∈ RM ×M , whose (m1 , m2 )-th entry is κ(m1 , m2 ), is ˜ ≤ 3 largest eigenvalues λ and their corresponding eigenvectors performed and only the N ˜ are considered further. The low-dimensional representation x ˜ m of a b ∈ R M ,  = 1 . . . , N point vm is obtained by   ˜ ˜ m = [ λ1 bm,1 , . . . , λN˜ bm,N˜ ]T ∈ RN , (2.108) x where bm, denotes the m-th entry in b . Multidimensional scaling  is a technique that can be used in combination with the RF-distance dRF (xm1 , xm2 ) = 1 − prox(xm1 , xm2 ) leading to a visualization of the RF-neighborhood relations in the original space. Whereas principal component analysis finds a low-dimensional representation that best preserves the variance and multidimensional scaling a representation that best maintains the distance relations of the observation points in the original space, isomaps [TdSL00] aim to preserve the intrinsic geometry of the data. This is done by defining a geodesic distance dgd (·, ·) with respect to the observation vectors vm , m = 1, . . . , M , a distance d(·, ·), and a neighborhood k as dgd (vm1 , vm2 ) = min Υ

|Υ | 

d(ν , ν+1 ),

(2.109)

=1

where Υ is a sequence of observation vectors of length |Υ | ≥ 2, ν are elements from the set {v1 , . . . , vM }, ν1 = vm1 , ν|Υ | = vm2 , and {ν , ν+1 } are k-nearest neighbors with respect to d(·, ·). The length of the sequence |Υ | is part of the optimization. As for multidimensional 40

Due to the eigen-decomposition the methods are called “spectral”.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

2.3 Feature Extraction

43

ˆ with the entries d2gd (vm1 , vm2 ) is computed and transformed scaling the symmetric matrix κ with the double-centering formula to κ according to (2.107). Like for multidimensional scaling, after computing the eigen-decomposition of κ, the low-dimensional representation of the ˜ ≤ 3 is given by (2.108). observation vectors vm that can be visualized for N 2.3.2 Feature Selection The second block in Fig. 2.7 represents the feature selection step. A reduced number of features not only combats the curse of dimensionality but also leads to a reduced computational complexity. Feature selection techniques can be divided into filter, wrapper, and embedded methods. Filter methods select features without using a learning machine, in general by ranking the features according to correlation coefficients that measure the dependence of the output y on the individual feature ˜xn [GGNZ06]. Wrapper and embedded methods involve a learning machine in the selection step, the importance of features being estimated based on the performance of the learning machine. An advantage of wrapper and embedded methods is that they also take the interactions among features into account when evaluating their importance for the classification task. Whereas wrapper methods use the learning machine as a black-box, i. e., the performance of the machine and the features is measured based on a score which can be computed by every learning machine, e. g., the estimated risk, embedded methods use a measure which is specific to a given learning machine. Feature selection also can be divided into forward and backward feature selection. In forward feature selection one starts with an empty set and iteratively adds to this set those features which lead to the maximal reduction of a performance score. In backward feature selection one starts with a set containing all features and iteratively removes the least relevant features from the set until the performance score reaches a critical value. In the following, two feature selection methods will be presented since they are the starting point for developing the SBRF-feature selection that will be introduced in Chapter 6 for time series classification. The first method is based on the CART algorithm and the second on the RF algorithm. Both methods belong to the group of embedded techniques. The former performs forward feature selection whereas the latter implements a forward-backward feature selection approach. 2.3.2.1 Feature Selection with CART CART is a suitable algorithm for feature selection since it allows the introduction of a very large number of candidate features and then in a forward selection process it chooses the best among them to define the splits. Since the tree structure allows almost unlimited introduction of features, Breiman [BFOS84] encourages one to construct new features by combining features or guided by intuition and the preliminary exploration of the data, e. g., by visualization techniques. The feature selection process is performed by keeping only those features that still appear in the pruned tree f . These features are used to build the input vector x of reduced dimensionality. Thereby, one can estimate the feature importance Δ(n) of a variable ˜xn by Δ

(n)

=

 t

 Δˆ r(ˆ s˜xn (t)|t), where Δˆ r(ˆ s˜xn (t)|t) =

Δˆ r(ˆ sopt (t)|t) sˆopt (t) = sˆ˜xn (t) 0 sˆopt (t) = sˆ˜xn (t),

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

(2.110)

44

2. Machine Learning

with the split sˆxn (t) being the estimated optimal split on the n-th feature at node t in the pruned tree f . At each node only the importance of the variable performing the split is increased. The above method may not always be satisfactory. This is due to the fact that a feature ˜xn1 may never occur in the splits of the pruned tree since it is masked by a different variable ˜xn2 . But if one removes ˜xn2 and grows a new tree the feature ˜xn1 may occur in many splits and the tree may also be almost as accurate as the initial tree where ˜xn1 was not used. In order to obtain a ranking of the variables that takes the masking effect into account in [BFOS84] a method using so-called surrogate splits 41 is introduced. At a node t, the best split sˆopt (t) computed based on the available data is defined by (2.84). The split sˆopt (t) divides the node t into the two nodes tL and tR . Looking at the feature ˜xn there is a finite number of possible splits sn (t) on ˜xn . A split on sn (t) divides the node t into the two nodes tL and tR . The estimated probability that an input ˜ x lying in t belongs to tL ∩ tL is x ∈ t) = p ˆ(˜ x ∈ tL ∩ tL |˜

M (tL ∩ tL ) , M (t)

(2.111)

where M (tL ∩tL ) is the number of patterns in the training set that lie in the node t and are sent into the left node by both sˆopt (t) and sn (t). The estimated probability p ˆ(˜ x ∈ tR ∩ tR |˜ x ∈ t) is defined analogously. Then, the probability that the split sn (t) predicts the split sˆopt (t) correctly is estimated by p ˆ(sn (t), sˆopt (t)) = p ˆ(˜ x ∈ tL ∩ tL |˜ x ∈ t) + p ˆ(˜ x ∈ tR ∩ tR |˜ x ∈ t).

(2.112)

In node t the surrogate split sˆs (t) of sˆopt (t) is defined as the split on a different feature than the one chosen for sˆopt (t) which most accurately predicts the action of sˆopt (t), i. e., sˆs (t) = argmax {ˆ p(s(t), sˆopt (t))}.

(2.113)

s(t) (n)

of a variable ˜xn , based on surrogate ⎧ r(ˆ sopt (t)|t) ⎨ Δˆ  Δˆ r(ˆ ss (t)|t) = Δˆ r(ˆ s˜xn (t)|t), where Δˆ r(ˆ s˜xn (t)|t) = ⎩ t 0

The feature importance Δs Δ(n) s

splits is defined as sˆopt (t) = sˆ˜xn (t) sˆs (t) = sˆ˜xn (t) else,

(2.114)

with the split sˆxn (t) being the estimated optimal split on the n-th feature at node t in the pruned tree f . The idea underlying the feature importance measure based on surrogate splits is to assign also a high importance to those features that predict the estimated optimal splits well, but are masked by the features which are used for the estimated optimal splits. Another possible measure for the importance of a feature ˜xn that can be computed with CART is the sum over all nodes of the relative risk reduction produced by the best split (n) on ˜xn at each node. This measure is inferior to Δs because the structure of the tree f is defined by the estimated optimal splits sˆopt . To clarify this one assumes that at a node t the 41

The concept of surrogate splits is useful not only for estimating the variable importance but also for detecting masking effects and handling missing values. The ability to handle missing values is important when there are some input vectors ˜ x for which some entries are unknown e. g., due to measurement difficulties. Details on these topics can be found in [BFOS84].

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

2.3 Feature Extraction

45

estimated optimal split is made on the variable ˜xn1 and that there is a different variable ˜xn2 whose best split has little association with sˆopt (t) but also leads to a high reduction of the relative risk. Since the split in the node t of the pruned tree is made on ˜xn1 it may happen that splits in the child nodes tL and tR on ˜xn2 are very close to the best split on ˜xn2 in t and still lead to a high relative risk reduction. Thus, the same split can contribute to the variable importance not only in t but also in tL and tR leading to an estimated variable importance of ˜xn2 that is too high. (n) The measure Δs from (2.114) does not have this drawbacks being the measure of choice if one wants to determine the feature importance by using CART. Since only the relative (n) magnitudes of Δ(n) or Δs are of interest one can normalize these quantities so that the most important feature has the importance 100 by replacing Δ

(n)

100 Δ(n) with max {Δ(n ) } 

(n)

and

Δ(n) s

with

n

100 Δs

(n )

max {Δs } 

.

(2.115)

n

After having identified the features with low importance values these can be removed, thereby reducing the dimensionality of the input space. In order to determine the number of features that can be removed without deteriorating the performance one should compute fair estimates of the prediction risk in the reduced input space with the methods presented in Subsection 2.1.2. As long as the estimated prediction risk is acceptably small one can continue removing features with low importance. 2.3.2.2 Feature Selection with RF Using the RF algorithm for feature selection has three main benefits. Firstly, being an ensemble technique the RF generalizes well even in high dimensional input spaces. Secondly, due to the oob-estimate it is possible to obtain a fair estimate of the prediction risk while using all available data for training. Thirdly, the RF is build as an ensemble of CART so that a forward feature selection also occurs during the construction of each tree. There are several possibilities to perform feature selection using the RF algorithm. The most intuitive is to compute the average over the importance measures of the trees in the (n) ensemble. By denoting with Δb the feature importance of ˜xn as computed by the b-tree according to (2.110), the average importance value among the B trees in the RF is (n)

Δ

B 1  (n) = Δ . B b=1 b

(2.116)

A consequence of the random choice of the NRF features that are examined for the best split in a tree-node is the fact that masking effects do not appear as in the CART algorithm. Therefore, the computation of the feature importance based on surrogate splits is not necessary. The SBRF-feature selection for time series from Chapter 6 builds on a different approach to compute the feature importance using the RF algorithm. This approach exploits the possibility to determine good estimates of the prediction error using the oob-estimates. Here, ˆ oob (f ) of the risk R(f ) is computed. Assuming that the in a first step the oob-estimate R importance of the n-th variable is analyzed, for each of the B trees in the RF the information

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

46

2. Machine Learning

of the n-th variable is removed by randomly permuting the n-th entry of the input vectors ˜ x which have not been used in the training phase by the tree. The risk with permuted n-th ˆ n,perm (f ). The difference entry is estimated again using the oob-method leading to R oob (n)

ˆ n,perm (f ) − R ˆ oob (f ) ΔOOB = R oob

(2.117) (n)

represents an importance measure for the n-th variable. The larger ΔOOB the more important the n-th variable is. Since only the relative magnitudes of the estimated feature importance values is of relevance, similarly to the Assignment (2.115) a normalization can be performed. Another method to compute the feature importance using the RF which exploits the structure of the algorithm even more makes use of the so-called margin of a pattern. By denoting with Bm the set of indices for the trees whose training sets do not contain the ˜ m , ym ), the margin is defined as [Bre01] pattern (x   1  1  ˜ m , ym ) = ˜ m , jb ), ym ) − max ˜ m , jb ), ck ) , (2.118) mg(x δ(f (x δ(f (x ck =ym |Bm | b∈B |Bm | b∈B m

m

where ck denotes the k-th class. A positive margin means that the example has been classified correctly, whereas a negative margin occurs for misclassifications. The margin offers a measure of confidence in the classification result. In order to figure out the importance of the n-th variable the margin can be used. Thereby, the decrease of the average margin over all ˜ m is taken data examples when removing the information from the n-th entry in the inputs x ˆ n,perm (f ) as a measure of importance of the n-th feature. Similarly to the computation of R oob ˜ m , ym ) can be determined after removing the information from the the margin of pattern (x ˜ m , ym ) in the following. The importance of the n-th feature. This value is denoted mg (n) (x n-th feature based on the confidence information from the RF is calculated as Δ(n) mg

M  1  ˜ m , ym ) . ˜ m , ym ) − mg (n) (x = mg(x M m=1

(2.119)

(n)

The values Δmg can be normalized according to the Assignment (2.115) to obtain the relative (n) importance of features. Features with low values Δmg represent candidates for removal. Another alternative suggested by Breiman in [Bre02] to determine a measure for the importance of the n-th feature is the computation of (n) ΔRF

B 1  (n) = Δ , B b=1 RF,b

(n) ˆ n,perm (f (·, jb )) − R ˆ oob (f (·, jb )), with ΔRF,b = R oob

(2.120)

ˆ oob (f (·, jb )) is the estimated risk for the b-th tree computed only with those patterns where R ˆ n,perm (f (·, jb )) is the estimated risk for the b-th that were not used during its training and R oob tree after randomly permuting the n-th entry in the inputs xm that have not been used by this tree in the training phase. The approach that is adopted with all feature importance measures from above, i. e., Δ(n) , (n) (n) (n) ΔOOB , Δmg , and ΔRF , in order to find a suitable input space for the classification task is based on an iterative removal of features. Hereby, one eliminates features with low importance as long as the estimated risk is still acceptable, making this way a hypothesis about the features

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

2.3 Feature Extraction

47

that should be used. After the removal of the weakest features a new RF is created and the performance in the reduced space is estimated using the oob-classifier. Then, the feature importance is computed in this space and the weakest features are removed. This procedure continues until the oob-estimate of the prediction risk is unacceptable. At this point one has to chose the input space from the previous iteration.

Conclusion In the first part of this chapter the basic concepts of machine learning have been presented. The bias-variance framework and possibilities to estimate the performance of machine learning procedures in practice have been discussed in this context. In the second part, state-of-the art algorithms have been described including ensemble learning techniques. Hereby, an explanation for the good performance of ensemble learning algorithms in classification tasks has been provided using the bias-variance framework. The third part of the chapter has treated the feature extraction task covering aspects from both the feature generation and the feature selection steps.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

3. Interpretable Generalized Radial Basis Function Classifiers Based on the Random Forest Kernel

This chapter introduces an algorithm to compute Generalized Radial Basis (GRBF) classifiers based on the Random Forest (RF) kernel. The algorithm emerged from the wish to use a classifier with good generalization abilities while allowing interpretability [BN08]. In many fields, e. g., medicine or in safety critical applications not only good generalization properties of classifiers but also their interpretability is required. Due to the limitations of humans to understand arbitrary segmentations of high dimensional spaces there is always a tradeoff between low generalization error rates and interpretability. Whereas classifiers like Neural Networks (NN), Support Vector Machines (SVM) or RF generalize well but cannot be interpreted, Classification And Regression Trees (CART) can be read as a set of “IFTHEN-ELSE” rules, but in general they have larger misclassification rates due to the high variance in the bias-variance framework. Interpretability limits the model space that can be used to implement the classifier and depending on the problem at hand this may deteriorate the classifiers performance. The approach introduced in this thesis uses RF as an ensemble algorithm that generalizes well in order to identify neighborhood relationships in the input space that lead to a low misclassification rate. Generalized radial basis functions are shaped based on the RF kernel in order to maintain the good classification performance. The generalized radial basis functions are used as kernels for the design of GRBF classifiers that allow fuzzy-like interpretability. This way it is possible to obtain both a low prediction risk and interpretability, making the GRBF classifier an attractive tool for many applications. GRBF classifiers belong to the group of linear basis expansion models that were described in Subsection 2.2.1 representing an extension of the RBF classifiers from Subsection 2.2.3. Similar to the RBF classifier in a first step the vector u ∈ RK is computed as a linear combination of the similarities between x and some predefined points xK,h in the input space, called centers, templates or prototypes. The vector u ∈ RK can then be transformed ˆ (y |x), the vector containing estimates of the posteriors p(y = ck |x = x). into p The similarity ψh (x, xK,h ) between x and a prototype xK,h is realized by the kernel καh (x, xK,h ) which depends only on the distance between x and the prototype, i. e., καh (x, xK,h ) = καh (||x − xK,h ||). In contrast to RBF for the GRBF classifiers the norm ||x − xK,h || does not denote the Euclidean distance. Fig. 3.1 shows the architecture of a GRBF network for classification. The main difficulties in the GRBF classifier construction are setting the value for the number H of centers, determining the corresponding locations xK,h of centers in the input space, and defining suitable similarity measures ψh (x, xK,h ) to compute the distances to centers. These are the factors that determine the GRBF mapping f and thus, the generalization performance. As mentioned in Subsection 2.2.3 the cost function that must be optimized in order to determine the parameters αh , xK,h and wk,h , h = 1, . . . , H, k = 1, . . . , K, is given by the empirical risk from (2.65). As stated there the minimization criterion is nonconvex with multiple local minima so that in general one is dividing the optimization problem into two

49 Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

50 3. Interpretable Generalized Radial Basis Function Classifiers Based on the Random Forest Kernel

ψ2 (x, xK,2 )



u1



u2

x .. . 

.. .

uK

cnf1 (u1 )

cnf2 (u2 ) .. . cnfK (uK )

Normalization

w1,1

Confidence mapping

ψ1 (x, xK,1 )

ˆ (y |x) p

max(·)

ˆy

ψH (x, xK,H ) w K,H

Figure 3.1: GRBF for classification

subtasks. In the first subtask one has to determine the number of prototypes, their locations and the parameters αh that are required to compute the similarities ψh (x, xK,h ). In the second subtask one has to find out the weights wk,h . Since the first subtask is mainly solved by using the RF kernel in the next section this kernel is explored in more depth. In Section 3.2 the design of GRBF classifiers using the RF kernel is discussed in detail. Since the interpretability can be improved by reducing the number of prototypes in Section 3.3 a method to reduce the number of centers will be introduced. The chapter ends with Section 3.4 presenting experimental results which demonstrate the good performance of the new classification technique.

3.1 Kernels Defined by Ensembles of CART Classifiers This section introduces kernels that are defined by ensembles of CART classifiers with the main focus on the RF algorithm. The CART classifier fb segments the whole input space RN into a number of b disjoint regions Xb,j , j = 1, . . . , b , which are represented by the leafs and assigns a class label to each region. The region Xb,j and the corresponding leaf obtains the label of the class with maximal estimated posterior probability in this region, the estimates being computed as in (2.79) (c )

Mb,jk p ˆ (y = ck |x ∈ Xb,j ) = , Mb,j (b)

(3.1)

where Mb,j denotes the number of examples xm in the training set that lie in the j-th leaf (c ) and Mb,jk the number of examples in the training set lying in the j-th leaf with label ck . In order to understand how the CART algorithm can be interpreted as a kernel classifier the function b,j (x) associated to the j-th leaf is defined as in [GEW06]  1 x ∈ Xb,j (3.2) b,j (x) = 0 else, and the classes ck are represented by the target unit vectors ek ∈ {0, 1}K , which are all-zero vectors except for the k-th entry which has the value 1. With )T ( b,1 (x) b,b (x) lb (x) =  ,...,  ∈ [0, 1]b (3.3) Mb,1 Mb,i

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

3.1 Kernels Defined by Ensembles of CART Classifiers

51

the vector of estimated conditional probabilities ˆ (b) (y |x) = [ˆ p(b) (y = c1 |x = x), . . . , p ˆ(b) (y = cK |x = x)]T ∈ [0, 1]K , p

(3.4)

can be computed as ˆ (y |x) = p (b)

M 

y¯m lbT (xm )lb (x)

M 

=

m=1

y¯m κb (xm , x),

(3.5)

m=1

where y¯m is the target unit vector corresponding to the input xm . Eq. (3.5) reveals that the CART algorithm can be interpreted as a kernel algorithm with the kernel  1/Mb,j x, xm ∈ Xb,j (3.6) κb (xm , x) = 0 else. The CART kernel is piecewise constant, since for x, xm ∈ Xb,j it has the value 1/Mb,j . Therefore, the initialization of a RBF network with a decision tree by creating a center for each region Xb,j as described in [Kub98], can be interpreted as an attempt to approximate the piecewise constant kernel of a decision tree by Gaussian radial basis functions. In [GEW06] an expression for the kernel of an ensemble of trees for regression tasks has been introduced, assuming that each tree uses the whole set D in the training phase. This ensemble model is similar to the one presented in Subsection 2.2.5 to explain how the RF algorithm can be interpreted as an adaptive nearest neighbor algorithm. Applied to classification tasks the vector of estimated conditional probabilities using an ensemble of trees where each tree is built from the whole set D, can be computed as ˆ ET (y |x) = p

M 

T (xm )lET (x) y¯m lET

=

m=1

M 

y¯m κET (xm , x),

(3.7)

m=1

with *

l1T (x) lT (x) √ ,..., B lET (x) = √ B B

+T

P

∈ [0, 1]

B b=1

b

.

(3.8)

The vector lET (x) is a sparse vector having only B entries which are nonzero since each vector lb (x) is sparse with only one entry different from zero. By taking the majority vote among the B classifiers for a query input x one chooses the class with largest estimated ˆ ET (y |x). Thus, the kernel for posterior probability p ˆET (y = ck |x = x), which is stored in p such an ensemble of trees is κET (xm , x) =

T lET (xm )lET (x)

B 1  = κb (xm , x). B b=1

(3.9)

If B is finite the kernel κET (xm , x) is piecewise constant but compared to the CART algorithm a much finer segmentation of the input space with regions of constant value κET (xm , x) is realized.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

52 3. Interpretable Generalized Radial Basis Function Classifiers Based on the Random Forest Kernel The kernel from (3.9) is not valid for the RF algorithm because here the bootstrap procedure is applied for each tree. The majority vote among the B trees in the RF corresponds to choosing that class ck whose entry in the vector of estimated posterior probabilities B 1  ˆ RF (y |x) = p B b=1

=

M 



y¯m κb (xm , x)

(3.10)

(xm ,ym )∈D (b)

y¯m κRF (xm , x),

(3.11)

m=1

is maximal, where D(b) represents the training set for the b-th tree obtained with the bootstrap method and B 1  κRF (xm , x) = βm,b κb (xm , x) B b=1

(3.12)

the RF kernel, with βm,b denoting the number of drawings of the pattern (xm , ym ) in D(b) . Thus, due to the bootstrap procedure for each tree about 0.37M factors βm,b are zero. It should be noted that the inner sum in (3.10) is an unit vector for each full grown tree. Although it is simple to take the majority vote among B trees and thus to classify with the RF, it is computationally expensive to evaluate κRF (xm , x). Therefore, the following RF-approximation of the vector of class conditional probabilities that takes into account all M patterns in D per tree is introduced: ˜ RF (y |x) = αRF p

M 

˜ RF (xm , x), y¯m κ

(3.13)

m=1

˜ RF (y |x) with αRF being a normalization factor which assures that the sum over the entries in p is one and B 1  B  (x, xm ) . Mb,j(x) κb (xm , x) = κ ˜ RF (xm , x) = B b=1 B

(3.14)

Here, Mb,j(x) denotes the number of training inputs in D(b) that lie in the same leaf as x, B  (x, xm ) is the number of trees in the RF in which x and xm lie in the same leaf and κ ˜ RF (xm , x) stands for a kernel that is defined by the RF and which is identical to the proximity measure prox(xm , x) from (2.100). Since κ ˜ RF (xm , x) is a linear combination—with positive weights—of kernels κb (xm , x), it is also a valid kernel [STC04]. The estimate from (3.13) is a M -nearest neighbor estimate of the posterior probabilities, where the neighborhood relations are defined by the RF kernel κ ˜ RF (xm , x). In contrast to κRF (xm , x), the kernel κ ˜ RF (xm , x) is easy to evaluate and to interpret: it represents the percentage of trees in which xm and x lie in the same leaf. Both κRF (xm , x) and κ ˜ RF (xm , x) are piecewise constant kernels that describe the intrinsic neighborhood relations of the RF classifier in the input space RN . It should be noted that in cases where x and xm are close to each other with respect to the Euclidean distance they can be far from each other with ˆ RF (y |x) is used for respect to the RF similarity and vice versa. Although the estimate p

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

3.2 GRBF Based on the RF Kernel

53

ˆ RF (y |x) and p ˜ RF (y |x) are based on classification tasks by the RF algorithm, since both p the intrinsic similarity measure of the RF, due to computational complexity reasons in the following the kernel κ ˜ RF (xm , x) will be used to describe the RF neighborhood relations in the input space. In [GEW06], [Bre00], and [Zha00] an in-depth analysis of the case B → ∞ is carried out. A main result states that the RF kernel in this case is continuous and piecewise linear. In practice the number B of trees is finite and thus, the kernel from (3.9) for ensemble of trees that do not use the bootstrap procedure or the kernel from (3.14) are used to represent the piecewise constant similarity measure in the input space.

3.2 GRBF Based on the RF Kernel The GRBF algorithm considered in this work implements the similarities ψh (x, xK,h ) to the centers xK,h by the Gaussian functions   (3.15) ψh (x, xK,h ) = exp −γ(x − xK,h )T AT h Ah (x − xK,h ) , where γ > 0 and Ah ∈ RN ×N is a matrix that linearly transforms input vectors x into the vectors Ah x. With respect to the metric ||x − xK,h ||2Ah = (x − xK,h )T AT h Ah (x − xK,h )

(3.16)

the Gaussian function from (3.15) is radial. As shown in Fig. 3.1 the output u ∈ RK can be computed as u = W y,

(3.17)

where W ∈ RK×H is the weight matrix with the (k, h)-th entry being wk,h and y = [y1 (x, xK,1 ), . . . , yH (x, xK,H )]T ∈ [0, 1]H denotes the similarity of input vector x to the H centers. ˆ −1 = AT Ah , The proposed strategy in order to determine the GRBF parameters H, xK,h , C h h wk,h and γ is discussed in the following. In a first step the RF kernel κ ˜ RF (xm1 , xm2 ) is used ˆ −1 . The RF neighborhood relations among the training points to compute H, xK,h , and C h are described by the values κ ˜ RF (xm1 , xm2 ) ∈ [0, 1], with xm1 , xm2 ∈ D so that a class-specific clustering of the input space can be computed. Algorithm 3.1 shows how this class-specific clustering can be implemented. The number of clusters that are generated for class ck is denoted with Qk . The parameter τsim ∈ [0, 1] in Line 7 of Algorithm 3.1 is important since a large value of τsim tends to produce a large number of clusters, each with few input vectors. Since for each cluster a center in the GRBF is constructed, τsim controls the parameter H=

K 

Qk .

(3.18)

k=1

This topic is discussed in more detail in Section 3.3. It should be noted that at this stage an identification of outliers can be performed. Outliers can be defined as those inputs xm˜ in the training set with κ ˜ RF (xm˜ , xm ) being smaller than a predefined threshold for all m = m. ˜ Another possibility is to define those points as outliers that lie in clusters containing less

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

54 3. Interpretable Generalized Radial Basis Function Classifiers Based on the Random Forest Kernel Algorithm 3.1 Clustering based on κ ˜ RF (xm1 , xm2 ) for all classes ck do 2: include all xm ∈ D belonging to ck into the set Dck for all xm1 in Dck do 4: compute κ ˜ RF (xm1 , xm2 ) for all xm2 ∈ Dck end for 6: while Dck not empty do − determine that xm ∈ Dck which has the largest number of similar inputs xm ∈ Dck . The inputs xm and xm are similar if κ ˜ RF (xm , xm ) > τsim 8: − let xm from Line 7 and inputs which are similar to it build a new cluster and remove them from Dck end while 10: end for than a predefined number of inputs from the training set. Outliers and the clusters generated by them are removed in order to increase the robustness of the resulting classifier. Denoting with Dck ,q the set of input vectors xm from the training set which belong to ˆ −1 in the GRBF the q-th cluster of the k-th class, the parameters xK,h and the inverse of C h corresponding to this cluster are computed according to  1 xK,h = xm , (3.19) |Dck ,q | x ∈D m

ˆh = C

ck ,q

 1 (xm − xK,h )(xm − xK,h )T , |Dck ,q | − 1 x ∈D m

(3.20)

ck ,q

where |Dck ,q | stands for the number of input vectors in Dck ,q . With xm ∈ Dck ,q being modeled ˆ h from as realizations of independent random variables with the same distribution, xK,h and C (3.19) and (3.20) represent the sample mean and the sample covariance of this distribution. Thus, the term (x − xK,h )T AT h Ah (x − xK,h ) in (3.15) implements the squared Mahalanobis distance between x and xK,h . At this point the number and the location of centers as well as the distance underlying the kernels καh (x, xK,h ) = ψh (x, xK,h ),

with α1 = α2 = . . . = αH = γ

(3.21)

can be computed. What is still required in order to minimize the criterion from (2.65) is the computation of the weights wk,h and of the parameter γ. In order to assure interpretability the weight wk,h combining the h-th similarity measure ψh (x, xK,h ) with the k-th output uk must be determined in such a way that it has a positive value if xK,h is the center of a cluster belonging to class ck and a negative value if xK,h is the center of a cluster belonging to another class than ck . It can be stated that the maximal output uk to an input x has the largest value due to the fact that it is most similar to prototypes belonging to class ck . The weights wk,h and the parameter γ are computed by minimizing with respect to W and γ the empirical risk M M 1  1  2 ||W ψm − y¯m ||2 = ||um − y¯m ||22 , Remp (γ, XK , W , D) = M m=1 M m=1

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

(3.22)

3.2 GRBF Based on the RF Kernel

55

with ψm = ψγ (xm , XK ) = [ψγ (xm , xK,1 ), . . . , ψγ (xm , xK,H )]T ∈ [0, 1]H denoting the similarities of the m-th input vector to the H centers represented by XK = [xK,1 , . . . , xK,H ] ∈ RN ×H . Due to the assignment of each cluster to a single class in general the constraints with respect to the positive or negative values of the weights wk,h are fulfilled by the pseudoinverse solution: W = Y¯ Ψ T (Ψ Ψ T )−1 ,

(3.23)

with Y¯ = [y¯1 , . . . , y¯M ] ∈ {0, 1}K×M and Ψ = [ψ1 , . . . , ψM ] ∈ [0, 1]H×M . If this is not the case one has to compute the weights wk,h from the constrained linear least-squared problem of minimizing Remp (γ, XK , W , D) so that ˜ k ≤ 0H Pk w

(3.24)

˜ k = [wk,1 , . . . , wk,H ]T ∈ RH and Pk being a diagonal is fulfilled for all k = 1, . . . , K, with w matrix with entry [Pk ]i,i = −1 if the i-th cluster belongs to class ck and [Pk ]i,i = 1 else. In the great majority of the experiments that have been performed the constraints were fulfilled by the pseudoinverse solution due to the class-specific construction of the clusters. Finally, the parameter γ is computed with the gradient descent method in order to minimize Remp (γ, XK , W , D). Since the values ψh (x, xK,h ) depend on γ for the gradient descent method an iteration index j must be introduced for γ, Ψ , um , W and Remp (γ, XK , W , D). At each iteration the parameter γ is updated according to (j)

γ

(j+1)



(j)

∂Remp (γ (j) , XK , W (j) , D) − ηGD , ∂γ (j)

(3.25)

where ηGD is the gradient descent step size and the gradient is computed as shown in Appendix A.9 by (j)

2 ∂Remp (γ (j) , XK , W (j) , D) = − tr{(U (j) − Y¯ )T W (j) (Ψ (j)  D)}, (j) ∂γ M

(3.26)

with tr{·} denoting the trace operator,  the elementwise multiplication, D ∈ RH×M + ˆ −1 (xm − xK,h ), and U (j) = the distance matrix with the (h, m)-th entry (xm − xK,h )T C h (j) (j) [u1 , . . . , uM ] ∈ RK×M . The iterations are stopped if the difference between the empirical risk at step j and the empirical risk at the previous step (j − 1) is smaller than a predefined threshold τGD or if the number of iterations exceeds a predefined threshold JGD . The weight Matrix W (j) and the parameter γ (j) from the last iteration are assigned to W and γ, respectively. This way all parameters that are required to compute the vector u for a query input x can be determined. As described in Subsection 2.2.1 a vector u can be firstly transformed using the confidence mapping described in Appendix A.5 into the vector cnf(u) = [cnf1 (u1 ), . . . , cnfK (uK )]T ∈ [0, 1]K and then by normalization as presented in (2.58) into the estimates p ˆ(y = ck |x = x) of the posterior probabilities p(y = ck |x = x). The values p ˆ(y = ck |x = x) build the vector of ˆ (y |x) which is used for classification by assigning that estimated conditional probabilities p ˆ (y |x) is maximal. class ck to an input x whose corresponding entry in p It should be noted that p ˆ(y = ck |x = x) can only be a reliable estimate of p(y = ck |x = x) in those regions of the input space where enough training data is available. Due to the

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

56 3. Interpretable Generalized Radial Basis Function Classifiers Based on the Random Forest Kernel localized nature of the generalized radial basis functions it is possible to identify whether an input x belongs to a region of the input space RN where the estimates p ˆ(y = ck |x = x) are not reliable. This is the case if all confidence values cnfk (uk (x)), k = 1, . . . , K are small. Thus, if all values cnfk (uk (x)) lie below a given threshold τcnf , this indicates that the training data is not suitable for reliable statistics on the input x and a decision for the class to which x belongs should be rejected. A second possibility to identify whether a decision should be rejected is obtained by ˆ (y |x) analyzing the estimates p ˆ(y = ck |x = x). If there is no dominating value in the vector p this indicates that the input x lies at the border between regions that belong to different classes and it is useful to reject the decision if the costs associated to a reject are lower than the costs associated to a misclassification. Denoting with p ˆ(y = ck1 |x = x) the largest and ˆ (y |x) a decision should be rejected if the with p ˆ(y = ck2 |x = x) the second largest entry in p difference between the two estimates is lower than a threshold τrej . Both thresholds τcnf and τrej can be adjusted by cross-validation techniques. By increasing the thresholds the part of the input space which leads to a reject-decision grows and by decreasing the thresholds it shrinks. Suitable values that are used for the experiments are τcnf = 0.3 and τrej = 0.1. x

u GRBF

confidence mapping

cnf (u)

reject

normalization

ˆ (y |x) p max(·)

reject

ck

Figure 3.2: GRBF with reject option

Fig. 3.2 summarizes the stages in the GRBF classification process. The GRBF design proposed here inherits the good generalization property of the RF classifier by approximating its neighborhood relations in the input space but it additionally offers advantages that RF do not have. The first main advantage is the interpretability which is realized by the similarity to class-specific templates in the input space. The interpretability must be understood here as a statement of the form: “The query input x obtains the class label ck because it is close to at least one center xK,h that belongs to class ck and far away from the centers that belong to the other classes”. Of course the centers to which x is similar can be specified. If these centers are interpretable then also the similarity to them is interpretable. The second advantage is the reduced computational complexity of the trained classifiers. Another useful property of the GRBF is the ability to recognize that a query input has a small similarity to all training examples, offering the possibility to reject a decision for such an input. Additionally it should be mentioned that by using the RF algorithm as part of the design process for GRBF classifiers, one can easily determine a suitable input space by performing feature selection as presented in Subsection 2.3.2.

3.3 Pruning In this section the topic of how to choose the parameter τsim in Alg. 3.1 is discussed, because this question leads to the tradeoff between low generalization error rate and interpretability. A large value of τsim yields a large number of centers in the GRBF classifier. Although a larger number of centers assures a better approximation of the RF similarity measure—examples must be very close to each other in terms of the RF neighborhood in order to belong to

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

3.3 Pruning

57

the same cluster—it also means an increased variance in the bias-variance framework and a reduced interpretability. The strategy proposed in order to deal with this tradeoff is to assign a high value to τsim and to prune the resulting GRBF as much as possible while assuring a good generalization performance. To prune a GRBF an impurity measure can be used that indicates how many examples from other classes are in a cluster that belongs to class ck . The following impurity measure is used for the h-th cluster: 2  (c ) |Dh k | , (3.27) imph = 1 − |Dh | (c )

where |Dh k | is the cardinality of the set containing the training examples xm that lie in the h-th cluster and belong to class ck , and |Dh | is the cardinality of the set containing training examples xm that lie in the h-th cluster. An input x lies in a cluster with center xK,h if its similarity ψh (x, xK,h ) is greater than the smallest value among the similarities computed with training inputs that are used to calculate the sample mean and covariance matrix for this cluster. The merging of two clusters belonging to the same class is performed by combining the corresponding sets Dh and Dh to Dh,h = Dh ∪ Dh . Analogously the set (c ) (c ) (c ) Dh,hk  = Dh k ∪ Dhk is constructed. Then, as in (3.27) the impurity imph,h of the cluster (c )

after the merging step is computed using Dh,h and Dh,hk  . Alg. 3.2 describes the steps required to determine which two clusters should be merged in such a way that the sum of the impurities in all clusters after the pruning of one center in ˆ h,h that are required for the similarity the GRBF is minimal. The parameters xK,h,h and C ψh,h (x, xK,h,h ) to the newly created center xK,h,h are computed as the sample mean and (c ) the sample covariance matrix from the set Dh,hk  according to (3.19) and (3.20). Algorithm 3.2 Pruning strategy for GRBF for all clusters h = 1, . . . , H do 2: find the center xK,h , h = h belonging to the same class as xK,h with largest value ψh (xK,h , xK,h ) merge the h-th with the h -th cluster, compute imph,h and undo the merging 4: end for prune the GRBF by merging those clusters h and h which lead to the smallest impurity imph,h In order to determine a suitable number of centers for a given application the V -fold cross validation method presented in Subsection 2.1.2 with the 0/1-loss is used. For each fold one starts with a large value of τsim so that a large number of clusters is generated. By applying Alg. 3.2 the number of clusters is reduced until a single cluster per class remains. For each ˆ H ) is estimated like in (2.37) as the average over the V number of clusters Hi the risk R(f i ˆ Dv (fH ,v ) , which represent an estimate of the GRBF risk with Hi clusters obtained values R i ˆ H ) the smallest number of centers Hopt by using the v-th fold. Based on the estimates R(f i ˆ that still leads to an acceptable small risk R(fHopt ) is chosen and a GRBF with a large number of centers is created using the whole training set D. With Alg. 3.2 the number of centers is reduced to Hopt so that the final GRBF classifier is obtained. A good estimate of

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

58 3. Interpretable Generalized Radial Basis Function Classifiers Based on the Random Forest Kernel ˆ Hopt ). The total number of RF the risk for the final GRBF classifier is given by the value R(f classifiers that must be computed to implement this method is V + 1. The following example illustrates the pruning procedure for a two-dimensional classification problem. The training set consists of 320 two-dimensional inputs belonging to class c1 and 320 belonging to class c2 in such a way that they form a spiral as shown on the left of Fig. 3.3. On the right side of Fig. 3.3 the decisions of a RF with B = 100 trees are plotted for the input space where the white area represents class c1 and the black area class c2 .

Figure 3.3: Spiral training set and RF decisions

Setting τsim = 0.5, i. e., inputs belong to the same cluster only if they land in the same leaf in more than 50% of the trees, generates a GRBF with 21 centers, which leads to the partitioning of the input space that is shown in Fig. 3.4. On the left hand side of the figure four different regions can be differentiated. The black area indicates that for all inputs x lying here a decision is rejected since all confidence values cnfk (uk (x)), k = 1, 2 are smaller than τcnf = 0.3 and thus, the estimates p ˆ(y = ck |x = x) are not reliable. The bright grey area denotes those regions in the input space that are assigned to class c1 and the dark grey area denotes those regions that are assigned to class c2 . Finally, the white area marks the border between the two classes, where a decision is rejected since there is no dominating estimated class posterior probability when using τrej = 0.1. On the right side of Fig. 3.4 level curves with equal similarity are shown for all H = 21 centers. This representation visualizes the GRBF neighborhood relations in the input space as they are extracted from the RF kernel.

Figure 3.4: τsim = 0.5, H = 21 centers

In Fig. 3.5 the partitioning of the input space as well as the level curves of equal similarity are shown after pruning the GRBF classifier as described above from H = 21 to H = 10

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

3.4 Experimental Results

59

ˆ H10 ), prototypes. Additional pruning leads to an estimated risk that is large compared to R(f

Figure 3.5: Pruning with the impurity measure, H = 10 centers

the estimated risk for H = 10, indicating that the pruning should be stopped. It can be seen from the level curves that the similarity measures are intuitive.

3.4 Experimental Results In this section the RF-based GRBF algorithm is compared to other state of the art algorithms by presenting its performance on four datasets where interpretability is desired. It will be shown that the GRBF algorithm achieves approximately the same generalization error rate and offers additionally the advantages of interpretability and reduced computational complexity. One advantage of the GRBF algorithm is the possibility to reject decisions as described in Section 3.2. In order to be able to compare the performance of the algorithm with results from other classifiers which do not offer this opportunity the V -fold estimate ⎛ ⎞ V   ˆ CV (f ) = 1 ⎝ 1 R (1 − δ(ym , f (xm , D \ Dv )))⎠ , (3.28) V v=1 |Dv | (xm ,ym )∈Dv

is replaced for GRBF classifiers by ⎛ V  1  ˆ CV ⎝ 1 (fGRBF ) = R V v=1 |Dv,nr |



⎞ (1 − δ(ym , fGRBF (xm , D \ Dv )))⎠ ,

(3.29)

(xm ,ym )∈Dv,nr

where Dv denotes the validation set for the v-th classifier and Dv,nr the set which contains ˆ  (fGRBF ) all patterns from Dv where the v-th classifier does not reject the decision. Thus, R CV represents the misclassification error among those test inputs that are not rejected. The four datasets used in this chapter stem from medical applications where it is important to predict whether a patient is healthy or not based on symptoms which are quantified and stored in the input vector x. All datasets represent two-class problems, i. e., y ∈ {c1 , c2 }. A detailed description of the four datasets breast cancer, diabetes, heart and thyroid can be found in [IDAG01]. For each dataset V = 100 folds are created by randomly choosing patterns from D to create the validation sets Dv . The remaining patterns represent the

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

60 3. Interpretable Generalized Radial Basis Function Classifiers Based on the Random Forest Kernel training data D \ Dv and the performance estimates are computed with (3.28) or (3.29). The description of the datasets is given in Table 3.1. Hereby, |D \ Dv | denotes the size of the training set, |Dv | the size of the validation set, and N the dimensionality of the input space. The inputs x are standardized as described in (2.105) in all datasets. Name breast cancer diabetes heart thyroid

|D \ Dv | 200 468 170 140

|Dv | 77 300 100 75

N 9 8 13 5

Table 3.1: Description of the four datasets

In order to evaluate the GRBF algorithm its performance is compared with those of popular algorithms as SVM, One-Nearest-Neighbor (1NN), and RF. The 1NN algorithm has not been discussed in Chapter 2 being a simple procedure: the output y for a query input x obtains the value ym of the closest input xm to x in the training set. Details about the 1NN algorithm and asymptotic bounds for its prediction risk can be found in [DHS01]. In the bias-variance framework the 1NN classifier has a low bias but a high variance. Being a memory-based method a drawback of the 1NN procedure is the computational and spacial complexity because for every query input the closeness to all inputs from the training set must be computed. It should be noted that “closeness” implies a similarity or dissimilarity measure which is chosen to be the Euclidean distance in the following. The classification performances of the SVM, RF, 1NN, and GRBF algorithms for the four datasets are given in Table 3.2.1 To obtain an impression about the stability of a prediction risk estimate, each average among the V folds is accompanied by the sample standard variance. Additionally to the SVM results in [IDAG01] the performances of other ˆ SVM ) R(f 0.260 ± 0.047 ˆ SVM ) R(f 0.235 ± 0.017 ˆ SVM ) R(f 0.160 ± 0.033 ˆ SVM ) R(f 0.048 ± 0.022

breast cancer ˆ ˆ 1NN ) R(fRF ) R(f 0.249 ± 0.046 0.327 ± 0.049 diabetes ˆ ˆ 1NN ) R(fRF ) R(f 0.242 ± 0.018 0.301 ± 0.021 heart ˆ ˆ 1NN ) R(fRF ) R(f 0.173 ± 0.034 0.232 ± 0.037 thyroid ˆ RF ) ˆ 1NN ) R(f R(f 0.047 ± 0.023 0.044 ± 0.022

ˆ  (fGRBF ) R 0.265 ± 0.053 ˆ  (fGRBF ) R 0.221 ± 0.031 ˆ  (fGRBF ) R 0.157 ± 0.038 ˆ  (fGRBF ) R 0.031 ± 0.019

Table 3.2: Estimated generalization error

algorithms like AdaBoost or regular RBF networks are presented. Their generalization error estimates are very close to those of the SVM classifier. It should be mentioned that for all 1

The SVM and the 1NN algorithms from the Matlab Spider toolbox [WEBS06] and an own implementation of the RF and GRBF algorithms in Matlab have been used to compute the results from Table 3.2.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

3.4 Experimental Results

61

these state of the art algorithms a number of iteration steps is required, in order to adapt their meta-parameters, e. g., the number of centers in RBF networks or kernel parameters and CSVM in SVM classifiers. All these classifiers offer slightly optimistic estimates of the generalization error and come along with the effort of performing the iteration steps for setting the meta-parameters. On the other hand, for all datasets no adaptation of parameters is necessary for the RF algorithm. Given a RF classifier the only parameter that has to be set in order to generate a GRBF network is the value of τsim . As already mentioned in the previous section it is recommended to start with a high value of τsim and then to prune the network in order to obtain a better interpretability and lower variance. However, the meaning of “high” depends on the dataset and it should be taken into account that the higher τsim the higher the number of identified outliers—an outlier being defined as in Section 3.2 based on the RF kernel. The method applied in this work is to set τsim in such a way that the number of outliers is less than 9% of the training data. ¯ nc the mean number of patterns in the validation sets that were not Denoting with M ¯ r the mean number of patterns where classified correctly among the V folds and with M the GRBF classifiers reject a decision, these values are given in Table 3.3 for the above ˆ  (fGRBF ) can be expressed as R ˆ  (fGRBF ) = datasets. With these values the estimated risk R ¯ nc − M ¯ r )/(|Dv | − M ¯ r ). (M |Dv | ¯ nc M ¯r M

breast cancer 77 27.89 10.21

diabetes 300 105.40 50.08

heart 100 22.82 8.49

thyroid 75 7.57 5.42

¯ nc , and M ¯ r for the 4 datasets Table 3.3: Values |Dv |, M

In order to construct a GRBF classifier a RF classifier must be created first which is beneficial not only for the feature selection task but also for the visualization of both the data and the GRBF prototypes. As mentioned in Subsection 2.3.1 using the RF distance dRF (xm1 , xm2 ) = 1 − prox(xm1 , xm2 ) multidimensional scaling can be accomplished. Hereby, the inputs xm from D as well as the prototypes xK,h are projected from RN into R2 so that the RF-neighborhood relations in RN are preserved in their two-dimensional representation. The visualization of the training sets as well as the prototypes for the four 4

6 4

2

2 0

0

−2

−2 −4

−4 −6 −6

−6 −4

−2

0

2

4

6

Figure 3.6: Breast cancer dataset

−8 −10

−5

0

5

10

15

Figure 3.7: Diabetes dataset

classification tasks are shown in Figures 3.6, 3.7, 3.8, and 3.9. Points marked as small stars

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

62 3. Interpretable Generalized Radial Basis Function Classifiers Based on the Random Forest Kernel 4

6 4

2 2 0

0 −2

−2 −4 −6

−4 −4

−2

0

2

4

Figure 3.8: Heart dataset

6

−6 −8

−6

−4

−2

0

2

4

Figure 3.9: Thyroid dataset

represent inputs that belong to class c1 , i. e., patient healthy, and inputs that belong to class c2 , i. e., patient ill, are marked as small circles. Additionally to the input vectors also the GRBF prototypes are shown in the figures by bold markers, a bold square marker for class c1 and a bold circle for class c2 . The figures are expressive since they give a hint about the separability of the data. The estimates of the prediction risks in Table 3.2 indicate that there is a relative large region in the input space where both p(y = c1 |x = x) and p(y = c2 |x = x) are large for the breast cancer and diabetes datasets, whereas the heart and especially the thyroid datases can be separated well. This fact also results from the two-dimensional visualization of the datasets using multidimensional scaling. For the breast-cancer and diabetes datasets there is a large region where the inputs corresponding to the two classes overlap. On the other hand, for the thyroid dataset no overlapping occurs. It is interesting to note that in Fig. 3.9 there are two separate regions in the input space which indicate that a patient is not healthy. This means that there are two kind of symptoms that are typical for the thyroid illness. Figures 3.6, 3.7, 3.8, and 3.9 also help understanding the GRBF algorithm better since the prototypes can be visualized. As shown in Subsection 2.2.5 the RF can be interpreted as an adaptive nearest neighbor algorithm. As a consequence in regions of the input space that clearly belong to a class ck , i. e., the posterior p(y = ck |x = x) is large compared to all other posteriors, only one or a few prototypes are enough, whereas in regions of the input space where classes overlap the number of prototypes is large. Thus, the location of the prototypes is adaptive: in regions of the input space where a finer resolution is required to discriminate the classes more prototypes are generated. This adaptive location of prototypes is a strength of the GRBF algorithm presented here. Unlike state of the art algorithms which offer low generalization error rates the proposed GRBF network also allows interpretability. Additionally, the computational complexity of a trained GRBF classifier is low. The RF algorithm requires the evaluation of many CART classifiers. Nearest neighbor algorithms require a comparison with every sample in the training set and the number of support vectors in SVM is in general considerable higher than the number H of centers in RF-based GRBF networks. For example, in the above applica¯ SVM versus the number of centers H is given in tions the mean number of support vectors M Table 3.4. In safety critical applications both interpretability and a low generalization error rate at low computational costs are desirable. The RF-based GRBF algorithm fulfills these requirements.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

3.4 Experimental Results

¯ SVM M H

breast cancer 115.73 7

diabetes 265 17

heart 82.80 7

63

thyroid 30.02 3

Table 3.4: Number of support vectors vs. H

Another application of GRBF classifiers is presented in [BBSH08], where a system for vehicle rear detection in images is described. Using the algorithm introduced in this chapter as principal component of the classification system it was possible to obtain a low error rate. The error rate was estimated using a large test set of video images from real traffic situations.

Conclusion This chapter has introduced a new method to compute GRBF classifiers. The algorithm makes use of the RF kernel. Thus, in the first part it has been presented how the RF kernel can be derived from the kernel of CART classifiers providing some insight into the structure of the RF algorithm. The second part of the chapter has shown how GRBF classifiers can be constructed in such a way that their kernel approximates the RF kernel so that a good classification performance can be achieved. It has been described how the algorithm can be designed to make the decisions of GRBF classifiers interpretable. The third part has treated the topic of pruning a GRBF classifier in order to achieve a good tradeoff between performance and interpretability. The final part of the chapter has presented experimental results which underline the advantages of the proposed classification technique.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

4. Classification of Temporal Data

A time series is a sequence of data points with each element in the sequence having a time stamp associated with it. Time series are often generated by recording measurements at regular or irregular intervals. Thus, time series occur in almost every field of practical interest, e. g., medicine, economy, meteorology or industry. In this thesis time series are considered where its elements are obtained by sampling a bandlimited continuous signal in regular intervals with a sampling rate that exceeds the Nyquist rate [OS75]. Classical approaches dealing with time series use a model of the process that generated the data. Since often the process cannot be described completely deterministically and the measurements leading to the time series are subject to noise, a statistical framework is adopted. Classical approaches for time series use stochastic models in order to deal with temporal data. Stochastic models have been studied extensively mainly in four important areas of application [BJ94]: forecasting, estimating the transfer function of a system, analysis of effects of unusual intervention events to a system and discrete control systems. Among the most successful techniques applied hereby are modeling the time series as an autoregressive moving average stochastic process and the use of hidden Markov models. Due to the abundance of applications dealing with time series in recent years the interest of the machine learning community in this field has grown. This is motivated by the fact that in many applications it is not possible or too expensive to build accurate stochastic models of the processes generating the time series. In such cases machine learning provides a possibility to handle the temporal data. The areas that have received much interest from machine learning researchers are indexing [FRM94, APWZ95, CF99, YF00, KCPM01], clustering [KP98, KGP01] and classification [Kad99, GD00, Geu01, Geu02, KS05, GW06]. In indexing, given a query time series s, a training set containing M other time series sm , m = 1, . . . , M , and a dissimilarity measure d(s, sm ), the task is to find the most similar time series to s in the training set in an efficient way. In clustering one is looking for a natural grouping of time series in a data base given some similarity measure between time series. Although the topic of time series classification can be treated in a more general form, in the literature the special case is considered where one aims at assigning a class to an unlabeled time series based on a training set containing labeled time series. This thesis focuses on the extension of the latter task by dealing with multivariate time series classification. In its general form multivariate time series classification aims at estimating at each time instance the class to which the current segment of a series belongs. Solving this problem using machine learning is a challenging task since it does not fit easily in the usual feature-value model described in Chapter 2, where the input vectors stored in the training set are considered to be realizations of i. i .d. random variables. Here, a multivariate time series can be seen as a large sequence of elementary features with adjacent elements being highly correlated, which leads to complications when defining and estimating the risk functional. It is the way how the elementary features interact with each other rather than their individual values that is important for the classification task. Whereas for time series the elementary features are the successive values of the signals to be classified, the same

65 Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

66

4. Classification of Temporal Data

problem appears also in sequence recognition, e. g., pattern recognition in DNA sequences or character sequences in a written text. A possible point of view for these problems is to interpret each sequence as a single input vector to a classifier, thereby, avoiding the problems resulting from the correlation of adjacent elements. This approach is unsatisfactory due to the curse of dimensionality and the complications related to the change of the output space when a multivariate time series has segments belonging to different classes. The technical literature on time series classification based on machine learning procedures only considers the restriction where a time series belongs to a single class. The more complicated classification task with segments of a multivariate time series belonging to one class and other segments belonging to other classes is not analyzed in any work known to the author. This type of classification is only mentioned in [KS05] and denoted there as “strong temporal classification”. Another categorization of time series classifiers is based on causality. One talks about off-line classifiers when at time stamp n the whole multivariate time series can be used for a decision. On the other hand, when at time stamp n only the multivariate time series up to this point is known one talks about on-line classifiers. An example for the use of on-line classifiers is the change detection task where at time instance n it must be decided whether a class-change occurred based only on the information up to this point. In Chapter 5 an application for on-line multivariate time series classification is presented. This chapter describes the time series classification task, presents the difficulties related to the topic and introduces the basics required for the methods that are introduced in this thesis in order to deal with on-line classification of temporal data. The procedures are presented in Chapters 6 and 7. After introducing the notation and some possibilities to represent time series in the next section the difficulties appearing in time series classification are described in Section 4.2. Then some classical approaches to handle time series are reviewed in Section 4.3. Since time series classification can be accomplished based on distance measures these are explored in Section 4.4. The topic of feature generation, including not only usual preprocessing but also the concepts of event-based features and prototypes is introduced in Section 4.5.

4.1 Time Series Representation and Dimensionality Reduction As in many other problems a suitable choice of representation can simplify the multivariate time series classification task significantly. Since it depends on the specific classification task which representation is best suited, in this section some possibilities to change the representation of time series will be presented. In the feature-value framework of Chapter 2 the input x and the output y of a classifier are modeled by random variables and the elements of the training set D are assumed to be realizations of i. i. d. random variables. In multivariate time series classification the smallest statistically independent entity is a scenario consisting of one or more time series. A scenario can be modeled as a finite dimensional multivariate random process S consisting of L finite dimensional univariate random processes s ,  = 1, . . . , L, each of length nend : S = [s1 , s2 , . . . , sL ]T ∈ RL×nend , with s = [s [0], s [1], . . . , s [nend − 1]]T ∈ Rnend .

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

(4.1)

4.1 Time Series Representation and Dimensionality Reduction

67

In this thesis it is assumed that any realization of s ,  = 1, . . . , L, is a square integrable function. The scenario S can also be represented using the random variables s[n] ∈ RL for each time stamp n = 0, . . . , (nend − 1): S = [s[0], . . . , s[nend − 1]], with s[n] = [s1 [n], . . . , sL [n]]T ∈ RL .

(4.2)

Thus, a scenario contains L × nend random variables whose statistical interdependence is normally not known. The corresponding target vector is the random process y ∈ {c1 , . . . , cK }1×nend which contains a class label for the scenario S at each time stamp: y = [y[0], . . . , y[nend − 1]].

(4.3)

The training set DS for this type of classification task is build up by M scenarios and the corresponding target values: DS = {(S1 , y1 ), . . . , (SM , yM )}.

(4.4)

As an example in Fig. 4.1 a scenario consisting of L = 2 time series is shown. The scenario is s1 [n]

c1

c2

c3

c4

n nend − 1 s2 [n]

n nend − 1

Figure 4.1: Multivariate time series

divided in 4 segments with the labels c1 , c2 , c3 and c4 . All targets y[n] in one segment belong to the same class. An approach to describe time series is to find a suitable model for the underlying stochastic process. These models can be Hidden Markov Models (HMM) or Auto-Regressive Moving Average (ARMA) models. Both will be described in more detail in Section 4.3. Such models are appropriate for stationary random processes, i. e., the joint probability density functions p(s[ni ], s[nj ]) depend only on the difference m = n i − nj ,

m ∈ {−nend + 1, . . . , nend − 1},

(4.5)

but not on the location in time of ni and nj . This type of stationarity is referred to as strong stationarity. Weak or second order stationary random processes are not as strict and

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

68

4. Classification of Temporal Data

only consider the stationarity of first and second order moments. Here it is enough if the expectation Es[n] {s[n]} does not depend on n and if the cross-covariance matrix   T  CS [ni , nj ] = Es[ni ],s[nj ] s[ni ] − Es[ni ] {s[ni ]} s[nj ] − Es[nj ] {s[nj ]} ∈ RL×L (4.6) reduces to [Rei03] CS [m] = Es[n],s[n+m]



 T  s[n]−Es[n] {s[n]} s[n + m]−Es[n+m] {s[n + m]} ∈ RL×L

= CST [−m],

(4.7) (4.8)

with n, n+m ∈ {0, . . . , nend −1}. For a weak stationary random process the cross-covariance matrix provides a useful summary of information on aspects of the dynamic interrelations among the components of the process. Since the cross-covariance matrix only depends on the index m stationary random processes—unlike non-stationary processes—can be represented by their spectral density matrix. The (h, i)-th element of the spectral density matrix DS [k], k = −nend + 1, . . . , nend − 1, is the Discrete Fourier Transform of the (h, i)-th element of CS [m]:1 [DS [k]]h,i =

nend −1

[CS [m]]h,i e

2π mk end −1

−j 2n

.

(4.9)

m=−nend +1

For stationary random processes not only a well developed theory for the time-domain but also spectral methods can be applied in order to analyze and model problems. These models can be used for classification by computing one or more models for each class. There are many applications producing multivariate time series that cannot be modeled by stationary processes and non-stationarity must be taken into account. Since in general it is very hard to obtain statistical knowledge about non-stationary processes2 the methods applied here are often specialized solutions for the given task, requiring lots of domain knowledge. One of the most frequent approaches applied for non-stationary processes is to change the representation of the time series and to use this new representation in order to solve the required task. Because in multivariate time series adjacent random variables, e. g., s[n] and s[n + 1], are highly correlated, it is often possible to apply dimensionality reduction 1

The (h, i)-th element of CS [m] can be computed from the (h, i)-th element of DS [k] using the Inverse Discrete Fourier Transform: [CS [m]]h,i =

nend −1

1 2nend−1

k=−nend +1

[DS [k]]h,i e

2π km end −1

+j 2n

.

Note that the orthogonal basis used here for the set of (2nend − 1)-dimensional complex vectors is  2π exp j 2nend −1 km with m = −nend + 1, . . . , nend − 1 and k = −nend + 1, . . . , nend − 1: nend −1

 e

j 2n

2π mk end −1

 e

2π mk end −1

−j 2n



= (2nend − 1)δ(k, k ).

m=−nend +1 2

A (non-stationary) random process is stochastically known perfectly if all joint probability density functions of the elements in the process are known.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

4.1 Time Series Representation and Dimensionality Reduction

69

techniques and generate thereby a new representation of the time series. Depending on the problem to be solved the reduced dimensionality representation can be used either to improve the performance or to decrease the computational load. The performance can often be improved by reducing the time series up to time instance n to a number Nn , Nn  nend , of features, which accumulate the relevant information needed to compute y[n]. This Nn features describing the time series will be called high level features in this thesis and are stored in the vector x[n] ∈ RNn . The mappings f e[n] : RL×(n+1) → RNn ,

S [0,n] → x[n],

(4.10)

with S [0,n] denoting the elementary feature representation of S up to time instance n and x[n] the corresponding high level representation, implement the feature extraction process and will be examined in more detail in Section 4.5. In order to apply machine learning techniques the dimensionality Nn of the high level representation is often chosen to be constant for all time instances n N := N1 = . . . = Nnend .

(4.11)

Thereby, an input space RN is created which can be used by machine learning algorithms. Machine learning methods which are based on the similarity of scenarios in the elementary feature representation usually use dimensionality reduction techniques due to the computational load. For example in classification tasks where a scenario belongs to one class, i. e., y[n] = ck , n = 0, . . . , (nend − 1) and the whole multivariate time series is available when one aims to determine ck , state of the art off-line classification algorithms rely on similarity measures computed in the elementary features representation. In order to classify a new scenario S the nearest neighbor algorithm is frequently applied, which labels the new scenario with the class belonging to the multivariate time series from the training set DS which has the highest similarity to S [RK04]. A drawback of all methods which have to examine the whole training set before computing an output, e. g., the nearest neighbor algorithm, is the huge complexity. For such techniques dimensionality reduction methods are highly important, transforming a scenario S into its new representation Sred . The idea hereby is to reduce the dimensionality of S and of every scenario in DS and to compute the similarities with these approximations. If the approximations allow lower bounding, i. e., d(Sred , Sm,red ) ≤ d(S, Sm )

(4.12)

where d(·, ·) is the distance measure used to compute the similarities, the number of distance computations in the elementary feature space can be reduced dramatically, and thereby, the total computational complexity, while obtaining no accuracy degradation due to dimensionality reduction [AFS93].3 When computing in the reduced dimensionality space the scenarios that are further away from Sred than a given threshold γdist , due to γdist < d(Sred , Sm,red ) ≤ d(S, Sm ), one can be sure that for all these scenarios the distance in the elementary feature space is larger and must not be considered further as candidates for the nearest neighbor. Therefore, (4.12) is the basis of indexing methods, i. e., organizing the data in such a way that a brute force scan of all time series in the training set in their elementary feature representation is avoided. 3

Lower bounding is the basis of the GEneric Multimedia INdexIng technique (GEMINI).

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

70

4. Classification of Temporal Data

In the following some general techniques to represent and approximate time series will be reviewed. These representations lead to state of the art dimensionality reduction techniques, whereas in Section 4.5 rather problem-specific reduction techniques for classification of nonstationary random processes will be described. The Discrete Fourier Transform (DFT) is used to analyze the spectral characteristics of time series. The main idea is to represent a time series s = [s [0], s [1], . . . , s [nend − 1]]T as a linear combination of the orthogonal basis exp(2πj kn/nend ), with k, n = 0, . . . , nend − 1, of the set of nend -dimensional complex vectors [OS75]: s [n] =

1 nend

nend −1

2πj

DFT{s }[k]e nend

kn

∈ R,

(4.13)

k=0

where DFT{s }[k] is the k-th element, k = 0, . . . , (nend − 1), of the DFT of s : DFT{s }[k] =

nend −1

s[n]e

− n2πj kn end

∈ C.

(4.14)

n=0

In order to obtain a reduced dimensionality representation of S one can use a truncated version of DFT{S} ∈ CL×nend , which is the matrix obtained by performing the DFT for each time series of S. Due to Parsevals theorem the Euclidean distance between two DFT coefficient vectors which are truncated at the same frequencies is always less than or equal to the Euclidean distance of the time series signals from which the transform was computed.4 When using the Euclidean distance the DFT allows lower bounding and is a frequently used technique to reduce complexity in methods that require the computation of Euclidean distances between time series in the elementary feature representation [AFS93, FRM94]. An important question in this context is which coefficients should be truncated. A frequently adopted practice is to discard the frequency components where the absolute values of the DFT coefficients are small, keeping only the dominant frequencies in the signal. The Discrete Wavelet Transform uses basis functions of finite duration in time in order to represent a signal in the time-frequency or time-scale description. Whereas the DFT transforms a series completely in the frequency domain, the discrete wavelet transform allows for a time-frequency localization. Hereby, the compromise between time and frequency resolution is dictated by Heisenberg’s uncertainty principle [Fla99]. The main idea is to construct the basis for the new representation recursively starting with the so-called mother wavelet Ψ [n] 4

With s being a discrete time sequence and DFT{s } its Discrete Fourier Transform, Parseval’s theorem states that: nend −1

|s [n]|2 =

n=0

1 nend

nend −1

|DFT{s }[k]|2 .

k=0

Parseval’s theorem implies that the DFT also preserves the Euclidean distance between two signals s1 and s2 [AFS93]: nend −1 n=0

|s1 − s2 |2 =

1

nend −1

nend

|DFT{s1 }[k] − DFT{s2 }[k]|2 ,

k=0

since DFT{s1 − s2 } = DFT{s1 } − DFT{s2 }.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

4.1 Time Series Representation and Dimensionality Reduction

71

by shifting, for a good resolution in the time domain, and by dilation, for a good resolution in the frequency domain. Denoting with m the discrete shifting index and with k the discrete dilation index the (m, k)-th basis function is Ψm,k [n] = 2k/2 Ψ [2k n − m],

k, m ∈ Z.

(4.15)

Using the new basis the n-th element of the discrete time series s can be represented as  s [n] = DWT{s }[m, k]Ψ [2k n − m], (4.16) m,k

where DWT{s }[m, k] denotes the (m, k)-th wavelet coefficient, which can be computed efficiently using filterbanks [Rie97]. The implementation of the DWT by filterbanks will be presented in Subsection 4.5.2. The range of m and k depends on the desired decomposition depth and is determined by the number of stages in the filterbank. In order to obtain a reduced dimensionality representation of S one can use a truncated version of DWT{S} ∈ RL×nend , which is the matrix obtained by performing the DWT for each univariate time series s of S. Since the wavelet transform gives a time-frequency localization of the signal most of the energy of the signal can be represented by only a few wavelet coefficients, making the DWT an ideal technique for dimensionality reduction. Although there are many different types of wavelets in machine learning usually the Haar wavelet [Haa10] is used since it has been proven in [CF99] that this type of transformation preserves the Euclidean distance and therefore guarantees lower bounding after truncating the wavelet coefficients. The Singular Value Decomposition (SVD) interprets a discrete time series s ∈ Rnend as a point in the nend -dimensional space. Given M scenarios Sm one can extract from each scenario the -th time series and interpret it as a point in the nend -dimensional space. This way a matrix A ∈ RM ×nend is obtained for the -th time series of the multivariate scenarios. With R being the rank of A , its reduced SVD is A = U Σ VT ,

(4.17)

where U = [u,1 . . . , u,R ] ∈ RM ×R , Σ = diag{σ,1 , . . . , σ,R } ∈ RR×R , with σ,1 ≥ σ,2 ≥ . . . ≥ σ,R > 0, and V = [v,1 . . . , v,R ] ∈ Rnend ×R . With this representation it is clear that T the m-th time series s,m , whose transpose s,m is stored in the m-th row of A , is a linear combination of the R column-vectors of V , v,r ∈ Rnend , r = 1, . . . , R. The vectors v,r are often called eigenwaves and therefore the time series s,m ∈ Rnend can be interpreted as a linear combination of eigenwaves: s,m =

R 

[U ]m,r [Σ ]r,r v,r .

(4.18)

r=1

It should be noted that the eigenwaves used by the SVD are data dependent unlike the DFT or the DWT where the basis used to represent the time series is not data dependent. If one wants to use the SVD for dimensionality reduction the idea hereby is to choose only the largest k singular values, σ,1 , . . . , σ,k with their corresponding vectors u,1 . . . , u,k , v,1 . . . , v,k and to approximate A by T A,red = U,red Σ,red V,red

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

(4.19)

72

4. Classification of Temporal Data

and an univariate time series s by a linear combination of eigenwaves s,red =

k 

wSVD v,r = V,red wSV D ,

(4.20)

r=1

with U,red = [u,1 . . . , u,k ] ∈ RM ×k , Σ,red = diag{σ,1 , . . . , σ,k } ∈ Rk×k , V,red = [v,1 . . . , v,k ] ∈ Rnend ×k and wSV D ∈ Rk . The SVD coefficients wSV D of s based on A,red are computed as T s . wSV D = V,red

(4.21)

The SVD is the optimal linear dimensionality reduction technique in the sense that the total approximation error ||A − A,red ||2 is minimized5 [SZ04]. In order to obtain a reduced dimensionality representation of a scenario S one can compute the SVD coefficients for each of the L univariate time series contained in S based on the matrices A,red ,  = 1, . . . , L. The Piecewise Linear Approximation (PLA) represents a time series by a series of linear segments. PLA belongs to the group of segmentation algorithms which divide a time series into a number of segments and find a different representation for each segment. The trade-off between accuracy and compactness of representation, which is common to all dimensionality reduction techniques, is in this case determined by the number of segments. Three methods can be used in order to deal with this trade-off [KCHP03]: sliding window, top-down or bottom-up. In the sliding window approach a segment is grown until the error in this segment between the data in the original representation and in the approximation reaches a predefined bound. The process repeats with the next value of the time series not included in the newly approximated segment. In the top-down approaches the time series is recursively partitioned until a stopping criteria is met. In the bottom-up approaches, starting with the finest possible approximation, segments are merged until a stopping criteria is met. Denoting the segment [n ,n ] of the time series s from time stamps n1 up to time stamp n2 with s 1 2 in PLA this segment is approximated by the linear model [n ,n ]

1 2 s,red [n] = a(n1 ,n2 ) n + b(n1 ,n2 ) ,

n1 ≤ n ≤ n2

(4.22)

where the coefficients a(n1 ,n2 ) and b(n1 ,n2 ) are determined using the least squares method [MYAU01]. This way a time series s can be approximated by a series of linear segments. In order to obtain a reduced dimensionality representation of a scenario S one can use PLA for each time series in S. The Piecewise Aggregate Approximation (PAA) divides a time series s of length nend into nred equisized segments and describes each segment by its mean value, thereby representing s in the nred -dimensional space. Usually nred is chosen to be a factor of nend . Then the i-th 5

The matrix norm of A − A,red is , - M nend −1 -  ||A − A,red || = . ([A ]m,n − [A,red ]m,n )2 . m=1

n=0

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

4.1 Time Series Representation and Dimensionality Reduction

73

element of the reduced dimensionality representation s,red is computed according to

s,red [i] =

nred nend

nend i nred



s [n],

i = 1, . . . , nred .

(4.23)

n n= nend (i−1)+1 red

PAA was introduced independently by two authors [KCPM00, YF00] and since then it became a very popular technique in time series indexing. This is due to the fact that the distance measure , nred - nend  dPAA (s1 ,red , s2 ,red ) = . (s ,red [i] − s2 ,red [i])2 (4.24) nred i=1 1 is provably [KCPM00] smaller or equal to the Euclidean distance between s1 and s2 in the elementary feature space. In order to obtain a reduced dimensionality representation of a scenario S one can use PAA for each time series in S. The Adaptive Piecewise Constant Approximation (APCA) method is a generalization of PAA allowing the piecewise constant segments to have arbitrary lengths. Thus, a segment in the elementary feature space needs two coefficients in the reduced space: one for its mean value and one for its length. In order to determine the reduced dimensionality representation of a time series s the authors who introduced APCA [KCPM01] suggest to first perform a wavelet compression of s which can then be converted into the APCA representation. Similar to the DWT dimensionality reduction technique, APCA uses the fact that time series have little detail in some regions and high detail in other regions by adaptively fitting to the representation achieving a better approximation. Whereas APCA interprets a time series as linear combination of non-overlapping “box”-basis functions in the wavelet representation the basis functions Ψm,k overlap. The introduction of a new distance measure for the APCA representation [KCPM01] allows to lower bound the Euclidean distance in the elementary feature space, making APCA an attractive representation for time series indexing. Details about the APCA distance and its use in indexing can be found in [KCPM01]. To obtain a reduced dimensionality representation of a scenario S one can use APCA for each time series in S. The Symbolic Aggregate approXimation (SAX) changes the representation of a time series s into a string of symbols which are elements of an alphabet ASAX = {aSAX,1 , . . . , aSAX,QSAX } of size QSAX . Whereas there are many attempts to tokenize time series, many of them being summarized in [DFT02], SAX is the first symbolic representation of time series that allows lower bounding and therefore can be used in time series indexing. The method has been introduced in [LKLC03] and it uses the PAA as intermediate step between the elementary space and the symbolic representation. Dimensionality reduction is obtained by applying PAA to the time series and then the values s,red [i], i = 1, . . . , nred , from (4.23) are mapped to symbols aSAX,q , q = 1, . . . , QSAX , of the alphabet ASAX : s,red [i] → aSAX,q ,

if γSAX,q−1 ≤ s,red [i] < γSAX,q ,

(4.25)

where the quantization thresholds γSAX,q are determined by taking the histogramm of the time series in the PAA representation into account in order to produce symbols with

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

74

4. Classification of Temporal Data

equiprobability. This way a time series s can be transformed into a word w,SAX consisting of nred symbols of the alphabet ASAX . In [LKLC03] a distance measure for words w,SAX is introduced which lower bounds the PAA distance from (4.24) and therefore also the Euclidean distance in the elementary feature space. To obtain a reduced dimensionality representation of a scenario S one can use the SAX transformation for each time series s ,  = 1, . . . , L. The Karhunen-Lo` e we Transform (KLT) was introduced to represent a continuous stochastic process as an infinite linear combination of orthogonal functions [CF67]. Whereas these orthogonal functions are deterministic and can be determined with the covariance of the stochastic process the coefficients are pairwise uncorrelated random variables. It should be emphasized that, unlike the above methods, the KLT is a decomposition of the stochastic process that generates the data and not of a specific realization. For a finite dimensional discrete random process s ∈ Rnend the KLT can be written as: s =

nend 

bk (bT k (s

− μs )) + μs =

k=1

nend 

zk bk + μs ,

(4.26)

k=1

where bk ∈ Rnend are the orthonormal basis vectors,6 μs = Es {s } is the expectation of s and the coefficients zk = bT k (s − μs ) ∈ R are random variables. The basis vectors bk can be computed starting from the requirement that the coefficients zk must be pairwise uncorrelated:

T T Ezk ,zk {zk zk } = Es ,s (bT = λk δ(k, k  ). (4.27) k (s − μs ))(bk (s − μs )) Eq. (4.27) can be rewritten as

Es ,s (s − μs )(s − μs )T bk = λk bk ,

(4.28)

which shows that the basis vectors bk are the eigenvectors of the covariance matrix Cs = Es ,s (s − μs )(s − μs )T ∈ Rnend ×nend . In order to reduce the dimensionality of a time series representation one can only use the nred eigenvectors bk belonging to the nred largest eigenvalues of the covariance matrix in (4.26). This leads to the approximation s,red of s . Keeping only the elements belonging to the largest eigenvalues in the sum means retaining just those characteristics of the data that contribute most to its variance, thereby, minimizing the error Es { s − s,red 22 }. Thus, in the reduced basis of eigenvectors a realization of the centered stochastic process (s − μs ) has the representation T T [z1 , . . . , znred ]T = [bT 1 (s − μs ), . . . , bnred (s − μs )] .

(4.29)

For practical applications given M realizations of s the basis vectors bk can be approximated by the eigenvectors of the sample estimate of the covariance matrix. In this case the KLT is named Principal Component Analysis. The multivariate case in its general form is much more complicated since for the time instances ni and nj the covariance is no longer a scalar 6

For orthonormal vectors bk the following holds: n end 

bk bT k = I nend .

k=1

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

4.2 Difficulties in Time Series Classification

75

as in the univariate case—the (ni , nj )-th entry in Cs —but a matrix CS [ni , nj ] ∈ RL×L as in (4.6). Also the basis vectors bk in (4.26) must be replaced with matrices Bk ∈ RL×nend . One can avoid such complications by using the vec-operator on the multivariate process S ∈ RL×nend transforming it to the vector svec = vec {S} ∈ RLnend and applying the KLT to svec . The time series representations described in this section can be categorized as shown in Table 4.1. with statistical model data adaptive HMM ARMA KLT

statistical model not required data adaptive not data adaptive SVD DFT PLA DWT APCA PAA SAX

Table 4.1: Representations of time series

4.2 Difficulties in Time Series Classification The main difficulty when dealing with time series classification is the high number of elementary features. Whereas it is possible to use the elementary feature representation in off-line time series classification, leading to a L × nend dimensional input space for the scenario S, it cannot be applied for on-line classification with standard machine-learning algorithms. The reason for this is the changing dimensionality of the input space which grows at every time stamp. Although machine learning algorithms as those presented in Section 2.2 can be used for off-line classification of time series in their elementary feature representation this approach is rather a naive method. Not only are the computational costs high but also the curse of dimensionality [Bel61] and the large variance of the classifier in the bias-variance framework lead to a very poor performance. The high variance is a consequence of the large dimensionality of the input space, since the classifiers must be very complex in order to deal with an input of dimensionality L × nend . Thus, they can easily learn the training set exactly but are unable to generalize. More evolved time series classification techniques take advantage of the special structure of a scenario using the correlation among the elements of S in order to construct a lower dimensional representation of the multivariate time series that is better suited for the classification task. Some possibilities to change the representation of a scenario have been presented in the last section. However, no statement can be made about which time series representation should be used since this highly depends on the classification problem. Any a priori knowledge about the process that generates the data should be used for the mappings f e[n] in (4.10) in order to change the representation of a time series at time stamp n. All successful time series classification methods which do not examine the whole training set in order to find nearest neighbors in the elementary feature representation make some assumptions about the data so that a priori knowledge is required. The most frequent approach is to suppose that the time series contain patterns of limited support which must be specified a priori, for example in the form of a dictionary of elementary signal patterns [Geu02]. Therefore, in time series classification—more than in other classification tasks—much of the work

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

76

4. Classification of Temporal Data

must be done in the preprocessing step, i. e., in the feature generation and feature selection step. Other difficulties that are typical for time series classification arise from the different length of the observed realizations, from the different time or amplitude scale, and from the correct alignment in time of the scenarios in the training set. If the observed multivariate time series have different length one can either truncate long scenarios or apply zero padding to short scenarios so that all examples Sm have the length nend . This is visualized in Fig. 4.2 for a scenario with L = 2 where the time series s1 has been truncated to the length nend and s2 has been extended to the length nend by zero padding. s1 [n]

s1 [n]

n nend − 1

s2 [n]

n nend − 1

s2 [n]

n nend − 1

n nend − 1

Figure 4.2: Truncation and zero padding

If the shape of a scenario is its discriminative feature it may happen that two scenarios which have the same target values may differ considerably in amplitude or in the duration of certain temporal patterns. In this case distance measures that focus on the shape of time series, e. g., Dynamic Time Warping which will be discussed in Section 4.4, can be used in the preprocessing stage. Fig 4.3 shows an example for these types of problems. The time series s1 of scenario S2 just differs by a scaling factor in the amplitude from the time series s1 of scenario S1 . Also s2 can be obtained from s2 by squeezing the time axis and zero padding to the length nend . s1 [n]

s2 [n]

s1 [n]

n nend − 1

n nend − 1 s2 [n]

S1

n nend − 1

S2

n nend − 1

Figure 4.3: Amplitude and temporal scaling

The difficulty of correct alignment in time arises from the fact that finite length scenarios are normally slices of duration nend from random processes of longer length. Thus, in many applications the absolute location in time of certain features is not important but rather their

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

4.3 Classical Approaches for Time Series Classification

77

relative location in time. If the algorithms that are used for an application require an absolute location of the origin the question of how to align the scenarios in time arises. This can be done for example by linking the origin with a certain event that is typical for the considered classification problem. This problem will be discussed in more detail in Subsection 4.5.3. For the general multivariate time series classification case when a scenario S has segments that belong to one class and other segments that belong to other classes additional difficulties appear since not only the classes but also the time instances when changes occur must be detected. Here, a scenario additionally must be divided correctly into segments. The topic of time series segmentation is complex since it already involves a classification step. Usually when one class label changes into another there is a transition time where no class label can be assigned accurately to the time stamps in this interval. For example, a local maximum can only be recognized as a maximum after observing some values after the peak. Therefore, “do not care” intervals are introduced which represent segments of a scenario that lie around those time instances where a change in the target value occurs. These intervals do not have any class label and are not considered for the computation of the risk functional. In Fig. 4.4 a scenario with three “do not care” intervals—represented by hatched areas—is visualized. The topic of segmentation for on-line time series classification is discussed in more detail in Section 7.1. s1 [n]

s2 [n]

c1

11 00 00c2 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11

11 00 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11

c3

11 00 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11

c4

n nend − 1

n nend − 1

Figure 4.4: “Do Not Care” intervals

4.3 Classical Approaches for Time Series Classification Classical approaches for time series classification build models of the temporal data and have a class label or a sequence of class labels associated with every model. In order to classify a new scenario it must be determined to which model the current scenario fits best and the class labels corresponding to this model represent the classifiers decision. The most popular techniques in this context to model time series are autoregressive moving average models and hidden Markov models. Another approach that can be used for time series classification is implemented by recurrent neural networks. Similar to autoregressive moving average and hidden Markov models, recursive neural networks were initially designed to predict future

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

78

4. Classification of Temporal Data

values of a time series but they can also be used for classification. In the following the three methods are briefly discussed. Hidden Markov Models (HMM) play an important role in temporal classification mainly due to their success in speech recognition [RJ86]. A HMM consists of NHMM states and it is assumed that in each state the random process to be modeled possesses distinctive properties. HMMs can be compared to finite state machines with the difference that both the output produced by a state as well as the transitions between states are not deterministic but parts of a statistical model. At each time stamp n a new state is entered based on a transition probability which only depends on the last state.7 After each transition an output a[n] ∈ AHMM , with AHMM being an alphabet of symbols, is produced according to an emission probability distribution which is only dependent on the current state. Thus, there are NHMM emission probability distributions in a HMM, one for each state. Fig. 4.5 shows a HMM with NHMM = 4 states and an alphabet AHMM = {a1 , a2 , a3 }. In the upper part of the figure the p0

p3 p1

1

2

p2 p14

p10

p11

p13

4

p12 p15 p8

p5

p4

3

p7 p9 state 1 outp. prob. p1a1 a1 a2 p1a2 p1a3 a3

state 2 outp. prob. p2a1 a1 a2 p2a2 a3 p2a3

p6 state 3 outp. prob. p3a1 a1 a2 p3a2 a3 p3a3

state 4 outp. prob. a1 p4a1 a2 p4a2 a3 p4a3

Figure 4.5: HMM with four states

graph visualizes the four states and the transition probabilities between the states. In the lower part the emission probabilities for the outputs in each state are presented. The sum of the emission probabilities in a state yields 1. If the output sequence {a[n] = a1 , a[n + 1] = a1 , a[n + 2] = a3 , a[n + 3] = a2 , a[n + 4] = a3 } is observed one cannot say exactly which state produced, for example, the output a3 at time stamp n + 4.8 If we assume that at time stamp n the current state is state 1, the output a3 at time stamp n + 4 can be a result of every state sequence starting with 1, e. g., {1, 2, 1, 1, 3}. For this state sequence one starts with state 1, emits the output a1 with probability p1a1 , then a transition to state 2 takes place 7 8

This is the Markovian property. Therefore, these Markov models are called “hidden”.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

4.3 Classical Approaches for Time Series Classification

79

with probability p1 where the output a1 is emitted with probability p2a1 . Next a transition to state 1 occurs with probability p2 where the output a3 is produced with probability p1a3 , followed by a transition to the same state 1 with probability p0 and an emission of the output a2 with probability p1a2 . Finally, a transition to state 3 occurs with probability p12 where the output a3 is emitted with probability p3a3 . In order to compute the probability p(a = [a1 , a1 , a3 , a2 , a3 ]T , states = {1, 2, 1, 1, 3}), that the sequence of states {1, 2, 1, 1, 3} generates the observation {a[n] = a1 , a[n+1] = a1 , a[n+2] = a3 , a[n+3] = a2 , a[n+4] = a3 }, all of the above mentioned probabilities and the probability of being in state 1 at time stamp n, p(state = 1), must be multiplied: p(a = [a1 , a1 , a3 , a2 , a3 ]T , states = {1, 2, 1, 1, 3}) = p(state = 1)p1a1 p1 p2a1 · p2 p1a3 p0 p1a2 p12 p3a3 .

(4.30)

The probability that this HMM produced the sequence of observations is the sum over the probabilities for all possible sequences of states leading to the output {a[n] = a1 , a[n + 1] = a1 , a[n+2] = a3 , a[n+3] = a2 , a[n+4] = a3 }. Although there are algorithms that can compute the sum of all the possible state sequences that lead to a given sequence of observations, the Viterbi algorithm [For73] is often used which is a dynamic programming algorithm that only computes the probability of the most likely sequence of states leading to the observation. HMM can be used for the classification of time series S by quantizing the time series to a sequence of symbols {a[0], . . . , a[nend − 1]}. This series can be interpreted as the output of a HMM. If there are more HMM models available and each model stands for a certain sequence of class labels one only has to compute which model has generated with highest probability the observation, e. g., by using the Viterbi algorithm. Then the corresponding sequence of class labels represents the decisions implemented by this classifier. It should be noted that such a classifier requires a large number of HMM models which have to be built based on the training set DS . Two important questions arise when constructing a HMM. Firstly, how many states and transitions should be chosen and secondly how to compute the transition and emission probabilities. There is no clear answer to the first question and here one has to rely on a priori knowledge about the process or to apply the trial and error method. On the other hand, there is an algorithm for computing the transition and emission probabilities in a HMM such that the likelihood for obtaining the observations in the training set associated with this model is maximized. The algorithm is called Baum-Welch reestimation algorithm [BPSW70] and is based on the expectation maximization technique. Although HMM have been applied with success in many temporal classification tasks there are some difficulties related to their use. The most serious drawback is the huge number of parameters that must be determined. A large number of parameters bears the risk of overfitting the training set. As already mentioned there is no general way to find the number of states and the transitions. Rather one has to rely on domain knowledge or on the trial and error method. Further, HMM do not generate an interpretable classifier which may be important for some applications. Moreover, there is no clear approach on how to perform the quantization step mentioned above which transforms a time series into a sequence {a[0], . . . , a[nend − 1]}.9 9

Methods like SAX or vector quantization can be used.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

80

4. Classification of Temporal Data

Auto-Regressive Moving Average (ARMA) models have become popular after the work of Box and Jenkins [BJ70] on modeling stationary time series. The ARMA class of models can also deal with data containing homogenous nonstationarity because this type of time series can be transformed into stationary time series by differentiation which removes the trends. The models that assume stationarity after differentiation are called Auto-Regressive Integrated Moving Average (ARIMA) models. Since ARIMA can be reduced to ARMA models in the following a short introduction to the latter type will be presented. ARMA models describe a stationary random process {s[n]}—where a scenario S is a slice of {s[n]}—with mean vector μs by a relation of the form (s[n] − μs ) =

p  j=1

C[j]((s[n − j] − μs )) + e[n] −

q 

Θ[j]e[n − j],

(4.31)

j=1

L with C[j], Θ[j] being L × L matrices T and e[n] ∈ R a vector white noise process with mean zero and covariance matrix Ee,e εε . The first two terms on the right hand side in (4.31) represent the auto-regressive part and the last two terms the moving-average part in {s[n]}. An Auto-Regressive (AR) process only depends on the current value of the input, e[n] and on the past values of the output. On the other hand a Moving-Average (MA) process only depends on the current and past values of the input. The difficulties related to the use of ARMA models not only originate from a suitable choice of the model orders p and q but also from the computation of the parameter matrices C[j] and Θ[j]. For the choice of p and q model selection criteria as the likelihood ratio test or the “An” Information Criterion (AIC) [Aka74] or the Bayesian Information Criterion (BIC) [Sch78] are used. Whereas the likelihood ratio test decides whether a null hypothesis, e. g., “the data is a result of an ARMA process with the current choice of p and q”, has enough statistical evidence by comparing a suitable test statistic [Rei03] with a constant threshold, the AIC and BIC criteria additionally penalize models with a large number of parameters. In order to determine a suitable model for some given data one starts with a low-order AR model and if this model does not provide an adequate representation for the time series a low-order ARMA model is considered. The estimation of the model parameters C[j] and Θ[j] can be solved for AR-models using the least squares technique, while for ARMA models maximum likelihood methods must be applied [Rei03]. ARMA models have a thorough statistical foundation and have been applied in many applications dealing with time series prediction and modeling. However, the models face the difficulties related to overfitting the given data and therefore, similar techniques as those used in machine learning, i. e., validation on an independent data set, must be applied. In order to perform time series classification one has to compute one or more models for each class and find out for a query time series to which model it fits best. This procedure leads to considerable computational complexity. But the main drawback of time series classification using ARMA models is their inability to deal with non-stationary time series. As mentioned above some extensions are possible to model homogenous non-stationarity but for many time series encountered in practice this assumption does not hold. Recurrent Neural Networks (RNN) are extensions of feedforward NN aiming to deal better with temporal data. The architecture of a NN is described by a graph that specifies which neuron is connected to which. The graph of a RNN is characterized by cycles in the graph which show that past context is used to compute the current output, thereby, introducing

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

4.3 Classical Approaches for Time Series Classification

81

the temporal dimension. An example for a RNN is given in Fig. 4.6 where the network from Fig. 2.5 has been extended by the unit R in the hidden layer and by the input r[n]. The value of r[n] is defined as the output value of the unit R at time stamp n − 1. This implements the recurrence since the output of R depends both on x[n] and on r[n] so that the output u[n] is influenced by the information from earlier values of x that are arbitrarily distant in time. The input x[n] of a RNN can be chosen to be the vector of elementary features s[n] or the vector of high level features described in (4.10). The output vector u[n] approximates the unit vector representation of the target y[n], similarly as in Subsection 2.2.2. The final steps that are not shown in Fig. 4.6 are confidence mapping, normalization and taking the maximum value in the vector of estimated conditional probabilities. If the RNN generates the output ˆy[n] = ck the current segment of the multivariate time series is assigned to class ck . ϑ1,1

x1 [n]

1

w1,1

.. .

ϑ1,

,1

r

ϑR

1

x2 [n] .. .

.. . K

uK [n]

wK,H

ϑ

H

,r

H ϑ H,N

u1 [n]

xN [n]

ϑR,N

R ϑ R,r

r[n]

Figure 4.6: Recurrent neural network

Many other topologies for RNN than the one showed in Fig. 4.6 can be implemented. For example one can insert several layers between the input x[n] and the unit R. Another possibility is to add more than just one unit in the hidden layer and more than just one input in order to realize the recurrence. RNN are more difficult to train than feedforward NN and several methods have been proposed [Elm90, WZ95, Moz95]. A popular algorithm, similar to the backpropagation algorithm, is called BackPropagation Through Time (BPTT) [Wer90]. The central idea of BPTT is the unfolding of the RNN into a sequence of feedforward NN as shown in Fig. 4.7. The unfolded network has a copy of the initial RNN for each time step, where the feedback loop is replaced by connections between the various copies. Therefore, the weights of the unfolded network can be trained using the standard backpropagation algorithm but with one restriction: the weights in all copies of the unfolded network should be the same since one wishes to keep just one copy of the RNN and one set of weights. To achieve this BPTT updates all equivalent weights using the sum of the gradients obtained for weights in equivalent layers. The drawbacks of RNN are similar to those of HMM, namely the large number of parameters and the missing interpretability. The parameters that must be chosen for a RNN include the number of neurons in the hidden layer, the recurrence structure or the weights in the network. Due to the temporal dimension it is even harder than for feedforward NN

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

4. Classification of Temporal Data

...

.. . .. .

H

1 .. u1 [n−2] . K

uK [n−2]

1 .. . .. .

R r[n − 2]

H

1 .. u1 [n−1] . K

uK [n−1]

1 x[n]

x[n − 2]

1

x[n − 1]

82

.. . .. .

R r[n − 1]

H

1 .. u1 [n] . K

uK [n]

R r[n]

Figure 4.7: Unfolding of a recurrent neural network

to extract meaningful rules from RNN, which makes them unsuitable for applications that require interpretability.

4.4 Similarity Measures for Time Series As already mentioned in Chapter 2 the metric for measuring neighborhoods in the input space is the key to solving classification problems. Especially in high dimensions where the input space is almost empty despite a large training set, assumptions about the neighborhood relations must be made, e. g., by the choice of a kernel or of a similarity measure. In this section distance measures for time series are considered. Hereby, each multivariate time series can be interpreted as a point in a high dimensional space and the similarity between two time series is defined by the distance measure in this high dimensional space. It depends on the classification task which similarity measure is most suitable, e.g. if the shape of time series is their discriminative feature then another distance measure must be used than in cases where the time of occurrence of certain events is the discriminative factor. According to [RKBL05] time series similarities can be divided into similarity in time, similarity in shape and similarity in change. Similarity in time can be measured using the L2 norm—which will be described below in more detail—and can be seen as a measure for the correlation between time series. On the other hand similarity in shape does not focus on the exact time of occurrence and duration of characteristic patterns but on the existence and the temporal succession of such characteristic patterns. The most frequently used measure for similarity in shape is the Dynamic Time Warping (DTW) measure, which will be discussed below. Finally, similarity in change is a measure for the degree of similarity of the autocorrelation of the time series. An approach to measure similarity in change is to use models like HMM or ARMA for the random process and to evaluate the degree of fitting [RKBL05]. In the following the Minkowski metric as a mean to measure the similarity in time and the DTW as a mean to measure similarity in shape will be presented. Then, a new similarity measure, which will be called in the following Augmented Dynamic Time Warping, will be introduced that captures both the similarity in shape as DTW but also the similarity in the duration of time series, i. e., Augmented Dynamic Time Warping penalizes also differences in the length of the scenarios that must be compared. At the end of this subsection some dissimilarity measures which became popular for indexing time series will be shortly described: the weighted Minkowski metric, the Levenshtein distance, and the Threshold Distance (TD). The most widely used dissimilarity measure is the Minkowski metric which is induced by the Lp norm. Given two univariate time series s and sk of length nend , the Lp norm of s − sk

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

4.4 Similarity Measures for Time Series

83

is defined as: Lp (s − sk ) =

nend−1 

 p1 |s [n] − sk [n]|p

.

(4.32)

n=0

When p = 1 the dissimilarity measure from (4.32) is called Manhattan or city-block norm, when p = 2 the Euclidean norm and when p = ∞ one talks about the maximum norm 10 and it can be rewritten as L∞ (s − sk ) =

max

n=0,...,nend−1

{|s [n] − sk [n]|}.

(4.33)

Figures 4.8 to 4.11 show the differences between the L1 , L2 , and L∞ norm. In Fig. 4.8 the reference series s1 with s1 [n] = 0.01(n−32)2 and nend = 64 is illustrated. The series s2 differs from s1 only at n = 4 having the value s2 [4] = s1 [4] − 3. The sequence s3 differs from s1 at n = 4 and at n = 57 having the values s3 [4] = s1 [4] − 2 and s3 [57] = s1 [57] − 2, respectively. The sequence s4 is constructed from s1 by alternatively subtracting and adding the value 0.5 to s1 [n]. Table 4.2 presents the distances between s1 and s2 , s3 , s4 computed using the L1 , L2 , and L∞ norm. It can be seen that using the L1 norm, sequence s2 is the most similar s2 s3 s4

L1 (s1 − (·)) 3.0 4.0 32.0

L2 (s1 − (·)) 3.0 2.8 4.0

L∞ (s1 − (·)) 3.0 2.0 0.5

Table 4.2: Distances to s1

to s1 . When the distance is measured with the L2 norm, sequence s3 is the most similar to s1 whereas the L∞ norm identifies sequence s4 as the most similar to s1 . In the Maximum Likelihood sense the L2 norm is optimal if the measurement errors at each time stamp are i. i. d. Gaussian random variables11 and the L1 norm is optimal if the errors are i. i. d. Laplacian random variables. Since the Laplacian distribution is a much longer tailed distribution than the Gaussian, it is better suited to model impulsive noise [SB99, YF00]. In most of the tasks involving time series the L2 norm is used since the Euclidean distance allows lower bounding for many dimensionality reduction techniques discussed in Section 4.1. For example both the DFT and the DWT are orthogonal linear transformations and thus, the L2 norm, being invariant under rotations, is preserved by the transformations. This is not the case for other Lp norms with p = 2. For multivariate time series S the dissimilarity measure can be defined as the weighted sum of the Lp distances between the corresponding univariate time series: dp (Sm1 , Sm2 ) =

L 

w Lp (sm1 , − sm2 , ),

(4.34)

=1 10

As p approaches infinity, the largest of the component distances completely dominates the overall distance measure. 11 Assuming the regression model y = Aθ+e with e ∼ N0,Cε the maximization of the Likelihood py (y = y; θ) is realized by the minimization of ||y − Aθ||2 −1/2 = (y − Aθ)T Cε−1 (y − Aθ). Cε

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

84

4. Classification of Temporal Data 12

12

10

10

8

8

6

6

4

4

2

2

0

0 0

10

20

30

40

50

0

60

Figure 4.8: Reference sequence s1

20

30

40

50

60

Figure 4.9: Sequence s2 : L1 < L2 < L∞

12

12

10

10

8

8

6

6

4

4

2

2

0

10

0 0

10

20

30

40

50

60

Figure 4.10: Sequence s3 : L2 < L1 < L∞

0

10

20

30

40

50

60

Figure 4.11: Sequence s4 : L∞ < L2 < L1

where w denotes the weight of the -th time series in the scenarios. In applications where the similarity in shape is important a different dissimilarity measure than the Minkowski metrics is better suited: Dynamic Time Warping (DTW).12 Unlike the Euclidean distance, DTW allows an elastic shifting of the time axis so that patterns which are similar but out of phase can be lined up. The difference between the Euclidean distance and DTW is visualized in Figures 4.12 and 4.13. The time series are represented here as continuous curves for a better illustration. Whereas the Euclidean distance is very sensitive to small distortions in the time axis because it assumes that the n-th point in s1 is aligned with the n-th point in s2 , DTW can warp the time axes of both series to achieve a better alignment. DTW is a technique based on dynamic programming and has s1

s1

s2

s2

Figure 4.12: No time warping 12

Figure 4.13: With time warping

Since the triangular inequality does not hold for the DTW dissimilarity it is not a metric and thus, it will not be called a “distance”.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

4.4 Similarity Measures for Time Series

85

been known in the speech processing community since 1978 [SC78]. It has been introduced to data mining in 1994 [BC94] but it only became popular by the works of Kim et al. [KPC01] and Keogh [KP01] who introduced indexing techniques for DTW. Unlike other dissimilarity measures, DTW does not require that the two sequences to be compared have same length. Given sequence s1 of length nend,1 and sequence s2 of length nend,2 the distance between the n-th point in s1 and the k-th point in s2 is defined as d[n, k] = (s1 [n] − s2 [k])2 .

(4.35)

In order to compute the DTW dissimilarity in a first step the nend,1 × nend,2 matrix DDTW is built, with the (n, k)-th element being d[n, k]. A contiguous path in DDTW from d[0, 0] to d[nend,1 − 1, nend,2 − 1] is called warping path WDTW = {w1,DTW , . . . , wI,DTW },

(4.36)

where max(nend,1 , nend,2 ) ≤ I < nend,1 +nend,2 −1 and wi,DTW = [n, k]i stores the indices n and k of the i-th element in the path. The warping path WDTW is subject to three constraints. The first constraint—called boundary condition—has already been mentioned and states that w1,DTW = [0, 0] and wI,DTW = [nend,1 − 1, nend,2 − 1]. This forces the path to start and finish in diagonally opposite corners of DDTW . The second constraint—called continuity condition—restricts the allowable steps in the path to adjacent cells: if wi,DTW = [n, k] and wi−1,DTW = [n , k  ], then n − n ≤ 1 and k − k  ≤ 1. Finally, the third constraint—called monotonicity condition—forces the points in WDTW to be monotonically spaced in time: if wi,DTW = [n, k] and wi−1,DTW = [n , k  ], then n − n ≥ 0 and k − k  ≥ 0. There are many  paths which satisfy these three constraints. The DTW algorithm determines the path WDTW which minimizes the warping cost ⎧, ⎫ ⎪ ⎪ I(W ) DTW ⎨-  ⎬ .  d[wi,DTW ] , (4.37) WDTW = argmin ⎪ WDTW ⎪ ⎩ ⎭ i=1 where I(WDTW ) stands for the length of the warping path WDTW . The optimal warping  path WDTW leads to the DTW dissimilarity , -I(W  ) DTW -  1 . dDTW (s1 , s2 ) = d[wi,DTW ]. (4.38)  I(WDTW ) i=1 The optimization problem in (4.37) can be solved very efficiently by dynamic programming. For this purpose a new nend,1 × nend,2 matrix ΓDTW is computed, which contains the cumulative distances γDTW [n, k] defined as γDTW [n, k] = d[n, k] + min {γDTW [n − 1, k − 1], γDTW [n − 1, k], γDTW [n, k − 1]} .

(4.39)

The sequence of indices leading to γDTW [nend,1 , nend,2 ] = dDTW (s1 , s2 ) builds the warping path  WDTW . It should be noted that for time series of same length, nend,1 = nend,2 , the Euclidean distance can be seen as a special case of DTW with the constraint that wi,DTW = [n, k]i , where n = k = i.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

86

4. Classification of Temporal Data 2

s2 s1

1

2

120 100

s2 s1

1

s1

80 0 −1

60

0

40

−1

20 −2 0

20

40

60

80

100 120

Figure 4.14: Original sequences

00

20

40

−2 0

60 80 100 120 s2

Figure 4.15: Warping path

50

100

150

Figure 4.16: Warped sequences

Figures 4.14 to 4.16 show an example for DTW. The time series s1 and s2 in Fig. 4.14 are both of length nend,1 = nend,2 = 128 and have as similar shape a plateau. The plateau of s1 , represented by the continuous line, is larger than the plateau of s2 , represented by the dashed line, which leads to a large Euclidean distance between the two time series. Using  the dynamic programming procedure described by (4.39) the optimal warping path WDTW ,  which is shown in Fig. 4.15 can be computed. Knowing WDTW the time axes can now be locally stretched for each series in order to take account for the similarity in shape. The result showing both time series stretched to the length 193 is presented in Fig. 4.16. The complexity of DTW is O(nend,1 nend,2 ) but it can be slightly speeded up by introducing a limitation for how far the path may depart from the diagonal of DDTW [BC94]. The width of the window around the diagonal marking the subset of DDTW which the path is allowed to visit is called warping window. Additionally to the complexity reasons, the main advantage of introducing a warping window is the prevention of pathological warpings, where a small section of the one sequence maps to a large sequence of the other. DTW can be extended to multivariate time series for measuring the dissimilarity between two scenarios Sm1 and Sm2 by computing the (n, k)-th entry in DDTW as d[n, k] =

L 

(sm1 , [n] − sm2 , [k])2 .

(4.40)

=1

The rest of the procedure remains the same, i. e., the warping path must be calculated from DDTW using (4.39) and the DTW dissimilarity results from (4.38). In [RM03] a method is presented to lower bound the DTW measure for multivariate time series, which enables indexing in large data bases of multivariate time series. In some applications the property of DTW to focus only on the shape and to neglect differences in duration of scenarios is not desired. Therefore, in the following an extension of DTW will be introduced which is able to penalize also differences in the duration of the considered scenarios. This extension will be called Augmented Dynamic Time Warping (ADTW) in the following. The main idea underlying the extension is simple: the scenario S = [s1 , . . . , sL ]T ∈ RL×nend is extended in a first step to the scenario Sext = [s1 , . . . , sL , s(L+1) ]T ∈ R(L+1)×nend

(4.41)

by including an additional univariate time series s(L+1) ∈ Nnend which contains indices related to all samples in S, s(L+1) = [1, . . . , nend ]T .

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

(4.42)

4.4 Similarity Measures for Time Series

87

Since it is assumed that all scenarios result from signals that are sampled at the same rate, s(L+1) [n] represents the duration of S up to time instance n. If (4.40) is applied for two expanded scenarios Sext,m1 and Sext,m2 in order to compute the DTW distance, also the difference in the duration of Sm1 and Sm2 is taken into account. However, working with Sext,m1 and Sext,m2 does not take into account that the univariate time series in one scenario may use different scales. Therefore, in a second step, the scenarios Sext are transformed by ˇext ∈ R(L+1)×nend . The normalization is achieved by dividing normalization into scenarios S each element s [n] in the univariate time series s through the value snorm which represents  the maximal absolut value in the -th time series among all scenarios in the training set   1 norm ˇs [n] = norm s [n], with s max = max {|s,m [nm ]|} , (4.43) m=1,...,M nm =0,...,nend,m −1 s

20

20

15

15 sm [n]

sm [n]

where nend,m denotes the length of the m-th scenario. The DTW-dissimilarity of two scenarios ˇext,m ∈ R(L+1)×nend,m2 which is computed by searching for the ˇext,m ∈ R(L+1)×nend,m1 and S S 2 1 smallest accumulated distance in the matrix DDTW , whose entries are computed according to (4.40), represents the ADTW-dissimilarity of the scenarios Sm1 and Sm2 , denoted with dADTW (Sm1 , Sm2 ). Figures 4.17, 4.18, 4.19, and 4.20 present an example which underlies the advantage of the ADTW dissimilarity measure for applications where discriminative information between scenarios lies in the slope. In Fig. 4.17 the scenario s1 , with s1 [n] = 2n+1, nend,1 = 9 is visualized

10 5 0 0

5

10

20 n

30

0 0

40

Figure 4.17: Original scenarios, L = 1 1

1

0.8

0.8

0.6 0.4 0.2 0 1

10

20 n

30

40

Figure 4.18: Warped scenarios, L = 1

sˇ1,m [n]

sˇ1,m [n]

10

0.6 0.4 0.2

40 0.5 sˇ2,m [n]

20 0 0

n

Figure 4.19: Original scenarios, L = 2

0 1

40 0.5 sˇ2,m [n]

20 0 0

n

Figure 4.20: Warped scenarios, L = 2

by the dotted bright grey (green) line, the scenario s2 , with s2 [n] = n + 1, nend,2 = 19 is visualized by the solid dark grey (blue) line, and the scenario s3 , with s3 [n] = 0.5n+1, nend,3 = 38

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

88

4. Classification of Temporal Data

is visualized by the dot-dashed dark grey (red) line. In order to compute the DTW dissimilarities dDTW (s1 , s2 ) and dDTW (s2 , s3 ) the series are warped as shown in Fig. 4.18 leading to the values dDTW (s1 , s2 ) = 0.50 and dDTW (s2 , s3 ) = 0.24. The DTW dissimilarity signalizes that scenarios s2 and s3 are closer to each other than s2 is to s1 , although the difference in duration is much larger between s2 and s3 than between s2 and s1 . On the other hand, if ADTW is used one obtains the dissimilarity values dADTW (s1 , s2 ) = 0.14 and dADTW (s2 , s3 ) = 0.25. ˇext,1 , S ˇext,2 and S ˇext,3 is shown in Fig. 4.19. In The extension of s1 , s2 and s3 to scenarios S Fig. 4.20 the warped scenarios as required for the computation of the ADTW dissimilarity are visualized. The ADTW dissimilarity between s1 and s2 results from summing up for all indices n in the warped representation the distances (ˇ s1,1 [n] − sˇ1,2 [n])2 + (ˇ s2,1 [n] − sˇ2,2 [n])2 . The ADTW dissimilarity between s2 and s3 is computed equivalently. Thus, also the differences in duration of the original signals are penalized by the introduction of the second univariate time series sˇ2,m [n], m = 1, . . . , 3. This way, ADTW captures not only similarity in shape but also differences in the duration of the considered scenarios. This property is important in many applications. The ADTW dissimilarity will be used in Subsection 7.2.2 for car crash classification. Another group of distance measures that can be used to compare scenarios are based on the weighted Minkowski metrics. Weighted measures take into account that different parts of the time series may have different significance for the similarity. Thus, the Lp norm can be extended to L(w) p (s − sk ) =

nend−1 

 p1 wn |s [n] − sk [n]|p

,

(4.44)

n=0

where wn denotes the weight at time stamp n. In practical applications the weights wn can be determined using the relevance feedback framework [KP98], which implies an iterative procedure where the user can adjust the weights in order to fit his sense of similarity. Weighted measures for time series have been used for example in [WFSP00]. Distance measures based on string matching first convert the time series into a sequence of letters from a predefined alphabet, e. g., using SAX and compute distances based on the string representation. The best known similarity measure for strings is the Levenshtein distance [Lev66]. The Levenshtein distance between two strings is the minimum number of operations needed to transform one string into the other. Thereby, three possible operations are allowed: substituition, deletion and insertion. For example the Levenshtein distance between the strings “Saturday” and “Sunday” is 3 since one has to remove the letters “a”—the first operation—and “t”—the second operation—from the string “Saturday” and finally the letter “r” must be substituted with “n”, which represents the third operation. The Levenshtein distance between two strings of length n1,end and n2,end can easily be determined by computing a (n1,end + 1) ×(n2,end + 1) matrix DLv using dynamic programming. The (n, k)-th entry in DLv results from the relation ⎧ dLv [n − 1, k − 1] + 0 ⎪ ⎪ ⎨ dLv [n − 1, k − 1] + 1 dLv [n, k] = min dLv [n − 1, k] + 1 ⎪ ⎪ ⎩ dLv [n, k − 1] + 1

same letter substituition , insertion deletion

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

(4.45)

4.4 Similarity Measures for Time Series

89

which is initialized with dLv [0, 0] = 0. For the above example with the strings “Saturday” and “Sunday” the matrix DLv is ⎛

DLv

⎜ ⎜ ⎜ ⎜ =⎜ ⎜ ⎜ ⎜ ⎝

0 1 2 3 4 5 6

1 0 1 2 3 4 5

2 1 1 2 3 3 4

3 2 2 2 3 4 4

4 3 2 3 3 4 5

5 4 3 3 4 4 5

6 5 4 4 3 4 5

7 6 5 5 4 3 4

8 7 6 6 5 4 3

⎞ ⎟ ⎟ ⎟ ⎟ ⎟. ⎟ ⎟ ⎟ ⎠

(4.46)

A relatively new distance for time series is the so-called Threshold Distance (TD) [AKK+ 06]. It is an useful distance for many applications where the information about time instances when the amplitude exceeds or falls below some predefined threshold is of interest. Whereas the Minkowski metric or the DTW similarity measure consider all values of the time series [see (4.32) and (4.38)], the TD focuses only on the relevant parts of the signals. Of course setting the right threshold which define what “relevant” is, requires a priori knowledge about the application. Using the TD two univariate time series are considered similar if their amplitudes exceed a threshold τTD within similar time intervals. Thus, the exact values of the time series are not required, the TD rather measures whether the time series at similar time intervals are above or below τTD . In [AKK+ 06] some applications are enumerated where this kind of similarity is required. For example in the pharmacy industry it is interesting to compare the time intervals when a blood value exceeds a critical threshold after a medical treatment for different patients. This can then be used to determine all patients who show a similar reaction to the treatment. Whereas the dimensionality reduction techniques presented in Section 4.1 loose the temporal information13 and are mainly applied for indexing as lower bounding distances for the Euclidean distance [see (4.12)], here the temporal information is maintained by reducing the signals to time intervals when the amplitude lies above a threshold τTD . As an example in Figures 4.21 to 4.23 the ECG signals of three persons are shown,14 as well as the time intervals when the amplitudes of the signals exceed the threshold τTD = 0.4. It can be seen that the patient with a high risk of cardiac infarct can be easily identified by comparing the time intervals when the amplitude lies above τTD , since these intervals are similar for the first two patients and quite distinct from those of the third patient. In order to quantify the subjective notions of “similar” and “distinct”, a distance measure for time intervals is needed. Before introducing a distance measure for time intervals it must be specified how a time series can be transformed into a sequence of intervals using the threshold τTD . For this reason in [AKK+ 06] a threshold crossing time interval for the series s is defined as a time interval 13

In the reduced space the time intervals when the original signals exceed a threshold can not be determined. This is due to the fact that the presented dimensionality reduction techniques aggregate time series values over time. The dimensionality reduction that results when computing the TD aggregates the time series over the amplitude spectrum, thereby, preserving the temporal information. 14 The ECG as well as other time series datasets have been kindly donated by Professor Eamon Keogh. The reference to the datasets is: Keogh, E., Xi, X., Wei, L. and Ratanamahatana, C. A. (2006). The UCR Time Series Classification/Clustering Homepage: www.cs.ucr.edu/ ∼eamonn/time series data/.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

90

4. Classification of Temporal Data 4 3 2

ECG signal τTD amplitude > τTD

4 3 2

ECG signal τTD amplitude > τTD

4 3 2

1

1

1

0

0

0

−1

−1

−1

−2

−2

−2

−3 0

50

100

150

Figure 4.21: Low risk patient 1

−3 0

50

100

150

Figure 4.22: Low risk patient 2

−3 0

ECG signal τTD amplitude > τTD

50

100

150

Figure 4.23: High risk patient 3

[nl , nu ] with s[nl − 1] ≤ τTD ,

s[nu + 1] ≤ τTD ,

and s[n] > τTD , nl ≤ n ≤ nu .

(4.47)

Using this definition one can transform the time series s of length nend into a sequence of threshold crossing time intervals T Ss,τTD = {[nl,1 , nu,1 ], . . . , [nl,Ps (τTD ) , nu,Ps (τTD ) ]},

(4.48)

with Ps (τTD ) < nend , where Ps (τTD ) denotes the number of intervals in which s has been decomposed using the threshold τTD . Among the various possibilities to define a similarity measure for time intervals in the following the Euclidean distance based on endpoints will be presented. Given the two threshold crossing time intervals [nl,i , nu,i ] and [nl,j , nu,j ], the distance is defined as 2 (4.49) dint ([nl,i , nu,i ], [nl,j , nu,j ]) = (nl,j − nl,i )2 + (nu,j − nu,i )2 . Since a time series can be reduced according to (4.48) to a sequence of threshold crossing time intervals, it is now possible to develop a distance measure for time series based on (4.49). An intuitive measure of similarity between two sequences T Ss,τTD and T Sq,τTD can be defined using the sum of minimum distances. It is called the threshold distance between the time series s and q: dTD (s, q) = dTD (T Ss,τTD , T Sq,τTD ) Ps (τTD )  1 min {dint ([nl,i , nu,i ], [nl,j , nu,j ])} = Ps (τTD ) i=1 j=1,...,Pq (τTD ) Pq (τTD )  1 min {dint ([nl,j , nu,j ], [nl,i , nu,i ])}. + Pq (τTD ) j=1 i=1,...,Ps (τTD )

(4.50)

The basic idea of the threshold distance is to map every interval from one sequence to the closest interval of the other sequence and vice versa. This distance is also suited for measuring the similarity in shape, because even if two time series which have similar patterns are decomposed into a different number of threshold crossing time intervals, due to the division with Ps (τTD ) and Pq (τTD ) on the right hand side of (4.50), they will have a small distance.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

4.4 Similarity Measures for Time Series

91

Another advantage of the threshold distance is that it mainly considers local similarity, in the sense that for each time interval only its closest counterpart from the other sequence is taken into account. For multivariate time series the threshold distance can be extended to a weighted sum of the TD of corresponding univariate time series. Thus, for the scenarios Sm1 and Sm2 the threshold distance is dTD (Sm1 , Sm2 ) =

L 

dTD (sm1 , , sm2 , ).

(4.51)

=1

Construction of Prototypes The standard approach for off-line classification is the One Nearest Neighbor (1NN) classifier based on an appropriate similarity measure for time series [RK04]. The main drawback of this procedure is the high spatial15 and computational complexity, since for a query scenario the distances to all examples in the training set must be computed. The approach that will be presented in the following aims at reducing this complexity by reducing the number of scenarios from the training set to a small number of representative prototypes so that the similarity to this prototypes can be used for the classification. In many cases an on-line time series classification task can be changed into an off-line time series classification task by using change detectors, which signalize when a change in the targets y[n] occurs. This way, a scenario S can be divided into segments, such that it is enough to determine the label at the time instances determined by the change detectors. Therefore, the on-line classification task is reduced to the classification of time series segments, a problem that is described by off-line classification. Thus, in this subsection it is assumed that for all time instances n = 0, . . . , nend − 1 in a scenario S the target has the same label, keeping in mind that for on-line classification this corresponds to only one segment of a scenario. The approach of using change detectors to generate segments will be utilized in Subsection 7.1.3 for the classification of car crashes. The computation of class-specific prototypes must be performed according to the type of similarity that is best suited for the given application. The similarities in time and in shape are considered in the following for the construction of prototypes. The first step in order to compute class-specific prototypes is to determine clusters for scenarios belonging to the same class, such that for each cluster a representative prototype can be computed. A possibility to perform this clustering is shown in Alg. 4.1. The procedure is similar to the clustering from Chapter 3 which is used to determine the centers for the GRBF classifier. Whereas in Chapter 3 the distance measure that is required to describe the similarity is based on the RF proximity, here the dissimilarity is computed using the measures for temporal data introduced in Section 4.4. An advantage of Alg. 4.1 is its ability to identify outliers. Clusters which contain only one or a small number of scenarios can be removed and its elements marked as outliers. If there is no domain knowledge about which similarity measure is best suited for the given data, this can be figured out by applying the cross-validation technique using the 1NN classifier. It should be noted that the choice of τsim is very important for the generation of prototypes since it determines the number of clusters. A large value of τsim leads to a large number of 15

The whole training set must be stored.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

92

4. Classification of Temporal Data

Algorithm 4.1 Clustering for the computation of class-specific temporal prototypes determine the similarity measure that is best suited for the given data 2: for all classes ck do include all scenarios Sm belonging to ck into the set D(ck ) 4: for all scenarios in D(ck ) do compute similarities to all other scenarios in D(ck ) 6: end for while D(ck ) not empty do 8: − determine that scenario in D(ck ) which has the largest number of similar scenarios in D(ck ) . Segments are similar if their similarity exceeds the threshold τsim − let the scenario from Line 8 and all scenarios that are similar to it build a new cluster and remove them from D(ck ) 10: end while end for prototypes, whereas a small value of τsim leads to a small number of prototypes. Thus, τsim is a parameter that determines the tradeoff between accuracy and complexity. The peculiarities that come along with the most frequently used measure for the similarity in time—the Euclidean distance—and the similarity in shape—the DTW similarity—are discussed in the following. The Euclidean distance and other Minkowski metrics from (4.32) and (4.34) assume that the time series are synchronized in time and of same length. Therefore, in a first step one has to make sure that these requirements are met. Whereas the synchronization can be achieved by cross-correlation techniques in off-line classification, in on-line classification the detection of specific patterns in the signals can be used for synchronization. Concerning the requirement of having scenarios with same length one of the simplest and most adequate techniques for many applications is to truncate all scenarios that belong to a class to the length of the shortest scenario in this class. In a subsequent step, after the clustering has been performed according to Alg. 4.1, the length of the prototypes is set according to the mean length of the scenarios in the corresponding clusters. With Mi,n denoting the set containing scenarios of length greater or equal to n, which are included in the i-th cluster that belongs to class ck , the value PEucl,i [n] of the prototype at time instance n is computed according to PEucl,i [n] =

 1 Sm [n], |Mi,n | m∈M

n = 0, . . . , ˜ i − 1.

(4.52)

i,n

i of the prototpye PEucl,i Hereby, |Mi,n | is the cardinality of the set Mi,n and the length ˜ is computed as the mean length of the scenarios in this cluster. Thus, assuming that all scenarios Sm are realizations of i. i. d. random processes, PEucl,i [n] represents the sample mean of the (n + 1)-th random variable in the processes. In many applications DTW is better suited for the computation of the prototypes than the Euclidean distance. Here, the time series can have different length and do not have to be exactly synchronized due to the ability of the procedure to stretch the time axis. Therefore, Alg. 4.1 can be applied directly to obtain a class-specific clustering of the scenarios, without truncation. Unfortunately, the computation of prototypes is not unambiguous as it was the case when using the Euclidean distance. In order to determine the prototypes it is important

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

4.4 Similarity Measures for Time Series

93

to explain how the combination of two scenarios Sm1 and Sm2 can be accomplished in such a way that the shapes of both scenarios are preserved. Firstly, the matrix DDTW and the  optimal warping path WDTW for the two scenarios must be computed according to (4.40) and (4.37). Each element wj,DTW = [n , k  ] of the warping path describes a mapping from the n -th time stamp in Sm1 to the k  -th time stamp in Sm2 . The combination Pm1 ,m2 of the two scenarios is achieved based on the warping path. In a first step the intermediate combination Pm 1 ,m2 of the stretched scenarios Sm1 and Sm2 is computed according to Pm 1 ,m2 [j] =

Sm1 [wj,DTW (1)] + Sm2 [wj,DTW (2)] . 2

(4.53)

 Then, from WDTW a time index for the prototype value Pm 1 ,m2 [j] is computed as 3 4 wj,DTW (1) + wj,DTW (2) w[j] ˜ = . 2

(4.54)

The final shape-preserving combination Pm1 ,m2 of the two scenarios results after averaging the values in Pm 1 ,m2 that are assigned to the same index. The computation of a prototype for the i-th cluster requires to determine a pairwise combination of the scenarios in this cluster. Then, the resulting averaged scenarios are combined and this procedure continues until in the end a single scenario, the prototype PDTW,i , remains which is a combination of all scenarios in the cluster. Unfortunately, the procedure is not unambiguous because the order in which the scenarios are combined influences the shape of the resulting prototype slightly. This can be seen by comparing Fig. 4.24 with Fig. 4.25. In both figures the same three univariate time series s1 , s2 and s3 are used for the computation of their shape-preserving prototype. The only difference is the order in which they are combined when computing the prototype. In Fig. 4.24 in a first step s1 and s2 are combined and the resulting time series is then used together with s3 to build the prototype pDTW,1 which is shown at the bottom. In Fig. 4.25 firstly s2 and s3 are combined. This leads to the prototype pDTW,2 which is slightly different from pDTW,1 . However, both pDTW,1 and pDTW,2 can be used as prototypes since both capture the similarity in shape of s1 , s2 and s3 . 10 0 0 10 0 0 10 0 0 10 0 0

s1 20

40

60

80

100 120 s2

20

40

60

80

100 120 s3

20

40

60

80

100 120 pDTW,1

20

40

60

80

100 120

Figure 4.24: Prototype No. 1 using DTW

10 0 0 10 0 0 10 0 0 10 0 0

s2 20

40

60

80

100 120 s3

20

40

60

80

100 120 s1

20

40

60

80

100 120 pDTW,2

20

40

60

80

100 120

Figure 4.25: Prototype No. 2 using DTW

In a similar way prototypes can be computed for the ADTW dissimilarity measure. The only difference to the DTW technique is that instead of starting the prototype generation ˇext,1 , . . . , S ˇext,M , whose construction with scenarios S1 , . . . , SM one starts with scenarios S is described by (4.43). If the scenarios Sm consist of L univariate time series, the ADTW prototypes have (L + 1) univariate time series.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

94

4. Classification of Temporal Data

After the construction of MP prototypes, with MP  M , the similarity to these prototypes can be used for classification, e. g., with the 1NN algorithm. The main benefit of introducing prototypes with respect to the nearest neighbor classifier is the reduced computational complexity and the possibility to identify outliers.

4.5 Feature Generation for Time Series Classification In Chapter 2 it has been emphasized that a classification model consists of an input space and a procedure to estimate the attribute y associated to an element x of the input space. Hereby, the input space is constructed from a set of observable attributes which describe the process of interest and which are stored in the vector v .16 The performance of the classification model does not only depend on the procedure used to compute the estimate of y but also on the choice of the input space, since each procedure uses a specific similarity measure which defines the neighborhood relations in the input space. Therefore, a main task when designing a classification model is the generation and identification of relevant features. This is also valid when the observations of the process of interest are organized as multivariate time series. The difficulties related to time series have been presented in Section 4.2 and also the importance of changing the representation from elementary to high level features has been underlined. This change in the representation from S [0,n] to x[n], which is described by the mapping in (4.10), is equivalent to the transition from the vector v to the feature vector x and aims at generating an input space where the posterior probability p(y = ck |x[n] = x[n]), with ck being the class with the largest posterior p(y = ck |S [0,n] = S [0,n] ), is clearly dominant. However, a distinction between the elementary feature representation of a scenario S and the random vector v is necessary due to the temporal dimension in S. The temporal dimension and the correlation among the elements of S permit the extraction of high level features, which are more directly related to the output variable y and thus, are suited to define the input space for the classification model. In this section methods to generate high level features for on-line classification from the elementary feature representation of time series will be examined. The feature generation step is represented by the mappings ˜

fg[n] : RL×(n+1) → RN , S [0,n] → ˜ x [n], ˜

(4.55)

where the vector ˜ x [n] ∈ RN contains high-level features. It is important to note here, that the dimensionality of the vector ˜ x [n] is not changing with time. If for example the i-th entry in ˜ x [n] stands for the amplitude of the first local minima in the -th time-series, its value ˜ of ˜ will be zero until this local minima is included in S [0,n] . Since the dimensionality N x [n] is normally large and many entries may contain redundant information, in the feature selection step, which will be discussed in Section 6.3, the vector ˜ x [n] is reduced to the vector x[n] with smallest possible dimensionality N that is required for an accurate classification. As stated in Section 2.3 the feature generation step and thereby, the performance of the classification model can be improved dramatically if a priori knowledge about the process that generated the data is available. If one has such domain knowledge features should be constructed that capture the known properties of the data. In some applications where a priori knowledge is not available, one approach is to construct a large set of possible templates for high-level features [Ric99, Ols01, KS05, MM05]. In [Ols01] for example structure 16

In the statistical framework used in this thesis v , x and y are modeled as random variables.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

4.5 Feature Generation for Time Series Classification

95

detectors are introduced which try to fit a function (e. g., constant, exponential, triangular) with free parameters to a time-series such that the difference between the raw data and the function is minimized. All dimensionality reduction techniques discussed in Section 4.1 represent possible methods to generate high-level features since all of them aim at aggregating elementary features in such a way that the reduced dimensionality approximation still contains the characteristics of the original scenario. The feature generating preprocessing of time series can be divided into the two categories of local and global methods. Whereas at time stamp n, global features describe characteristics of the whole time series up to this point, local features are constructed from segments of the time series, which normally are defined by local time windows. Both methods will be described in more detail in Sections 4.5.1 and 4.5.2. In this thesis a different kind of features will be introduced which are designed to consider the time specific nature of the data. They will be called event-based features and are presented in Subsection 4.5.3. The main idea is to define some application-specific characteristics of the time series—called events—and to generate high-level features based on the occurrence of events. 4.5.1 Global Features Given a scenario S up to time stamp n, denoted with S [0,n] , global features describe characteristics of this multivariate time series by using the information from the time interval [0, n]. For example the global feature xg [n] ∈ R may represent the maximum value in the [0,n] means of all L univariate time series s in S [0,n] up to time instance n. By generating at each time stamp n a vector ˜ xg [n] ∈ RNg of global features, a transformation of S [0,n] to the ˜ g[0,n] is performed: finite dimensional random process X ˜ g[0,n] , RL×(n+1) → RNg ×(n+1) , S [0,n] → X

(4.56)

˜ g[0,n] = [˜ X xg [0], . . . , ˜ xg [n]] ∈ RNg ×(n+1) .

(4.57)

where

The advantage of this transformation is that ˜ xg [n] aggregates the information contained in [0,n] S , unlike the vector s[n], allowing for a more reliable on-line classification of the scenario S in the input space RNg . Since the number of global features that can be extracted from S [0,n] is unlimited, in what follows in this subsection a selection of useful global features will be presented. Useful features are in general built from an aggregation of values in the original time series, since aggregation increases the robustness to noise. A broad number of features are based on the frequency domain characteristics of time series. In Section 4.1 the DFT and the DWT have already been introduced as methods to obtain frequency domain information about time series. Such a change of representation allows to identify dominant components in the frequency domain and to use them as global features in the vector ˜ x [n]. The DFT transformation of S [0,n] leads to (n + 1) frequency [0,n] components for each of the L time series s in S [0,n] . If it is known a priori or by a [0,n] are relevant for the spectral analysis of the signals17 that only certain frequencies in s 17

An analysis of the class specific histogramms of the M time series s from the training set may indicate which frequency components are relevant for the discrimination among the classes.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

96

4. Classification of Temporal Data

classification, one can keep only these frequency components and use them as global features for the vector ˜ xg [n]. For example the q1 -th entry in ˜ xg [n] can be defined to be the amplitude [0,n] of the k-th frequency component in s :   n   2πj     − n+1 n k ˜xg,q1 [n] =  (4.58) s [n ]e  , k = 0, . . . , n.    n =0

Another expressive global feature can be constructed by considering the sum of the squared amplitudes of the frequency components in the signal between two frequencies which are specified by the indices k1 and k2 : ⎛ 2 ⎞ k n 2     2πj   ⎝ ˜xg,q2 [n] = (4.59) s [n ]e− n+1 n k  ⎠ .    k=k1

n =0

Although the features generated with the DFT can be meaningful because for each frequency component the whole available temporal information is used, a drawback of this method is its high computational cost, since the transformation must be done for a growing sequence at each time stamp n. Further, the DFT is not well-suited for non-stationary signals because all temporal information is lost through the transformation. Therefore, rather local spectral feature generation methods like the short time Fourier transform or the DWT are used for non-stationary time series. A feature that is robust to noise can be obtained by computing the cumulative sum for [0,n] each time series s : CS [n] = CS [n − 1] + (s [n] − μCS, ),

(4.60)

with CS [−1] = 0 and μCS, being a reference threshold value, e. g., the mean value of the -th time series computed from the training set. By setting ˜xg,q3 [n] = CS [n] one obtains a global feature that picks out general trend changes from noise, since the random noise cancels out having as many positive as negative values. A consistent change in s appears as a gradual departure from zero in CS [n] so that this feature is not only suited to detect sharp changes in time series but also gradual changes. The cumulative sum has been used with success in applications such as burst detection [SS07], representing a feature that can be obtained at low computational costs. [0,n] Other easily to compute global features which can be extracted from s are the squared L2 norm of the signal up to time stamp n, ˜xg,q4 [n] =

n 

|s [n ]| , 2

(4.61)

n =0

or its mean per sample n 1  2 ˜xg,q5 [n] = |s [n ]| . n + 1 n =0

(4.62)

One can also use the features from (4.61) and (4.62) for multivariate time series S [0,n] by summing up the values from all L time series.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

4.5 Feature Generation for Time Series Classification

97

A group of global features is based on the computation of estimates of statistical moments. For example the mean value n 1  ˜xg,q6 [n] = s [n ] n + 1 n =0

(4.63)

represents such a feature. It should be noted here, that for non-stationary time series ˜xg,q6 [n] is not the sample mean of a random variable but must rather be interpreted as an aggregated value of a realization of the finite length random process s [0,n] . Especially in applications where the crossing of certain thresholds of the amplitude contains discriminative information about the classes, the extremal values up to time stamp n may represent very useful global features. For example ˜xg,q7 [n] can be defined as {s [n ]}, ˜xg,q7 [n] = max  n

n = 0, . . . , n.

(4.64)

Unlike the other features presented in this subsection which are built based on time averages [0,n] of the time series, the usage of extremal values in s leads to features which are very sensitive to noise. Therefore, it is advisable to filter the scenario S [0,n] before looking for extremal values. Noise removing filters can range from low-pass Finite Impulse Response (FIR) filters to more complicated procedures as the usage of filterbanks. 4.5.2 Local Features In non-stationary time series and in on-line classification tasks the use of global features may not be appropriate because rather local properties in the time series indicate that a change in the label has occurred. Given a scenario S up to time stamp n, denoted with S [0,n] , local features describe characteristics of this multivariate time series by using only the information in S [0,n] from time intervals of limited extent. It must be differentiated between time windows of fixed length and windows of variable length. Time windows of variable length are mainly used in applications where some a priori knowledge about the data is available so that the length of the windows can be set according to the considered problem. This is the main idea of the event-based features from the next subsection where the occurrence of certain characteristics in the time series are used to define the localized time windows of interest. In this subsection the extraction of features from time windows wi of length ˜i will be presented. In Fig. 4.26 an example for a time window wi is visualized.

0

˜i − 1

n

Figure 4.26: Time window wi of fixed length ˜i

Denoting with S [n1 ,n2 ] , where n2 = n1 + ˜i −1, that segment of the scenario S which is taken into account at time instance n due to the usage of time window wi , from S [n1 ,n2 ] a large number of local features can be extracted. Assuming that only Nl,i features are generated from the segment S [n1 ,n2 ] , this can be written as the mapping ˜

RL×i → RNl,i , S [n1 ,n2 ] → ˜ xl,i [n],

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

(4.65)

98

4. Classification of Temporal Data

where the vector ˜ xl,i [n]  contains local features. If not just one but qw time windows are used, w then one obtains Nl = qi=1 Nl,i local features at each time instance and similar to (4.56) a [0,n] ˜ [0,n] , is transformation of S to the finite dimensional random process for local features, X l realized: ˜ RL×(n+1) → RNl ×(n+1) , S [0,n] → X l

[0,n]

,

(4.66)

˜ [0,n] = [˜ xl [0], . . . , ˜ xl [n]] ∈ RNl ×(n+1) , X l

(4.67)

where

T T with ˜ xl [n] = [˜ xl,1 [n], . . . , ˜ xl,q [n]]T ∈ RNl . As in the previous subsection the extraction of w local features from only one of the L time series in S [0,n] is considered. This represents a restriction, but by using a priori knowledge to generate a new time series by combining the [0,n] time series s one can also consider properties of the whole scenario rather than of a single univariate time series in order to construct features for the vector ˜ xl [n]. A method to extract features from the frequency domain for non-stationary signals is the Short-Time-Fourier Transform (STFT). Unlike the DFT the STFT produces a time[0,n] frequency representation of the signal by mapping the time series s into a two-dimensional [0,n] time-frequency plane. The main idea is to assume that the non-stationary time series s is stationary when seen through the window wi . The Fourier transform of the windowed signal leads to the STFT

STFT[n, k] =

˜i −1 

− s [n + n − ˜i + 1]wi [n ]e

2πj  nk ˜i

.

(4.68)

n =0

The window wi is in general chosen to meet the needs of the considered application, i. e., time-frequency localization18 or computational efficiency. The q1 -th feature in ˜ xl [n] can be constructed by using the STFT and computing the sum of the squared amplitudes of the frequency components in the windowed time series between two frequencies which are specified by the indices k1 and k2 : 2 ⎞ ⎛  ˜i −1  k 2   − 2πj n k    ˜ ⎝ (4.69) s [n + n − ˜i + 1]wi [n ]e i  ⎠ . ˜xl,q1 [n] =     k=k1 n =0 Another spectral feature generation method is the use of filterbanks which can be utilized to implement the DWT. The idea to use the wavelet decomposition for time series classification has been already adopted in some works. For example in [SC94] or [Eng98] the authors describe methods to construct wavelet bases which maximize the class separability. However, these works do not deal with the on-line classification task. Unlike in the STFT the fundamental property of the DWT is that the time resolution and the frequency resolution vary in the time-frequency plane. Given a time window wi that ˜ cuts the last ˜i > ˜i values out of s [0,n] , the resulting signal s [n−i +1,n] can be decomposed in 18

A Gaussian window is the only window that meets the Heisenberg bound with equality and the STFT with this special choice of the window is called Gabor transform [Gab46].

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

4.5 Feature Generation for Time Series Classification

99

Scale

Frequency

the time frequency-plane either with the STFT or with the DWT so that the transformation coefficients represent local features. Whereas for the STFT one must choose an appropriate window width ˜i by hand, leading to a fixed segmentation of the time-frequency plane, the ˜ DWT allows for a more flexible time-frequency decomposition of s [n−i +1,n] . This is shown in the Figures 4.27 and 4.28. It should be noticed that it is not necessary to compute the STFT at every time stamp to obtain a decomposition of the signal containing the whole ˜ information in s [n−i +1,n] . For this reason it is enough to compute the STFT every ˜i time instances.

˜i

Time

Time

˜i

˜i

Figure 4.27: Time-Frequency plane STFT

Figure 4.28: Time-Frequency plane DWT

A very efficient way to implement the DWT is the use of filterbanks. Hereby, a lowpass FIR filter with impulse response g[n ] and a highpass FIR filter with impulse response h[n ] are chosen in such a way that after downsampling the output of each filter, the original signal can still be reconstructed. The most common downscaling factor is 2, leading to the so-called dyadic DWT. A filterbank showing the first three stages of a dyadic DWT is presented in Fig. 4.29. Each output of the filterbank is a DWT coefficient and represents the value of ˜ s [n−i +1,n] in a slice of the time-frequency plane from Fig. 4.28. To assure that the signal can h[n ]

2 h[n ]

s [n ] g[n ]

2

2 g[n ]

h[n ]

2

g[n ]

2

2

Figure 4.29: Filterbank for the dyadic DWT

be perfectly reconstructed, i. e., no information is lost by the decomposition, the DWT filters g[n ] and h[n ] are frequently chosen to be quadrature mirror filters. This can be realized by taking a FIR filter with an even number NDWT of taps for the lowpass filter g[n ] fulfilling NDWT −1

g[n ]g[n − 2k] = δ(k, 0)

(4.70)

n =0

and setting the taps of the highpass filter to be [SB84, Rie97] 

h[n ] = (−1)n g[NDWT − 1 − n ].

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

(4.71)

100

4. Classification of Temporal Data

Local features for on-line time series classification result by computing the DWT coefficients ˜ of s [n−i +1,n] . In off-line classification, similar to signal compression, a frequently adopted technique is to compute the DWT of the whole signal s and to keep only the largest coefficients. For on-line classification it is useful to work with time windows wi of length ˜i and to decide in the feature selection step what positions in the reduced time-frequency plane are relevant for the given application. In a dyadic DWT ˜i must be chosen to be a power of 2. It should be also noted that due to the downsampling operations the rate at the outputs differ from the rate at the input of the filterbank. This can be easily solved in order to construct local features for the vector ˜ xl [n] by holding the value of each filterbank output until a new one is computed. Another class of local features can be constructed by so-called structure detectors. A structure detector identifies a particular structure, e. g., constant, linear, exponential, sinusoidal, triangular or trapezoidal in time series and generates values that describe the identified structure. In [Ols01] structure detectors for off-line classification are chosen in such a way that the sum of squared errors between a time series and its approximation, which is constructed from the identified structures, is minimized. In order to apply this idea to on-line [n ,n ] [0,n] classification one can cut out the part s 1 2 from s by using a time window wi and try [n ,n ] to fit this part with a predefined function containing free parameters. For example s 1 2 can be fitted by a linear function as in PLA. The coefficients are then included in the feature vector ˜xl [n]. The main drawback of this kind of local features is the sensitivity to the choice of the time window wi , which leads to the segmentation problem mentioned in Section 4.2. There is a huge number of other local features that may be useful in certain applications. For example all global feature construction methods mentioned in the previous subsection can be used to compute local features such as extremal points in a time window of the filtered time series or estimates of statistical moments assuming stationarity in a given time window. All these local features can then be integrated in the vector ˜ xl [n]. ˜ Finally, after generating N = Ng + Nl global and local features one obtains the vectors ˜ x [n] = [˜ xgT [n], ˜ xlT [n]]T

˜

∈ RN

(4.72)

which can now be used as input to a classifier. Hereby, a transformation of S [0,n] to the finite dimensional random process ) ( [0,n] ˜ X ˜ g ˜ [0,n] = ∈ RN ×(n+1) (4.73) X [0,n] ˜ X l

is performed. 4.5.3 Event-Based Features Event-based features offer the opportunity to define application-specific features for time series classification and to include thereby a priori knowledge in the classification system. The characteristic of event-based features is that they imply a detection step and thus, their generation requires a classifier. In addition to the possibility to create expressive and class specific features, a main advantage is the interpretability that can be achieved by event-based features.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

4.5 Feature Generation for Time Series Classification

101

In this thesis an event Ej is defined as a mapping (4.74) Ej : RL×(n+1) → N0 ,  0 signal property associated with Ej not yet identified S [0,n] → nj signal property associated with Ej firstly identified at time stamp nj . Each event Ej is associated with a predefined signal property. If this signal property has not been detected in S [0,n] the event generates the output 0, otherwise it outputs the time instance at which the signal property has been firstly identified in S [0,n] . In the following the expression “the event occurs at time instance nj ” refers to the fact that the signal property associated with Ej has been firstly identified at time stamp nj . Examples for such signal properties can be the exceeding of a certain threshold by a value that is computed from a realization of S [0,n] or a characteristic pattern in this realization. The occurrence of an event can be either incorporated directly in the feature vector ˜ x [n], e. g., in form of the time instance nj , or events can be used as aids, e. g., by specifying a time window wi , to construct new and more expressive features. A very useful application of identifying events in time series is the assessment of the origin. As already mentioned the finite dimensional random process S of length nend is usually only a part of a longer random process and in many applications the assessment of the origin is based on the occurrence of events. For example in monitoring systems the exceeding of thresholds are often used as signal properties which trigger the following observation process. It is clear that events associated with such signal properties are of huge importance, since their correct identification determines the performance of the classification model. The detection of characteristic signal properties is performed by specialized classifiers. Thus, as in every classification task, the detection of events must cope with the bias-variance tradeoff. To keep both terms small, domain knowledge should be used to define suitable application-specific signal characteristics so that the detection step can be implemented with low bias by simple classifiers e. g., linear classifiers which also come along with a low variance. This is the strategy that is applied in Section 7.1. Since event-based features are application-specific there is no general rule for defining them. However, there are some properties which are desirable. Firstly, the identification of signal properties associated with events should be robust to noise in S [0,n] . In order to achieve this aim the signal characteristics used for the definition of events should be constructed from preprocessed time series, e. g., from the signals that result after a low-pass filtering of S [0,n] . Secondly, the classifiers used for identifying the signal properties associated to an event must have a low generalization error, i. e., low bias and variance. As mentioned above, this can be achieved by using a priori knowledge about the considered application. Thirdly, the events should be defined in such a way that they can be easily understood by humans, if interpretability is an issue for the classification model. In the following some possibilities to incorporate events in the feature construction process are presented. One approach to use events in on-line classification is to define time windows wj between the occurrence of events and to extract local features from these windows, as presented in the previous subsection. Unlike in Subsection 4.5.2 the time windows used here are random variables, i. e., they are of variable length and data adaptive. Therefore, the local features generated this way are more flexible in capturing the characteristics of the considered application. An example is shown in Figures 4.30 and 4.31 where the time

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

102

4. Classification of Temporal Data

τ2

τ1

n j1

n j2

n

Figure 4.30: Events defined by thresholds

nj 1

n j2

n

Figure 4.31: Window wj of variable length

window wj is defined by two events occurring at time instances nj1 and nj2 . In this example the events are based on an univariate random process which is constructed from S [0,n] . This univariate random process is typically constructed in a similar way as the local or global features by a mapping from S [0,n] . Thus, signal properties of a realization of the scenario S [0,n] can be defined based on signal properties of the corresponding realization of the newly constructed univariate random process. In the example the events Ej1 and Ej2 occur when a realization of the univariate random process crosses the two thresholds, τ1 and τ2 , for the first time as shown in Fig. 4.30. Having a time window wj , defined by the events Ej1 and Ej2 , local features as those from Subsection 4.5.2 can be extracted from S (wj ) = S [nj1 ,nj2 ] . Another approach to use events in order to classify time series is based on the temporal relation between the time instances when they occur. One way to quantify and to create features from the temporal relation between events is to consider time intervals that are defined by them. Based on these time intervals it is possible to construct features that represent the temporal succession of the occurrence of events. The number of possible temporal relations between events grows exponentially with the number of events. Therefore, domain knowledge should be used in order to keep the number of events low and to consider only those sequences of events that may occur in the given application. In the following the construction of features from two, three and four events is presented, assuming that the events Ej1 , Ej2 , Ej3 and Ej4 are chosen in such a way that19 nj1 < nj2 < nj3 < nj4 .

(4.75)

For two events, Ej1 and Ej2 which occurred at time stamps nj1 and nj2 the q2 -th entry in ˜ xl [n] can be set to ˜xl,q2 [n] = nj2 − nj1 + 1,

(4.76)

and interpreted as the duration between the events Ej1 and Ej2 . For three events, Ej1 , Ej2 and Ej3 which occured at time stamps nj1 , nj2 and nj3 the intervals defined by the events which can be used for the construction of features that reflect their temporal succession are shown in Fig. 4.32. Thus, the q3 -th entry in ˜ xl [n] can be chosen to be 2 ˜xl,q3 [n] = (nj3 − nj2 )2 + (nj3 − nj1 )2 or (4.77) nj + nj1 ˜xl,q3 [n] = nj3 − 2 . (4.78) 2 19

This can be achieved by defining as events the crossing of thresholds by a monotonically increasing time series that can be computed from a realization of S [0,n] , e. g., the accumulated energy.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

4.5 Feature Generation for Time Series Classification nj 3 − n j 2

nj 3 −

nj3 − n j1 nj1

nj2

n

nj3

nj1

103

nj2 +nj1 2

nj2

n

nj3

Figure 4.32: Intervals representing the temporal relation of three events

Given four events Ej1 , Ej2 , Ej3 and Ej4 occurring at nj1 , nj2 , nj3 and nj4 some useful intervals that are defined by these events are shown in Fig. 4.33. Whereas for the intervals shown in the left side of the figure a distance measure has already been introduced in (4.49), leading to a possible feature 2 ˜xl,q4 [n] = (nj4 − nj2 )2 + (nj3 − nj1 )2 , (4.79) the interval on the right side of Fig. 4.33, defined by the midpoints of the instances where E1 , E2 and E3 , E4 occurred, also expresses the temporal relation of the four events and can be used alternatively to feature ˜xl,q4 [n]: ˜xl,q4 [n] =

nj + nj1 nj4 + nj3 − 2 . 2 2

(4.80)

nj4 − n j2 nj4 +nj3 2

nj3 − n j1 nj1

nj 2

nj3

nj 4

n

nj1

nj2



nj2 +nj1 2

nj3

nj4

n

Figure 4.33: Intervals representing the temporal relation of four events

It should be noted that there exists a calculus called “Allen’s Interval Algebra” [All83] that defines relations between intervals and which can be used for rule-based systems dealing with temporal data. A huge advantage that comes along with event-based features is the possibility to achieve interpretability in temporal on-line classification. Since the signal characteristics associated to events are in general defined based on domain knowledge, the occurrence of events can be interpreted by humans. This way events can be used to obtain an interpretable input space.

Conclusion This aim of this chapter is to provide access to the topic of time series classification. In the first part the representation of time series, the challenges that come along with temporal data and common attempts to deal with them have been discussed. The second part has focused on similarity measures for time series, introducing the ADTW dissimilarity. In this context ways to construct representative prototypes based on the adopted similarity measure have been pointed out. The third part has reviewed possibilities to construct global or local features from time series and has presented the concept of event-based features.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

5. Classification in Car Safety Systems

This chapter presents the car crash classification task. Section 5.1 describes the peculiarities associated with this application and the data sets which are used in this work, while in Section 5.2 it is discussed how the classification performance can be evaluated.

5.1 The Car Crash Dataset In modern cars there are various sensors used to detect and to categorize the severity of a crash. The data considered in the following is generated during standardized front-crash tests. To detect these types of crashes the cars are equipped with 4 deceleration sensors each producing signals which are processed at a rate of 1 kHz. Three of the four sensors measure the deceleration in longitudinal direction of movement and one sensor the acceleration in the transversal direction. Two of the sensors measuring the longitudinal deceleration are located in the front part of the car, one on the left, called ECSL1 , and one on the right, called ECSR2 . Through their location these two sensors offer valuable information about the crash severity at a very early stage of an impact. The third sensor measuring the longitudinal deceleration, called ZAEX3 , is located in the airbag control unit which lies in the passenger compartment, close to the gear shift. The fourth sensor, called ZAEY4 , measures the deceleration in transversal direction and is also located in the airbag control unit. The advantage of the latter two sensors is that they do not get damaged as easily as ECSL and ECSR during a front crash. On the other hand, during a front crash ZAEX and ZAEY do not sense the crash as early as ECSL or ECSR. Depending on the crash severity it is required to activate safety-systems like belt tensioners or airbags at the right time. The output vector y[n] at time instance n belongs to one of the four classes c1 , c2 , c3 and c4 . Each class is related to an action that must start when the decision is taken. These relations and the special case of rejecting a decision are given in Table 5.1. If the crash severity is classified as c4 then firstly the belt tensioner, then the first c0 c1 c2 c3 c4

“reject decision” “no deployment of safety systems” “deployment of seat belt tensioner” “deployment of airbag stage 1” “deployment of airbag stage 2”

Table 5.1: Classes and their representations

stage and finally the second stage of the airbags are deployed. If the crash severity is classified as c3 firstly the belt tensioner and then only the first stage of the airbag are activated. If the 1

ECSL stands for Early Crash Sensor Left. ECSR stands for Early Crash Sensor Right. 3 ZAEX stands for Zentrale Airbag Einheit X-Richtung. 4 ZAEY stands for Zentrale Airbag Einheit Y-Richtung. 2

105 Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

106

5. Classification in Car Safety Systems γ[n]

c1

ck

γmax

γmin 1 0.25ns 0.75ns ns ne 1.25ne 1.5ne

n

Figure 5.1: γ[n] for the crash-detection application

crash severity is classified as c2 only the belt tensioner is activated. A decision for c1 or c0 does not lead to the deployment of any safety system. A characteristic of this application is the fact that at the initial phase of every severe crash the target class is c1 and then, at a later time instance, one of the classes c2 , c3 or c4 . This assures that the safety systems are activated during a time window which is defined in such a way that the injury risk of the passengers is minimized. These time windows represent “do not care” intervals, i. e., the outputs of the classifier during this time are not penalized. Nevertheless, at the end of such a time window the crash must be associated to one of the classes c2 , c3 or c4 . If the classifiers decision before or after the “do not care” interval is not the same as the target y[n], a misclassification occurs. To evaluate the classification performance an additional weighting of misclassifications, γ[n], is introduced to take into account the costs related with a too early or too late deployment of the safety systems. In Fig. 5.1 the time dependent course of the weighting γ[n] is shown for crashes where a deployment of safety systems is required, i. e., ck ∈ {c2 , c3 , c4 }. Hereby, ns denotes the time instance when the “do not care” interval starts and ne the time instance when it ends. The further a misclassification lies from ns and ne , the stronger it is penalized. Thereby, deploying a time instance  ms too late is worse than deploying it  ms too early. For crashes that do not require the deployment of safety systems the values γ[n] are set to γmin for the entire duration of the scenario, i. e., if the classifier decides c2 , c3 or c4 at an arbitrary time instance then this misclassification is penalized with γmin . Finding the “do not care” intervals {ns , ns + 1, . . . , ne } in such a way that the injury risk of the passengers is minimized is a difficult task since not only the crash impulse influences the optimal deployment time but also the body dimensions and the exact location of the passengers. Thus, ns and ne vary from crash to crash. For the data sets that are used in this thesis the “do not care” intervals were predetermined by experts and their analysis is not part of this work. Nevertheless, a rule of thumb to estimate a suitable time instance to deploy the airbags should be mentioned. It is called the “5” − 30 ms” criterion. This criterion requires the deployment at the time instance when an unbelted passenger—represented by a so-called 50% male dummy being subject to the negative of the deceleration that is measured by the ZAEX sensor—has moved forward 5 inches minus 30 ms, which are required for the inflation of the airbag. The time instance computed with the “5” − 30 ms” criterion is set to ne and then ns is set to some time stamps before ne . Since the requirements set by law, insurance guidelines and consumer-tests are very demanding simple thresholds on the deceleration signals are not enough to activate the safety systems in the required time window according to the crash severity. Generally, crashes with

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

5.1 The Car Crash Dataset

107

stiff collision partners are easier to classify than crashes with deformable objects. The latter may lead to signals that are very hard to distinguish from signals where the activation of safety systems is not desired. These facts are visualized exemplarily in Figures 5.2 and 5.3, 150

150

100

100

50

50

0

0

−50

−50

−100

−100

−150

0

10

20

30

40

50

−150 0

150

150

100

100

50

50

0

0

−50

−50

−100

−100

−150 0

10

20

30

40

50

−150 0

10

20

30

40

50

10

20

30

40

50

Figure 5.2: Sensor signals from a 27 km/h and a 56 km/h crash

where the abscissa shows the time measured in milliseconds (ms) and the ordinate the deceleration measured in g (9.81 m/s2 ). In Fig. 5.2 the signals from two crashes, one with 56 km/h represented by the solid curve and the other with 27 km/h represented by the dashed curve, against a stiff wall with 0◦ angle of impact are shown for the first 50 ms after the impacts. Hereby, the upper left figure shows the ECSL signals, the upper right the ECSR signals, the lower left figure the ZAEX signals, and the lower right figure the ZAEY signals. For the 56 km/h crash all safety systems must be deployed between ns = 7 and ne = 13 and for the 27 km/h crash only the belt tensioner and the first stage of the airbag must be deployed between ns = 14 and ne = 33. These “do not care” intervals are marked by the shadowed areas. One can see that it is possible to discriminate reliably between the two crashes before the corresponding “do not care” intervals end by a simple threshold in the ZAEX signals. The discrimination task is more complicated for crashes with deformable objects. This is visualized in Fig. 5.3. Here, the solid curves represent a crash with 40 km/h against a deformable barrier with an offset of 40%, i. e., only 40% of the car front—the left side—hit the deformable collision partner with 0◦ angle of impact. This kind of standardized setting stands for traffic-crashes with other cars, which normally have a deformable crush-zone. The dashed curves show the signals stemming from a 15 km/h crash with 40% offset and 10◦ angle of impact. Whereas for the 40 km/h crash the belt tensioner and the first stage of the airbag must be deployed, for the 15 km/h crash no safety systems should be activated since the risk of a severe injury is very small at such a low speed. Thus, a “do not care” interval is only defined for the 40 km/h crash and marked by the shadowed areas between ns = 28 and

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

108

5. Classification in Car Safety Systems 150

150

100

100

50

50

0

0

−50

−50

−100

−100

−150 0

10

20

30

40

50

−150 0

150

150

100

100

50

50

0

20

30

40

50

10

20

30

40

50

0

−50

−50

−100

−100

−150 0

10

10

20

30

40

50

−150 0

Figure 5.3: Sensor signals from a 15 km/h and a 40 km/h ODB crash

ne = 48 in Fig. 5.3. As above the four plots in Fig. 5.3 present the ECSL, ECSR, ZAEX and ZAEY deceleration signals. The plots show that the discrimination between the two crash types can not be accomplished by a simple threshold in the ZAEX signal. For these two crashes a discrimination can be achieved by using the information from the ECSL sensors. The number of crash types which are used to design reliable classification algorithms is large and as the above examples show it is necessary to combine the information from different sensors in order to obtain an input space where the crash severity can be correctly categorized. One of the most frequently applied signal processing steps used in this feature generation step is low-pass filtering. Besides the reduction of noise effects which simplifies the classification task, the removal of high frequency parts and the generation of features based on the low frequency signal components aids the use of computer generated data based on the finite element method. Current computer simulations are able to model only the low frequency parts of crash signals well. In order to integrate data from simulations in the design of algorithms only features based on the low frequency signal components should be used. This work does not consider the design of algorithms based on data from computer simulations but, as it will be shown in Chapters 6 and 7, the features that are used for the classification of crashes contain only information from low frequency signal components. As it can also be seen from Figures 5.2 and 5.3 the length of the signals up to the time instance when the “do not care” intervals end varies from crash to crash. A characteristic of the car crash application is the fact that not only a decision about the class to which the sensor signals belong but also about the time instance when to take the decision are required. This makes the problem on hand much more complicated than the standard time series classification tasks described in the literature, where mainly time series of same length, each belonging to a single class must be categorized. Two possible ways to handle this

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

5.2 Evaluation of Classification Performance

109

difficulty are presented in the work. The first, described in the next Chapter, uses a single machine learning algorithm, which must fulfill both tasks: determination of the time instance for deployment and classification of the crash severity. The second, described in Chapter 7 divides the problem into a segmentation step, where the time instance for deployment is computed, and in a labeling step where the categorization to one of the classes is realized. State-of-the-art algorithms for crash detection [Hua02] use features which are deduced from the deceleration signals and define rules based on these features that fulfill the requirements for the available crash test measurements. Both, the choice of the features and the rules are based on empirical experience and require lots of manually performed “trial and error” steps for each car type. In the Chapters 6 and 7 methods will be introduced that replace the empiricism with machine learning algorithms for classification and feature selection. The data sets from two car types, denoted with I and II, are considered in the following. From car type I there are M = 63 scenarios Sm available and for car type II the data set contains M = 87 scenarios. Since the car types are different the data sets can not be mixed and must be treated separately. The data is recorded during crash tests with real cars, which explains why the number of available scenarios is so small. The crashes in the data sets cover a large spectrum of crash scenarios that can occur in reality: crashes with different velocities, offsets, angles of impact and collision partners. These crash settings are categorized into 16 groups so that all crashes in one group stem from the same setting, i. e., same relative velocity, angle of impact, stiffness of collision object, and offest. Among the M scenarios there are about 17% so-called misuse crashes for each car type. The misuse crashes represent situations where the safety systems should not be activated, e. g., collisions with deer, driving against the curb or through road holes. The number of misuse cases can be increased easily since in many misuse scenarios the cars do not get damaged and thus the costs of producing additional misuse data are low. Nevertheless, this is not done in order to keep the data set balanced. The classification method from Chapter 7 is not strongly affected by a large number of misuse scenarios in the data set, since there each crash produces a single example for the training set used to design the classification algorithm. On the other hand, for methods like the one presented in Chapter 6 from each crash a large number of examples is extracted for the data set since also the time instance for the activation of the safety systems must be learned. Misuse scenarios are associated to class c1 for their entire duration. Additionally, each crash must be classified as belonging to class c1 before the “do not care” interval starts. Therefore, one must assure that the number of input vectors associated to class c1 are not over-represented, since this leads to an unbalanced training set.

5.2 Evaluation of Classification Performance The quality of the proposed classification system is measured by the Leave-One-Out (LOO) method. The LOO estimation is a suitable measure of quality for the application since the data set is small and for each crash setting only a few representatives exist. In order to get an expressive measure of quality the crashes are categorized in three levels of importance, γimp ∈ {1, 2, 3}, depending on the relative velocity, the stiffness of the collision partner, the angle of impact, and the displacement between car and the collision partner measured by the offset. The levels of importance γimp for the crash settings in the training set DS are shown in Table 5.2.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

110

5. Classification in Car Safety Systems

γimp = 1

γimp = 2

γimp = 3

relative velocity 0 − 27 km/h 0 − 32 km/h 0 − 40 km/h 27 − 40 km/h 32 − 40 km/h 40 − 56 km/h > 40 km/h > 40 km/h > 56 km/h

stiffness of collision partner rigid rigid deformable rigid rigid deformable rigid rigid deformable

angle of impact 0◦ 30◦ 0◦ 0◦ 30◦ 0◦ 0◦ 30◦ 0◦

offset 0% 0% 40% 0% 0% 40% 0% 0% 40%

Table 5.2: Penalization γimp for various crash settings

Additionally, to measure the classification performance it must be differentiated whether the crash severity is underestimated, correctly estimated or overestimated. These possibilities are presented in Table 5.3. If a crash requires the deployment of a safety system that is represented by class ck and the classification system decides ck , with k  > k, then the crash severity is overestimated. Analogously, if a crash requires the deployment of a safety system that is represented by class ck and the classification system decides ck , with k  < k, then the crash severity is underestimated. Otherwise the correct crash severity has been detected. This categorization by means of the crash severity is presented in the columns of Table 5.3. Since also the time instance nfire , when a decision ˆy[nfire ] ∈ {c2 , c3 , c4 } is taken for the first time, must be considered the row in Table 5.3 shows whether the decision comes too early, i. e., Δt < 0, during the “do not care” interval, i. e., Δt = 0, or too late, i. e., Δt > 0, with ⎧ ⎨ ns − nfire nfire < ns 0 ns ≤ nfire ≤ ne Δt = (5.1) ⎩ nfire − ne ne < nfire . If the classifiers decisions change during the course of the crash, nfire is set to the time stamp when the class with highest index k is firstly identified. The case where deployment systems are activated although this is not required leads to Δt = −∞ and the case where no deployment systems are activated although this is required leads to Δt = ∞. This way, the classification of a crash scenario with an arbitrary algorithm leads to a single entry in the Table 5.3, describing the quality of the decision with respect to the crash severity and the time instance when the decision is taken. Hereby, the entries Δto , Δtc , and Δtu are computed according to (5.1).

deployment time

crash severity overestimated Δto

crash severity estimated correctly Δtc

crash severity underestimated Δtu

Table 5.3: Categorization of a car-crash decision for a fixed scaling factor αscale

The performance of different classification systems for the car crash classification application can now be evaluated by computing the histograms for the entries in Table 5.3 with the LOO estimates. Such histograms will be used further in this work as basis for the comparison of different classification algorithms.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

5.2 Evaluation of Classification Performance

111

Robustness and Interpretability The car crash classification task is a safety critical application and thus, besides the required low misclassification rate also the interpretability of the classification system is desirable. The time series produced by the sensors during a car crash are nonstationary and represent the output of a nonlinear system. There are lots of sources for uncertainty during a car crash, e. g., positions of the passengers, relative velocity of the collision partners, angle of impact, stiffness of the objects involved in the collision. Therefore, it is advisable to use statistical methods both in the design and in the validation of the classification system. In order to find an optimal classification system, i. e., one that minimizes the rate of misclassifications—both regarding the crash severity and the time for activating the safety systems—it is required to know the joint probability functions of the random variables that influence the system. Since these are not known in practice two possible approaches to examine the performance are: the use of cross validation techniques combined with the inclusion of models for variations which may occur in the real application and the design of interpretable classifiers that can be checked by experts. In this work both approaches are considered. As mentioned above the LOO technique is applied to obtain a fair estimate of the classifiers performance. Additionally, to draw on the tradition of testing the robustness of algorithms designed for the deployment of safety systems, the raw data is scaled up and down by 15% and then the classification algorithm is evaluated again with these scaled signals using the LOO estimate. Thus, for each scaling factor, αscale ∈ {0.85, 1.00, 1.15}, that is applied to the raw data the classifiers LOO decision and the time instance when the decision is made are recorded and rated according to the penalizations from Fig. 5.1. The scaling with αscale represents a model of the variations that may appear due to the numerous random variables in the system and is based on empiricism. The method attempts to gain a better estimate of the risk, which is defined as the expectation of the loss function, by covering a larger region of interest in the classifiers input space. Hereby, a behavior that reflects the expectations, i. e., a downscaling of the signals shifts the deployment time to a later time instance and an upscaling to an earlier time instance, is interpreted as a measure for the robustness of the classification system. A frequently applied approach in practice, especially in safety critical applications, is the use of expert systems, e. g., a set of empirically obtained rules. The advantage of expert systems is their interpretability, i. e., the way how a decision is taken based on the input values can be understood by humans. This assures that no unexpected decisions are taken for arbitrary inputs. As already mentioned, there is a tradeoff between interpretability and low misclassification rates. While there are lots of techniques that aim at incorporating domain knowledge into the design of standard machine learning algorithms, e. g., Knowledge Based Artificial Neural Networks [TS94], the resulting classifiers are not interpretable. Among the standard classifiers that are based on statistical learning the majority of the techniques are not interpretable. Decision trees are interpretable by generating ,,IF-THEN-ELSE” rules but pay the price for their interpretability with higher misclassification rates. The GRBF classifiers described in Chapter 3 allow only fuzzy-like interpretability but they are able to model much more complex separating hypersurfaces than decision trees and therefore, their misclassification rate is normally lower. The GRBF algorithm is considered for the car crash classification task in Subsection 7.2.1. Another category of classifiers whose interpretability is

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

112

5. Classification in Car Safety Systems

based on the distance to some predefined points in the input space is the One-Nearest Neighbor (1NN) procedure. The 1NN algorithm is applied to the task of car crash classification in Subsection 7.2.2. For a robust design not only the classification algorithm and the extracted features are of great importance but also the representativeness of the input-target pairs in the training set. The existence of outliers in the training set may deteriorate the generalization ability of a classifier. Therefore, a frequently used technique to increase the robustness of classifiers is the elimination of outliers from the training set. This can be done for example by using tree ensembles which measure the similarity among the input points in the training set. The classification methods from Chapter 6 and Subsection 7.2.1 for the car crash datasets have incorporated the option to eliminate outliers.

Conclusion This chapter has presented the car crash classification task and the challenges related to it. The first part has described the datasets and has provided background knowledge about the origin of the signals. The second part has shown how classification systems can be evaluated for this application in order to capture the validity of both the estimated crash severity and the time instance when a decision to deploy safety systems is taken. The last part of the chapter has addressed the topics of robustness and interpretability which are important for the car crash classification application.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

6. Scenario-Based Random Forest for On-Line Time Series Classification

In this chapter a machine learning technique for on-line time series classification will be introduced, which computes the label of a query scenario at every time stamp n. The algorithm is a modified version of the Random Forest (RF) classifier [Bre01] in such a way that the peculiarities that come along with on-line time series classification can be considered. Therefore, the algorithm will be called Scenario-Based Random Forest (SBRF) [BN07b]. The next section formulates the classification problem, Section 6.2 introduces the SBRF classifier, Section 6.3 shows how feature selection can be performed with the SBRF, and in Section 6.4 the algorithm is applied to the task of car crash classification.

6.1 Problem Formulation On-line time series classification is a challenging task, as already mentioned and described in Chapter 4, since the usual feature-value model cannot be applied. The aim of on-line time series classification is to figure out for a new scenario which label the current segment has at a given time-index n by using only the information from the time-series up to this point. With S [0,n] = [s[0], . . . , s[n]] ∈ RL×(n+1) denoting the time-series up to time-index n one is looking for the mappings gn : RL×(n+1) → {c1 , . . . , cK }, S [0,n] → ˆy[n].

(6.1)

The smallest entity that can be assumed independent from the others is an entire scenario S, although the decision of the classifier is taken at every time stamp n. This requires a new definition of the risk, i. e., the classification error. If the corresponding targets to the time-series S are accumulated in the vector y , as described in Section 4.1, the expected risk is R(g) = ES,y {L(g(S), y )} ,

(6.2)

M 1  Remp (g) = L(g(Sm ), ym ), M m=1

(6.3)

and the empirical risk is

where L(g(S), y ) denotes the loss function with g(S) ∈ {c1 , . . . , cK }1×nend , 5    6 g(S) = g0 S [0,0] , . . . , gnend −1 S [0,nend −1] ,

(6.4)

and ES,y {L(g(S), y )} is the expectation of L(g(S), y ) with respect to the random processes S and y .

113 Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

114

6. Scenario-Based Random Forest for On-Line Time Series Classification

The mappings fg[n] from (4.55) transform the segment S [0,n] of a scenario from its elementary representation to the vector ˜ x [n] of high level features, wich can be both local or global features. Thus, the whole scenario S is mapped to ˜ ˜ = [˜ X x [0], . . . , ˜ x [nend − 1]] ∈ RN ×nend .

(6.5)

To reduce the curse of dimensionality and the computational complexity feature selection is ˜ ˜: performed by using the selection matrix J ∈ {0, 1}N ×N , N < N x[n] = J˜ x [n].

(6.6)

Thus, the vector of high level features x[n], contains only some of the entries in ˜ x [n]. This way, the scenario S is finally transformed into ˜ = [x[0], . . . , x[nend − 1]] ∈ RN ×nend . X = JX

(6.7)

Based on the input vector x[n] and the target values y[n] a machine learning algorithm can then be used to learn the mapping fJ : RN → {c1 , . . . , cK }, x[n] → ˆy[n].

(6.8)

Thus, instead of searching for the mappings gn , the task changes into finding the mapping fJ using a machine learning algorithm. However, this is not a classical classification problem since the successive vectors x[n] from the same scenario are strongly correlated. Moreover, the risk functional used for the classification must be modified because the costs are evaluated on scenario level and not on time-stamp level. The expected risk R(fJ ) and the empirical risk Remp (fJ ) can now be introduced analogously to (6.2) and (6.3) by replacing g(S) with ˜ ) = [fJ (x[0]) , . . . , fJ (x[nend − 1])] . f (X ) = f (J X

(6.9)

For the evaluation of the classifier performance it does not matter whether one works with elementary or high-level features so that the loss function must be chosen to fulfil R(fJ ) = R(g) and Remp (fJ ) = Remp (g).

(6.10)

In off-line time-series classification, where a single label is assigned to a whole scenario, the loss function measures whether the time-series has been assigned to the correct class and in the on-line time series classification task the time instances when class-changes are identified or missed determine the loss function so that it does not play any role whether elementary or high-level features are used. For on-line time series classification the loss function L(f (X ), y ) is designed in such a way that it takes account of the consequences that a misclassification has at the various time instances of a scenario. If the penalizations of misclassifications are stored in the vector γ = [γ[0], . . . , γ[nend − 1]] ∈ R1×nend

(6.11)

and the time instances when a class change occurs either in y or in f (X ) are denoted with nch,i , then the loss is set to Cch 1  L(f (X ), y ) = γ[nch,i ](1 − δ(fJ (x[nch,i ]), y[nch,i ])), Cch i=1

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

(6.12)

6.2 The Scenario-Based Random Forest Algorithm c1

c2

c3

115

c4

γ[n] γmax

n

γmin n

Figure 6.1: Penalization terms γ[n]

where Cch is the total number of changes in y and f (X ). Note that by setting γ[n] = 0 for all time-stamps where an observed scenario has a length smaller than nend we can consider scenarios of different lengths. Another possible definition of the loss function is nend −1 γ[n](1 − δ(fJ (x[n]), y[n])) L(f (X ), y ) = n=0 , (6.13) nend −1 γ[n] n=0 but this definition does not differentiate whether misclassifications are caused by the same or a different event in the scenario, thereby putting too much emphasis on misclassifications that have the same reason. Thus, in the following the loss from (6.12) will be used. The penalizations γ[n] for the example from Fig. 4.4 may have the values shown in Fig. 6.1. The values of γ[n] have been chosen here to penalize a misclassification strongly if it occurs in regions that clearly belong to one of the classes, while the penalization gets smaller the closer one moves towards a change to a different class. During the “do not care” intervals the penalization is set to zero so that a misclassification here is not taken into consideration when computing the risk. It should be noted that this requires to set the corresponding value nch,i from (6.12) to the next time instance following the “do not care” interval. The penalizations γ[n] which are used in the car crash application have been presented in Fig. 5.1. In order to obtain a good on-line time series classification system one has to find the mappings gn that minimize the expected loss R(g), or according to (6.10) the selection matrix J and the mapping fJ which minimizes R(fJ ). The selection matrix J and the mapping fJ must be determined based on a training set DS consisting of M scenarios and the corresponding labels for each time stamp [see (4.4)]. The next two chapters show how the matrix J and the mapping fJ can be computed using the SBRF algorithm.

6.2 The Scenario-Based Random Forest Algorithm The SBRF algorithm is a modification of the RF algorithm from Subsection 2.2.5 which is performed in such a way that the on-line time series classification problem described in the previous section can be handled. In time series classification one must take account for the fact that the smallest entity that can be assumed independent from the others is a scenario. Therefore, in the SBRF algorithm the bootstrap used for building a forest must be done on (b) scenario level, i. e., the training set DX˜ needed for the b-th tree is constructed by sampling ˜ with replacement from M scenarios in their high-level features representation X ˜ 1 , y1 ), . . . , (X ˜ M , yM )}. DX˜ = {(X

(6.14)

As presented in Fig. 6.2 a scenario S is transformed by the feature generation mappings fg ˜ . The SBRF classifier has to learn the mappings for each time instance from ˜ into X x [n] to y[n]. Due to the bootstrap procedure a tree in the forest has access only to approximately

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

116

6. Scenario-Based Random Forest for On-Line Time Series Classification

˜ xN˜

S

fg

Training

≈ 63%

˜ x [n]

˜ x1 n

n n

n

˜ =X

Test

≈ 37%

Figure 6.2: Scenario-Based Random Forest

63% of the scenarios from DS during the training phase. Equivalently, every scenario is used only by approximately 63% of the trees in the SBRF for training. To take into account that the penalization for misclassifications γ[n] depends on the time stamp n different weightings of the patterns (˜ x [n], y[n]) are implemented by means of oversampling. The idea here is based on the methods applied for imbalanced data-sets where one class constitutes only a very small minority of the data [CLB04]. In these cases one can either assign a high cost to the misclassification of the minority class and then minimize the overall cost [Dom99] or sampling techniques can be applied. Different sampling methods exist, e. g., downsampling the majority class, oversampling of the minority class [LL98] or both. Oversampling raises the weight of those samples that are replicated and this is exactly what is needed in on-line time series classification. More precisely, the patterns (˜ x [n], y[n]) will be replicated according to their weighting γ[n]. Hereby, the number of replications for an example (˜ x [n], y[n]) is obtained by rounding up the ratio γ[n]/γlow , where γlow denotes (b) the lowest penalization γ[n ] = 0. The oversampling transforms the set DX˜ into the set (b)

DX˜ , which is used for the training of the b-th tree in the forest. Thus, the training set for the b-th tree contains patterns (˜ x [n], y[n]) from approximately 63% of the scenarios, with repetitions on time stamp level to consider the penalization γ[n] and repetitions on scenario level which are a consequence of the bootstrap procedure. At this level there is no difference to the RF algorithm, i. e., unpruned trees are grown where at each node only a reduced number of features is examined for the best split thereby introducing an additional source of randomization. After building B trees, for each time instance of a new scenario the class label can be computed by taking the majority vote among the trees, thereby implementing the mapping fJ , with J = I N˜ . Since for the m-th scenario in DX˜ there are approximately 37% of the B trees which have not used this scenario in the training phase, one can compute the oobˆ oob (fJ ) of R(fJ ): estimate R M 1  ˆ L(f (m) (Xm ), ym ), Roob (fJ ) = M m=1

with

(6.15)

7 8 (m) (m) f (m) (Xm ) = fJ (x(m) [0]), . . . , fJ (x(m) [nend − 1]) .

(6.16) (m)

Hereby, x(m) [n] denotes the (n + 1)-th feature vector of the m-th scenario and fJ (x(m) [n]) is computed by taking the majority vote among only those trees in the forest which have not

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

6.3 Feature Selection with SBRF

117

utilized the m-th scenario in the training stage. This way, a good estimate of the classifiers performance can be obtained. The steps required for the SBRF algorithm are summed up in Alg. 6.1. Algorithm 6.1 SBRF compute DX˜ by transforming the scenarios Sm , m = 1, . . . , M , into their high-level ˜m representations X 2: for b = 1 : B do (b) construct DX˜ : sample M scenarios with replacement from DX˜ 4:

(b)

˜ construct DX˜ : in each scenario oversample the patterns (x[n], y[n]) according to γ[n] & ' ˜ features at each split and construct the full-grown tree fb by considering only N (b)

by using the training set DX˜ 6: end for the mapping fJ is realized by taking the majority vote among the B trees The SBRF algorithm is an off-the-shelf approach for on-line time series classification that attempts to learn the mappings from the feature representation of a scenario to the corresponding class labels for every time instance n. Although the SBRF comes along with advantages that other algorithms do not have, e. g., the opportunity to compute a fair estimate of the performance while using all the available data for learning, the resulting classifier is not interpretable. Therefore, one must rely on statistical evidence when deciding about the suitability of the algorithm for a given classification problem. Nevertheless, the algorithm is a valuable tool for safety critical applications since it offers the opportunity to estimate the difficulty of the classification task and to perform feature selection. The latter task is discussed in the next section.

6.3 Feature Selection with SBRF This section focuses on wrapper and embedded methods for time series where the learning machine involved in the selection step is realized by the SBRF. An advantage of wrapper and embedded methods is that they also take into account interactions among the features when evaluating their importance for the classification task. The algorithm represents a forwardbackward feature selection method. The forward feature selection is accomplished by the use of decision trees in the SBRF and the backward feature elimination is realized by starting with a large set of features and reducing this set based on the performance of the SBRF. The task of feature selection is to determine the matrix J from (6.6). As already mentioned in Section 2.3 feature selection is strongly related to the task of model assessment. The ˆ oob (fJ ) from (6.15). For the approach presented here is based on the out-of-box estimate R feature selection task it is assumed that the only mapping that is unknown and which influences R(fJ ) is the selection matrix J . In order to assure that this assumption roughly holds it is necessary to use machine learning algorithms for implementing fJ which are able to generalize well in high dimensional spaces. For the same reasons as the Random Forest algorithm the SBRF classifier leads to low generalization errors even in high dimensions. Additionally, it offers the option to compute fair estimates of the performance, which is a key

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

118

6. Scenario-Based Random Forest for On-Line Time Series Classification

issue for feature selection. In the next subsection a wrapper method and in Subsection 6.3.2 embedded methods using the SBRF algorithm will be presented. 6.3.1 Wrapper Method The loss function for the construction of a tree fb is L(fb (˜ x [n]), y[n]) = 1 − δ(fb (˜ x [n]), y[n]),

(6.17) (b)

where the realizations of ˜ x [n] and y[n] used to built the b-th tree are stored in DX˜ . The minimization of the expectation of L(fb (˜ x [n]), y[n]) also assures the minimization of the risk using the losses from (6.12) or (6.13). To find out the importance of the j-th high-level feature one computes the loss on performance due to the j-th feature ˆ (j) (fJ ) − R ˆ oob (fJ ). Δ(j) = R oob

(6.18)

(j)

ˆ (fJ ) is the oob-estimate of the risk when the information from the j-th feature Hereby, R oob ˜ m . Alg. 6.2 sums up the has been removed by randomly permuting the j-th column in X steps required to perform feature selection using the SBRF algorithm. Algorithm 6.2 SBRF feature selection J = I N˜ ˆ oob (fJ ) = 0 2: R ˆ oob (fJ ) ≤ critical value do while R 4: for m = 1 : M do ˜m ˜m = JX X 6: end for construct a SBRF with B trees according to Alg. 6.1 ˆ oob (fJ ) 8: compute R ˜ do for j = 1 : N ˆ (j) (fJ ) 10: remove the information from the j-th feature and determine R oob (j) (j) ˆ ˆ compute Δ = Roob (fJ ) − Roob (fJ ) 12: end for choose a new J in such a way that the feature with the minimal value of Δ(j) is discarded 14: end while the penultimate J is the desired selection matrix The larger the value of Δ(j) in Alg. 6.2 the more important is the j-th feature. Since only the relative magnitudes of Δ(j) are of interest one can normalize these quantities so that the most important feature has the importance 100. Then, in an iterative process one can eliminate the features with low importance as long as the estimated risk is still acceptable, thereby making a hypothesis about the features to use. Since for every new hypothesis a new SBRF is required one can speed up the selection process by eliminating—especially in the first iterations—more than one feature in line 13 of Alg. 6.2. Equivalently to the RF feature selection method masking effects do not appear, i. e., features that share much of the same important information will lead to high values of the

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

6.3 Feature Selection with SBRF

119

corresponding Δ(j) making it difficult to eliminate such redundant features. Fortunately, this effect only plays a role in the final selection iterations. Therefore, when only a small number of features is left, one can extend the algorithm and test whether the removal of a different feature than the one with lowest estimated importance can still fulfill the requirement of having a generalization error below the critical value from line 3 of Alg. 6.2. As presented up to here, the feature selection can be performed with an arbitrary machine learning algorithm fJ able to generalize well in high dimensional spaces. Therefore, a wrapper method is realized by using the above procedure. Exemplarily, in Fig. 6.3 a two-dimensional ˜x2 [n]

ˆy[n] = ck2 n = nch,i + 2 n = nch,i

ˆy[n] = ck1 n = nch,i − 2 ˜x1 [n]

Figure 6.3: Trajectory of a scenario and its relevant point for the wrapper method

space and the separating boundary that is defined by the mapping fJ and represented by the bold curve are shown. Additionally, one can see a dashed trajectory representing a segment of a scenario S in the input space of fJ , where a change from label ck1 to ck2 occurs. The inputs ˜ x [n] are represented by the markers on the dashed curve. Although the SBRF can provide confidence information about a decision this is not used in the wrapper approach. Due to the loss function from (6.12) only the time instance n = nch,i , when the change from ck1 to ck2 occurs is needed to evaluate the performance. The input ˜ x [nch,i ] is visualized in Fig. 6.3 by the black rectangle. In the vicinity of ˜ x [nch,i ] there are inputs ˜ x [n] where not all B trees in the SBRF decide for the same class. In Fig. 6.3 the bright grey area represents a region of the input space where some trees in the SBRF vote for class ck2 but the majority votes for class ck1 . Equivalently, the dark grey area stands for a region in the input space where some trees in the SBRF vote for class ck1 but the majority votes for class ck2 . This soft information can be used for feature selection, leading to the embedded methods from the next subsection. 6.3.2 Embedded Methods Similarly to the RF procedure for the usual feature-value model other loss functions can be defined based on the margin of the SBRF algorithm, leading to embedded methods for feature selection in change detection. For this purpose the confidence based penalization is introduced cp[n] = γ[n](1 − mg(˜ x [n], y[n])), where mg(˜ x [n], y[n]) is defined as in (2.118) 1  mg(˜ x [n], y[n]) = δ(fb (˜ x [n]), y[n]) − max ck =y[n] |B| b∈B



 1  δ(fb (˜ x [n]), ck ) , |B| b∈B

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

(6.19)

(6.20)

120

6. Scenario-Based Random Forest for On-Line Time Series Classification

where B denotes the set of decision trees fb which have not used the scenario to which ˜ x [n] and y[n] belong during the training phase. The margin mg(˜ x [n], y[n]) measures the degree of confidence one can have in the decision of the SBRF at time instance n. If for the input ˜ x [n] all trees in the SBRF identify the correct class y[n], then cp[n] will be zero, whereas the greatest value of cp[n] is 2γ[n] since the margin mg(˜ x [n], y[n]) ∈ [−1, 1]. Whenever a misclassification occurs the value cp[n] exceeds the value γ[n]. Fig. 6.4 shows exemplarily how cp[n] is generated from the penalization γ[n] and the margin mg(˜ x [n], y[n])) for a scenario with two class labels c1 and c2 . The upper left plot presents y[n] and the upper right plot the decisions taken by a SBRF fJ . The “do not care” interval contains the time instances n = 3 and n = 4. The soft information underlying the SBRF decisions is described by the margin in the lower right plot. Then, with the penalization γ[n] that can be seen in the lower left plot, the confidence penalization cp[n] is computed according to (6.19) and also shown in the lower left plot. The SBRF produces two misclassifications at n = 5 and n = 6. For these two time instances the margin has the value “−1”, which indicates that all trees in the forest have voted for the wrong class. For n = 7 and n = 8 the SBRF decides the correct class c2 but there are trees in the forest which vote for c1 so that cp[n] is not zero here. Finally, for n = 9 the margin becomes “1” indicating that all trees in the forest vote for class c2 . fJ (˜ x [n])

y[n] c2

c2

c1

c1 0 1 2 3 4 5 6 7 8 9 n

0 1 2 3 4 5 6 7 8 9 n mg(˜ x [n], y[n]))

γ[n] cp[n] 1

1 2 3 4

7 8 9 n

−1 0 1 2 3 4 5 6 7 8 9 n

Figure 6.4: Example for the computation of cp[n]

Using the confidence based penalization a loss function on scenario level is defined as nend −1 cp[n] n=0 . (6.21) Lemb,1 (f (X ), y ) =  nend −1 2 n=0 γ[n] The lower Lemb,1 (f (X ), y ) the more confidence one can have that the decisions of the SBRF algorithm are correct for the whole scenario. Using the loss from (6.21) in Line 8 and 10 of Alg. 6.2 in order to estimate the risk R(fJ ) feature selection based on the SBRF confidence information can be performed. For the same setting that underlies Fig. 6.3, in Fig. 6.5 the points ˜ x [n] with cp[n] = 0, i. e., those points that increase the loss from (6.21), are marked by black rectangles. Performing

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

6.3 Feature Selection with SBRF

121

˜x2 [n]

ˆy[n] = ck2 n = nch,i + 2 n = nch,i

ˆy[n] = ck1 n = nch,i − 2 ˜x1 [n]

Figure 6.5: Trajectory of a scenario and its relevant points with Lemb,1 (f (X ), y )

feature selection this way favors those features which assure a high confidence in the correct decision for the whole duration of the scenarios. Nevertheless, this strategy has a drawback due to the correlation among adjacent input vectors. A chain of misclassifications that have the same reason, i. e., the inputs lie in the same region of the input space leading to a wrong decision, are weighted according to the duration of the stay in this region. Thus, misclassifications of longer duration are penalized more heavily than misclassifications of shorter duration. If one wants to penalize the underlying reasons leading to misclassifications fairly, the loss Lemb,1 (f (X ) must be changed. To do so, two further loss functions will be introduced in the following. Using the confidence based penalization the loss function Lemb,2 (f (X ), y ) is defined as Cch 1  Lemb,2 (f (X ), y ) = cp[nch,i ], Cch i=1

(6.22)

where, equivalently to (6.12), Cch denotes the total number of changes in y and f (X ) and nch,i the time instances when these changes occur. As already mentioned, if nch,i lies in a “do not care” interval then nch,i is set to the next time instance following the “do not care” interval. The loss from (6.22) is closely related to the loss from (6.12), but instead of penalizing a misclassification hardly with the cost γ[nch,i ], here a misclassification is penalized softly with cp. Thus, for the setting underlying Figures 6.3 and 6.5 the relevant point for performing feature selection using Lemb,2 (f (X ), y ) is the one marked by a black rectangle in Fig. 6.3. Finally, a loss function is introduced that picks out those time instances for evaluating cp, when the trajectory described by the scenario in the input space enters or departs from a region where the SBRF has low confidence in the decision. By choosing these time instances, regions of the input space with low confidence, which occur at the border between classes, are weighted equally, without taking into account how long the trajectory stays there. To introduce such a loss function firstly a confidence measure for SBRF decisions is defined as   1  1  cm[n] = δ(fb (˜ x [n]), ˆy[n]) − max δ(fb (˜ x [n]), ck ) , (6.23) ck =ˆ y[n] |B| b∈B |B| b∈B where ˆy[n] represents that class for which the majority of the trees in the SBRF vote at time stamp n and B, equivalently to (6.20), the set of decision trees fb which have not used the scenario to which ˜ x [n] belongs during the training phase. Unlike the margin mg[n], the confidence measure cm[n] can be computed without knowing y[n]. Time instances ncm,i are

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

122

6. Scenario-Based Random Forest for On-Line Time Series Classification

chosen using cm[n] as ncm,i = min {|cm[n] − cm[ncm,i−1 ]| > τSBRF ; n

n > ncm,i−1 } ,

(6.24)

where ncm,0 = 0 and τSBRF is a threshold that is set to 0.1 in order to assure that at least 10% of the trees have changed their decision since ncm,i−1 . Based on the time instances ncm,i , i = 1, . . . , Ccm , the loss Lemb,3 (f (X ), y ) is defined as Ccm 1  Lemb,3 (f (X ), y ) = cp[ncm,i ]. Ccm i=1

(6.25)

For the same setting that underlies Figures 6.3 and 6.5 a possible set of points ˜ x [n] that are chosen for computing the loss from (6.25), are marked by black rectangles in Fig. 6.6. ˜x2 [n]

ˆy[n] = ck2 n = nch,i + 2 n = nch,i

ˆy[n] = ck1 n = nch,i − 2 ˜x1 [n]

Figure 6.6: Trajectory of a scenario and its relevant points with Lemb,3 (f (X ), y )

Table 6.1 gives a brief overview on the three losses Lemb,1 (f (X ), y ), Lemb,2 (f (X ), y ), and Lemb,3 (f (X ), y ). Lemb,1 (f (X ), y ) Lemb,2 (f (X ), y )

Lemb,3 (f (X ), y )

confidence penalization considered for all time instances confidence penalization considered for those time instances when a class change occurs in y or in f (X ) confidence penalization considered for those time instances when the trajectory described by the scenario in the input space enters or departs from a region where the SBRF has low confidence in the decision

Table 6.1: The three losses Lemb,1 (f (X ), y ), Lemb,2 (f (X ), y ), and Lemb,3 (f (X ), y )

6.4 SBRF for Car Crash Classification To determine the best suited loss function for robust feature selection the strategy from Section 5.2 is adopted. Hereby, for each of the M folds the feature importance is computed using Alg. 6.2. A SBRF is constructed with M − 1 scenarios and the importance of every feature is estimated from these M − 1 scenarios with all four loss functions: L from (6.12), Lemb,1 from (6.21), Lemb,2 from (6.22), and Lemb,3 from (6.25). Since there are M folds, one obtains M estimates for the importance of every feature with each selection method. Based

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

6.4 SBRF for Car Crash Classification

123

on suitability and robustness which are measured by the estimated mean and standard deviation of the importance of each feature, that selection method is chosen where the important features are identified most reliably among the M folds. The feature selection method that is introduced in this work can be applied to many industrial applications. In the following it will be shown how feature selection can be performed for the car crash application. The number of features at the beginning of the selection process ˜ = 33. The construction of features ˜x1 [n], ˜x4 [n], and ˜x7 [n] requires the description is set to N of the classification system that is presented in Chapter 7. Therefore, at this point it is only mentioned that the three features represent aggregated values over time: ˜x1 [n] stems from the ZAEX sensor, ˜x4 [n] is a sum of two signals that are obtained from ECSL and ECSR, and ˜x7 [n] is a difference of two signals that are obtained from ECSL and ECSR.1 In order to take into account how these signals evolve over time the following features are used ˜x2 [n] = ˜x1 [n] − ˜x1 [n − τback,1 ], ˜x3 [n] = ˜x1 [n] − ˜x1 [n − τback,2 ], ˜x5 [n] = ˜x4 [n] − ˜x4 [n − τback,1 ], ˜x6 [n] = ˜x4 [n] − ˜x4 [n − τback,2 ], ˜x8 [n] = ˜x7 [n] − ˜x7 [n − τback,1 ], ˜x9 [n] = ˜x7 [n] − ˜x7 [n − τback,2 ],

(6.26) (6.27) (6.28) (6.29) (6.30) (6.31)

with τback,1 = 8 and τback,2 = 16. The features ˜x10 [n], ˜x11 [n], ˜x12 [n], ˜x16 [n], ˜x17 [n], ˜x18 [n], ˜x22 [n], ˜x23 [n], ˜x24 [n], ˜x28 [n], ˜x29 [n] and ˜x30 [n] are computed from the multivariate time-series     U  = [uZAEX , uZAEY , uECSL , uECSR ]T ∈ R4×nend which results after low-pass filtering each univariate time series in the scenario S. The low-pass filter chosen here is an all-ones FIR filter with 16 taps, scaled by the factor 1/16. Similarly, the features ˜x13 [n], ˜x14 [n], ˜x15 [n], ˜x19 [n], ˜x20 [n], ˜x21 [n], ˜x25 [n], ˜x26 [n], ˜x27 [n], ˜x31 [n], ˜x32 [n] and ˜x33 [n] are computed from the multivariate time-series U = [uZAEX , uZAEY , uECSL , uECSR ]T ∈ R4×nend which results after low-pass filtering each univariate time series in the scenario S with an all-ones FIR filter with 8 taps, scaled by the factor 1/8. With uECS+ [n] = |uECSL [n]| + |uECSR [n]| and uECS− [n] = ||uECSL [n]| − |uECSR [n]||

(6.32) (6.33)

the features are constructed as follows:

1

˜x10 [n] = max{|uZAEX [n]|, |uZAEX [n − 1]|, . . . , |uZAEX [n − τhist ]|}, ˜x11 [n] = ˜x10 [n] − ˜x10 [n − τback,1 ], ˜x12 [n] = ˜x10 [n] − ˜x10 [n − τback,2 ],

(6.34) (6.35) (6.36)

˜x13 [n] = max{|uZAEX [n]|, |uZAEX [n − 1]|, . . . , |uZAEX [n − τhist ]|}, ˜x14 [n] = ˜x13 [n] − ˜x13 [n − τback,1 ], ˜x15 [n] = ˜x13 [n] − ˜x13 [n − τback,2 ],

(6.37) (6.38) (6.39)

The exact computation of ˜x1 [n], ˜x4 [n], and ˜x7 [n] is described by Eqs. (7.37), (7.38) and (7.39).

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

124

6. Scenario-Based Random Forest for On-Line Time Series Classification ˜x16 [n] = max{|uZAEY [n]|, |uZAEY [n − 1]|, . . . , |uZAEY [n − τhist ]|}, ˜x17 [n] = ˜x16 [n] − ˜x16 [n − τback,1 ], ˜x18 [n] = ˜x16 [n] − ˜x16 [n − τback,2 ],

(6.40) (6.41) (6.42)

˜x19 [n] = max{|uZAEY [n]|, |uZAEY [n − 1]|, . . . , |uZAEY [n − τhist ]|}, ˜x20 [n] = ˜x19 [n] − ˜x19 [n − τback,1 ], ˜x21 [n] = ˜x19 [n] − ˜x19 [n − τback,2 ],

(6.43) (6.44) (6.45)

˜x22 [n] = max{uECS+ [n], uECS+ [n − 1], . . . , uECS+ [n − τhist ]|}, ˜x23 [n] = ˜x22 [n] − ˜x22 [n − τback,1 ], ˜x24 [n] = ˜x22 [n] − ˜x22 [n − τback,2 ],

(6.46) (6.47) (6.48)

˜x25 [n] = max{uECS− [n], uECS− [n − 1], . . . , uECS− [n − τhist ]|}, ˜x26 [n] = ˜x25 [n] − ˜x25 [n − τback,1 ], ˜x27 [n] = ˜x25 [n] − ˜x25 [n − τback,2 ],

(6.49) (6.50) (6.51)

˜x28 [n] = max{uECS+ [n], uECS+ [n − 1], . . . , uECS+ [n − τhist ]|}, ˜x29 [n] = ˜x28 [n] − ˜x28 [n − τback,1 ], ˜x30 [n] = ˜x28 [n] − ˜x28 [n − τback,2 ],

(6.52) (6.53) (6.54)

˜x31 [n] = max{uECS− [n], uECS− [n − 1], . . . , uECS− [n − τhist ]|}, ˜x32 [n] = ˜x31 [n] − ˜x31 [n − τback,1 ], ˜x33 [n] = ˜x31 [n] − ˜x31 [n − τback,2 ],

(6.55) (6.56) (6.57)

with τhist = 16. Using the categorization from Section 4.5 all 33 features belong to the group of local features since only signal properties from a limited time interval are considered in the feature generation step. In order to perform feature selection a division of a data set in M folds is accomplished and for each fold a SBRF with B = 100 CART trees is grown and used to estimate the importance of the features. Hereby, the penalizations γ[n] are set according to the course in Fig. 5.1 with γmin = 2 and γmax = 4. For each feature the importance is estimated with L, Lemb,1 , Lemb,2 and Lemb,3 . Additionally, the classification performance is measured in each fold for the scenario that is not used for the training of the SBRF so that a LOO estimate of the performance can be computed. This procedure generates M estimates of the importance for each feature and loss function and M estimates of the classification performance. The features are then sorted by decreasing importance for each loss function in every fold so that 4M importance lists result. The rank of a feature in these importance lists is treated as a random variable whose mean and variance are estimated from the M folds leading to the values μ ˆrank and σ ˆrank for every feature and each loss function. For car type I these estimates are shown in Table 6.2 and for car type II in Table 6.3. For clarity, the values μ ˆrank of the top ten features for each selection method are written with bold numbers.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

6.4 SBRF for Car Crash Classification Feature x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30 x31 x32 x33

μ ˆrank \ˆ σrank with L 16.41\12.34 20.02\11.60 26.90\9.54 21.46\11.18 26.30\8.36 28.77\5.37 25.96\7.06 21.42\8.35 26.66\4.53 6.85\8.79 12.95\9.35 23.14\6.09 3.95\5.27 13.09\9.04 18.55\7.24 21.15\6.78 18.30\7.30 21.65\5.04 17.26\7.20 14.77\6.82 20.25\4.50 3.84\4.66 16.31\5.98 19.07\4.49 13.77\6.91 14.52\5.65 16.11\4.41 9.12\5.94 13.15\5.03 14.74\4.59 11.77\4.91 10.80\4.32 11.82\4.88

μ ˆrank \ˆ σrank with Lemb,1 6.86\1.62 18.97\2.05 22.57\2.64 8.68\1.66 21.39\4.62 25.84\4.05 15.96\5.09 27.79\3.91 30.17\2.92 1.50\0.55 14.71\1.27 24.30\3.28 3.73\0.47 17.98\1.49 21.63\1.59 9.25\1.65 11.80\1.43 27.53\3.25 6.09\1.28 10.25\1.93 29.74\2.50 1.55\0.52 14.46\2.32 27.34\2.71 8.42\1.77 15.71\3.25 28.77\2.79 3.20\0.53 18.68\3.47 28.15\2.32 7.28\2.02 21.68\4.83 28.87\2.17

μ ˆrank \ˆ σrank with Lemb,2 14.12\12.51 14.62\10.39 21.34\10.63 17.87\11.32 21.76\10.50 28.07\7.18 21.79\8.70 18.47\9.18 26.63\6.24 8.01\9.82 12.77\8.82 20.50\7.92 6.09\7.85 11.69\8.03 16.79\8.53 20.20\8.10 16.69\9.16 21.98\6.22 16.28\8.22 14.55\8.71 22.63\5.78 4.19\6.20 15.74\6.74 21.71\5.92 17.28\7.91 13.92\5.93 20.09\5.18 12.58\8.17 14.82\5.72 19.49\4.85 16.00\6.03 14.38\4.73 17.79\4.84

125

μ ˆrank \ˆ σrank with Lemb,3 5.84\1.48 15.60\3.12 23.75\4.34 8.00\1.67 21.52\4.78 27.00\4.84 20.38\5.25 22.79\3.66 27.84\4.20 2.79\0.86 11.42\2.79 23.03\3.89 2.825\0.78 15.25\2.37 22.26\2.66 11.44\2.54 14.87\4.15 26.98\3.60 7.84\2.26 12.03\3.61 27.39\3.74 1.03\0.17 13.66\2.32 26.06\3.94 7.26\1.52 19.46\4.64 30.42\2.49 3.34\0.73 18.14\3.74 27.61\2.95 7.65\1.85 25.33\4.18 30.07\2.35

˜ = 33 Table 6.2: Feature ranking for car type I, N

It can be seen that by using the loss functions L from (6.12) and Lemb,2 from (6.22) the standard deviations for all 33 features is high, which means that the selection methods using these loss functions are not robust, leading to large variations when identifying the most important features. The variations are not only caused by the removal of a scenario in the LOO-procedure but also by the randomness in the SBRF, i. e., evaluating the feature importance with different SBRF classifiers, all grown with the same data, still leads to large variations in the ranking. The large variations of the feature ranks computed with L from (6.12) are shown by the high values of σ ˆrank in the second column of Tables 6.2 and 6.3. Similarly, large values for the estimated standard deviations result when using the

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

126

6. Scenario-Based Random Forest for On-Line Time Series Classification

loss function Lemb,2 from (6.22) for car type I, as it can bee seen in the fourth column of Table 6.2. A hint that Lemb,2 is better suited for the problem at hand than L from (6.12) is given by the small values of σ ˆrank in the fourth column of Table 6.3 for the top features of car type II. Feature x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30 x31 x32 x33

μ ˆrank \ˆ σrank with L 21.45\12.18 28.94\7.88 27.20\7.91 6.31\7.64 16.37\10.23 22.79\9.79 23.50\9.61 19.91\9.80 21.67\9.00 8.25\8.21 20.51\8.35 21.77\7.46 16.12\9.11 20.80\7.10 24.31\5.04 18.43\8.04 19.24\6.82 21.71\5.64 16.98\7.00 18.93\5.52 20.36\4.91 2.55\3.07 11.97\6.58 15.98\5.43 15.10\5.87 14.70\6.50 16.06\5.19 2.21\2.51 12.74\5.61 14.81\5.79 13.03\5.20 11.65\5.02 14.48\5.62

μ ˆrank \ˆ σrank with Lemb,1 6.27\0.76 18.47\2.12 25.24\2.47 3.17\0.43 13.95\2.43 23.72\2.65 12.80\2.84 20.41\1.44 27.70\2.56 4.03\0.66 12.02\2.51 26.47\2.27 4.86\0.55 15.21\2.81 27.41\2.13 15.33\2.37 15.29\3.05 31.65\1.61 11.19\2.12 20.98\1.58 30.22\1.42 1.17\0.37 7.71\1.24 24.29\2.15 7.59\0.95 16.82\2.55 31.12\1.53 1.85\0.41 13.85\2.18 27.60\2.10 9.59\1.69 21.91\2.10 30.95\1.34

μ ˆrank \ˆ σrank with Lemb,2 13.49\10.01 21.47\9.38 23.86\7.91 3.03\1.15 10.78\6.76 21.47\9.44 23.19\9.00 21.33\9.42 23.27\7.81 5.78\5.07 13.81\7.23 21.94\7.04 9.02\6.60 17.57\7.53 22.02\6.07 20.64\8.40 15.25\6.69 25.02\5.47 18.26\7.77 20.04\6.60 24.54\4.94 1.66\0.78 10.54\6.56 19.25\6.47 20.77\6.97 20.47\7.02 21.08\5.18 1.83\0.85 11.60\6.32 19.57\6.02 18.60\6.50 18.16\6.30 21.57\4.70

μ ˆrank \ˆ σrank with Lemb,3 9.58\2.54 21.60\3.37 29.09\3.62 3.21\0.53 12.29\3.30 25.55\4.15 16.27\4.53 18.86\3.54 28.22\3.75 5.10\0.94 14.41\3.04 25.93\3.63 4.29\0.84 18.83\3.49 25.54\3.26 16.06\4.25 17.80\4.15 31.62\1.67 9.28\2.17 22.22\4.15 28.94\2.68 1.35\0.47 8.83\2.30 24.02\3.92 5.50\0.78 12.37\2.78 25.96\3.27 1.65\0.49 12.58\3.02 27.10\3.27 9.18\2.04 20.11\4.06 27.48\3.13

˜ = 33 Table 6.3: Feature ranking for car type II, N

When using Lemb,1 or Lemb,3 the most important features, i. e., those with with small μ ˆrank , have also small values for σ ˆrank as it can be seen in the third and fifth columns of Tables 6.2 and 6.3. This fact indicates that feature selection with Lemb,1 or Lemb,3 is robust both to

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

6.4 SBRF for Car Crash Classification

127

random effects introduced through the SBRF algorithm and to variations in the training set DS . The analysis of the four loss functions for the application at hand shows that Lemb,1 from (6.21) and Lemb,3 from (6.25) lead to stable selection methods, since the most important features are identified among all folds with small variations. This fact justifies that the influence of a single scenario is marginal in selecting the most important features with Lemb,1 or Lemb,3 so that the performance measurement with the LOO method in an input space of reduced dimensionality—where the dimensionality reduction is accomplished with all M scenarios—is still a good estimate of the true performance. Therefore, in the following the feature selection criterion is based on Lemb,1 and Lemb,3 . The selection procedure that is applied for this application does not remove only a single feature per iteration as proposed in Alg. 6.2 but for complexity reasons a larger set of input variables. In a first step only those features are kept where both the selection with Lemb,1 and Lemb,3 rank the features in the first third, i. e., where the values of μ ˆrank lie below 11. Thus, for car type I the features ˜x1 [n], ˜x4 [n], ˜x10 [n], ˜x13 [n], ˜x19 [n], ˜x22 [n], ˜x25 [n], ˜x28 [n], ˜x31 [n] and for car type II the features ˜x1 [n], ˜x4 [n], ˜x10 [n], ˜x13 [n], ˜x22 [n], ˜x23 [n], ˜x25 [n], ˜x28 [n], ˜x31 [n] remain for further consideration after the first iteration. To make sure that these selections are suitable, the performance of the classifier is measured with the LOO method both with 33 and with the remaining 9 features. The results are shown in Tables 6.10 and 6.12 for car type I and in Tables 6.11 and 6.13 for car type II. ˜ = 33 to N ˜ = 9 does not deteriorate the classifiThe reduction of dimensionality from N cation performance significantly while combatting the curse of dimensionality. For car type I and γ = 1 there is one crash that was mistakenly fired in the 33-dimensional space and ˜ = 9 which is classified correctly in the 9-dimensional space. On the other hand, with N there is a crash which is mistakenly underestimated and which was classified correctly in the 33-dimensional space. All crashes with γ = 2 and γ = 3 are classified correctly both with ˜ = 33 and N ˜ = 9 for car type I. N ˜ to 9 decreases the number of misclassifications while For car type II and γ = 1 reducing N for γ = 3 an additional misclassification occurs, as presented in Tables 6.11 and 6.13. One could stop the selection process here for car type II since the misclassification rate for crashes with γ = 3 increased through the dimensionality reduction and enlarge the input space with some of the features which were discarded. Nevertheless, in the following additional feature selection is performed for two reasons: firstly, there is only one additional misclassification in the LOO estimates after selection, which does not necessarily mean that the input space is not suited to solve the classification task accurately. Secondly, it is interesting to find out how much the input space can be reduced before the classifiers performance drops drastically. The number of features before such a drop occurs represents an empirical estimate of the intrinsic dimensionality of the problem at hand. Similarly to the results in Tables 6.2 and 6.3 the feature importance of the 9 remaining features is analyzed and shown in Table 6.4 for car type I and in Table 6.5 for car type II. One can notice that the ratio between the average estimated standard deviation and the ˜ = 9. This means that the ranking of number of features is smaller for both car types when N the features is more stable with all four selection methods in the smaller dimensional space. For an additional selection step the results from Tables 6.4 and 6.5 are used. Again the loss functions Lemb,1 and Lemb,3 are chosen for selection and those features are kept which are ranked to belong to the better half with both loss functions. For car type I the remaining

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

128

6. Scenario-Based Random Forest for On-Line Time Series Classification Feature x1 x4 x10 x13 x19 x22 x25 x28 x31

μ ˆrank \ˆ σrank with L 5.73\2.24 6.95\1.78 1.66\0.83 1.77\0.74 6.20\1.82 3.73\2.03 6.69\1.45 5.49\1.85 6.74\1.55

μ ˆrank \ˆ σrank with Lemb,1 6.30\0.52 8.52\0.70 1.04\0.21 3.41\0.52 3.69\0.60 1.95\0.21 7.50\0.87 4.88\0.36 7.66\0.97

μ ˆrank \ˆ σrank with Lemb,2 6.90\2.29 5.47\1.97 1.33\0.69 4.31\2.13 5.84\1.91 2.68\1.84 6.77\1.58 5.03\1.89 6.63\1.72

μ ˆrank \ˆ σrank with Lemb,3 5.98\0.12 7.79\0.56 1.80\0.58 2.87\0.37 4.38\0.51 1.31\0.49 7.42\0.63 4.63\0.48 8.77\0.54

˜ =9 Table 6.4: Feature ranking for car type I, N

Feature x1 x4 x10 x13 x22 x23 x25 x28 x31

μ ˆrank \ˆ σrank with L 7.96\1.38 4.24\1.91 2.43\1.21 6.09\1.80 2.68\1.71 6.05\1.40 7.05\1.57 2.09\1.30 6.36\1.56

μ ˆrank \ˆ σrank with Lemb,1 6.31\0.50 4.59\0.51 2.60\0.51 4.39\0.48 1.03\0.18 8.70\0.45 6.71\0.45 2.36\0.54 8.27\0.49

μ ˆrank \ˆ σrank with Lemb,2 7.83\0.98 4.88\1.21 2.66\1.08 5.77\1.06 1.68\1.02 4.06\0.91 8.29\0.85 2.27\1.10 7.50\0.93

μ ˆrank \ˆ σrank with Lemb,3 8.79\0.45 4.28\0.84 3.90\0.95 3.95\0.85 1.14\0.35 7.80\0.67 5.88\0.44 1.85\0.35 7.36\0.58

˜ =9 Table 6.5: Feature ranking for car type II, N

features are x10 , x13 , x19 , x22 , and x28 . For car type II the remaining features are x4 , x10 , x13 , x22 , and x28 . The choice of the five features identified as the most important by the SBRF is meaningful and indicates that the selection procedure retains features whose importance can be confirmed by domain knowledge. The classification results with these 5 features are presented in Tables 6.14 and 6.17. For car type I no degradation in the performance occurs ˜ = 9 to N ˜ = 5. For car type when reducing the dimensionality of the input space from N II, in the 5-dimensional input space the crash with γ = 3 that was misclassified with ˜ = 9 is now classified correctly. Nevertheless, the applied feature selection leads to more N ˜ = 9. Compared misclassifications for the category with γ = 1 than it was the case with N ˜ = 33 the performance in the 5-dimensional space does with the original input space with N not change remarkably, which justifies the large amount of features that were removed at each selection step. ˜ = 4 according to Tables 6.6 and 6.7 would lead A further reduction of dimensionality to N to the strongest features x10 , x13 , x22 and x28 for both car types. With these features a worse performance than in the 5-dimensional space is obtained for both car types indicating that the feature selection should be stopped. As already mentioned, the SBRF feature selection method does not suffer from masking effects. Therefore, when the dimensionality of the input space is small methods to remove the redundancy from the features which contain relevant

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

6.4 SBRF for Car Crash Classification Feature x10 x13 x19 x22 x28

μ ˆrank \ˆ σrank with L 1.25\0.59 2.01\0.60 4.84\0.44 3.76\0.72 3.12\0.74

μ ˆrank \ˆ σrank with Lemb,1 1.01\0.12 4.22\0.78 4.25\0.79 1.98\0.12 3.52\0.63

μ ˆrank \ˆ σrank with Lemb,2 1.12\0.41 3.53\0.83 4.69\0.60 3.06\1.08 2.57\0.84

129

μ ˆrank \ˆ σrank with Lemb,3 1.17\0.45 2.87\0.33 4.95\0.21 1.95\0.48 4.04\0.21

˜ =5 Table 6.6: Feature ranking for car type I, N

Feature x4 x10 x13 x22 x28

μ ˆrank \ˆ σrank with L 4.82\0.52 1.56\0.75 3.86\0.60 2.22\0.93 2.51\0.96

μ ˆrank \ˆ σrank with Lemb,1 4.97\0.14 2.19\0.47 4.01\0.18 1.04\0.20 2.77\0.47

μ ˆrank \ˆ σrank with Lemb,2 4.89\0.34 1.86\0.76 4.00\0.42 1.66\0.78 2.57\0.81

μ ˆrank \ˆ σrank with Lemb,3 4.89\0.34 2.96\0.28 4.08\0.31 1.24\0.47 1.81\0.46

˜ =5 Table 6.7: Feature ranking for car type II, N

information for the classification task can be considered for further selection. For example the linear correlation coefficient of Pearson Pearson (xk , x ) =

Exk ,x {xk x } − Exk {xk } Ex {x }  Var{xk }Var{x }

(6.58)

can be used as measure for the redundancy between two features. For car type I the estimated Pearson correlation coefficients ˆPearson are given in Table 6.8 and for car type II in Table 6.9. Hereby, all available data has been used, which might lead to a slightly increased variance of the final classification system if features are eliminated based on these estimates. The ˆPearson x10 x13 x19 x22 x28

x10 1.00 0.97 0.73 0.87 0.83

x13 0.97 1.00 0.74 0.90 0.87

x19 0.73 0.74 1.00 0.74 0.71

x22 0.87 0.90 0.74 1.00 0.98

x28 0.83 0.87 0.71 0.98 1.00

˜ =5 Table 6.8: Estimates of the Pearson correlation coefficients for car type I, N

highest correlation coefficients are obtained between x10 and x13 , and between x22 and x28 for both car types. In order to decide which features can be further eliminated without deteriorating the classifiers performance one can choose among those features which exhibit high correlation coefficients and apply a wrapper method. Estimating the performance with the LOO procedure after removing a further feature, shows that for car type I the best results are obtained with x10 , x19 , x22 , x28 , and for car type II with x4 , x10 , x13 , x22 , i. e., the feature x13 is removed and x18 , respectively. This result shows that the Pearson correlation coefficient offers a good hint about which features could be removed during the last iterations

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

130

6. Scenario-Based Random Forest for On-Line Time Series Classification .

ˆPearson x4 x10 x13 x22 x28

x4 1.00 0.72 0.76 0.95 0.96

x10 0.72 1.00 0.97 0.81 0.80

x13 0.78 0.97 1.00 0.84 0.84

x22 0.95 0.81 0.84 1.00 0.98

x28 0.96 0.80 0.84 0.98 1.00

˜ =5 Table 6.9: Estimates of the Pearson correlation coefficients for car type II, N

of the SBRF feature selection. The classification results with the above mentioned features ˜ = 4 are shown in Tables 6.22 and 6.23. It can be seen that compared to N ˜ = 5 for for N car type I the performance gets slightly worse since an additional misclassification occurs for γimp = 1. For car type II the performance gets slightly better since a misclassification for γimp = 1 vanishes. ˜ = 3 leads to a drop in the performance since crashes with A further reduction to N ˜ = 4 are misclassified for both car types. γimp = 3 that were classified correctly when N ˜ = 5 or N ˜ = 4. Therefore, one should stop the dimensionality reduction at N ˜ = 5 is taken into account for comparisons with other classification The case with N systems in Section 8.1 since only the reduction using the SBRF feature selection as described in Alg. 6.2 should be evaluated. Therefore, the five features x10 , x13 , x19 , x22 , x28 are chosen for car type I and x4 , x10 , x13 , x22 , x28 for car type II. To validate the robustness of the resulting classification systems the LOO estimates are also computed for the cases where the row data in the scenarios Sm are scaled up and down by 15%. The results are shown in Tables 6.15 and 6.16 for car type I and in Tables 6.18 and 6.19 for car type II. overestimated: 30.7%

γimp = 1

correct: 69.3% 25

25

15

15

15

5

5

5

0 −∞

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

γimp = 2

−15

−5 0 5

15

0 −∞



correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

−15

−5 0 5

15

0 −∞



correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

−15

−5 0 5

15

−5 0 5

15



−15

−5 0 5

15



underestimated: 0.0%

25

0 −∞

−15

underestimated: 0.0%

25

0 −∞

γimp = 3

underestimated: 0.0%

25



0 −∞

−15

−5 0 5

˜ = 33 Table 6.10: Car type I, αscale = 1.00, N

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

15



6.4 SBRF for Car Crash Classification overestimated: 23.7%

γimp = 1

correct: 60.5% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

γimp = 2

−15

−5 0 5

15



0 −∞

correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

−15

−5 0 5

15



0 −∞

correct: 96.7% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

−15

−5 0 5

15

−5 0 5

15



−15

−5 0 5

15



underestimated: 3.3%

25

0 −∞

−15

underestimated: 0.0%

25

0 −∞

γimp = 3

underestimated: 15.8%

25

0 −∞

131



0 −∞

−15

−5 0 5

15



˜ = 33 Table 6.11: Car type II, αscale = 1.00, N overestimated: 26.9%

γimp = 1

correct: 69.3% 25

25

15

15

15

5

5

5

0 −∞

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

γimp = 2

−15

−5 0 5

15



0 −∞

correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

−15

−5 0 5

15



0 −∞

correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

−15

−5 0 5

15

−5 0 5

15



−15

−5 0 5

15



underestimated: 0.0%

25

0 −∞

−15

underestimated: 0.0%

25

0 −∞

γimp = 3

underestimated: 3.8%

25



0 −∞

−15

−5 0 5

15

Table 6.12: Car type I, αscale = 1.00, ˜ x = [x1 , x4 , x10 , x13 , x19 , x22 , x25 , x28 , x31 ]T

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.



132

6. Scenario-Based Random Forest for On-Line Time Series Classification overestimated: 21.1%

γimp = 1

correct: 65.8% 25

25

15

15

15

5

5

5

0 −∞

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

γimp = 2

−15

−5 0 5

15



0 −∞

correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

−15

−5 0 5

15



0 −∞

correct: 93.3% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

−15

−5 0 5

15

−5 0 5

15



−15

−5 0 5

15



underestimated: 6.7%

25

0 −∞

−15

underestimated: 0.0%

25

0 −∞

γimp = 3

underestimated: 13.1%

25



0 −∞

−15

−5 0 5

15



Table 6.13: Car type II, αscale = 1.00, ˜ x = [x1 , x4 , x10 , x13 , x22 , x22 , x25 , x28 , x31 ]T overestimated: 26.9%

γimp = 1

correct: 69.3% 25

25

15

15

15

5

5

5

0 −∞

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

γimp = 2

−15

−5 0 5

15



0 −∞

correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

−15

−5 0 5

15



0 −∞

correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

−15

−5 0 5

15

−5 0 5

15



−15

−5 0 5

15



underestimated: 0.0%

25

0 −∞

−15

underestimated: 0.0%

25

0 −∞

γimp = 3

underestimated: 3.8%

25



0 −∞

−15

−5 0 5

Table 6.14: Car type I, αscale = 1.00, ˜ x = [x10 , x13 , x19 , x22 , x28 ]T

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

15



6.4 SBRF for Car Crash Classification overestimated: 15.38%

γimp = 1

correct: 65.39% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

γimp = 2

−15

−5 0 5

15



0 −∞

correct: 76.47% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

−15

−5 0 5

15



0 −∞

correct: 85.0% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

−15

−5 0 5

15

−5 0 5

15



−15

−5 0 5

15



underestimated: 15.0%

25

0 −∞

−15

underestimated: 23.53%

25

0 −∞

γimp = 3

underestimated: 19.23%

25

0 −∞

133



0 −∞

−15

−5 0 5

15



Table 6.15: Car type I, αscale = 0.85, ˜ x = [x10 , x13 , x19 , x22 , x28 ]T overestimated: 26.9%

γimp = 1

correct: 73.1% 25

25

15

15

15

5

5

5

0 −∞

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

γimp = 2

−15

−5 0 5

15



0 −∞

correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

−15

−5 0 5

15



0 −∞

correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

−15

−5 0 5

15

−5 0 5

15



−15

−5 0 5

15



underestimated: 0.0%

25

0 −∞

−15

underestimated: 0.0%

25

0 −∞

γimp = 3

underestimated: 0.0%

25



0 −∞

−15

−5 0 5

Table 6.16: Car type I, αscale = 1.15, ˜ x = [x10 , x13 , x19 , x22 , x28 ]T

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

15



134

6. Scenario-Based Random Forest for On-Line Time Series Classification overestimated: 23.7%

γimp = 1

correct: 57.9% 25

25

15

15

15

5

5

5

0 −∞

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

γimp = 2

−15

−5 0 5

15



0 −∞

correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

−15

−5 0 5

15



0 −∞

correct: 96.7% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

−15

−5 0 5

15

−5 0 5

15



−15

−5 0 5

15



underestimated: 3.3%

25

0 −∞

−15

underestimated: 0.0%

25

0 −∞

γimp = 3

underestimated: 18.4%

25



0 −∞

−15

−5 0 5

15



Table 6.17: Car type II, αscale = 1.00, ˜ x = [x4 , x10 , x13 , x22 , x28 ]T overestimated: 13.2%

γimp = 1

correct: 57.9% 25

25

15

15

15

5

5

5

0 −∞

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

γimp = 2

−15

−5 0 5

15



0 −∞

correct: 84.2% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

−15

−5 0 5

15



0 −∞

correct: 80.0% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

−15

−5 0 5

15

−5 0 5

15



−15

−5 0 5

15



underestimated: 20.0%

25

0 −∞

−15

underestimated: 15.8%

25

0 −∞

γimp = 3

underestimated: 28.9%

25



0 −∞

−15

−5 0 5

Table 6.18: Car type II, αscale = 0.85, ˜ x = [x4 , x10 , x13 , x22 , x28 ]T

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

15



6.4 SBRF for Car Crash Classification overestimated: 34.2%

γimp = 1

correct: 57.9% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

γimp = 2

−15

−5 0 5

15



0 −∞

correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

−15

−5 0 5

15



0 −∞

correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

−15

−5 0 5

15

−5 0 5

15



−15

−5 0 5

15



underestimated: 0.0%

25

0 −∞

−15

underestimated: 0.0%

25

0 −∞

γimp = 3

underestimated: 7.9%

25

0 −∞

135



0 −∞

−15

−5 0 5

15



Table 6.19: Car type II, αscale = 1.15, ˜ x = [x4 , x10 , x13 , x22 , x28 ]T overestimated: 30.8%

γimp = 1

correct: 65.4% 25

25

15

15

15

5

5

5

0 −∞

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

γimp = 2

−15

−5 0 5

15



0 −∞

correct: 88.2% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

−15

−5 0 5

15



0 −∞

correct: 95.0% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

−15

−5 0 5

15

−5 0 5

15



−15

−5 0 5

15



underestimated: 5.0%

25

0 −∞

−15

underestimated: 11.8%

25

0 −∞

γimp = 3

underestimated: 3.8%

25



0 −∞

−15

−5 0 5

Table 6.20: Car type I, αscale = 1.00, ˜ x [n] = [x10 , x13 , x22 , x28 ]T

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

15



136

6. Scenario-Based Random Forest for On-Line Time Series Classification overestimated: 23.7%

γimp = 1

correct: 55.3% 25

25

15

15

15

5

5

5

0 −∞

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

γimp = 2

−15

−5 0 5

15



0 −∞

correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

−15

−5 0 5

15



0 −∞

correct: 93.3% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

−15

−5 0 5

15

−5 0 5

15



−15

−5 0 5

15



underestimated: 6.7%

25

0 −∞

−15

underestimated: 0.0%

25

0 −∞

γimp = 3

underestimated: 21.0%

25



0 −∞

−15

−5 0 5

15



Table 6.21: Car type II, αscale = 1.00, ˜ x [n] = [x10 , x13 , x22 , x28 ]T overestimated: 30.7%

γimp = 1

correct: 65.5% 25

25

15

15

15

5

5

5

0 −∞

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

γimp = 2

−15

−5 0 5

15



0 −∞

correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

−15

−5 0 5

15



0 −∞

correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

−15

−5 0 5

15

−5 0 5

15



−15

−5 0 5

15



underestimated: 0.0%

25

0 −∞

−15

underestimated: 0.0%

25

0 −∞

γimp = 3

underestimated: 3.8%

25



0 −∞

−15

−5 0 5

Table 6.22: Car type I, αscale = 1.00, ˜ x [n] = [x10 , x13 , x19 , x22 ]T

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

15



6.4 SBRF for Car Crash Classification overestimated: 23.7%

γimp = 1

correct: 60.5% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

γimp = 2

−15

−5 0 5

15



0 −∞

correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

−15

−5 0 5

15



0 −∞

correct: 96.7% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

−15

−5 0 5

15

−5 0 5

15



−15

−5 0 5

15



underestimated: 3.3%

25

0 −∞

−15

underestimated: 0.0%

25

0 −∞

γimp = 3

underestimated: 15.8%

25

0 −∞

137



0 −∞

−15

−5 0 5

15



Table 6.23: Car type II, αscale = 1.00, ˜ x [n] = [x4 , x10 , x22 , x28 ]T

Conclusion After formulating in the first part of this chapter the on-line time series classification task using a suitable risk and loss function in the second part the SBRF algorithm has been introduced as an extension of the RF classifier. The main differences are the bootstrap sampling on scenario level, the modified loss function and the oversampling that is introduced to take into account different levels of penalization. The third part of the chapter has treated the topic of feature selection showing how the SBRF classifier can be used to accomplish this task. In the last part the SBRF algorithm has been applied to the car crash classification problem.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

7. Segmentation and Labeling for On-Line Time Series Classification

The previous chapter has introduced a general method for on-line time series classification that produces a decision at every time stamp n, hereby implementing a change detection and classification system. Due to the complexity of the on-line time series classification problem the learning algorithms must be able to realize complex separating boundaries in order to obtain a low bias and they must also use a mechanism, e. g., aggregation, to obtain a low variance. These requirements lead to classification systems which in general are not interpretable. Moreover, the computational burden is high since a decision is required at each time stamp n. Since in on-line time series classification it is enough to compute the class labels at the time instances when a class-change occurs, the approaches described in this chapter rely on dividing the problem into a segmentation and a labeling step. This way, not only the computational complexity can be reduced but it is also possible to formulate the problem in such a way that traditional classification algorithms can be applied. This comes along with the option of using interpretable classifiers, which is important in safety critical applications. The main idea is visualized in Fig. 7.1. A multivariate time series is segmented in such a way that the time instances when a class-change occurs are detected at low costs with high reliability, i. e., the true positive rate of the Segmentation Classifier must be high. Then, for each candidate time instance a more complex algorithm, the Labeling Classifier is applied which determines the class label to which the current segment belongs. Interpreting the Feature Generation 2 enable S [0,n]

Feature Generation 1

Segmentation enable Classifier

Labeling Classifier

Label

Figure 7.1: Segment and label approach

segmentation as a two-class classification task for every time stamp n, with the classes “0” (no change) and “1” (change), the above requirement involves a high rate of false alarms, i. e., often the beginning of a new segment and thus, a class-change is wrongly signalized. A high rate of false alarms is also favored by the wish to use interpretable segmentation algorithms, e. g., the exceeding of thresholds or linear classifiers. In the subsequent step, for those time instances when the segmentation algorithm indicates a change, a more complex classifier with low false positive rate is triggered to label the current segment. A crucial benefit of this modular approach is the fact that the task which has to be fulfilled by the labeling classification system, consisting of the Feature Generation 2 and Labeling Classifier blocks in Fig. 7.1, is simplified compared to the classification method described in the previous chapter. The labeling classification system is not responsible for decisions at every time step n but only

139 Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

140

7. Segmentation and Labeling for On-Line Time Series Classification

for decisions at representative time instances. The labeling classification system does not have to identify these representative time instances since this task is solved by the segmentation classification system, consisting of the Feature Generation 1 and the Segmentation Classifier block in Fig. 7.1. By this division of the overall on-line classification problem it is possible to find input spaces of smaller dimensionality for each classification module compared to the input space of a single classifier having to take a decision at each time step. Hereby, the modular approach represents a way to combat the curse of dimensionality. One can think of the modular approach as a way to focus the classification task on those regions of the input space that are relevant for the considered application. This way the system consisting of segmentation and labeling classifiers can achieve both a high true positive rate and a small false positive rate. The next section deals with the topic of segmentation focusing on a high positive rate for detecting changes in the class label, whereas Section 7.2 presents two ways to implement the labeling stage in the classification system.

7.1 On-Line Segmentation of Time Series In the technical literature segmentation of time series refers to the process of dividing a scenario S into a concatenation of subsequences in such a way that each subsequence is in some sense homogenous. In most cases this homogeneity is represented by a linear approximation. In [KCHP03] Keogh defines a segmentation algorithm as an algorithm whose input is a time series and whose output is a piecewise linear representation of this time series. Then, each linear piece stands for a segment. In this work the following problem is denoted as on-line segmentation of time series: given the scenario S [0,n] up to time instance n determine whether the class label y[n] has changed with respect to previous time stamps. Thus, the segmentation is realized by the change detector cd :

RNz → {0, 1},

z[n] → d[n],

(7.1)

with d[n] ∈ {0, 1}, and z[n] ∈ RNz being a feature vector that is extracted from S [0,n] in a similar way as x[n] in (4.10): f ez [n] : RL×(n+1) → RNz ,

S [0,n] → z[n].

(7.2)

If the class label y[n] has changed at time instance n, the mapping cd has to produce the output d[n] = 1, otherwise the output d[n] = 0. It should be emphasized, that the segmentation step does not differentiate between the classes c1 , . . . , cK , being a two class classification problem. Thus, on-line time series segmentation is rather related to the topic of change detection in evolving data streams [Agg03, KBDG04, GY06] than to the classical segmentation of time series. A frequently used approach for change detection in evolving data streams makes the assumption that the distribution of the observed data is stable for each class label ck and a classchange can be detected by identifying a change in the distribution of the data. In [KBDG04] a two-window paradigm is used for change detection, i. e., the distribution of the observed data in some “reference time window” is compared to the distribution in a current time window, which slides forward at each new time instance. Using statistical tests, to decide

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

7.1 On-Line Segmentation of Time Series

141

whether the two distributions are equal or not, a change is signalized if the distribution in the current window is significantly different to the one in the reference window. If a change is detected the reference window is updated. Another approach for change detection that uses estimates of the probability density functions in sliding windows is described in [KL01]. Here, at each time stamp the probability density function of the data in a window containing the most recent samples is estimated and the idea is to find a compact representation of the sequence of probability density functions in terms of a small set of prototype densities. Then, the beginning of a new segment is signalized at time instance n if the distance of the associated prototype to the preceding prototype exceeds a certain threshold. Hereby, the distance between probability density functions is measured by the Integrated Squared Error [Sil86], which can be calculated analytically if the density estimates are generated using a mixture of Gaussians. The approaches for on-line segmentation based on density estimation rely on two assumptions, which cannot be guaranteed in many practical applications. Firstly, the data in each of the time windows must be stationary and secondly, enough data must be available to perform reliable multivariate density estimation. Even if the former assumptions holds the results presented in [Sil86] are rather discouraging since in practical applications the amount of data required to obtain good density estimates is usually not available. The approach introduced in this thesis takes account for the fact, that the segmentation step is only a part of a divide and conquer technique for on-line time series classification. The segmentation procedure will be explained by using Fig. 7.2. Firstly, at time stamp n the feature vector z[n] is computed from S [0,n] according to (7.2). Fig. 7.2 shows the temporal sequence of z[n]—which is chosen to be univariate in this example—and a sequence of six segments belonging to the classes c1 , . . . , c4 . Whereas it is the task of the on-line time series classification system to estimate y[n], the class for each time stamp, the change detector cd has to determine the sequence d[n] which indicates when a new segment begins. Between segments there exist “do not care” intervals which are represented by hatched areas in Fig. 7.2. As mentioned in Section 4.2 the “do not care” intervals are introduced in order to simplify the segmentation task and are justified by the fact that in most practical applications the data in the scenario S is continuous and thus, it is more adequate to introduce a small interval with no class label at the change from one segment to another than to fix the change to a single time stamp. The “do not care” intervals represent the transition time from one segment to another. The feature vectors z[n] to the left and to the right of each “do not z[n]

z[n] y[n] d[n]

1 0

c1 c1 c2 c3 c4 c2 00 11 00 11 00 11 00 11 00 11 00 00 11 00 11 00 11 00 11 00 1 0 . . . 0 11 11 00 1 0 . . . 0 11 11 00 1 0 . . . 0 11 11 00 1 0 . . . 0 11 11 00 1 0 11 0 . . . 0 11 00 11 00 00 00 00 00 11 00 11 00 11 00 11 00 11 00 11 11 00 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11

00 11 11 00 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11

00 11 11 00 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11

00 11 11 00 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11

00 11 11 00 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11

n

Figure 7.2: Change detector cd(1) and segmentation task

care” interval are used for the construction of the change detector cd. In Fig. 7.2 the inputs

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

142

7. Segmentation and Labeling for On-Line Time Series Classification

z[n] that belong to the class “0” lie in a bright grey area and the inputs z[n] that belong to the class “1” lie in a dark grey area. This way a training set for the construction of the classifier cd can be obtained. Each feature vector z[n] can be built with the entire information from S [0,n] . Thus, it is enough to consider only one feature vector z[n] before and one feature vector after each “do not care” interval for the training set that is used to compute the change detector cd. The training set obtained this way is balanced, which is advantageous for many learning algorithms. Given a training set DS = {(S1 , y1 ), . . . , (SM , yM )}

(7.3)

and assuming that in the scenario Sm there are αm class-changes one obtains Mcd = 2

M 

αm

(7.4)

m=1

examples in the training set used to compute cd. It is important to note that a fair estimate of the performance of the change detector cd can only be obtained if one takes account for the fact that the Mcd examples stem from M scenarios. Thus, methods aiming to estimate the performance of cd like V -fold cross validation, jackknife or bootstrap must sample on scenario level, i. e., all 2αm examples resulting from (Sm , ym ) belong either to the training or to the test set. The role of the change detector in the classification system is to determine the time instances nc when a change occurs and to activate at these time stamps another classifier fL , whose task is to decide the class label of the scenario S at the instances nc . The inputs of fL are the vectors x[nc ] ∈ RN which are computed according to (4.10). Thus, the classification system is a hierarchical classifier for on-line time series. It can be interpreted as a decision tree in which the root node has access to the features stored in z[n] and whose child node has access to the features stored in x[n]. Unlike in the CART algorithm the decisions in the nodes are not realized by threshold classifiers working on single features but by more complex classifiers working on subsets of the available features. Therefore, the introduction of the change detector cd is motivated not only by a reduced complexity—the classifier fL (x[n]) is triggered only at the time instances nc —but also by the structure that comes along with the hierarchical system. Similarly to a decision tree where the nodes close to the root decide about the main partitioning of the input space, here the change detector implements the main temporal partitioning of a scenario S. This kind of organization of a temporal classifier breaks the overall optimization task into a number of simpler optimizations but it requires the development of the classifiers architecture. For example instead of using a single change detector for all class-changes one can use more change detectors cd(q) , each trained to identify with high accuracy only a subset of the possible class-changes. This is the approach that will be presented in more detail in this chapter. The concept of modularization in classifier design can lead to an improved performance by reducing both bias and variance of the overall classification system. Firstly, smaller classifiers have less parameters resulting in a low variance. Secondly, due to the simplified tasks of each component classifier it is possible to achieve a small bias with relatively simple classifiers. Assuming a decomposition of a classification task into Q modules this decomposition leads

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

7.1 On-Line Segmentation of Time Series

143

to an improved performance if R(f ) > R(fM ),

R(fM ) ≤

Q 

with

(7.5)

R(f (q) ),

(7.6)

q=1

where R(f ) is the risk functional of a single classifier f that is responsible for the overall classification task, R(fM ) the risk of the modularized classifier fM , and R(f (q) ) the risk of the q-th module in fM . As already mentioned the bias and variance of each component classifier can be kept small by a proper decomposition of the classification task, leading to small values of R(f (q) ). Thus, the inequation in (7.5) can be fulfilled if the decomposition in the Q modules is an adequate representation of the underlying process generating the data. According to [Sch96] the success of modularization approaches lies in the fact that imperfections concerning estimation and approximation can be withheld from becoming effective. This is due to the fact that by decomposing a complex classification problem into smaller subtasks one normally integrates a priori knowledge about the data in the classification system and excludes regions of the input space which are irrelevant for the problem at hand, knowledge that otherwise must be learned. A frequently used approach is the decomposition of the problem according to its input space, i. e., the classification system consists of a number of modules that are specialized for different parts of the input space [JY93]. Another possibility to divide a classification problem into subtasks is based on a decomposition of the output space [KGC02]. Additionally to the improved performance that can be achieved by modular classifiers a main benefit of these techniques is the resulting interpretability. Although normally there is a tradeoff between a low generalization error and interpretability, due to the classifiers structure it is possible to improve both of them by modularization. For example by decomposing the output space such that a module is responsible only for a subset of the classes, feature selection for this subtask helps in understanding which features are relevant for the chosen subset of classes. This way it is possible to divide the output space in such a way that each module uses simple and interpretable feature spaces. The approach presented in this chapter to solve on-line time series classification tasks does not only divide the overall problem into a segmentation and a labeling step but it introduces an additional modularization step—based on the decomposition of the output space—in order to improve both the performance and the interpretability. The mapping cd that has to detect the changes between arbitrary classes is replaced by Q specialized change detectors cd(q) :

RNz (q) → {0, 1},

z (q) [n] → d(q) [n],

(7.7)

with q = 1, . . . , Q, d(q) [n] ∈ {0, 1}, and z (q) [n] ∈ RNz (q) . Hereby, each change detector cd(q) (q) has the task to identify with high true positive rate the class changes from a subset, Ωc , of the possible class-changes Ωc : Ωc(q) ⊂ Ωc = {c1 → c2 , c1 → c3 , . . . , cK−1 → cK }.

(7.8)

For example cd(q) must identify all class-changes from ck1 to ck2 and from ck3 to ck4 , i. e., (q) Ωc = {ck1 → ck2 , ck3 → ck4 } but no requirements are set to the classification of other

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

144

7. Segmentation and Labeling for On-Line Time Series Classification

class-changes. In the example from Fig. 7.2 a change detector cd(1) could be used to identify the class-changes from c1 to c2 and from c3 to c4 , then another change detector cd(2) could be used to find the changes from c2 to c3 and so on. On the left side of Fig. 7.2 the training features, i. e., all values z[n] in the bright and dark grey areas are shown, and it can be seen that a linear classifier cd(1) —in this case a threshold classifier—can be used to detect the changes from c1 to c2 and from c3 to c4 . The corresponding training set for cd(1) consists of the values z[n] which are marked by black triangles for the class “0” and black rectangles for the class “1”. The splitting of cd into Q classifiers is motivated by the fact that a subset of the classchanges can be detected by a relatively simple and interpretable classifier cd(q) with high true positive rate. This can be achieved by finding the features z (q) [n] in such a way that classifiers with low variance, e. g. linear classifiers, can achieve a low bias in the task of discriminating between the inputs z (q) [n] before and after the “do not care” interval. If this is not possible either domain knowledge should be used to define more adequate features or the detection of a single class-change must be implemented by two or more classifiers cd(q) . This way it is possible to implement a modular segmentation system for which (7.5) holds. Fig. 7.3 presents the classification system that is proposed in this chapter for on-line time series classification. The classifications system consists of the Q change detectors cd(q) and (q) the associated labeling classifiers fL . The change detectors cd(q) compute—based on the inputs z (q) [n]—at each time stamp n the outputs d(q) [n]. In the following Ω (q) ⊂ Ω = {c1 , . . . , cK }

(7.9)

denotes that subset of Ω, which contains the classes to whom a change must be detected by (q) cd(q) . In the example from above the set of class-changes Ωc = {ck1 → ck2 , ck3 → ck4 } leads to Ω (q) = {ck2 , ck4 }. The cardinality of Ω (q) is |Ω (q) |. The detection of a possible begin of a x (1) [n] z

(1)

[n]

z (2) [n]

cd(1)

enable

(1)

fL

reject x (2) [n]

cd(2)

enable

(2) fL

x (Q) [n] cd(Q)

(2)

yˆs [n] reject

.. .

Decision Logic

ˆy[n]

.. .

.. . z (Q) [n]

(1)

yˆs [n]

enable

(Q)

yˆs [n]

(Q)

fL

reject

Figure 7.3: Classification system with segmentation step

new segment signalized by cd(q) , i. e., d(q) [n] = 1, enables the subsequent labeling classifier (q) fL that has the task to determine based on the input x (q) [n] ∈ RNx (q) —which may contain local, global, event-based or temporal distance-based features—if at time instance n the scenario S changes to one of the classes included in Ω (q) and if this is the case, to which (q) class. The design of the classifiers fL is discussed in Section 7.2. If two or more classifiers (q) fL are enabled at the same time and they do not reject the decisions for the classes they (q) were trained for, the soft values yˆs [n] of their outputs are compared and the class with

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

7.1 On-Line Segmentation of Time Series

145

highes estimated posterior probability is chosen by the Decision Logic as output of the time series classification system at time stamp n. The class label ˆy[n] is kept as output of the on-line time series classification system until the begin of a new segment is identified by one of the Q modules. (q) Analogously to (7.4) the number of training examples Mcd used for the change detector (q) (q) cd(q) results from the number αm of class-changes from the set Ωc in the m-th scenario: (q) Mcd

=2

M 

(q) αm .

(7.10)

m=1

Fair estimates of the performance of the change detectors cd(q) can only be obtained by (q) taking account for the fact that the αm class-changes stem from a single scenario. Thus, all (q) 2αm training examples from the m-th scenario must belong either to the training set or to the test set. In this work it is assumed that a classifier cd(q) can detect at least one class-change, e. g., ck1 to ck2 , with high accuracy. Then, in the worst case, i. e., one change detector for each possible class-change, the maximal number of detectors is Qmax = K(K − 1), where K denotes the number of classes. In practical applications the number of change detectors is much smaller than Qmax since some classifiers cd(q) can be designed to detect more than just one class-change and because some of the class-changes might never occur in the considered time series classification problem. The approach adopted in this work to keep Q small is a bottom-up approach that will be presented in the next subsection. 7.1.1 Reduction of the Number of Change Detectors A reduction of the number of change detectors is not only desirable from a computational point of view but it can also decrease the variance of the classification system. This is due to the fact that the training set of a change detector that combines the tasks of two or more change detectors is larger than the training set of each individual classifier. The algorithm that is introduced in the following is related to the agglomerative hierarchical clustering technique [DHS01] and is presented in Alg. 7.1. Similar to the notation in [KGC02], where so-called meta-classes are introduced, here meta-class-changes cc(q ) ⊂ Ωc ,

(7.11)

which represent disjoint subsets of Ωc are used. Here, Ωc denotes those class changes that can occur in the considered task. Starting with Q = |Ωc | segmentation classifiers, each specialized for only one of the class-changes, the bottom-up approach reduces the number of change detectors iteratively by one until a stopping criteria is reached. Such a stopping criteria is the reduction to a predefined number of change detectors or—as applied in this work—the exceeding of a predefined limit by the estimated cost in Line 8 of Alg. 7.1. After a merger, the task of two detectors cd(q1 ) and cd(q2 ) is taken over by a single detector cd(q1,2 ) . As shown in Fig. 7.3 the Q change detectors may have different input spaces, which leads to difficulties if one tries to merge two change detectors, as required in Lines 5 or 10 of Alg. 7.1. A possible solution when merging cd(q1 ) and cd(q2 ) can be realized by creating in a first step the input space for cd(q1,2 ) as the union of the two input spaces of cd(q1 ) and cd(q2 ) . Then, in a second step feature selection is performed. Hereby, a wrapper method using a

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

146

7. Segmentation and Labeling for On-Line Time Series Classification

Algorithm 7.1 Reduction of the number of change detectors set Q = |Ωc |, the cardinality of Ωc 2: each element in Ωc is assigned to a meta-class-change cc(q ) , q = 1, . . . , Q while stopping criteria not reached do 4: for all combinations of two meta-class-changes {cc(q1 ) , cc(q2 ) }, q1 , q2 = 1, . . . , Q do compute a change detector cd(q1,2 ) for the current combination of meta-class-changes 6: estimate the costs of detecting cc(q1 ) and cc(q2 ) with cd(q1,2 ) end for 8: find that pair of meta-class-changes whose merging leads to minimal costs if stopping criteria is not reached then 10: merge the pair of meta-class-changes and store the corresponding change detector set Q = Q − 1 12: end if end while classifier of the same type as cd(q1 ) and cd(q2 ) , e. g., a linear classifier, should be used, as described in Subsection 2.3.2, in order to find the final input space of cd(q1,2 ) . In the car crash classification task the special case will be considered, where all change detectors use the same input space RNz . This makes the merging of two classifiers easier but it requires domain knowledge in generating an input space that is suitable for the detection of all class-changes. If all classifiers cd(q) use the same input space the resulting segmentation algorithm represents an overlapping partitioning of RNz so that every time the border of a (q) partition associated to cd(q) is crossed the corresponding classifier fL is triggered. The computational complexity associated with Alg. 7.1 is high since at each iteration a number of Q(Q − 1)/2 change detectors and the costs associated to using them have to be computed. Since the number of possible class-changes is normally not very large and Alg. 7.1 is applied off-line, the presented approach to reduce the number of change detectors is appropriate for practical applications. 7.1.2 Linear Change Detectors The task of a change detector cd(q) can be solved by a two-class classifier (q)

RNz (q) → {−1, 1},

fcd :

(q)

z (q) [n] → ˆycd [n]

(7.12)

if the change detector outputs  (q)

d [n] =

(q) (q) 1 if ˆycd [n − 1] = ∓1 and ˆycd [n] = ±1 0 otherwise.

(7.13)

Each cd(q) is responsible for detecting only a subset of the possible class-changes in Ωc so (q) that the Bayes classifier for the task that has to be solved by fcd can be approximated by a linear classifier while still obtaining a low bias. On the other hand, linear classifiers are not greatly affected by slight differences in the training set, leading to a small variance. Thus, linear models represent robust methods to implement the change detectors.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

7.1 On-Line Segmentation of Time Series

147

This subsection focuses on the implementation of the change detectors by linear classifiers i. e.,   (q) (q) ˆycd [n] = sign a(q),T z (q) [n] + a0 , (7.14)

(q) fcd ,

(q)

(q)

(q)

where sign(·) denotes the signum function, a(q),T = [a1 , . . . , aN (q) ] ∈ RNz (q) and a0 ∈ R. z By introducing the augmented input vector z˜ (q) [n] = [1, z (q) [n]T ]T

(7.15)

(q),T T ˜ (q) = [a(q) ] , a 0 ,a

(7.16)

and the augmented weight vector

(q) the output ˆycd [n] can also be expressed as

 (q),T (q)  (q) ˜ ˆycd [n] = sign a z˜ [n] .

(7.17)

(q) (q) ˜ (q) consists of Mcd examples [see (7.10)], where The training set Dcd used to compute a (q) (q) inputs z [n] before the “do not care” intervals are associated to the class ycd [n] = −1 and (q) inputs after the “do not care” intervals to the class ycd [n] = +1. It should be mentioned that with an increasing dimensionality of the input space the (q) likelihood of linear separability of Mcd points increases rapidly. As shown in [Cov65] the (q) number of linearly separable dichotomies of Mcd points in general position in an Nz (q) dimensional Euclidean space is

(q) G(Mcd , Nz (q) )

Nz (q) −1 

=2

 k=0

(q)

Mcd − 1 k

 .

(7.18)

(q)

Hereby, a set of Mcd vectors in an Nz (q) -dimensional Euclidean space is said to be in general position if every subset of Nz (q) or fewer vectors is linearly independent. Even if the requirement related to the general position is not fulfilled, a higher dimensionality leads to a better linear separability. Therefore, one can start with a high dimensional input space—where the training data is linearly separable—for each change detector and reduce the dimensionality by feature selection to a size Nz (q) where the performance of the linear classifier is still good, in order to combat the curse of dimensionality and overfitting. Wrapper methods which use linear methods as classifiers are suitable for feature selection in this context. There are a variety of linear methods for classification. Some stem from classical statistics1 like Linear Discriminant Analysis or Logistic Regression [HTF01], others from the machine learning community2 like the Perceptron [Bis95] or AdaTron [FCC98] algorithm and others are rather related to numerics since the classification task is reformulated as finding a solution to a set of linear equations [DHS01]. In Appendix A.10 the AdaTron algorithm is reviewed 1

Here discriminant functions are modeled for each class and a query input is assigned to the class with the largest value for its discriminative function. 2 Here the boundaries between the classes in the input space are modeled to be linear and one is explicitly looking for separating hyperplanes.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

148

7. Segmentation and Labeling for On-Line Time Series Classification

since it is an optimal algorithm if the training data is linearly separable, i. e., if there exits ˜ (q) where a vector a (q)

˜ (q),T z˜(q) [n] ≥ 0 ycd [n] a

(7.19)

(q) (q) (q) (q) holds for all patterns (z˜k [n], ycd,k [n]), k = 1, . . . , Mcd , in the set Dcd . In the following the minimum squared-error procedure will be presented because it can be applied also to nonseparable cases and because its computational complexity is manageable, which is important for the use in Alg. 7.1. ˜ (q) to The minimum squared-error procedure replaces the problem of finding a solution a the set of linear inequalities from (7.19) by the problem of finding the solution to a set of ˜ (q) satisfying linear equations. Here the task is to find a ˜ (q) a ˜ (q) = Y (q) , (7.20) Z cd

with

* +T (q) (q) (q) (q) ˜ Z = z˜1 [n], . . . , z˜ (q) [n] ∈ RMcd ×Nz (q) +1 , Mcd

(q) Ycd

* (q) (q) = ycd,1 [n], . . . , y

+T

(q) cd,Mcd

[n]

and

(7.21)

(q)

∈ {−1, 1}Mcd .

(7.22)

˜ (q) is overdetermined In general there are more equations than unknowns in (7.20) so that a ˜ (q) by minimizing and no exact solution exists. Therefore, one finds a suitable solution for a the sum-of-squared-error criterion ⎧ (q) ⎫  cd  ⎨M 2  2 ⎬   ˜ (q) (q) (q),T (q) (q)  (q) (q) ˜ = argmin ˜ − ycd,k [n] ˜ − Ycd  . (7.23) a = argmin Z α z˜k [n]α ⎩ ⎭ ˜ (q) ˜ (q) α α k=1

As it can be easily verified by building the gradient of the minimization criterion and setting it to zero the solution is ˜ (q),T Z ˜ (q) )−1 Z ˜ (q),T Y (q) . ˜ (q) = (Z a (7.24) cd

The mean-squared-error criterion puts emphasis on points where pz (q) (z (q) ) is large, rather than on points that are near the Bayes decision surface. As a consequence, the minimumsquared-error solution does not necessarily lead to a separating hyperplane if one exists. Despite this property, it is an attractive solution since it always exists and one obtains a useful classifier in both the separable and the nonseparable case. The mean-squared-error procedure can be extended if some a priori knowledge about the data is available. For example if monotonicity of the output with respect to all features of the input is given, i. e., ˜ (q),T z˜k(q) [n] = δk if α

˜ (q),T z˜k(q), [n] ≥ δk , then α

(7.25)

(q), (q) with z˜k [n] = z˜k [n] + ei ,  > 0 and arbitrary i ∈ {1, . . . , Nz (q) }, one can include this domain knowledge in the optimization problem:  2   ˜ (q) (q) (q)  (q) (q) (q) ˜ = argmin Z α ˜ − Ycd  a s. t. a1 ≥ 0, . . . , aN (q) ≥ 0. (7.26) ˜ (q) α

z

This is a simple linear programming optimization task that can be solved by standard software.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

7.1 On-Line Segmentation of Time Series

149

7.1.3 Segmentation for Car Crash Classification The classification system for the car crash application described in Chapter 5 is shown in Fig. 7.4. Compared to the general system depicted in Fig. 7.3, here the change detectors use the same inputs z[n] ∈ RNz . The way how the segmentation unit works is described in the following. .. . S[n]

LP

U[n]

fgz

z[n]

cd(1)

(1)

en

.. .

Segmentation Unit

cd(Q)

fL rej

fgx

.. (1) yˆs [n] .

Decision Logic

ˆy[n]

(Q)

en

fL rej

(Q+1)

fgx

x[n]

(Q)

yˆs [n]

(Q+1)

fL

ˆy(Q+1) [n]

Figure 7.4: Classification system for car crash detection

In a first step, the reliable ZAEX-signal enters a comparator and if its amplitude exceeds the threshold τZAEX,1 , |sZAEX [n]| > τZAEX,1

(7.27)

starting with this time instance the multivariate time-series S is fed into a low-pass filter leading to the multivariate time-series U = [uECSL , uECSR , uZAEX , uZAEY ]T ∈ R4×nend . The low-pass filter is applied to each univariate time series in S. It is an all-ones FIR filter with 8 taps, scaled by the factor 1/8. Then, if the amplitude of the low-pass filtered ZAEX-signal exceeds a second threshold τZAEX,2 , |uZAEX [n]| > τZAEX,2 ,

(7.28)

this time instance is set to the origin of the time axis n0 = n and the feature generation fgz starts. The thresholds τZAEX,1 and τZAEX,2 result from the training set D. Firstly, the low-pass filtered ZAEX-signals from all scenarios Sm that require the deployment of safety systems are computed up to the time instance when their “do not care” interval ends. From these signals the maximal value in each scenario is stored. The value τZAEX,2 is chosen to be the minimum among the stored values times a scaling factor, which is set to 0.4. Finally, the first threshold is set to τZAEX,1 = 0.2τZAEX,2 . The length of the filter and the scaling factors 0.4 and 0.2 are chosen empirically for this application. The filter is chosen to be a scaled all-ones filter so that its output can be interpreted as velocity reduction and the length 8 takes into account that for some crashes the “do not care” intervals end after 8 ms. The factor 0.4 is motivated by the fact that there might be crashes that require the deployment of safety systems whose low-passed ZAEX-amplitude is smaller than the minimum of the stored maximal values from the training set. Finally, the factor 0.2 leads to a small value of the threshold τZAEX,1 , which is necessary in order to start with the low-pass filtering at an early stage of every crash.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

150

7. Segmentation and Labeling for On-Line Time Series Classification

The feature generation step starts at n0 , the time instance when |uZAEX [n]| > τZAEX,2 for the first time. The entries in z[n] are computed as: z1 [n] = z2 [n] =

n  n =n0 n  n =n

|uZAEX [n ]| |uECSL [n ]| + |uECSR [n ]|,

(7.29)

0

which leads to a two-dimensional input space that can be visualized. Due to the positive and (q) (q) monotonically increasing values z1 [n] and z2 [n] linear classifiers with a1 > 0, a2 > 0 and (q) a0 < 0 can be used to identify the time instances when new segments begin. In the given application, a time series consists at most of two segments: a first segment, where no safety system should be deployed and then a segment where the deployment of safety systems is required. As mentioned in Section 5.1 there are 16 groups of different crash configurations, where 13 groups require the deployment of safety systems. Thus, the initial cardinality of Ωc is set to Q = 13 since for each of the groups that require the deployment of safety systems, a distinct class-change is assumed. Exemplarily, in Fig. 7.5 the data for the training set of the change detectors from the ◦ 0 , 27 km/h group and in Fig. 7.6 the data from the 0◦ , 56 km/h group is shown in the (z1 , z2 )-plane for car type II. Every crash is represented by two points in the figures, a circle for the time instance before the “do not care” interval begins and a cross for the time instance after the “do not care” interval ends. It can be seen that two linear classifiers, one for each group can easily detect the time instances when a class change occurs. But it is not possible to merge the two groups and detect the class change with a single linear classifier. Thus, in the final classification system the two groups will be segmented by different change detectors in order to assure the deployment of safety systems at the right time. In Fig. 7.6 the data is closer to the origin than in Fig. 7.5 since the safety systems in a 56 km/h crash must be deployed earlier than in a 27 km/h crash. 500

500

400

400

300

300

200

200

100

100

0 0

100

200

300

400

Figure 7.5: Car type II, 27 km/h

0 0

100

200

300

400

Figure 7.6: Car type II, 56 km/h

In the next step the number of change detectors must be reduced according to Alg. 7.1. The stopping criteria required in the algorithm can be defined either by the maximum number of change detectors or by the misclassification costs. The costs that are mentioned in Lines 6 and 8 of Alg. 7.1 and which may be used as stopping criteria, can be computed either as the misclassification rate of the linear classifier in the two class problem or according to the importance weights γ[n] associated to each time stamp in every crash. For the latter option

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

7.1 On-Line Segmentation of Time Series

151

γ[nc ] is computed for every scenario that should be detected by the considered classifier, with nc being the time first time stamp after the crossing of the separating line that is described by the classifier in the (z1 , z2 )-plane. The sum of the γ[nc ] values over all scenarios that must be detected by the considered classifier yields the costs. In the following the misclassification rate in the two-class problem combined with the maximal number of change detectors is used as stopping criteria: change detectors are merged as long as the misclassification rate is smaller than 0.03; if this threshold is exceeded and the current number of change detectors is still greater than a predefined number, which is set to 4 in the considered application, the merging (q) continues. The linear classifiers fcd are computed according to the optimization problem described in (7.26). An advantage of the two-dimensional space in which the segmentation classifiers are designed is the possibility to visualize their separating line and validate their suitability. With these settings the number of change detectors for car type I is reduced to QI = 2 and for car type II to QII = 4. Figures 7.7 and 7.8 show the separating lines that are included in the change detectors cd(1) and cd(2) for car type I. Again each crash is represented by two points, a circle for the time instance before the “do not care” interval starts and a cross for the time instance after the “do not care” interval ends. As it can be seen two change detectors are enough for car type I to determine time instances when a deployment is required. There is a single crash—represented in Fig. 7.7 by the circle with z1 [n] = 181 and z2 [n] = 276 lying (1) (1) above the separating line of fcd —where the corresponding labeling classifier fL is triggered slightly too early. 1500

1500

1000

1000

500

500

0 0

200

400

600

0 0

800 (1)

Figure 7.7: Car type I, separating line of fcd

200

400

600

800 (2)

Figure 7.8: Car type I, separating line of fcd

Figures 7.9, 7.10, 7.11 and 7.12 show the separating lines that are included in the change detectors cd(1) , cd(2) , cd(3) and cd(4) for car type II. Here, four change detectors are needed to identify the required deployment times for the 13 groups of crashes. Comparing car type I with type II one notices that for similar crash configurations at the required deployment times, type I has higher values in the entries of z[n]. The reason is the higher mass of cars from type II which leads to smaller deceleration signals. Due to the construction of the features in z[n] every car collision with |uZAEX [n]| > τZAEX,2 also leads to the crossing of the lines defined by the segmentation classifiers. This means that every change detector has a high false positive rate if one considers more than the subset of class-changes for which the classifier was trained. Despite this fact, the true positive rate of the change detectors is high and for the overall classification system only the high true positive rate is important, since it guarantees the triggering of the labeling classifiers at the correct time instances. Due to the simple two-dimensional input space and the possibility

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

152

7. Segmentation and Labeling for On-Line Time Series Classification 1500

1500

1000

1000

500

500

0 0

200

400

600

0 0

800 (1)

Figure 7.9: Car type II, separating line of fcd

1000

1000

500

500

400

600

0 0

800 (3)

Figure 7.11: Car type II, separating line of fcd

600

800 (2)

1500

200

400

Figure 7.10: Car type II, separating line of fcd

1500

0 0

200

200

400

600

800 (4)

Figure 7.12: Car type II, separating line of fcd

to validate the classifiers by visualization and domain knowledge it is not necessary to use methods like cross-validation to state that the true positive rate is high. Thus, in the given application linear classifiers are suitable for the implementation of the change detectors.

7.2 Labeling Classifiers for On-Line Time Series Classification (q)

The labeling classifiers fL are triggered by the corresponding change detectors cd(q) , q = 1, . . . , Q, and as mentioned in connection with Fig. 7.3 they have the task to decide if a class change occurs at the time instance when they are enabled and if yes, which one. The q-th labeling classifier ( ) (q) (q) y ˆ [n] (q) s fL : RNx (q) → [0, 1]|Ω | × {0, c0 }, x (q) [n] → yˆ (q) [n] = , (7.30) (q) ˆyh [n] must determine based on the input x (q) [n] ∈ RNx (q) , which is constructed according to (4.10), if at time instance n the scenario S changes its label to one of the classes included in Ω (q) and if this is the case, to which class. The vector yˆs(q) [n] ∈ [0, 1]|Ω

(q) |

(7.31)

contains soft values of the decisions for the classes in Ω (q) , which in general are the estimated values of the conditional probabilities p(y[n] = ckq |x (q) [n] = x(q) [n]), with ckq ∈ Ω (q) . On the other hand, ˆyh(q) [n] ∈ {0, c0 }

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

(7.32)

7.2 Labeling Classifiers for On-Line Time Series Classification

153

represents a hard decision. Here, the class c0 stands for “reject” and includes all classes in (q) (q) (q) Ω \ Ω (q) for which cd(q) and fL have not been trained. If the output ˆyh [n] of fL is c0 , then (q) the current module consisting of cd(q) and fL cannot detect a class-change reliably at time (q) (q) instance n. Otherwise, if ˆyh [n] = 0, the classifier fL does not reject the decision and the (q) soft values yˆs [n] are used for the computation of ˆy[n]. Hereby, the Decision Logic allows a change of the output ˆy[n] only at those time instances when a labeling classifier is triggered. At these time instances among the triggered classifiers that do not reject the decision, the (q) label with the highest confidence, i. e., the largest value in the considered vectors yˆs [n], is determined and assigned to ˆy[n]. (q) To assure that the labeling classifier fL discriminates correctly its training set is build based on all scenarios Sm which are triggered by the change detector cd(q) . The scenarios in the training set DS are passed to cd(q) and for every time instance nc,q when the detector hypothesizes a change an input vector x(q) [nc,q ] is constructed according to (4.10) and stored (q) in the training set for the classifier fL . If the time stamp nc,q lies in a “do not care” interval, the target y[nc,q ] associated to x(q) [nc,q ] is set to the class label following the interval. If the time stamp nc,q lies after a “do not care” interval, the target y[nc,q ] is used. Finally, if the time stamp nc,q lies in front of the “do not care” interval it must be differentiated whether the crash from which x(q) [nc,q ] is extracted, is used for the training of cd(q) . If this is the case y[nc,q ] is set to the class label ck following the “do not care” interval in this scenario, otherwise y[nc,q ] is set to c1 , in order to assure that the safety systems are not deployed too early. If cd(q) signalizes a change for a scenario where no safety systems should be deployed y[nc,q ] is also set to c1 . This way training sets for all Q labeling classifiers are constructed. It should be noted that for the usage in practice the classification system from Fig. 7.3 requires an additional reset module which brings the system back into its initial state. This is necessary since the segmentation classifiers utilize features which are constructed as cumulative sums. A suitable time interval for a reset in the car crash classification problem has the length of approximately 100 ms. An advantage of the proposed modular design of the classification system is the fact that (q) fL can be implemented with traditional machine learning algorithms. The on-line time series classification problem is reduced, thanks to the change detectors, to the better explored problem of static time series classification since the scenarios are labeled only at selected time instances. The following subsections show how the GRBF classifier from Chapter 3 and nearest neighbor classifiers using temporal prototypes can be utilized to implement the (q) labeling classifiers fL for the car crash classification application. 7.2.1 GRBF for Car Crash Classification The Generalized Radial Basis Function (GRBF) classifiers from Chapter 3 offer valuable (q) advantages for the implementation of the labeling modules fL . In particular properties like interpretability or the access to confidence values of a decision are desirable in many applications. GRBF have the possibility to reject decisions if the confidence is too low and their interpretability results from the similarity of the current input to predefined templates. Therefore, GRBF are very attractive for safety critical applications and their usage for crash classification is presented in the following. (q) As shown in Fig. 7.4 the labeling modules contain Q classifiers fL which work in the input (Q+1) space generated by fgx and a classifier fL that decides based on the features generated by

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

154

7. Segmentation and Labeling for On-Line Time Series Classification

(Q+1)

(Q+1)

fgx . The first Q classifiers are implemented using the GRBF algorithm whereas fL is (Q+1) in the system lies in the fact that a a threshold-classifier. The reason for introducing fL GRBF classifier rejects decisions for a great part of its input space. As already mentioned this is an advantage since it leads to interpretability and confidence information about decisions, but it may be a disadvantage if the training set is small. In the car crash application there are only examples for collisions with relative velocity up to 64 km/h. Due to the localized nature of the GRBF classifier it may happen that the features from crashes with much higher relative velocity lie too far away from the centers th which lead to the deployment of (Q+1) safety systems. Therefore, fL is introduced as an expert system that is responsible for the deployment of safety systems in crashes with a very high velocity reduction in a short (Q+1) time interval. The output of fL at time stamp n is ˆy(Q+1) [n] ∈ {c1 , . . . , cK }. Firstly, the implementation of the Q labeling modules containing the GRBF classifiers is discussed and (Q+1) (Q+1) later on the implementation of fgx and fL . At time stamp n the feature generation unit fgx has access to the samples of the multivariate time series U from the last mem time instances fgx : RL×mem → RN ,

U [n−mem +1,n] → x[n].

(7.33)

The parameter mem is set to 16 since for crashes with high relative velocity the time between impact and required deployment is smaller than 16 ms and for crashes with rather low relative velocity it is assumed that at the time instances when the change detectors trigger the last 16 ms of the signal U contain enough information about the collision. The computation of fair estimates of the systems performance shows that choosing mem greater than 16 brings no improvements. Driven by the wish to compute interpretable features from the low-frequency parts of the signals, domain knowledge is used for the feature generation. In a first step the sets MZAEX [n], MECSL [n] and MECSR [n] are defined to contain those time instances in the interval [n − mem + 1, n] where the amplitudes of corresponding signals uZAEX , uECSL , and uECSR exceed the threshold τfg : MZAEX [n] = {n : |uZAEX [n ]| > τfg , n − mem < n ≤ n} MECSL [n] = {n : |uECSL [n ]| > τfg , n − mem < n ≤ n} MECSR [n] = {n : |uECSR [n ]| > τfg , n − mem < n ≤ n}.

(7.34) (7.35) (7.36)

For n < n0 the signal U is assumed to consist of zeros. The entries of the feature vector x[n] ∈ RN , with N = 3, are computed as x1 [n] =

1 |MZAEX [n]| 1



|uZAEX [n ]|,

(7.37)

n ∈MZAEX [n]



|uECSL [n ]| +

1



|uECSR [n ]|, |MECSL [n]|  |MECSR [n]|  n ∈MECSL [n] n ∈MECSR [n]         1 1   |uECSL [n ]| − |uECSR [n ]| . x3 [n] =  |MECSR [n]|   |MECSL [n]| n ∈MECSL [n]  n ∈MECSR [n] x2 [n] =

(7.38)

(7.39)

The threshold τfg has a low value and is introduced to assure that the features described by (7.37), (7.38) and (7.39) are computed as means among only those signal values that stem

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

7.2 Labeling Classifiers for On-Line Time Series Classification

155

from the crash. For the experimental results the value of τfg is set to τZAEX,1 . The mean among all mem values would falsify the features at the beginning of every crash due to the zeros that are assumed to precede n0 in the signal U. Hereby, x1 [n] measures the local mean absolute value at time instance n in the ZAEX-signal produced by a crash. Equivalently, x2 [n] measures the sum of the local mean absolute values at time instance n in the signals produced by the crash in the two front sensors ECSL and ECSR. Finally, x3 [n] is a measure for the difference in the local mean absolute values at time instance n in the signals produced by the crash in the two front sensors ECSL and ECSR. (q) The training set D(q) for the GRBF classifier fL is constructed as described above by the input vectors x[nc,q ] = [x1 [nc,q ], x2 [nc,q ], x3 [nc,q ]]T and the output vectors y[nc,q ], where nc,q denotes the time instances when cd(q) signalizes a change. As described in Section 5.1 in this application the output vector y[nc,q ] belongs to one of the four classes c1 , c2 , c3 and c4 . Based on D(q) the q-th GRBF classifier, with q = 1, . . . , Q, is computed as described in Chapter 3. The (Q + 1)-th module uses a threshold classifier and is designed to assure the deployment of safety systems for crashes where the velocity reduction is very high in a short time interval, which mainly occurs for collisions with high relative velocity at an early time instance. The (Q+1) (Q+1) (Q+1) input vectors x (Q+1) [n] = [x1 [n], x2 [n]]T of fL are computed based on the features x1 [n] and x2 [n] from (7.37) and (7.38) as (Q+1)

x1

(Q+1)

x2

[n] = max(0, x1 [n] − x1 [n − past ])

(7.40)

[n] = max(0, x2 [n] − x2 [n − past ]).

(7.41)

Here, the max-operator is introduced to produce only positive features in x (Q+1) [n], which are then a measure of how much the local mean absolute values per sample have increased in the last past time stamps. The parameter past is set to 8 since for crashes with high relative velocity the local mean absolute values of the signals grow strongly during a time interval of (Q+1) (Q+1) this length. For crashes with high relative velocity, x1 [n] and x2 [n] have large values at early stages of the collision and are thus, suitable features for safety system deployment algorithms. (Q+1) The classifier fL itself is built by three threshold classifiers, one for each of the classes c2 , c3 and c4 . The main idea when constructing the threshold classifier for class ck , with (Q+1) k ∈ {2, 3, 4}, is to quantize the feature x1 [n] into Jquant intervals and to define for every (Q+1) k interval a threshold τjquant for x2 [n] that discriminates between ck and the other classes. (Q+1)

If for an input x (Q+1) [n] the feature x1 [n] lies in the jquant -th interval and the feature (Q+1) k x2 [n] exceeds the threshold τjquant the class ck is detected. In Fig. 7.13 the threshold classifier for class ck , with Jquant = 5, is visualized. The quantization steps and the thresholds for each classifier are computed by selecting in a first step all scenarios in the training set which do not require the deployment of those safety systems that are represented by ck . (Q+1) Then, the greatest value x1 [n] among all these scenarios is determined and divided into (Jquant − 1) equally large intervals. The upper bound of the (Jquant − 1)-th interval is enlarged by a scaling factor γscale since the safety systems represented by ck are deployed if the value (Q+1) of x1 [n] exceeds this bound. The scaling factor γscale is set to 1.2. Finally, in each of the (Q+1) (Q+1) quantization intervals of x1 [n] the largest value of x2 [n] among the selected scenarios is determined and multiplied with γscale . The resulting value for the j-th interval is τjk . (Q+1) If in a scenario the thresholds for more than one class are exceeded, the classifier fL generates the output ˆy(Q+1) [n] = ck with highest index k, since the deployment of the first

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

156

7. Segmentation and Labeling for On-Line Time Series Classification (Q+1)

x2

[n]

ck τ3k τ2k τ1k τ4k

ck τ5k

k jquant =1

k jquant =2

k jquant =3

k jquant =4

k jquant =5

(Q+1)

x1

[n]

Figure 7.13: Threshold classifier for class ck

stage of the airbag requires the deployment of the seat belt tensioner and the deployment of the second stage of the airbag requires the deployment of the first stage and the belt (Q+1) tensioner. The introduction of fL in the classification system is taken into account by the Decision Logic in such a way that ˆy(Q+1) [n] overrules the decision computed from the (q) soft values yˆs [n] only if ˆy(Q+1) [n] = ck , with k > 1. The performance of the classification systems is evaluated as described in Section 5.2 by computing the histograms for the entries in Table 5.3 with the Leave-One-Out (LOO) estimates for each γimp and each αscale ∈ {0.85, 1.00, 1.15} . In Table 7.1 the results are presented for car type I and the unscaled raw data, i. e., αscale = 1. As it can be seen crashes with γimp = 1 produce the highest rate of false classifications. The upper left figure in Table 7.1 shows that there are two crashes that lead to the deployment of safety systems, although this is not wanted. There are also three crashes whose severity is overestimated since they require only the deployment of the first stage of the airbags, but also the second stage is deployed. Similarly, the upper right figure in Table 7.1 shows that there are two crashes which require the deployment of safety systems but which are labeled as belonging to classes c1 . These misclassified crashes are represented by points in the input spaces of the labeling classifiers which lie at the border between regions that require and regions that do not require the deployment of safety systems. Since for the LOO estimates the crashes leading to the histograms are not in the training data points lying at the border between classes can lead to a reject or to a misclassification. It should be mentioned that by increasing the threshold defining a reject (see Chapter 3) it is possible to classify all these crashes as belonging to c0 but thereby, also the true positive rate is decreased. Moreover, the middle figure at the top in Table 7.1 shows that there is one crash where the safety systems are deployed too early. All crashes with γimp = 2 and γimp = 3 deploy the required safety systems. Thus, the classification system performs very well. The misclassifications that occur affect crashes with low relative velocity (γimp = 1) favoring the deployment of safety systems in ambiguous collisions. If one computes the resubstituition performance of the classification system, i. e., the validation is performed based on crashes that are used in the training set, the two misclas-

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

7.2 Labeling Classifiers for On-Line Time Series Classification overestimated: 19.2%

γimp = 1

correct: 73.1% 25

25

15

15

15

5

5

5

−15

−5 0 5



15

0 −∞

overestimated: 0.0%

γimp = 2

−15

−5 0 5

15



0 −∞

correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5



15

0 −∞

overestimated: 0.0%

−15

−5 0 5

15



0 −∞

correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

−15

−5 0 5

15

−5 0 5

15



−15

−5 0 5

15



underestimated: 0.0%

25

0 −∞

−15

underestimated: 0.0%

25

0 −∞

γimp = 3

underestimated: 7.7%

25

0 −∞

157



0 −∞

−15

−5 0 5

15



Table 7.1: Car type I, αscale = 1.00

sifications in the top left figure of Table 7.1, where safety systems are mistakenly deployed, remain. The two crashes are “misuse” collisions with deer (boxing sacks). These misclassifications occur because the points in the input spaces of the labeling classifiers representing the crashes lie in a region that is dominated by points representing crashes that require the deployment of safety systems. Therefore, during the RF clustering for the construction of the GRBF classifiers the points from the two deer-collisions are identified as outliers and are not considered for the construction of the labeling classifiers. This is not a weakness of the classification system but rather a strength since it leads to robust classifiers that avoid the generation of clusters for only one or two examples, which would implement special rules for outliers. Table 7.2 presents the histograms for the LOO evaluation of the classification system resulting by scaling the raw data with αscale = 0.85 and Table 7.3 the histograms resulting from scaling with αscale = 1.15. As it can be seen, for crashes with γimp = 1 a downscaling leads to a slight shift to the right in the histograms, i. e., the deployment times are later, whereas an upscaling leads to more crashes where the safety systems are mistakenly deployed. This confirms the intuitive expectation of monotonicity. For γimp = 3 no changes occur by up - or downscaling. These results underline the high robustness of the classification system. Table 7.4 presents the performance of the classification system without the labeling clas(Q+1) (Q+1) sifier fL . It can be seen by comparing with Table 7.1 that fL has almost no influence on the performance of the classification system. Nevertheless, it is important to include a (Q+1) classifier like fL in the system to assure that also in crashes with higher velocity reduction in short time intervals than in the available data sets the safety systems are deployed at an early stage of the collision.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

158

7. Segmentation and Labeling for On-Line Time Series Classification overestimated: 19.2%

γimp = 1

correct: 73.1% 25

25

15

15

15

5

5

5

0 −∞

−15

−5 0 5



15

0 −∞

overestimated: 0.0%

γimp = 2

−15

−5 0 5

15



0 −∞

correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5



15

0 −∞

overestimated: 0.0%

−15

−5 0 5

15



0 −∞

correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5



15

0 −∞

−15

−5 0 5

15

−5 0 5

15



−15

−5 0 5

15



underestimated: 0.0%

25

0 −∞

−15

underestimated: 0.0%

25

0 −∞

γimp = 3

underestimated: 7.7%

25



0 −∞

−15

−5 0 5

15



Table 7.2: Car type I, αscale = 0.85 overestimated: 34.6%

γimp = 1

correct: 61.6% 25

25

15

15

15

5

5

5

0 −∞

−15

−5 0 5



15

0 −∞

overestimated: 0.0%

γimp = 2

−15

−5 0 5

15



0 −∞

correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5



15

0 −∞

overestimated: 0.0%

−15

−5 0 5

15



0 −∞

correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

−15

−5 0 5

15

−5 0 5

15



−15

−5 0 5

15



underestimated: 0.0%

25

0 −∞

−15

underestimated: 0.0%

25

0 −∞

γimp = 3

underestimated: 3.8%

25



0 −∞

−15

−5 0 5

15



Table 7.3: Car type I, αscale = 1.15

Equivalently to car type I, the performance of the classification system is validated for car type II. Table 7.5 presents the results for αscale = 1.00, Table 7.6 for αscale = 0.85 and

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

7.2 Labeling Classifiers for On-Line Time Series Classification overestimated: 19.2%

γimp = 1

correct: 73.1% 25

25

15

15

15

5

5

5

−15

−5 0 5



15

0 −∞

overestimated: 0.0%

γimp = 2

−15

−5 0 5

15

0 −∞



correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5



15

0 −∞

overestimated: 0.0%

−15

−5 0 5

15

0 −∞



correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5



15

0 −∞

−15

−5 0 5

15

−5 0 5

15



−15

−5 0 5

15



underestimated: 0.0%

25

0 −∞

−15

underestimated: 0.0%

25

0 −∞

γimp = 3

underestimated: 7.7%

25

0 −∞

159

0 −∞



−15

−5 0 5

15



(Q+1)

Table 7.4: Car type I, αscale = 1.00, no fL overestimated: 31.5%

γimp = 1

correct: 63.2% 25

25

15

15

15

5

5

5

0 −∞

0 −∞

−15

−5 0 5



15

overestimated: 0.0%

γimp = 2

−15

−5 0 5

15



0 −∞

correct: 100.0% 25

25

15

15

15

5

5

5

0 −∞

0 −∞

−15

−5 0 5



15

−15

−5 0 5

15



0 −∞

correct: 100.0% 25

25

15

15

15

5

5

5

0 −∞

0 −∞

−5 0 5

15



−15

−5 0 5

15

−5 0 5

15



−15

−5 0 5

15



underestimated: 0.0%

25

−15

−15

underestimated: 0.0%

25

overestimated: 0.0%

γimp = 3

underestimated: 5.3%

25



0 −∞

−15

−5 0 5

15



Table 7.5: Car type II, αscale = 1.00

Table 7.7 for αscale = 1.15. Also the results for car type II confirm the expectation that a downscaling of the signals shifts the deployment times to a later and an upscaling to an

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

160

7. Segmentation and Labeling for On-Line Time Series Classification overestimated: 31.5%

γimp = 1

correct: 60.6% 25

25

15

15

15

5

5

5

0 −∞

−15

−5 0 5



15

0 −∞

overestimated: 0.0%

γimp = 2

−15

−5 0 5

15



0 −∞

correct: 94.7% 25

25

15

15

15

5

5

5

−15

−5 0 5



15

0 −∞

overestimated: 0.0%

−15

−5 0 5

15



0 −∞

correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5



15

0 −∞

−15

−5 0 5

15

−5 0 5

15



−15

−5 0 5

15



underestimated: 0.0%

25

0 −∞

−15

underestimated: 5.3%

25

0 −∞

γimp = 3

underestimated: 7.9%

25



0 −∞

−15

−5 0 5

15



Table 7.6: Car type II, αscale = 0.85 overestimated: 36.9%

γimp = 1

correct: 60.5% 25

25

15

15

15

5

5

5

0 −∞

−15

−5 0 5



15

0 −∞

overestimated: 0.0%

γimp = 2

−15

−5 0 5

15



0 −∞

correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5



15

0 −∞

overestimated: 0.0%

−15

−5 0 5

15



0 −∞

correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

−15

−5 0 5

15

−5 0 5

15



−15

−5 0 5

15



underestimated: 0.0%

25

0 −∞

−15

underestimated: 0.0%

25

0 −∞

γimp = 3

underestimated: 2.6%

25



0 −∞

−15

−5 0 5

15



Table 7.7: Car type II, αscale = 1.15

earlier time instance. Again, for all crashes with γimp = 2 and γimp = 3 the required safety systems are deployed. There are four crashes with γimp = 2 and six crashes with γimp = 3

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

7.2 Labeling Classifiers for On-Line Time Series Classification overestimated: 31.5%

γimp = 1

correct: 63.2% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

γimp = 2

−15

−5 0 5

15

0 −∞



correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

−15

−5 0 5

15

0 −∞



correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

−15

−5 0 5

15

−5 0 5

15



−15

−5 0 5

15



underestimated: 0.0%

25

0 −∞

−15

underestimated: 0.0%

25

0 −∞

γimp = 3

underestimated: 5.3%

25

0 −∞

161

0 −∞



−15

−5 0 5

15



(Q+1)

Table 7.8: Car type II, αscale = 1.00, no fL

that deploy the safety systems later than required. Nevertheless, when considering the “do not care” intervals obtained with the 5 − 30 ms rule only for a single crash, the one with γimp = 2 and Δtc = 14, the second stage of the airbag is deployed 2 ms too late. It is a crash with a rigid collision partner at 30 km/h and 0◦ angle of impact. For this crash the first stage of the airbag is deployed within the “do not care” interval for the second stage (Q+1) of the airbag. The decision to inflate also the second stage of the airbag is taken by fL . (Q+1) One can see from Table 7.8, where the case without fL is considered that without the (Q + 1)-th classification module only the first stage of the airbag is inflated for this crash. (Q+1) Table 7.8 illustrates that also for car type II the introduction of fL does not influence the performance of the classification system significantly for the considered types of crashes. Comparing the classification systems for car type I with car type II one notices that for car type I the performance is superior since the time instances when the safety systems are deployed fit the requirements better. The fact that the number of change detectors for car type II is four, as presented in Subsection 7.1.3, whereas for car type I two change detectors are enough is a first hint that the requirements for car type II are harder to meet. Nevertheless, the classification systems show a very reliable and robust performance for crashes where the injury risk for the passengers is high (γimp = 2 and γimp = 3). For crashes with lower relative velocities (γimp = 1) some misclassifications occur during the LOO estimation of the performance. In addition to the statistical validation or the interpretability that can be achieved for some classifiers, visualization is an instrument that can be used by experts to verify if a classification system leads to suitable solutions for the considered task. The input vector (q) x[n] is three-dimensional and thus, it is possible to visualize how the classifiers fL label the

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

162

7. Segmentation and Labeling for On-Line Time Series Classification

input space. Exemplarily, Fig. 7.14 shows the segmentation of the input space, as performed (1) by fL for car type I. On the left, the regions of the input space that belong to class c1 are marked bright grey (green) and regions that belong to class c3 are marked dark grey (blue). On the right of Fig. 7.14 the region belonging to class c4 is marked dark grey (red). Since there are is no training data for class c2 for car type I, all other parts of the input space lead (1) to c0 , i. e., a decision is rejected by fL . This segmentation of the input space is interpretable since one sees that all impacts that do not require the deployment of safety systems lie close to the origin since the local mean absolute values in the signals produced by a crash are low. The crashes that require only the deployment of the first stage of the airbag have larger local (1) mean absolute values in the signals at the time instances when cd(1) triggers fL . Finally, all crashes that require the deployment of the second stage of the airbag have large local (1) mean absolute values in the deceleration signals. The GRBF that implements fL has three centers, one for each of the classes c1 , c3 and c4 .

Figure 7.14: Regions of the input space that belong to the classes c1 and c3 are shown on the left and the region belonging to class c4 on the right for car type I

7.2.2 Temporal Prototypes for Car Crash Classification The feature vectors x[n] = [x1 [n], x2 [n], x3 [n]]T which are constructed in the feature generation block from Fig. 7.4 according to (7.37), (7.38) and (7.39) build the scenario X [n0 ,n] = [x[n0 ], . . . , x[n]] ∈ RL×(n−n0 +1) , with L = 3. The scenario X [n0 ,n] can be segmented according to the change detectors cd(q) . If the q-th change detector signalizes a possible class change at time instance nc,q , then the resulting segment X [n0 ,nc,q ] can be classified by computing the similarity of X [n0 ,nc,q ] to some predefined prototypes. As already mentioned in Chapter 4, in such an approach the choice of the similarity measure is crucial. Here, the Augmented Dynamic Time Warping (ADTW) technique from Section 4.4 is used due to its property to capture the similarity in shape as well as the differences in duration between scenarios. The latter property is important since the length of X [n0 ,nc,q ] , ˜(q) = nc,q − n0 + 1, is a discriminative feature for the car crash detection task: ˜(q) is small for severe and long for harmless crashes. In the following the output of the feature generation block from Fig. 7.4 ˇ [n0 ,nc,q ] ∈ R(L+1)×(nc,q −n0 +1) , at the trigger time instance nc,q is not x[nc,q ] but a scenario X ˇ [n0 ,nc,q ] are the normalized rows from X [n0 ,nc,q ] and the with L = 3. The first three rows in X ˇ [n0 ,nc,q ] is build from the time stamps x [n0 ,nc,q ],T = [1, . . . , nc,q − n0 + 1]T by fourth row in X 4 [n0 ,nc,q ],T . The normalization for the ADTW-similarity measure is normalizing each entry in x4

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

7.2 Labeling Classifiers for On-Line Time Series Classification [n0 ,nc,q ],T

achieved by dividing each element x [n] in the univariate time series x ˇxq, [n] =



1 xnorm q,

x [n],

with

xnorm q,

= max

m∈Mcc

max

n0,m ≤n≤nc,q,m

163

through xnorm q,

 {|xm, [n]|} ,

(7.42)

where Mcc denotes the set of indices for the scenarios in the training set where a class change occurs, n0,m the origin of the m-th scenario and nc,q,m the time instance when the q-th change detector indicates a possible class change for the m-th scenario. (q) For the labeling classifiers fL the One-Nearest-Neighbor (1NN) algorithm will be applied for the car crash detection task due to its interpretability. In contrast to the classification method using the GRBF classifiers, here the (Q + 1)-th module is not required. Therefore, another difference to the block diagram from Fig. 7.4 is the fact that the (Q + 1)-th module is absent. This will be discussed below. As already mentioned above, due to the modular approach containing change detectors, the on-line classification problem is reduced to the task of static time series classification. Here, the q-th labeling classifier implements the mappings (q)

fL :

R(L+1)×(nc,q −n0 +1) → {c1 , . . . , cK },

ˇ [n0 ,nc,q ] → ˆy(q) [nc,q ]. X

(7.43)

[n0 ,nc,q ] (q) ˇm resulting from DS and the The training set for fL is build by the M scenarios X [n0 ,nc,q ] ˇ corresponding targets ym [nc,q ]. The target associated to Xm is ym [nc,q ] whose value is set to the class label that follows after the “do not care” interval of the m-th crash. If the crash does not have a “do not care” interval then ym [nc,q ] is set to c1 . Each change detector has its own set of prototypes. Considering the q-th change detector, [n0 ,nc,q ] ˇm , m = 1, . . . , M , from all scenarios in the training set in a first step the segments X are determined and then their number is reduced similarly to the procedure from Alg. 4.1. The modified algorithm for the construction of prototypes for the q-th change detector is presented in Alg. 7.2. There are two major modifications compared to Alg. 4.1. The first change must be included in order to take into account that each change detector is only [n0 ,nc,q ] ˇm responsible for a group of crash settings. Thus, in a first step all M scenarios X (q) (q) resulting from the training set DS are separated into sets Dr . All scenarios in Dr belong to the same class and their class change times are determined by the same change detector. Since a change detector may be trained to trigger scenarios belonging to different classes and the scenarios belonging to a class can be triggered by different change detectors the number of (q) emerging sets Dr depends on the data in DS . The second modification in Alg. 7.2 compared to Alg. 4.1 is the additional IF-condition in Line 7. This condition assures that at least sim scenarios are used for the construction of one prototype. After the construction of prototypes which is described for the ADTW-technique in Sec(q) tion 4.4 for the q-th module one obtains Hq prototypes Ph , with h = 1, . . . , Hq . The clasˇ [n0 ,nc,q ] based on these prototypes is presented in Alg. 7.3. sification for a query scenario X As it can be seen a (Q + 1)-th module as depicted in Fig. 7.4 is not required since the lastly triggered module always generates an output according to the class of the nearest neighbor if no other module takes the decision at an earlier time. An advantage of classification methods based on the similarity to prototypes is the possibility to visualize results for plausibility. For the Figures 7.15 and 7.16 a 40 km/h crash against a rigid wall with 0◦ angle of impact and 0% offset was removed from the training

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

164

7. Segmentation and Labeling for On-Line Time Series Classification

Algorithm 7.2 Generation of prototypes for the q-th change detector [n0 ,nc,q ] ˇm include all segments X into the set D(q) (q) (q) 2: generate subsets Dr from D in such a way that all scenarios that belong to the same class and whose class change times must be determined by the same change detector are (q) included into one subset Dr (q) for all subsets Dr do (q) 4: compute the similarities between all scenarios in Dr (q) while Dr not empty do (q) 6: determine that segment in Dr which has the largest number of similar segments in (q) Dr . Segments are similar if their similarity exceeds the threshold τsim if the set of scenarios built by the segment from Line 6 and all segments that are similar to it contains less than sim elements then (q) (q) 8: mark all scenarios from Dr as outliers and let Dr be the empty set else 10: − let the segment from Line 6 and all segments that are similar to it build a new (q) cluster and remove them from Dr − merge all scenarios from the newly created cluster to a prototype 12: end if end while 14: end for Algorithm 7.3 Classification based on prototypes with the 1NN algorithm for all time instances nc,q in their chronological order do (q) ˇ [n0 ,nc,q ] 2: determine the nearest neighbor prototype PhNN to X (q) (q) if PhNN is constructed from a set Dr containing scenarios whose class-change times are recognized by the q-th change detector then (q) 4: set the output ˆy[nc,q ] to the class label associated to PhNN and exit else if nc,q ≥ nc,q , ∀q  = q then (q) 6: set the output ˆy[nc,q ] to the class label associated to PhNN else 8: set the output ˆy[nc,q ] to c1 , i. e., reject the decision for the current time instance end if 10: end for set and used as query scenario. The figures show the three feature sequences xˇ4 [n], xˇ1 [n], and xˇ3 [n] for scenarios of car type I. In Fig. 7.15 no reduction to prototypes is performed. The bright grey (green) solid curves represent scenarios belonging to class c1 , the dotted dark grey curves (blue) scenarios belonging to class c3 and the dot-dashed dark grey (red) curves scenarios whose class label is c4 . The query scenario is marked by a bold black curve. Its nearest neighbor according to the ADTW similarity is visualized as a bold curve which is a dot-dashed dark grey (red) curve, i. e., the query scenario is assigned to class c4 since the first change detector which leads to the curves in the figure is responsible for identifying the class change of the scenario represented by the bold dot-dashed dark grey (red) curve. Fig. 7.16 presents the same setting but here a reduction to prototypes according to Alg. 7.2 is (1) performed. As it can be seen there are only H1 = 8 prototypes Ph : two for class c1 , one for

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

7.2 Labeling Classifiers for On-Line Time Series Classification

165

class c3 and five for class c4 . Again, according to the ADTW similarity, the nearest neighbor of the query 40 km/h crash which is marked by a bold black curve is the bold dot-dashed dark grey (red) prototype belonging to class c4 . The query scenario is assigned to class c4 since the first change detector is responsible for the class change of the bold dot-dashed dark grey (red) prototype.

1

xˇ3 [n]

xˇ3 [n]

1

0.5

0.5 0 1

1

0.5 xˇ1 [n]

0 0

0.5 xˇ4 [n]

Figure 7.15: Classification of a 40 km/h crash [n0 ,nc,1 ] ˇm , car type I with all scenarios X

0 1

1

0.5 xˇ1 [n]

0 0

0.5 xˇ4 [n]

Figure 7.16: Classification of a 40 km/h crash (1) with prototypes Ph , car type I

The performance of the classification systems is evaluated as described in Section 5.2 by computing histograms for the entries in Table 5.3 with the LOO estimates for each γimp and each αscale ∈ {0.85, 1.00, 1.15} . Tables 7.9, 7.10, and 7.11 consider the case where no reduction to prototypes is performed for car type I so that all scenarios from the training set are treated as prototypes. This procedure leads to a large number of scenarios that must be stored and a high computational burden in order to determine the nearest neighbor. In Table 7.9 the results are presented for car type I and αscale = 1. As it can be seen, crashes with γimp = 1 produce the highest rate of false classifications. The upper left figure in Table 7.9 shows that there are two crashes that lead to the deployment of safety systems, although this is not wanted. There are also two crashes whose severity is overestimated since they require only the deployment of the first stage of the airbags, but also the second stage is deployed. The upper right figure in Table 7.9 shows that there are two crashes which require the deployment of safety systems but which are labeled as belonging to classes c1 . These results are very similar to those from Subsection 7.2.1, the misclassified crashes representing the same time series that cannot be assigned reliably to only one class. In contrast to the GRBF-classification system from the previous subsection for γimp = 2 there are two crashes where the second stage of the airbag is deployed 3 ms and 4 ms too late. These delays result from the fact that according to Alg. 7.3 some decisions are rejected by all labeling classifiers up to the one that is triggered last. For γimp = 3 no misclassifications occur. The jackknife classification performance for a down- and upscaling of the raw data with 15% in the query scenario is shown for car type I in Tables 7.10 and 7.11, respectively. As expected downscaling increases the rate of underestimated scenarios and upscaling the rate of overestimated scenarios. When downscaling the raw data with αscale = 0.85 there is a crash with γimp = 3 where a misclassification occurs, as it can be seen from the bottom right figure in Table 7.10. This misclassification is due to the presence of two crashes representing collisions with deer where no deployment of safety systems is required. As already mentioned in the previous subsection these deer-collisions generate signals that are very similar to signals where the deployment of the second stage of the airbag is required. A possible solution to increase the robustness of the classification system is to mark the deer-crashes as outliers and

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

166

7. Segmentation and Labeling for On-Line Time Series Classification

to not take them into consideration when computing the nearest neighbors. When upscaling the raw data with αscale = 1.15 no crashes with γimp = 2 or γimp = 3 are underestimated, but the rate of overestimated crashes grows slightly as it can be seen in Table 7.11. In order to reduce the number of scenarios for the 1NN algorithm and thereby, the spatial and computational complexity, Alg. 7.2 is applied. Since the ADTW computes dissimilarities in Line 6 a change is necessary: scenarios are considered similar if their ADTW-dissimilarity lies below a threshold which is set to 0.3 for car type I. The value of sim is set to 3 so that the two deer-crashes are not considered for the construction of prototypes. With these settings one obtains the results from Tables 7.12, 7.13 and 7.14. Hereby, the average number ¯ = 8.7, i. e., the number of prototypes over the 63 folds during the jackknife procedure is H of prototypes is reduced by 86%. The performance is almost the same as the one from Tables 7.9, 7.10, and 7.11 indicating that a reduction to prototypes is recommendable. A downscaling of the raw data with αscale = 0.85 leads to longer intervals [n0 , . . . , nc,q ] and since the ADTW-dissimilarity measure is sensitive to the duration of scenarios, the number of underestimated crashes grows. This is the reason for the results presented in the middle and bottom right figures of Table 7.13. Equivalently to car type I, the performance of the classification system is validated for car type II. Table 7.15 presents the results for αscale = 1.00, Table 7.16 for αscale = 0.85 and Table 7.17 for αscale = 1.15, assuming that all scenarios from the training set are used as prototypes. For γimp = 1 there are five crashes where safety systems are deployed, although this is not wanted, one crash where no safety system is deployed although this is required, and six other crashes where the deployed safety systems were not adequate for the crash severity. For γimp = 2 there are four crashes where only the first stage of the airbag is inflated, although the second stage should also be fired. The reason for these results is that the dataset for car type II requires ca. 4 change detectors and according to Alg. 7.3 a decision is only taken if either the nearest neighbor to the query scenario belongs to the triggering change detector or otherwise by the chronologically last labeling classifier. The results from Tables 7.16 and 7.17 confirm the expectation that a downscaling of the signals shifts the deployment times to a later and an upscaling to an earlier time instance. Tables 7.18, 7.19 and 7.20 present the classification performance for αscale = 1.00, αscale = 0.85 and αscale = 1.15 when the number of prototypes is reduced according to Alg. 7.2 with the ADTW-dissimilarity threshold set to 0.2 and sim set to 2. When computing the ¯ = 13.7 prototypes per jackknife estimates these values lead on average to a number of H change detector so that a reduction of 84% is realized. Hereby, the performance is only slightly worse than in Table 7.15 where H = 86 prototypes are used. Again, downscaling the raw signals of the query scenarios shifts the deployment times to later and upscaling to earlier time instances. It is conspicuous that the number of crashes where safety systems are deployed although this is not wanted for γimp = 1 varies other than expected when increasing the scaling factor αscale from 1.00 to 1.15. This is a result of the high variance of the 1NN classifier in the bias-variance tradeoff: in regions of the input space where no dominant posterior probability exists the rate of false classifications is high. On the other hand, it should be noted that by a suitable construction of prototypes such regions can be kept small. Therefore, the presented classification method based on the generation of prototypes and the 1NN algorithm represents an interesting implementation of labeling classifiers.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

7.2 Labeling Classifiers for On-Line Time Series Classification overestimated: 15.4%

γimp = 1

correct: 76.9% 30

30

25

25

25

15

15

15

0 −∞

5 −15

−5 0 5

15



0 −∞

overestimated: 0.0%

γimp = 2

5 −15

−5 0 5

15



0 −∞

correct: 100.0% 30

30

25

25

25

15

15

15

0 −∞

5 −15

−5 0 5

15



0 −∞

overestimated: 0.0%

−5 0 5

15



0 −∞

correct: 100.0% 30

25

25

25

15

15

15

5 −15

−5 0 5

15



0 −∞

15



−15

−5 0 5

15



underestimated: 0.0%

30

0 −∞

−5 0 5

5 −15

30

5

−15

underestimated: 0.0%

30

5

γimp = 3

underestimated: 7.7%

30

5

167

5 −15

−5 0 5

15



0 −∞

−15

−5 0 5

15



Table 7.9: Car type I, αscale = 1.00, no reduction to prototypes overestimated: 15.4%

γimp = 1

correct: 73.1% 30

30

25

25

25

15

15

15

5

5

5

0 −∞

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

γimp = 2

−15

−5 0 5

15



0 −∞

correct: 100.0% 30

30

25

25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

−15

−5 0 5

15



0 −∞

correct: 95.0% 30

30

25

25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

−15

−5 0 5

15

−5 0 5

15



−15

−5 0 5

15



underestimated: 5.0%

30

0 −∞

−15

underestimated: 0.0%

30

0 −∞

γimp = 3

underestimated: 11.5%

30



0 −∞

−15

−5 0 5

Table 7.10: Car type I, αscale = 0.85, no reduction to prototypes

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

15



168

7. Segmentation and Labeling for On-Line Time Series Classification overestimated: 26.9%

γimp = 1

correct: 69.2% 30

30

25

25

25

15

15

15

5 0 −∞

5 −15

−5 0 5

15



0 −∞

overestimated: 0.0%

γimp = 2

5 −15

−5 0 5

15



0 −∞

correct: 100.0% 30

30

25

25

25

15

15

15

0 −∞

5 −15

−5 0 5

15



0 −∞

overestimated: 0.0%

−5 0 5

15



0 −∞

correct: 100.0% 30

25

25

25

15

15

15

5 −15

−5 0 5

15



0 −∞

15



−15

−5 0 5

15



underestimated: 0.0%

30

0 −∞

−5 0 5

5 −15

30

5

−15

underestimated: 0.0%

30

5

γimp = 3

underestimated: 3.9%

30

5 −15

−5 0 5

15



0 −∞

−15

−5 0 5

15



Table 7.11: Car type I, αscale = 1.15, no reduction to prototypes overestimated: 15.4%

γimp = 1

correct: 76.9% 25

25

15

15

15

5

5

5

0 −∞

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

γimp = 2

−15

−5 0 5

15



0 −∞

correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

−15

−5 0 5

15



0 −∞

correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

−15

−5 0 5

15

−5 0 5

15



−15

−5 0 5

15



underestimated: 0.0%

25

0 −∞

−15

underestimated: 0.0%

25

0 −∞

γimp = 3

underestimated: 7.7%

25



0 −∞

−15

−5 0 5

¯ = 8.7 Table 7.12: Car type I, αscale = 1.00, reduction to prototypes, H

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

15



7.2 Labeling Classifiers for On-Line Time Series Classification overestimated: 11.5%

γimp = 1

correct: 80.8% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

γimp = 2

−15

−5 0 5

15



0 −∞

correct: 88.2% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

−15

−5 0 5

15



0 −∞

correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

−15

−5 0 5

15

−5 0 5

15



−15

−5 0 5

15



underestimated: 0.0%

25

0 −∞

−15

underestimated: 11.8%

25

0 −∞

γimp = 3

underestimated: 7.7%

25

0 −∞

169



0 −∞

−15

−5 0 5

15



¯ = 8.7 Table 7.13: Car type I, αscale = 0.85, reduction to prototypes, H overestimated: 15.4%

γimp = 1

correct: 76.9% 25

25

15

15

15

5

5

5

0 −∞

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

γimp = 2

−15

−5 0 5

15



0 −∞

correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

−15

−5 0 5

15



0 −∞

correct: 95.0% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

−15

−5 0 5

15

−5 0 5

15



−15

−5 0 5

15



underestimated: 5.0%

25

0 −∞

−15

underestimated: 0.0%

25

0 −∞

γimp = 3

underestimated: 7.7%

25



0 −∞

−15

−5 0 5

¯ = 8.7 Table 7.14: Car type I, αscale = 1.15, reduction to prototypes, H

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

15



170

7. Segmentation and Labeling for On-Line Time Series Classification overestimated: 23.7%

γimp = 1

correct: 68.4% 30

30

25

25

25

15

15

15

5 0 −∞

5 −15

−5 0 5

15



0 −∞

overestimated: 0.0%

γimp = 2

5 −15

−5 0 5

15



0 −∞

correct: 78.9% 30

30

25

25

25

15

15

15

0 −∞

5 −15

−5 0 5

15



0 −∞

overestimated: 0.0%

−5 0 5

15



0 −∞

correct: 100.0% 30

25

25

25

15

15

15

5 −15

−5 0 5

15



0 −∞

15



−15

−5 0 5

15



underestimated: 0.0%

30

0 −∞

−5 0 5

5 −15

30

5

−15

underestimated: 21.1%

30

5

γimp = 3

underestimated: 7.9%

30

5 −15

−5 0 5

15



0 −∞

−15

−5 0 5

15



Table 7.15: Car type II, αscale = 1.00, no reduction to prototypes overestimated: 5.3%

γimp = 1

correct: 81.6% 30

30

25

25

25

15

15

15

5

5

5

0 −∞

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

γimp = 2

−15

−5 0 5

15



0 −∞

correct: 78.9% 30

30

25

25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

−15

−5 0 5

15



0 −∞

correct: 93.3% 30

30

25

25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

−15

−5 0 5

15

−5 0 5

15



−15

−5 0 5

15



underestimated: 6.7%

30

0 −∞

−15

underestimated: 21.1%

30

0 −∞

γimp = 3

underestimated: 13.1%

30



0 −∞

−15

−5 0 5

Table 7.16: Car type II, αscale = 0.85, no reduction to prototypes

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

15



7.2 Labeling Classifiers for On-Line Time Series Classification overestimated: 13.2%

γimp = 1

correct: 76.3% 30

30

25

25

25

15

15

15

0 −∞

5 −15

−5 0 5

15



0 −∞

overestimated: 0.0%

γimp = 2

5 −15

−5 0 5

15



0 −∞

correct: 73.7% 30

30

25

25

25

15

15

15

0 −∞

5 −15

−5 0 5

15



0 −∞

overestimated: 0.0%

−5 0 5

15



0 −∞

correct: 100.0% 30

25

25

25

15

15

15

5 −15

−5 0 5

15



0 −∞

15



−15

−5 0 5

15



underestimated: 0.0%

30

0 −∞

−5 0 5

5 −15

30

5

−15

underestimated: 26.3%

30

5

γimp = 3

underestimated: 10.5%

30

5

171

5 −15

−5 0 5

15



0 −∞

−15

−5 0 5

15



Table 7.17: Car type II, αscale = 1.15, no reduction to prototypes overestimated: 26.3%

γimp = 1

correct: 68.4% 25

25

15

15

15

5

5

5

0 −∞

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

γimp = 2

−15

−5 0 5

15



0 −∞

correct: 73.7% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

−15

−5 0 5

15



0 −∞

correct: 96.7% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

−15

−5 0 5

15

−5 0 5

15



−15

−5 0 5

15



underestimated: 3.3%

25

0 −∞

−15

underestimated: 26.3%

25

0 −∞

γimp = 3

underestimated: 5.3%

25



0 −∞

−15

−5 0 5

¯ = 13.7 Table 7.18: Car type II, αscale = 1.00, reduction to prototypes, H

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

15



172

7. Segmentation and Labeling for On-Line Time Series Classification overestimated: 15.8%

γimp = 1

correct: 73.7% 25

25

15

15

15

5

5

5

0 −∞

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

γimp = 2

−15

−5 0 5

15



0 −∞

correct: 78.9% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

−15

−5 0 5

15



0 −∞

correct: 96.7% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

−15

−5 0 5

15

−5 0 5

15



−15

−5 0 5

15



underestimated: 3.3%

25

0 −∞

−15

underestimated: 21.1%

25

0 −∞

γimp = 3

underestimated: 10.5%

25



0 −∞

−15

−5 0 5

15



¯ = 13.7 Table 7.19: Car type II, αscale = 0.85, reduction to prototypes, H

overestimated: 18.4%

γimp = 1

correct: 76.3% 25

25

15

15

15

5

5

5

0 −∞

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

γimp = 2

−15

−5 0 5

15



0 −∞

correct: 73.7% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

overestimated: 0.0%

−15

−5 0 5

15



0 −∞

correct: 100.0% 25

25

15

15

15

5

5

5

−15

−5 0 5

15



0 −∞

−15

−5 0 5

15

−5 0 5

15



−15

−5 0 5

15



underestimated: 0.0%

25

0 −∞

−15

underestimated: 26.3%

25

0 −∞

γimp = 3

underestimated: 5.3%

25



0 −∞

−15

−5 0 5

¯ = 13.7 Table 7.20: Car type II, αscale = 1.15, reduction to prototypes, H

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

15



7.2 Labeling Classifiers for On-Line Time Series Classification

173

Conclusion In this chapter a segment and label approach has been introduced as a possible way to tackle the on-line time series classification problem. The first part of the chapter has focused on the segmentation task showing how the number of modules in the classification system can be reduced and how the segmentation can be implemented by linear classifiers for the car crash application. The second part has presented two possibilities to implement the labeling classifiers. The first option is to use the GRBF algorithm from Chapter 3 and the second to apply the 1NN algorithm in combination with the ADTW dissimilarity measure. Both options have been evaluated using the available car crash datasets.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

8. Conclusions

The thesis deals with the topic of on-line time series classification using machine learning techniques considering as main application a car crash classification task. In the first part of the work it is shown how classification problems can be solved using machine learning methods. Besides presenting state-of-the-art algorithms, the bias-variance framework, possibilities to estimate the classification performance and tools for feature selection in Chapter 2, the Generalized Radial Basis Function (GRBF) classifier is introduced as an approach to achieve both a low prediction risk and interpretability in Chapter 3. In the second part of the thesis the focus turns to the classification of temporal data. Chapter 4 provides access to this topic by introducing concepts which are important when dealing with time series. In Chapter 5 the car crash classification problem is introduced. It is the main application in this work belonging to the group of on-line time series classification tasks. Then, in Chapters 6 and 7 three approaches to deal with this application are developed. Firstly, the Scenario Based Random Forest (SBRF) algorithm is introduced in Chapter 6 as an extension of the Random Forest (RF) algorithm in order to handle the on-line time series classification task. Secondly, in the segment and label framework that is introduced in Chapter 7 it is shown how the GRBF classifier can be applied to take decisions about the crash severity and the safety systems that must be activated. Thirdly, in Chapter 7 it is also presented how the Augmented Dynamic Time Warping (ADTW) similarity that is introduced in Section 4.4 can be employed in the car crash application. In order to evaluate the suitability of each technique for the usage in practice the following important properties of the three approaches are discussed in the next section based on the car crash application: classification performance including the robustness to variations in the sensor signals, interpretability, and computational load.

8.1 Comparison of the Presented Methods for Car Crash Classification The classification performance for the car crash application has been evaluated as described in Chapter 5 using Leave-One-Out estimates and has been presented in the form of histograms. For car type I the results are shown in Tables 6.14, 6.15, and 6.16 for the SBRF classifier, in Tables 7.1, 7.2, and 7.3 for the GRBF classifier, and in Tables 7.12, 7.13 and 7.14 for the classifier using the ADTW similarity measure. Comparing the histograms one notices that for αscale = 1.00, i. e., the raw signals are not scaled up or down, all three methods perform very well, the SBRF classifier tending to overestimate the crash severity. A downscaling of the raw data with 15% affects the SBRF strongly because for some crashes belonging to the groups with γimp = 2 and γimp = 3 no safety systems are deployed. The classifier using the ADTW similarity is only slightly affected by the downscaling, whereas the GRBF performance remains uninfluenced. An upscaling with 15% leads to approximately the same decisions as in the unscaled case for the SBRF and ADTW technique, whereas the GRBF classifier tends to overestimate the crashes belonging to the group with γimp = 1. Taking into account the robustness to variations in the raw signals and the overall performance for car type I the GRBF classifier should be the favored. The results for car type II are

175 Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

176

8. Conclusions

shown in Tables 6.17, 6.18, and 6.19 for the SBRF classifier, in Tables 7.5, 7.6, and 7.7 for the GRBF classifier, and in Tables 7.18, 7.19 and 7.20 for the classifier using the ADTW similarity measure. As already mentioned in the previous chapters the classification task for this car type is harder than for car type I, leading to a larger percentage of erroneous decisions for all three methods. With αscale = 1.00 the SBRF and the technique using the ADTW similarity lead to misclassifications for crashes with high severity, a fact that does not happen with the GRBF method. Also the up- and downscaling of the raw data with 15% reflects the superiority of the GRBF classifier. Comparing the ADTW with the SBRF labeling procedure it can be seen that using the former a higher robustness to variations in the sensor signals is achieved. Summing up the results with respect to the classification performance it can be concluded that among the presented techniques the classification method using GRBF is best suited for the car crash application followed by the ADTW and finally the SBRF method. Considering the interpretability of the three classification systems the SBRF fares worse than GRBF or ADTW due to the majority vote among the base classifiers. As described in Chapter 3 the usage of constraints in the design of GRBF classifiers leads to fuzzy-like interpretability. A similar interpretability is also obtained by the ADTW technique since the decisions are based on the similarities to class-specific prototypes. Thus, from this point of view the classification systems using GRBF or the ADTW similarity should be favored. A comparison of the computational load among the three approaches is an ambiguous task since it depends on the hardware that is used to implement the procedures. Factors like memory efficiency, means of parallel computing or the way how numerical values are represented make the task difficult and will not be taken into account. Here, to get a rough approximation of the running time complexity the O-notation is used for the number of operations in an algorithm and in order to approximate the spacial complexity it is assumed that each numerical value requires one byte. Only the computational load of applying a classification system—and not of designing it—is considered because no further changes occur once a classifier is implemented in a car. The resources that are necessary in the feature generation step are almost the same for all three procedures since the computational load that is required for each feature x [n] can be approximated well by considering two lowpass filtering steps. Therefore, the resources that are required by the classification algorithms must be compared. A SBRF algorithm consists of B classification trees and the number of levels in each tree is O(log(MSBRF )), with MSBRF denoting the number of training pairs (x[n], y[n]) that are used by one tree. Taking into account that there are M available scenarios and assuming an average length of the scenarios of scen the complexity of applying a SBRF algorithm is given by the number of levels per tree times the number of trees, i. e., O(B log(M scen )). The spacial resources that are required result from storing the split value per node and the feature log(M log(M ) SBRF )−1 i index for the split. A tree with log(MSBRF ) levels has 2 = 1−2 1−2SBRF = i=0 2log(MSBRF ) −1 nodes. Thus, in a SBRF with B trees the number of nodes can be approximated by BM scen , leading to a requirement of BM scen bytes for the classifier. In the GRBF classification algorithm the number of operations is dominated by the matrix multiplication that must be computed for each basis function to implement the similarities ψh (·, ·). For the H basis functions in a GRBF classifier O(HN 2 ) operations must be performed, where N denotes the dimensionality of the input. The multiplication of the H similarities with the weights requires O(HK) operations, where K is the number of classes,

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

8.1 Comparison of the Presented Methods for Car Crash Classification

177

and the confidence mapping only O(K) operations. Assuming that all Q labeling classifiers could be triggered at the same time instance the computation needs O(QHN 2 ) operations. The spacial resources that are required result from storing for each GRBF classifier the parameter γ ∈ R, the H matrices Ch−1 ∈ RN ×N , the H prototypes xK,h ∈ RN , the weight matrix W ∈ RK×H , and for each of the K classes the 6 parameters that are needed for the confidence mapping. For the classification system with Q labeling classifiers this leads to Q(H(N 2 + N + K) + 6K) bytes. Additionally, the backup classifier needs 2KJquant bytes for storing the thresholds of the expert system, with Jquant denoting the number of intervals in which the abscissa is divided. Since K and Jquant are small the spacial complexity can be approximated by QH(N 2 + N + K) bytes. The classification approach using the ADTW similarity measure makes high demands to both the computational performance and the storage capacity of the required hardware. To determine the distance matrix DDTW between two univariate time series O(2scen ) operations are needed, assuming that both time series have the length scen . If there are L univariate time series in the two scenarios which should be compared then O(2scen (L+1)) operations must be performed. Searching for the warping path needs O(2scen ) additional operations. The spacial ¯ and the average length scen of the prototypes requirement is determined by the number H ¯ + 1)scen required bytes per that must be stored leading to a number of approximately H(L labeling classifier. Assuming that all Q labeling classifiers can be triggered at the same time instance both the number of operations and the number of bytes must be multiplied with Q. An overview of the approximated resources for the three approaches is given in Table 8.1. To compare the three possible ways with respect to their computational demands in the

SBRF GRBF ADTW

number operations O(B log(M scen )) O(QHN 2 ) ¯ 2 L)) O(QH scen

storage requirement (bytes) BM scen QH(N 2 + N + K) ¯ + 1)scen QH(L

Table 8.1: Approximation of the resources that are required by the presented approaches

car crash classification task the following values which are representative for the application ¯ = 12, and can be used: B = 100, M = 70, scen = 30, H = 7, N = 3, K = 4, L = 3, H Q = 3. Although the required resources are only approximated roughly, these values make clear that the GRBF approach should be favored since both the number of operations and the needed storage is much smaller than those of the other two classification methods. With respect to the required number of operations the SBRF algorithm should be preferred to the ADTW technique but with respect to the storage demands the ADTW is more economical. It should be noted that the SBRF classifier has to take a decision at each time stamp, whereas the GRBF and the ADTW algorithms are invoked only when they are triggered by the segmentation classifiers whose computational demands are low since they are implemented by linear change detectors. Nevertheless, because the peak performance is important for practical considerations the above approximations hold. The considerations from above show that among the presented techniques the classification system using GRBF is best suited for the car crash application. This is a consequence of the segment and label approach where more application-specific knowledge is integrated in the system than in the SBRF classifier. Taking into account that the SBRF is an off-the-shelf technique for on-line time series classification its performance in the car crash application is

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

178

8. Conclusions

remarkable. Moreover, the usefulness of the SBRF classifier in order to identify important features must be underlined, making the SBRF algorithm a suitable tool for on-line time series classification tasks. The technique using the ADTW similarity measure is also based on the segment and label approach but its performance is worse than that of the GRBF classifier. One of the reasons for this result is the high variance of the One-Nearest-Neighbor (1NN) classifier that is applied with the ADTW similarity, whereas the GRBF method uses a more sophisticated procedure to combine the similarities to prototypes. Nevertheless, classifying with the ADTW similarity measure is attractive due to the good interpretability of the procedure. Table 8.2 summarizes the considerations from above, the symbol “+” denoting that in the considered field the classification method provides advantages, whereas the symbol “−” stands for disadvantages. SBRF GRBF ADTW and 1NN

performance and robustness + ++ +

interpretability − + +

computational load − + −

Table 8.2: Comparison of the methods used for car crash classification

8.2 Contributions and Future Work The main contributions of this work include the development of GRBF classifiers that are constructed using the RF kernel, the extension of the RF algorithm in order to deal with temporal data leading to the SBRF classifier, the usage of the SBRF algorithm for feature selection, the introduction of a segment and label approach to tackle the on-line time series classification task, the extension of the dynamic time warping similarity in order to penalize differences in the length of time series, and the application of all these techniques to the car crash classification task. An interesting field for future work that has already attracted the attention of some researchers is a better theoretical understanding of ensemble techniques. In this work an attempt to interpret the good performance of ensemble techniques in classification tasks based on the bias-variance framework has been presented. Hereby, the importance of reducing the variance has been described using a two-class problem. Although a multi-class problem can be decomposed into a sequence of two-class problems, a deep understanding of the biasvariance framework for multi-class classification is still missing and constitutes an attractive topic in statistical learning for future research. Moreover, the connection between various ensemble techniques, e. g., the AdaBoost and the RF algorithm, represents an open field. The classification of temporal data using machine learning is a demanding problem and a lot of aspects concerning the theoretical background but also practical applications require the attention for future work. Especially the task of on-line time series classification, i. e., the detection and categorization of class changes in real-time using machine learning techniques is poorly explored. This thesis has presented some approaches to address the problem attempting to formulate the methods in such a way that they are applicable to a large spectrum of problems. However, the approaches have only been evaluated for the car crash classification task. Some adaptation of the techniques to the considered application occurred,

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

8.2 Contributions and Future Work

179

e. g., in the feature generation stage. Thus, it would be interesting to apply the presented approaches to other on-line time series classification problems, but unfortunately, no datasets for this type of classification tasks are available. The construction of a freely available repository containing data from various applications that require on-line time series classification would push on the research in the field. One of the advantages that is often mentioned in connection with machine learning is the possibility to solve classification or regression problems automatically from a set of representative examples. This paradigm is rarely fully adopted in practice and even for standard classification problems often hand-crafted preprocessing of the data is performed. When dealing with time series a completely automatic learning is a highly challenging task. As mentioned in connection with the no free lunch theorem in Chapter 2 a classification system performs better in an application than another due to the fact that its intrinsic assumptions fit the problem better. Therefore, an automatically learning approach must be able to explore various time series representations and various algorithms aiming to cover a large spectrum of problems that occur in practical applications. Since this requirement is highly time-consuming and resource-intensive an approach that facilitates the usage of machine learning techniques for temporal data is the integration of domain knowledge into the system. This strategy is applied for most machine learning applications in practice and its usefulness is also underlined by Breiman who states that “The cleverest algorithms are no substitute for human intelligence and knowledge of the data in the problem”. However, for applications where no domain knowledge is available it is useful to have a set of off-the-shelf classifiers that can be employed. An element of this set is the SBRF algorithm and future work on the topic should also consider the development of other off-the-shelf classifiers for time series. An issue that plays an important role especially in safety critical applications is the interpretability of classification systems. The attempt that was presented in this thesis to assure interpretability is the classification based on similarities to class-specific prototypes. The development of other machine learning algorithms for temporal data that allow interpretability, while assuring a low generalization error can foster the usage of machine learning in industry, economy and other fields.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

Appendices

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

A. Appendix on Machine Learning Procedures

A.1 Bias-Variance Decomposition for Regression The bias-variance decomposition for the loss function L(y, f (x)) = (y − f (x, D))2 will be presented in the following [GBD92, Bis95]. The generalization error can be decomposed in





ED Ex,y|D (y−f (x, D))2 = ED Ex|D Ey|x y2 − 2 Ey|x {y} f (x, D) + f (x, D)2 = ED {Ex|D {f (x, D)2 − 2 Ey|x {y} f (x, D) + Ey|x {y}2

+ Ey|x y2 − Ey|x {y}2 }}

= ED {Ex|D {(f (x, D) − Ey|x {y})2 }} + Ex {Ey|x y2 − Ey|x {y}2 }  ei

(A.1) = ei + ED {Ex|D {(f (x, D) − ED|x {f (x, D)} + ED|x {f (x, D)} − Ey|x {y})2 }} = ei + Ex {ED|x {(f (x, D) − ED|x {f (x, D)})2 }} + Ex {ED|x {(ED|x {f (x, D)} − Ey|x {y})2 }} + 2 Ex {ED|x {(f (x, D) − ED|x {f (x, D)})}  0

· (ED|x {f (x, D)} − Ey|x {y})} = ei + Ex {ED|x {(f (x, D) − ED|x {f (x, D)})2 }}+  varˆy|x {ˆ y}

Ex {(ED|x {f (x, D)} − Ey|x {y})2 }  2 (biasˆy|x {ˆy}) = ei + ev + eb .

(A.2)

From (A.1) it results that the minimization of the generalization error is achieved by choosing f (x, D) = Ey|x {y} = fB (x), i. e., the Bayes regression function is the conditional mean estimator. This leads to (2.16).

A.2 Example for the Bias-Variance Decomposition for Classification In classification tasks an increasing variance can lead to an decreasing prediction risk, while the bias remains unchanged. This will be shown in the following. Assuming that the problem under consideration is a four class problem with the class probabilities for an input x that are given in Table A.1. Now, two machine learning algorithms are considered leading to f (1) (x, D) and f (2) (x, D) for a given input x and a training set D. The probabilities

pD (f (1) (x, D) = ck ) = ED|x δ(ck , f (1) (x, D)) and (A.3)

(A.4) pD (f (2) (x, D) = ck ) = ED|x δ(ck , f (2) (x, D))

183 Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

184

A. Appendix on Machine Learning Procedures p(y = c1 |x) 0.15

p(y = c2 |x) 0.60

p(y = c3 |x) 0.25

p(y = c4 |x) 0.10

Table A.1: Probabilities p(y = ck |x) for the problem under consideration

for the four classes are given in the Tables A.2 and A.3. The Bayes classifier takes the pD (f (1) (x, D) = c1 ) 0.50

pD (f (1) (x, D) = c2 ) 0.20

pD (f (1) (x, D) = c3 ) 0.20

pD (f (1) (x, D) = c4 ) 0.10

Table A.2: Probabilities pD (f (1) (x, D) = ck ) for the first classifier

pD (f (2) (x, D) = c1 ) 0.10

pD (f (2) (x, D) = c2 ) 0.10

pD (f (2) (x, D) = c3 ) 0.10

pD (f (2) (x, D) = c4 ) 0.70

Table A.3: Probabilities pD (f (2) (x, D) = ck ) for the second classifier

decision fB (x) = c2 since p(y = c2 |x) has the largest value. On the other hand, for the first (1) machine learning algorithm one obtains fmaj (x) = c1 and for the second machine learning (2) algorithm fmaj (x) = c4 . The first bias-variance decomposition described by the Eqs. (2.18) to (2.23) leads to the following results for this classification problem. Both classifiers have the same bias (1)

(1)

biasˆy|x {ˆy} = 1 − δ(fB (x), fmaj (x)) = 1

(A.5)

(2) biasˆy|x {ˆy}

(A.6)

=1−

(2) δ(fB (x), fmaj (x))

=1

but the first classifier has a larger variance    (1) (1) = 1 − 0.5 = 0.5 varˆy|x {ˆy} = 1 − ED|x δ fmaj (x), f (1) (x, D)    (2) (2) varˆy|x {ˆy} = 1 − ED|x δ fmaj (x), f (2) (x, D) = 1 − 0.7 = 0.3.

(A.7) (A.8)

Although the first classifier has a larger variance and the same bias as the second, the first classifier has a smaller local mean average error

Ey|x ED|x,y 1 − δ(y, f (1) (x, D)) = 1 − 0.5 · 0.15 − 0.2 · 0.6 − 0.2 · 0.25 − 0.1 · 0.1 = 0.745 (A.9)

Ey|x ED|x,y 1 − δ(y, f (2) (x, D)) = 1 − 0.1 · 0.15 − 0.1 · 0.6 − 0.1 · 0.25 − 0.7 · 0.1 = 0.83. (A.10) The second bias-variance decomposition described by the Eqs. (2.24) to (2.26) is slightly different. Here, one obtains (1)

(1)

biasˆy|x {ˆy} = (1 − p(y = fmaj (x)|x)) − (1 − p(y = fB (x)|x)) = (1 − 0.15) − (1 − 0.6) = 0.45 (2) biasˆy|x {ˆy}

= (1 − p(y =

(2) fmaj (x)|x))

(A.11)

− (1 − p(y = fB (x)|x))

= (1 − 0.1) − (1 − 0.6) = 0.5

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

(A.12)

A.3 Bias-Variance Decomposition for Classification

185

and

(1) (1) varˆy|x {ˆy} = ED|x Ey|x,D L(y, f (1) (x, D)) − (1 − p(y = fmaj (x)|x)) = 0.745 − (1 − 0.15) = −0.105

(2) (2) varˆy|x {ˆy} = ED|x Ey|x,D L(y, f (2) (x, D)) − (1 − p(y = fmaj (x)|x)) = 0.83 − (1 − 0.1) = −0.07.

(A.13) (A.14)

Again, the learning algorithm with the smaller absolute value of the variance (aggregation) has the larger local mean average error.

A.3 Bias-Variance Decomposition for Classification In order to derive (2.29) one must take into account (2.22), i. e., ED|x {δ(ck , f (x, D))} = pD (f (x, D) = ck ). In a two-class problem, at input x one of the two classes is fB (x). Thus, using (2.17) one obtains:

ED|x Ey|x,D {L(y, f (x, D))} = 1 − p(y = fB (x)|x)pD (f (x, D) = fB (x)) − (1 − p(y = fB (x)|x))(1 − pD (f (x, D) = fB (x))) = p(y = fB (x)|x) + pD (f (x, D) = fB (x)) − 2p(y = fB (x)|x)pD (f (x, D) = fB (x)) = 1 − p(y = fB (x)|x) + 2p(y = fB (x)|x) − 1 − 2p(y = fB (x)|x)pD (f (x, D) = fB (x)) + pD (f (x, D) = fB (x)) = 1 − p(y = fB (x)|x) + (1 − pD (f (x, D) = fB (x)))(2p(y = fB (x)|x) − 1). (A.15)

A.4 Computation of W and θ with p(y |x) Versus ¯ y as Target Vector The cost function in the optimization problem from (2.54) can be decomposed into     ˜ − p(y |x)||2 = Ex,¯y ||(W ˜ −¯ ˜ ϕ(x, ϑ) ˜ ϕ(x, ϑ) y ) − (p(y |x) − ¯ y )||2 Ex ||W   ˜ −¯ ˜ ϕ(x, ϑ) = Ex,¯y ||W y ||2   T ˜ ˜ − 2 Ex,¯y (W ϕ(x, ϑ)) (p(y |x) − ¯ y) T



y (p(y |x)−¯ y ||2 . (A.16) y ) + Ex,¯y ||p(y |x)−¯ + 2 Ex,¯y ¯ The first term on the right-hand side is the cost function from the optimization problem in (2.56). The second term on the right-hand side in (A.16) is 0 since it can be decomposed into       ˜ T (p(y |x)−¯ ˜ T ˜ ϕ(x, ϑ)) ˜ ϕ(x, ϑ)) Ex,¯y (W y) = (W (p(y |x)−¯ y )p(¯ y |x) p(x = x)dx, RN

¯ y

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

186

A. Appendix on Machine Learning Procedures

where



p(y |x)p(¯ y |x) = p(y |x)

¯ y

and

⎡  ¯ y

⎢ ⎢ ⎢ ¯ y p(¯ y |x) = ⎢ ⎢ ⎣

1 0 0 .. .



p(¯ y |x) = p(y |x)

(A.17)

¯ y





⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ p(y = c1 |x) + ⎢ ⎥ ⎢ ⎦ ⎣

0

0 1 0 .. .





⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ p(y = c2 |x) + . . . + ⎢ ⎥ ⎢ ⎦ ⎣

0 0 0 .. .

⎤ ⎥ ⎥ ⎥ ⎥ p(y = cK |x) ⎥ ⎦

1

0

= p(y |x),

(A.18)

so that



(p(y |x)−¯ y )p(¯ y |x) = 0K .

(A.19)

¯ y

The third and fourth term on the right-hand side in (A.16) cannot be influenced by the choice of W or θ. Therefore, the matrix W and the free parameters θ in the basis functions which minimize the cost function from (2.54) are the same as those that minimize the cost function from (2.56).

A.5 Confidence Mapping Given an input x, the value uk = [W ϕ(x, ϑ)]k represents the k-th entry in the vector that results after mapping x to u according to (2.53). In a first step all inputs xm from the training set D are mapped to the corresponding vectors um . Then, for each k = 1, . . . , K, a histogram is determined for the values um,k = [W ϕ(xm , ϑ)]k , m = 1, . . . , M . For an input x that is mapped to u the notation hist(uk |ck ) is introduced representing the number of training inputs xm in D belonging to class ck , whose k-th entries in um —denoted with um,k —lie in the same histogram bin as uk , and hist(uk |¯ ck ) denotes the number of training examples belonging to another class than ck and whose k-th entries in um lie in the same histogram bin as uk . The number of bins can be determined by quantizing the values um into intervals of fixed length, e. g., 0.05. In a second step all values um,k are transformed into confidence values cfdk (um,k ) by computing cfdk (um,k ) =

hist(um,k |ck ) ∈ [0, 1]. hist(um,k |ck ) + hist(um,k |¯ck )

(A.20)

In order to obtain a smooth monotonically increasing mapping1 cnf k (uk , ϑ¯k ), where ϑ¯k stands for the free parameters in the mapping, from R to the interval [0, 1] the values cfdk (um,k ) are used in the optimization task M   2  . (A.21) cnf (um,k , ϑ¯ ) − cfd (um,k ) ϑ¯k = argmin ¯ ϑ k

1

k

k

k

m=1

To express that the mapping cnf k (uk ) depends on the free parameters ϑ¯k it will be denoted in the following as cnf k (uk , ϑ¯k ).

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

A.5 Confidence Mapping

187

In [Sch96] cnf k (uk , ϑ¯k ) consists of three pieces, which are chosen in such a way that the first derivative of cnf k (um,k ) is the same at the merging points for all three pieces: a lower sigmoidal part, a straight line part and an upper sigmoidal part. Fig. A.1 shows cnf k (uk , ϑ¯k ) (1) (6) and the 6 parameters ϑ¯k = [ϑ¯k , . . . , ϑ¯k ]T which determine the exact mapping. cnf k (uk , ϑ¯k ) 1 (5) ϑ¯k (4) ϑ¯k

(3) ϑ¯k (2) ϑ¯k (1) (3) ϑ¯k uk

(4) (6) uk ϑ¯k

uk

¯k ) Figure A.1: Confidence mapping function cnf k (uk , ϑ

The mapping can be described analytically as ⎧ (2) (3) ⎪ ϑ¯k + 1+exp(−γ1L (uk −μL )) uk < uk ⎪ ⎪ ⎨ (1) ϑ¯k (3) (4) 1 u + uk ≤ uk ≤ uk cnf k (uk , ϑ¯k ) = k (6) ¯(1) (6) ¯(1) ¯ ¯ ϑk −ϑk ϑk −ϑk ⎪ ⎪ (5) ⎪ ϑ¯k (4) ⎩ uk > uk . 1+exp(−γU (uk −μU ))

(A.22)

By taking into account that the lower sigmoidal part and the linear part have the point (3) (3) (uk , ϑ¯k ) in common and the same derivative in this point as well as that the upper sigmoidal (4) (4) part and the linear part have the point (uk , ϑ¯k ) in common and the same derivative in this point one obtains the following equations   (3) (6) (1) ¯(3) (1) ¯ ¯ uk = ϑk − ϑk ϑk + ϑ¯k (A.23)   (3) (2) 1 ϑ¯k − ϑ¯k γL = ln (A.24) (2) (3) (3) 1 + ϑ¯k − ϑ¯k u k − μL       ¯(3) − ϑ¯(2) ϑ (3) (6) (1) (2) (3) (3) (2) k k (A.25) 1 + ϑ¯k − ϑ¯k ϑ¯k − ϑ¯k ln μL = uk − ϑ¯k − ϑ¯k (2) (3) ¯ 1 + ϑk − ϑ¯k   (4) (6) (1) (4) (1) uk = ϑ¯k − ϑ¯k ϑ¯k + ϑ¯k (A.26)   (4) 1 ϑ¯k γH = ln (A.27) (5) (4) (4) u k − μU ϑ¯k − ϑ¯k      (4) (5) (4) (6) (1) (4) ϑ¯k − ϑ¯k ϑ¯k ϑ¯k − ϑ¯k ¯ ϑ (4) k μU = uk − . (A.28) ln (5) (5) (4) ¯ ¯ ϑk ϑk − ϑ¯k The expressions from Eqs. (A.23) - (A.28) can be inserted into (A.22) so that the dependence on γL , μL , γH and μH disappears in the confidence mapping cnf k (uk , ϑ¯k ). Thus, the derivative

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

188

A. Appendix on Machine Learning Procedures

∂cnf k (uk , ϑ¯k )/∂ ϑ¯k can be expressed analytically and suitable values for ϑ¯k can be determined by applying the gradient descent method to the optimization task from (A.21).

A.6 Support Vector Machines Whereas for RBF classifiers the points xK,h that are required for the kernels are not necessarily inputs xm from D, SVMs start the learning phase with one kernel καm per input xm and select a subset of these input vectors during training, the so-called support vectors. The number H of support vectors that are finally used for the mapping f is normally much smaller than the training set size M , although it is often relatively large and increases with M . In contrast to MLP or RBF the SVM classifiers do not provide estimates for the conditional probabilities p(y = ck |x = x). They were initially designed for binary classification but SVMs can also be applied to multi-class tasks by decomposing a K-class problem into a series of two-class problems with the one-against-all technique [BGV92, Bur98]. A key to understanding SVM is the fact that kernels can be decomposed into the inner product ˜  ), Φ(x ˜ q ), κ(x , xq ) = Φ(x

(A.29)

where Φ˜ denotes the mapping from RN into a inner dot space (F, ·, ·) which usually is of higher dimensionality than N . This formulation of kernels as inner products allows to use many well-known algorithms by applying the kernel trick, i. e., if an algorithm can be formulated in such a way that a query input x appears only in the form of scalar products then these scalar products can be replaced by kernels according to (A.29) [STC04]. Thereby, the mapping Φ˜ must not be evaluated or known explicitly, although the solutions are computed ˜ SVMs result from applying the kernel trick to the linear in the space that is defined by Φ. maximum margin classifier. The linear maximum margin classifier is designed for linearly separable two-class problems, i. e., y ∈ {−1, 1} and seeks to maximize the margin, which is defined as the nearest distance from the training inputs xm to the hyperplane separating the two classes. The distance from ˜ = [w0 , wT ]T is |wT x + w0 | / ||w|| [DHS01]. an input vector x to the hyperplane defined by w ˜ that maximizes βMM > 0 in Thus, the goal is to find the weight vector w ym (wT xm + w0 ) ≥ βMM , ||w||

m = 1, . . . , M.

(A.30)

Since for any w and w0 satisfying the inequality in (A.30) any positively scaled multiple satisfies it too, one can set arbitrarily ||w|| = 1/βMM . Therefore, the optimization problem ˜ is that has to be solved in order to determine w   1 ˜ = argmin w ||w||2 s. t. ym (wT xm + w0 ) ≥ 1, m = 1, . . . , M. (A.31) 2 w,w0 This is a convex optimization problem having a quadratic criterion with linear inequality conditions. Thus, it can be formulated and solved as an unconstrained optimization problem by the method of Lagrange undetermined multipliers [HTF01].2 The resulting vector w can 2

An iterative method to solve the optimization task from (A.31), the AdaTron algorithm, is presented in Appendix A.10.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

A.6 Support Vector Machines

189

be written as w=

M 

λMM,m xm ,

(A.32)

m=1

where the coefficients λMM , m represent the Lagrange multipliers. In general a large number of Lagrange multipliers are zero so that only a small number H < M of inputs xh from the training set, the support vectors, contribute to the solution. A query input x is then classified based on the sign of wT x + w0 , which indicates whether x lies on the left or on the right side of the hyperplane defined by w and w0 . Hereby, the input x appears only in the form of scalar products when computing the output of the maximum margin classifier  H   f (x) = sign (A.33) λMM,h xT h x + w0 . h=1

Therefore, the kernel trick can be applied to the maximum margin classifier leading to the SVM solution  H   H    ˜ h ), Φ(x) ˜ f (x) = sign λSVM,h Φ(x + w0 = sign λSVM,h κ(xh , x) + w0 . (A.34) h=1

h=1

Similarly to (A.31) the optimization task leading to the solution from (A.34) is   1 2 ˜ m ) + w0 ) ≥ 1, m = 1, . . . , M. ˜ = argmin s. t. ym ( w, Φ(x ||w|| w 2 w,w0

(A.35)

˜ at In order to find the solution to this minimization task it is not necessary to evaluate Φ(·) any time due to (A.29). A main benefit of SVM classifiers is that the optimization criterion is convex, although the training involves nonlinear optimization so that the difficulties that come along with local minima are avoided. In general the mapping Φ˜ transforms the inputs x into a space of much higher dimensionality. This is advantageous since the linear separability grows with increasing dimensionality so that the linear maximum margin algorithm is a good choice for a classifier in the space ˜ It should be noted that the hyperplane resulting from the linear defined by the mapping Φ. maximum margin algorithm corresponds to a nonlinear boundary in the input space RN . In contrast to many classification methods that attempt to minimize the empirical risk from (2.10) the SVM algorithm includes the correct classification of the training data as constraints in the optimization task. In case that the data is not linearly separable in the space that is defined by the mapping Φ˜ it is necessary to relax the constraints in (A.35) by introducing slack variables that allow misclassifications [Bur98]. The effect of relaxing the constraints on the solution is that the Lagrange multipliers are bounded from above by a constant CSVM . The value for CSVM must be fixed before the optimization starts such that CSVM is normally determined with hold-out methods like cross-validation. SVM have a sound theoretical background and are state of the art classifiers that perform well on many benchmark data sets. Nevertheless, there are also some drawbacks that come along with this technique. The main drawback is that it is not clear how to choose a suitable

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

190

A. Appendix on Machine Learning Procedures

kernel when no domain knowledge about the data under consideration is available. Moreover, SVM were initially designed for binary classification and the extension to K classes is problematic. When the number M of training patterns is large it becomes hard to compute the Lagrange multipliers since the quadratic programming task that leads to the coefficients λSVM,h must be performed with the M × M kernel matrix resulting from the training inputs xm . Another drawback is the lack of interpretability which may play an important role for some practical applications.

A.7 Minimum Local Risk in CART Classification The Bayes decision in a partition assigns to x ∈ t the class ck that maximizes the posterior probability p(y = ck |x ∈ t). The learning problem that is considered in thefollowing is to determine the function T (x) = [T1 (x), . . . , TK (x)]T , with Tk (x) ∈ [0, 1] and K k=1 Tk (x) = 1 that maps an input x to its posterior probabilities [p(y = c1 |x = x), . . . , p(y = cK |x = x)]T . The loss function for this learning problem is L(y, T (x)) = ||T (x) − ¯ y || = 2

K 

(δ(y, ck ) − Tk (x))2 ,

(A.36)

k=1

leading to the risk  R(T ) = RN

Ey|x

K 

 p(x = x)dx,

(δ(y, ck ) − Tk (x))2

(A.37)

k=1

where ¯ y ∈ {0, 1}K is the unit vector representation of the random variable y, as in (2.56). Since Ey|x {δ(y, ck )} = p(y = ck |x = x) and

Ey|x (δ(y, ck ) − p(y = ck |x = x))2 = p(y = ck |x = x) − 2p(y = ck |x = x)2 + p(y = ck |x = x)2 = p(y = ck |x = x)(1 − p(y = ck |x = x)) (A.38) the risk can be rewritten as K    Ey|x (δ(y, ck )−p(y = ck |x = x)+p(y = ck |x = x)−Tk (x))2 p(x = x)dx R(T ) = RN





= RN

= RN



Ey|x (δ(y, ck ) − p(y = ck |x = x))2

k=1

+ 

k=1 K 



K  k=1 K 

 (p(y = ck |x = x) − Tk (x))2 p(x = x)dx p(y = ck |x = x)(1 − p(y = ck |x = x)) +

k=1

K 

 (p(y = ck |x = x) − Tk (x))2

k=1

· p(x = x)dx.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

(A.39)

A.8 Random Forests Do Not Overfit

191

From (A.39) it follows that the Bayes estimates for the class probabilities are TB (x) = [p(y = c1 |x = x), . . . , p(y = cK |x = x)]T ,

(A.40)

because for these values the second factor in the sum is zero, whereas the first factor cannot be influenced by the choice of T . From (A.39) it also follows that the minimal risk in a node t of a partition T˜ is

rmin (t) = min Ey|x∈t {L(y, T (x))} =

T K 

p(y = ck |x ∈ t)(1 − p(y = ck |x ∈ t)).

(A.41)

k=1

A.8 Random Forests Do Not Overfit Given a RF with B trees {fb (x, jb )}, b = 1, . . . , B, implementing the mapping f (x) as the majority vote among the trees, the margin of the RF for a pattern (x, y) is defined as B 1  δ(fb (x, θb ), y) − max margin(x, y) = ck =y B b=1



 B 1  δ(fb (x, θb ), ck ) . B b=1

(A.42)

The larger the margin, the more confidence in the classification, since the margin measures the extent to which the number of votes at (x, y) for the right class exceeds the average vote for any other class. For the zero-one loss, the generalization error is R(f ) = Ex,y {(1 − δ(y, f (x)))} = 1 − Ex,y {δ(y, f (x))} = 1 − px,y (f (x) = y) = px,y (f (x) = y) = px,y (margin(x, y) < 0)

(A.43) (A.44) (A.45)

the probability being over the (x, y)-space. RF do not overfit as more trees are added to the forest, i. e., the generalization error converges. As the number of trees increases the generalization error R(f ) converges to 







px,y pjb (fb (x, θb ) = y) − max pjb (fb (x, θb ) = ck ) < 0 . ck =y

(A.46)

In order to proof (A.46) it is enough to show [see (A.42)] that B 1  δ(fb (x, θb ) = ck ) → pjb (fb (x, θb ) = ck ). B b=1

This is guaranteed by the law of large numbers3 . 3

B 1  δ(ϕ(θb ), k) converges to pθb (ϕ(θb ) = k). B→∞ B

lim

b=1

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

(A.47)

192

A. Appendix on Machine Learning Procedures

A.9 Computation of the Gradient for the GRBF Algorithm For the minimization of Remp (γ, XK , W , D) from (3.22) with the gradient descent method the derivative of Remp (γ, XK , W , D) with respect to γ is required. The empirical risk can be rewritten as M

1  1 ¯ tr (Y − U )T (Y¯ − U ) (A.48) (y¯m − um )T (y¯m − um ) = Remp (γ, XK , W , D) = M m=1 M



 1  ¯T¯

= tr Y Y − 2tr Y¯ T U + tr U T U , (A.49) M (j)

(j)

with U = [u1 , . . . , uM ] ∈ RK×M and Y¯ = [y¯1 , . . . , y¯M ] = W Ψ ⎤⎡ ⎡ e−γdM (x1 ,xK,1 ) e−γdM (x2 ,xK,1 ) · · · e−γdM (xM ,xK,1 ) w1T ⎢ wT ⎥ ⎢ e−γdM (x1 ,xK,2 ) e−γdM (x2 ,xK,2 ) · · · e−γdM (xM ,xK,2 ) ⎢ 2 ⎥⎢ = ⎢ .. ⎥ ⎢ .. .. .. .. ⎣ . ⎦⎣ . . . . T −γdM (x1 ,xK,H ) −γdM (x2 ,xK,H ) −γdM (xM ,xK,H ) wK e e ··· e

⎤ ⎥ ⎥ ⎥ ∈ [0, 1]K×M ⎦

(A.50)

(A.51)

ˆ −1 (xm − xK,h )). Thus, the where wk = [wk,1 , . . . , wk,H ]T and dM (xm , xK,h ) = (xm − xK,h )T C h gradient can be expressed as    ¯ ¯  ∂Remp (γ, XK , W , D) 1 T ∂Y T ∂Y ¯ = 2tr Y − 2tr U (A.52) ∂γ M ∂γ ∂γ   ¯  2 T ∂Y ¯ = tr (Y − U ) . (A.53) M ∂γ The derivative ∂ Y¯ /∂γ is ⎤ ⎤⎡ ⎡ −e−γdM (x1 ,xK,1 ) dM (x1 , xK,1 ) · · · −e−γdM (xM ,xK,1 ) dM (xM , xK,1 ) w1T ⎢ wT ⎥ ⎢ −e−γdM (x1 ,xK,2 ) dM (x1 , xK,2 ) · · · −e−γdM (xM ,xK,2 ) dM (xM , xK,2 ) ⎥ ∂ Y¯ ⎥ ⎢ 2 ⎥⎢ = ⎢ .. ⎥ ⎢ ⎥ .. .. .. ∂γ ⎦ ⎣ . ⎦⎣ . . . T −γdM (x1 ,xK,H ) −γdM (xM ,xK,H ) wK −e dM (x1 , xK,H ) · · · −e dM (xM , xK,H ) (A.54) = −W (Ψ  D),

(A.55)

with  denoting the elementwise multiplication and D ∈ RH×M the distance matrix with + the (h, m)-th entry dM (xm , xK,h ). Therefore, the derivative of the empirical risk with respect to γ is

 ∂Remp (γ, XK , W , D) 2  ¯ =− tr (Y − U )T W (Ψ  D) . ∂γ M

(A.56)

A.10 The AdaTron Algorithm The AdaTron4 [FCC98] is an iterative algorithm for the separable case that finds the coeffi˜ (q) by maximizing the margin which is defined here as the distance from the nearest cients a 4

AdaTron is a name derived from Adaptive PerzepTron.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

A.10 The AdaTron Algorithm

193

˜ (q) . As for SVM the intuition underlying training inputs to the hyperplane described by a this approach is that a large margin on the training data can lead to good separation of ˜ (q) the test data. The distance from an input vector z (q) [n] to the hyperplane defined by a (q) ˜ (q) that is |a(q),T z (q) [n] + a0 | / ||a(q) || [DHS01]. Thus, the goal is to find the weight vector a maximizes βAT > 0 in (q)

(q)

(q)

ycd,k [n] (a(q),T zk [n] + a0 ) ≥ βAT , ||a(q) ||

(q)

k = 1, . . . , Mcd .

(A.57)

(q)

Since for any a(q) and a0 satisfying the Inequality (A.57) any positively scaled multiple satisfies it too, one can set arbitrarily ||a(q) || = 1/βAT . Therefore, the optimization problem ˜ (q) is that has to be solved in order to determine a   1 (q) 2 (q) (q) (q) (q) (q) ˜ = argmin s. t. ycd,k [n] (α(q),T zk [n] + α0 ) ≥ 1, k = 1, . . . , Mcd . ||α || a 2 (q) α(q) ,α 0

(A.58) This is a convex optimization problem having a quadratic criterion with linear inequality conditions. Thus, it can be formulated as an unconstrained optimization problem by the method of Lagrange undetermined multipliers. The Lagrangian functional for (A.58) that (q) has to be minimized with respect to α(q) and α0 is (q)

cd    1 (q) (q) (q) λk ycd,k [n] (α(q),T zk [n] + α0 ) − 1 , LP = ||α(q) ||2 − 2 k=1

M

(A.59)

with λk being the Lagrange multipliers. Setting the derivatives to zero yields (q)



(q)

Mcd

α(q) =



Mcd (q)

(q)

λk ycd,k [n]zk [n] and 0 =

k=1

(q)

λk ycd,k [n],

(A.60)

k=1

which can be substituted in (A.59) leading to the so-called Wolfe dual (q)



Mcd

LD =

k=1

(q)

(q)

Mcd Mcd  1 (q) (q) (q),T (q) λk − λk λ ycd,k [n]ycd, [n]zk [n]z [n] s. t. λk > 0. 2 k=1 =1

(A.61)

The Wolfe dual represents a simpler convex optimization problem which can be solved iteratively. For this reason the derivatives (q)

cd  ∂LD (q) (q) (q),T (q) (q) (q),T =1− λ ycd,k [n]ycd, [n]zk [n]z [n] = 1 − ycd,k [n]zk [n]α(q) ∂λk =1

M

(A.62)

are required. Since the Lagrange multipliers must be positive one obtains the adaptation rule   (q) (q),T λk ← λk + max −λk , ηAT (1 − ycd,k [n]zk [n]α(q) ) , (A.63)

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

194

A. Appendix on Machine Learning Procedures

where ηAT is the step-size for this iterative algorithm. According to (A.60) the contribution of the k-th pattern to changing α(q) is realized by the update (q)

(q)

α(q) ← α(q) + λk ycd,k [n]zk [n].

(A.64)

The solution to the optimization problem must satisfy the Karush-Kuhn-Tucker conditions which include the equations in (A.60), the inequations λk ≥ 0 and   (q) (q) (q) (q) λk ycd,k [n](a(q),T zk [n] + a0 ) − 1 = 0, k = 1, . . . , Mcd . (A.65) (q)

The latter equation can be used to compute a0 after a(q) is determined iteratively using the (q) update rule in (A.64). To do so one has to choose an λk > 0 and solve for α0 . The AdaTron (q) is summarized in Alg. A.1, where all examples from the training set Scd are iterated maxIter times in order to assure the convergence of the algorithm. Algorithm A.1 AdaTron algorihm (q)

set λk = 0, k = 1, . . . , Mcd 2: for i = 1, . . . , maxIter do (q) for k = 1, . . . , Mcd do 4:

compute ζk =

(q) α(q),T zk [n]



 =

(q) (q),T (q) [n]z [n] =1 λ ycd, [n]zk

set λk = λk + max −λk , ηAT (1 − (q)



(q) Mcd



(q) ycd,k [n]ζk )

(q)

set α(q) = α(q) + λk ycd,k [n]zk [n] end for 8: end for (q) (q) (q) find an index k with λk > 0 and set α0 = ycd, [n] − α(q),T zk [n] 6:

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

B. List of Frequently Used Symbols

αscale bias(·) B b b ˆ C ck cnfk (·) cd(q) D D d(·, ·) dDTW (·, ·) dADTW (·, ·) Δ(n) δ(·, ·) eb (B) eb ei ev (B) ev f fB fJ (q) fL foob f˜ fe fg γ γ[n] γimp ile(·) J H K κ(·, ·)

scaling factor local bias number of base learners in an ensemble classifier index for a base learner in an ensemble classifier -th eigenvector estimated covariance matrix the k-th class confidence value for the k-th entry in u q-th change detector random variable representing the training set available training set distance between the arguments Dynamic Time Warping dissimilarity Augmented Dynamic Time Warping dissimilarity importance of n-th feature function whose output is 1 only if both arguments are identical, otherwise the output is 0 bias part of the prediction risk bias of an ensemble learner irreducible part of the prediction risk variance part of the prediction risk variance of an ensemble method mapping from x to ˆy Bayes classifier or Bayes regression function mapping f that depends on the selection matrix J mapping representing the q-th labeling classifier out-of-bootstrap classifier mapping from v to y feature extraction feature generation scaling parameter in the GRBF similarity penalization term for a misclassification at time instance n variable used for the categorization of crash severities irreducible local error selection matrix number of basis functions in linear basis expansion models number of classes kernel function

195 Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

196 L L(·, ·) M M (t) mg(·, ·) μ ˆ N N ˜ N NRF n nc nend Ω (q) Ωc (q) Ωc p(·) p ˆ(·) prox(·, ·) Φ(·) Φ˜ ψh (·, ·) ϕ(·, ϑ) Q R(·) Remp (·) Roob (·) ˆ D (·) R ˆ D2 (·) R ˆ CV (·, V ) R ˆ BS (·) R ΔR(s|t) S S [n1 ,n2 ] s [n ,n ] s 1 2 s sˆopt (t) σ ˆ t j

B. List of Frequently Used Symbols number of univariate time series in a scenario S loss function number of examples in the training set D number of examples from D that lie in node t margin in a RF estimated mean dimensionality of v dimensionality of x dimensionality of ˜ x number of features that are used to compute the split in a tree-node of the RF time index time index when a change occurs length of a scenario S set of classes to whom a change must be detected by cd(q) set with possible class changes set with class changes that must be identified by the q-th module probability or probability density function estimate of p(·) proximity measure upper tail of the standard normal distribution mapping into the inner dot space related to a kernel κ(·, ·) similarity measure in the GRBF classifier vector containing the evaluations of the H basis functions which depend on ϑ at the input value specified by the argument number of modules in the segmentation and labeling classification system prediction risk empirical risk out-of-bootstrap estimate of R(·) resubstitution estimate of Remp (·) estimate of R(·) using the set D2 that was not utilized during the training V -fold cross validation estimate of R(·) bootstrap estimate of R(·) relative risk reduction in a CART due to a split s in node t a scenario, i e., a finite dimensional multivariate random process scenario S from time stamp n1 up to time stamp n2 -th univariate time series in the scenario S the univariate time series s from time stamp n1 up to time stamp n2 split in a CART tree optimal empirical split at node t estimated standard deviation node in a CART tree random elements in the Random Forest algorithm

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

B. List of Frequently Used Symbols ϑ U U u uZAEX uECSL uECSR v v var(·) W wk,h X ˜ X X [n1 ,n2 ] x ˜ x ˜ xg ˜ xl x y y (q) yˆs ˆy ˆyh(q) y y¯m z (q)

vector of free parameters for basis functions multivariate time series resulting after a low-pass filtering of S multivariate time series resulting after a low-pass filtering of S output of a linear basis model before performing the confidence mapping and normalization steps low-pass filtered signal stemming from the ZAEX sensor low-pass filtered signal stemming from the ECSL sensor low-pass filtered signal stemming from the ECSR sensor random variable representing the observation attributes vector of observation attributes local variance weight matrix weight from the h-th basis function to the k-th output ˜ after the features selection step X representation of S with high-level features scenario X from time stamp n1 up to time stamp n2 random variable representing the feature vector feature vector before the feature selection step global features in ˜ x local features in ˜ x realization of x random variable representing the target attribute target vector for a scenario S soft decisions produced by the q-th labeling classifier output of a learning system hard decision of the q-th labeling classifier with respect to the rejection of a decision realization of y target unit vector corresponding to input xm feature vector for the q-th change detector

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

197

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

Bibliography

[AFS93]

[AG97] [Agg03] [Aka74] [AKK+ 06]

[All83] [APWZ95]

[BBSH08]

[BC94] [BC03] [Bel61] [BFOS84]

[BGV92]

[Bis95] [Bis06] [BJ70] [BJ94]

Rakesh Agrawal, Christos Faloutsos, and Arun N. Swami. Efficient similarity search in sequence databases. In 4-th International Conference on Foundations of Data Organization and Algorithms, pages 69–84, October 1993. Y. Amit and D. Geman. Shape quantization and recognition with randomized trees. Neural Computation, 9:1545–1588, 1997. C. Aggarwal. A framework for diagnosing changes in evolving data streams. In ACM SIGMOD Conference, 2003, 2003. Hirotugu Akaike. A new look at the statistical model identification. In IEEE Transactions on Automatic Control, volume 19, pages 716–723, 1974. J. Aßfalg, H.-P. Kriegel, P. Kr¨oger, P. Kunath, A. Pryankhin, and M. Renz. Similarity search on time series based on threshold queries. In 10th International Conference on Extending Database Technology, EDBT, pages 276–294, 2006. James F. Allen. Maintaining knowledge about temporal intervals. In Communications of the ACM, volume 26, pages 832–843, 1983. Rakesh Agrawal, Giuseppe Psaila, Edward L. Wimmers, and Mohamed Za¨ıt. Querying shapes of histories. In 21st International Conference On Very Large Databases, pages 502–514, 1995. P. Bergmiller, M. Botsch, J. Speth, and U. Hofmann. Vehicle rear detection in images with generalized radial-basis-function classifiers. In IEEE Intelligent Vehicles Symposium 2008 (IV 2008), June 2008. D. Berndt and J. Clifford. Using dynamic time warping to find patterns in time series. In AAAI Workshop on Knowledge Discovery in Databases, 1994. L. Breiman and A. Cutler. Enar short course, 2003. R. Bellman. Adaptive Control Processes: A Guided Tour. New Jersey: Princeton University Press, 1961. L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. The Wadsworth & Brooks/Cole Statistics/Probability Series. Wadsworth, 1984. B. E. Boser, I. M. Guyon, and V .Vapnik. A training algorithm for optimal margin classifiers. In Fifth Annual Workshop on Computational Learning Theory, pages 144–152, 1992. C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995. Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer Science+Business Media, LLC, 2006. G. E. P. Box and G. M. Jenkins. Time Series Analysis: Forecasting and Control. Holden-Day, 1970. G. E. P. Box and G. M. Jenkins. Time Series Analysis: Forecasting and Control. Prentice Hall, 1994.

199 Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

200

Bibliography

[BN07a]

M. Botsch and J. A. Nossek. Feature Selection and Ensemble Learning in TimeSeries Classification Tasks. Technical Report TUM-LNS-TR-07-01, Technische Universitaet Muenchen, January 2007. [BN07b] M. Botsch and J. A. Nossek. Feature selection for change detection in multivariate time-series. In Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining (CIDM 2007), April 2007. [BN08] M. Botsch and J. A. Nossek. Construction of interpretable radial basis function classifiers based on the random forest kernel. In IEEE World Congress on Computational Intelligence 2008 (WCCI 2008), June 2008. [BPSW70] L. E. Baum, T. Petrie, G. Soules, and N. Weiss. A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. The Annals of Mathematical Statistics, 41(164-171), 1970. [Bre96a] L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996. [Bre96b] L. Breiman. Out-of-bag estimation. Technical report, University of California, Berkeley, 1996. [Bre00] L. Breiman. Some infinity theory for predictor ensembles. Technical report, University of California at Berkley, 2000. [Bre01] L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001. [Bre02] L. Breiman. Wald lectures, 2002. [Bre04] L. Breiman. Consistency for a simple model of random forests. Technical report, University of California at Berkley, 2004. [Bur98] Christopher J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2:121–167, 1998. [CC94] T. Cox and M. Cox. Multidimensional Scaling. Chapman & Hall, 1994. [CCG91] S. Chen, C. F. N. Cowan, and P. M. Grant. Orthogonal least squares learning algorithm for radial basis function networks. In IEEE Transactions on Neural Networks, volume 2, pages 302–309, 1991. [CF67] Y. Chien and King-Sun Fu. On the generalized karhunen-lo´eve expansion. IEEE Transactions on Information Theory, 13(518-520), 1967. [CF99] Kin Pong Chan and Ada Wai-Chee Fu. Efficient time series matching by wavelets. In ICDE, pages 126–133, 1999. [CLB04] C. Chen, A. Liaw, and L. Breiman. Using random forest to learn imbalanced data. Technical report, University of California, Berkeley, 2004. [Cov65] T. M. Cover. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. In IEEE Transactions on Electronic Computers, 1965. [DFT02] C.S. Daw, C.E.A. Finney, and E.R. Tracy. A review of symbolic analysis of experimental data. In Review of Scientific Instruments, 2002. [DHS01] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. WileyInterscience, 2001. [Die02] T. G. Dietterich. Ensemble Learning. MIT Press, 2002. [Dom99] P. Domingos. Metacost: a general method for making classifiers cost-sensitive. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 155–164, San Diego, 1999. [Elm90] J. L. Elman. Finding structure in time. In Cognitive Science, volume 14, pages 179–211, 1990.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

Bibliography

201

Kevin Englehart. Signal Representation for Classification of the Transient Myoelectric Signal. PhD thesis, University of New Brunswick, Fredericton, New Brunswick, Canada, 1998. [FCC98] T. Friess, N. Cristianini, and C. Campbell. The kernel adatron algorithm: a fast and simple learning procedure for support vector machine. In 15th International Conference on Machine Learning, 1998. [Fla99] Patrick Flandrin. Time-Frequency / Time-Scale Analysis. Academic Press, 1999. [Fla03] P. A. Flach. The geometry of roc space: Understanding machine learning metrics through roc isometrics. In Proceedings of the 20th International Conference on Machine Learning, pages 194–201, 2003. [For73] G. D. Forney. The viterbi algorithm. In Proceedings of the IEEE, volume 61, pages 268–278, 1973. [Fri97] J. H. Friedman. On bias, variance, 0/1-loss and the curse of dimensionality. Data Mining and Knowledge Discovery, 1:55–77, 1997. [FRM94] Christos Faloutsos, M. Ranganathan, and Yannis Manolopoulos. Fast subsequence matching in time-series databases. In Proceedings 1994 ACM SIGMOD Conference, Mineapolis, MN, pages 419–429, 1994. [FS97] Y. Freund and R. E. Schapire. A decision theoretic generalization of on-line learning and an application to boosting. Journal of Computer and Systems Sciences, 55(1):119–139, 1997. [Gab46] D. Gabor. Theory of communication. The Journal of the Institute of Electrical Engineers, Part III, 93(21):429–457, January 1946. [GBD92] S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias/variance dilemma. Neural Computation, 4:1.58, 1992. [GD00] C. J. A. Gonz´alez and J. J. R. Diez. Time series classification by boosting interval based literals. In Inteligencia Artificial, Revista Iberoamericana de Inteligencia Artificial, pages 2–11, 2000. [Geu01] Pierre Geurts. Pattern extraction for time series classification. Lecture Notes in Computer Science, 2168:115–127, 2001. [Geu02] P. Geurts. Contributions to Decision Tree Induction: Bias/Variance Tradeoff and Time Series Classification. PhD thesis, Universit´e de Li`ege, 2002. [GEW06] P. Geurts, D. Ernst, and L. Wehenkel. Extremly randomized trees. Machine Learning, 2006. [GGNZ06] I. Guyon, S. Gunn, M. Nikravesh, and L. Zadeh. Feature Extraction. Foundations and Applications. Springer, 2006. [GW06] P. Geurts and L. Wehenkel. Segment and combine: a generic approach for supervised learning of invariant classifiers from topologically structured data. In Machine Learning Conference of Belgium and The Netherlands, pages 15–23, 2006. [GY06] M. M. Gaber and P. S. Yu. Detection and classification of changes in evolving data streams. In International Journal of Information Technology and Decision Making, volume 5, 2006. [Haa10] A. Haar. Zur Theorie der orthogonalen Funktionensysteme. In Mathematische Annalen, pages 331–371, 1910. [Hot33] H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24(417-441), 1933.

[Eng98]

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

202 [HTF01] [Hua02] [IDAG01]

[JY93]

[Kad99]

[Kay98] [KBDG04] [KCHP03]

[KCPM00]

[KCPM01]

[KGC02]

[KGP01]

[KL01]

[KP98]

[KP01] [KPC01]

[KS05]

Bibliography T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer-Verlag, 2001. M. Huang. Vehicle Crash Mechanics. CRC Press, 2002. Frauenhofer Institut Rechnerarchitektur und Softwaretechnik Intelligent Data Analysis Group. Benchmark repository. http://ida.first.fraunhofer.de/projects/bench/, 2001. R. E. Jenkins and B. P. Yuhas. A simplified neural network solution through problem decomposition: the case of the truck backer-upper. In IEEE Transactions on Neural Networks, volume 4, pages 718–720, 1993. M. W. Kadous. Learning comprehensible descriptions of multivariate time series. In Proceedings of the 16th International Conference on Machine Learning, ICML’99, pages 454–463, Bled, 1999. S. M. Kay. Fundamentals of Statistical Signal Processing. Prentice-Hall, 1998. Daniel Kifer, Shai Ben-David, and Johannes Gehrke. Detecting change in data streams. In Proceedings of VLDB, 2004. E. Keogh, S. Chu, D. Hart, and M. Pazzani. Segmenting time series: A survey and novel approach. In Data Mining in Time Series Databases, World Scientific Publishing Company, 2003. E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra. Dimensionality reduction for fast similarity search in large time series databases. In Knowledge and Information Systems, pages 263–286, 2000. E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra. Locally adaptive dimensionality reduction for indexing large time series databases. In ACM SIGMOD International Conference on Management of Data, pages 188–288, 2001. S. Kumar, J. Ghosh, and M. M. Crawford. Hierarchical fusion of multiple classifiers for hyperspectral data analysis. In Pattern Analysis & Applications, pages 210–220, 2002. Konstantinos Kalpakis, Dhiral Gada, and Vasundhara Puttagunta. Distance measures for effective clustering of arima time-series. In 2001 IEEE International Conference on Data Mining, pages 273–280, 2001. J. Kohlmorgen and S. Lemm. An on-line method for segmentation and identification of non-stationary time series. In Neural Networks for Signal Processing XI, 2001. Eamonn J. Keogh and Michael J. Pazzani. An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In 4th International Conference on Knowledge Discovery and Data Mining, pages 239–241, 1998. E. Keogh and M. Pazzani. Dynamic time warping with higher order features. In SIAM International Conference on Data Mining, Chicago, USA, 2001. Sang-Wook Kim, Sanghyun Park, and Wesley W. Chu. An index-based approach for similarity search supporting time warping in large sequence databases. In ICDE, pages 607–614, 2001. M. W. Kadous and C. Sammut. Classification of multivariate time series and structured data using constructive induction. In Machine Learning, volume 58, pages 179–216, 2005.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

Bibliography

203

M. Kubat. Decision trees can initialize radial-basis-function networks. In IEEE Transactions on Neural Networks, volume 9, pages 813–821, 1998. [Lev66] Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet Physics Doklady, volume 10, pages 707–710, 1966. [Lin95] Tsau Young Lin. Universal approximator,-turing computability. In The First Annual International Conference Computational Intelligence & Neuroscience, 1995. [LKLC03] J. Lin, E. Keogh, S. Lonardi, and B. Chu. A symbolic representation of time series, with implications for streaming algorithms. In Workshop on Research Issues in Data Mining and Knowledge Discovery in Conjunction with ACM SIGMOD International Conference on Management of Data, pages 2–11, 2003. [LL98] C. Ling and C. Li. Data mining for direct marketing problems and solutions. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York, 1998. [LZ05] Yi Liu and Yuan F. Zheng. One-against-all multi-class svm classification using reliability measures. In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN 2005), August 2005. [MM05] I. Mierswa and K. Morik. Automatic feature extraction for classifying audio data. In Machine Learning, volume 58, pages 127–149, 2005. [Moz95] M. Mozer. A focused backpropagation algorithm for temporal pattern recognition. In Backpropagation: Theory, Architectures and Applications, pages 137–169. Lawrence Erlbaum Associates, 1995. [MYAU01] Yuu Morinaka, Masatoshi Yoshikawa, Toshiyuki Amagasa, and Shunsuke Uemura. The l-index: An indexing structure for efficient subsequence matching in time sequence databases. In Pacific-Asian Conference on Knowledge Discovery and Data Mining, pages 51–60, 2001. [Ols01] R. T. Olszewski. Generalized Feature Extraction for Structural Pattern Recognition in Time-Series. PhD thesis, Carnegie Mellon University, Pittsburgh, 2001. [Orr98] M. J. L. Orr. An EM algorithm for regularized RBF networks. In International Conference on Neural Networks and Brain, 1998. [OS75] Alan Victor Oppenheim and Ronald W. Schafer. Digital Signal Processing. Prentice-Hall, 1975. [Pea01] K. Pearson. On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2:559–572, 1901. [PG89] Tomaso Poggio and Federico Girosi. A theory of networks for approximation and learning. Technical Report AIM-1140, Massachusetts Institute of Technology, 1989. [Pow87] M. J. D. Powell. Radial basis functions for multivariable interpolation: a review. Algorithms for approximation, pages 143–167, 1987. [Rei03] Gregory C. Reinsel. Elements of Multivariate Time Series Analysis. Springer, 2003. [RHW86] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Nature, 323:533–536, 1986. [Ric99] B. Rice. Techniques for developing an automatic signal classifier. In Proceedings ICSPAT, 1999. [Rie97] P. Rieder. Methoden zum Entwurf von diskreten Wavelettransformationen. PhD thesis, Technische Universit¨at M¨ unchen, 1997. [Kub98]

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

204

Bibliography

[RJ86]

L. R. Rabiner and B. H. Juang. An introduction to hidden markov models. In IEEE Magazine on Acoustics, Speech and Signal Processing, pages 4–16, 1986. [RK04] C. A. Ratanamahatana and E. Keogh. Making time-series classification more accurate using learned constraints. In Proceedings of SIAM, 2004. [RKBL05] C. A. Ratanamahatana, E. Keogh, A. J. Bagnall, and S. Lonardi. A novel bit level time series representation with implication for similarity search and clustering. In Proceedings 9th Pacific-Asian International Conference on Knowledge Discovery and Data Mining, pages 771–777, 2005. [RM03] Toni M. Rath and R. Manmatha. Lower-bounding of dynamic time warping distances for multivariate time series, 2003. [SB84] M. W. Smith and T. P. Barnwell. A procedure for designing exact filter banks for tree-structured subband coders. In IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 9, pages 421–424, 1984. [SB99] N. D. Sidiropoulos and R. Bros. Mathematical programming algorithms for regression-based non-linear filtering in Rn . In IEEE Transactions on Signal Processing, 1999. [SC78] Hiroaki Sakoe and Seibi Chiba. Dynamic programming algorithm optimization for spoken word recognition. In IEEE Transactions on Acoustics, Speech and Signal Processing, volume 26, pages 159–165, 1978. [SC94] N. Saito and R. R. Coifman. Local discriminant bases. In Wavelet Applications in Signal and Image Processing II, Proc. SPIE 2303, pages 2–14, 1994. [Sch78] Gideon Schwarz. Estimating the dimension of a model. In Annals of Statistics, volume 6, pages 461–464, 1978. [Sch94] Cullen Schaffer. A conservation law for generalization performance. In International Conference on Machine Learning, 1994. [Sch96] J¨ urgen Sch¨ urmann. Pattern Classification: a unified view of statistical and neural approaches. John Wiley & Sons, 1996. [Sil86] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, London, 1986. [SS07] L. Singh and M. Sayal. Privacy preserving burst detection of distributed time series data using linear transforms. In IEEE Symposium on Computational Intelligence and Data Mining, pages 646–653, 2007. [STC04] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge, 2004. [SZ04] Dennis Shasha and Yunyue Zhu. High Performance Discovery in Time Series. Springer, 2004. [TdSL00] J. Tenenbaum, V. de Silva, and J.C.L. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319–2323, 2000. [Tib97] R. J. Tibshirani. Bias, variance and prediction error for classification rules. Technical report, University of Toronto, 1997. [Tor52] W. S. Torgerson. Multidimensional scaling: I. theory and method. Psychometrika, 17(401-419), 1952. [TS94] Geoffrey G. Towell and Jude W. Shavlik. Knowledge-based artificial neural networks. Artificial Intelligence, 70(1-2):119–165, 1994. [Vid03] M. Vidyasagar. Learning and Generalization. Springer-Verlag, 2003.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

Bibliography [WC96]

205

B. A. Whitehead and T. D. Choate. Cooperative-competitive genetic evolution of radial basis function centers and widths for time series prediction. In IEEE Transactions on Neural Networks, volume 7, pages 869–880, 1996. [WD92] Dietrich Wettschereck and T. G. Dietterich. Improving the performance of radial basis function networks by learning center locations. In Neural Information Processing Systems, volume 4, pages 1133–1140, 1992. [WEBS06] Jason Weston, Andre Elisseeff, G¨okhan BakIr, and Fabian Sinz. The spider, 2006. http://www.kyb.tuebingen.mpg.de/bs/people/spider/main.html. [Wer74] Paul J. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University, 1974. [Wer90] P. J. Werbos. Backpropagation through time: What it does and how to do it. In Proceedings of the IEEE, volume 78, pages 1550–1560, 1990. [WFSP00] L. Wu, C. Faloutsos, K. Sycara, and T. R. Payne. Falcon: Feedback adaptive loop for content-based retrieval. In 26th Int’l Conf. on Very Large Databases, pages 297–306, 2000. [WZ95] R. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Backpropagation: Theory, Architectures and Applications, pages 433–486, 1995. [YF00] B. Yi and C. Faloutsos. Fast time sequence indexing for arbitrary Lp norms. In 26th International Conference on Very Large Databases, pages 385–394, 2000. [ZB98] A. M. Zoubir and B. Boashash. The bootstrap and its application in signal processing. IEEE Signal Processing Magazine, 15:56–76, 1998. [Zha00] G. Zhao. A New Perspective On Classification. PhD thesis, Utah State University, Department of Mathematics and Statistics, 2000.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.

Dieses Werk ist copyrightgeschützt und darf in keiner Form vervielfältigt werden noch an Dritte weitergegeben werden. Es gilt nur für den persönlichen Gebrauch.