Signal Processing in Medicine and Biology 3030368432, 9783030368432

This book covers emerging trends in signal processing research and biomedical engineering, exploring the ways in which s

137 3 11MB

English Pages 289 [287] Year 2020

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Contents
1 An Analysis of Automated Parkinson's Diagnosis Using Voice: Methodology and Future Directions
1.1 Introduction
1.1.1 Voice as a Biomarker
1.1.2 Parkinson's Disease Background
1.1.3 Parkinson's Disease Detection—Current Methods
1.1.4 Parkinson's Disease Pathophysiology
1.1.5 Previous Work in Parkinson's Diagnosis Using Voice
1.2 mPower Voice Dataset
1.3 Methods
1.3.1 Voice Activation Detection
1.3.2 Feature Selection
1.3.2.1 Mel-Frequency Cepstrum Coefficients
1.3.2.2 GeMAPS Features
1.3.2.3 Audio Visual Emotion Recognition Challenge 2013 (AVEC) Features
1.3.3 Maximum Relevance Minimum Redundancy
1.3.4 Machine Learning
1.3.4.1 Cross Validation and Grid Search
1.3.4.2 Decision Trees
1.3.4.3 Random Forest
1.3.4.4 Extra Trees
1.3.4.5 Gradient Boosted Decision Trees
1.3.4.6 Support Vector Machine
1.3.4.7 Artificial Neural Networks
1.4 Results
1.5 Discussion
1.6 Conclusion
References
2 Noninvasive Vascular Blood Sound Monitoring Through Flexible Microphone
2.1 Introduction and Background
2.2 Prior Work in Phonoangiographic Detection of Stenosis
2.2.1 Effect of Stenosis and Blood Flow on Bruit Spectra and Intensity
2.2.2 Effect of Recording Location on Bruit Spectra
2.2.3 Stenosis Severity Classification from PAG Signal Analysis
2.3 Phonoangiogram Signal Processing
2.3.1 Bruit-Enhancing Filter
2.3.2 PAG Wavelet Analysis
2.3.3 Wavelet-Derived Auditory Signals
2.3.4 PAG Systole/Diastole Segmentation
2.3.5 PAG Spectral Feature Extraction
2.4 Skin-Coupled Recording Microphone Design and Assembly
2.4.1 Sensor Construction
2.4.1.1 Frequency Response
2.4.1.2 Signal to Noise Ratio Calculation
2.4.1.3 Array of Microphones
2.5 Detection of Vascular Access Stenosis Location and Severity In Vitro
2.5.1 Feature Performance
2.5.1.1 Auditory Spectral Flux (ASF)
2.5.1.2 Auditory Spectral Centroid (ASC)
2.5.2 Threshold-Based Phonoangiographic Detection of Vascular Access Stenosis
2.6 Summary of Stenosis Detection and Classification Performance
2.7 Conclusion
References
3 The Temple University Hospital Digital Pathology Corpus
3.1 Introduction
3.1.1 Digital Pathology
3.1.2 Deep Learning
3.2 The TUH Digital Pathology Corpus (TUDP)
3.2.1 Computing Infrastructure
3.2.2 Image Digitization
3.2.3 Data Organization
3.2.4 Data Anonymization
3.2.5 Annotation
3.3 Deep Learning Experiments
3.3.1 Baseline System Architecture
3.3.2 Experimental Results
3.4 Summary
References
4 TransientArtifactsSuppressioninTimeSeriesviaConvexAnalysis
4.1 Introduction
4.1.1 Related Work
4.2 Preliminaries
4.2.1 Difference Matrices
4.2.2 Soft-Thresholding, Total Variation, and Fused Lasso Penalty
4.2.3 The Generalized Moreau Envelope
4.3 Transient Artifacts Suppression
4.3.1 Problem Formulation
4.3.2 Optimization Algorithm
4.3.3 Parameters
4.4 The Generalized Conjoint Penalty
4.5 Transient Artifact Suppression Using the Generalized Conjoint Penalty
4.5.1 Design of Parametric Matrix B
4.5.2 Optimization Algorithm
4.6 Numerical Examples
4.6.1 Example 1
4.6.2 Example 2
4.7 Conclusion and Future Work
References
5 The Hurst Exponent: A Novel Approach for Assessing Focus During Trauma Resuscitation
5.1 Introduction
5.2 Method
5.2.1 Hurst Exponent (H)
5.2.2 Experimental Protocol
5.2.3 Measure of Head Movements
5.2.4 Application of Hurst Exponent to Head Movements
5.3 Results
5.4 Discussion
5.5 Conclusion
Appendix
A Simplified Approach for the Estimation of the Hurst Exponent
References
6 Gaussian Smoothing Filter for Improved EMG Signal Modeling
6.1 Introduction
6.2 Related Works
6.3 Problem Formulation
6.4 GSF-Based Enhanced Classification Process
6.4.1 Gaussian Smoothing Filter (GSF)
6.4.2 Filtered EMG Signals Classification
6.4.3 Support Vector Machine (SVM)
6.4.4 k-Nearest Neighbor (k-NN)
6.4.5 Naïve Bayes Classification (NBC)
6.4.6 Linear Discriminant Analysis (LDA)
6.4.7 Gaussian Mixtures Model (GMM)-Based Classifier
6.5 Experimental Validations
6.5.1 Experiment 1: Hand Gestures
6.5.2 Experiment 2: Grasping Task
6.6 Discussions
6.7 Conclusion
References
7 Clustering of SCG Events Using Unsupervised Machine Learning
7.1 Introduction
7.2 Methods
7.2.1 Experimental Measurements
7.2.2 Preprocessing
7.2.2.1 Filtering
7.2.2.2 SCG Segmentation
7.2.3 Unsupervised Machine Learning
7.2.3.1 Clustering SCG Morphology
7.2.4 Dynamic Time Warping (DTW)
7.2.5 Averaging SCG Beats
7.2.5.1 DTW Barycenter Averaging (DBA)
7.2.5.2 Clustering Algorithms
7.2.5.3 k-Medoid Clustering with DTW as a Distance Measure
7.3 Results and Discussion
7.3.1 Optimum Number of Clusters
7.3.2 Purity of Clustering with Labels HLV/LLV and INS/EXP
7.3.3 Analyzing Cluster Distribution with Respiratory Phases
7.3.4 Cluster Switching
7.3.5 Relation Between Heart Rate and Clustering
7.3.6 Intra-cluster Variability
7.4 Conclusion
References
8 Deep Learning Approaches for Automated Seizure Detection from Scalp Electroencephalograms
8.1 Introduction
8.1.1 Leveraging Recent Advances in Deep Learning
8.1.2 Big Data Enables Deep Learning Research
8.2 Temporal Modeling of Sequential Signals
8.2.1 A Linear Frequency Cepstral Coefficient Approach to Feature Extraction
8.2.2 Temporal and Spatial Context Modeling
8.3 Improved Spatial Modeling Using CNNs
8.3.1 Deep Two-Dimensional Convolutional Neural Networks
8.3.2 Augmenting CNNs with Deep Residual Learning
8.3.3 Unsupervised Learning
8.4 Learning Temporal Dependencies
8.4.1 Integration of Incremental Principal Component Analysis with LSTMs
8.4.2 End-to-End Sequence Labeling Using Deep Architectures
8.4.3 Temporal Event Modeling Using LSTMs
8.5 Experimentation
8.5.1 Evaluation Metrics
8.5.2 Postprocessing with Heuristics Improves Performance
8.5.3 A Comprehensive Evaluation of Hybrid Approaches
8.5.4 Optimization of Core Components
8.6 Conclusions
References
Correction to: The Temple University Hospital DigitalPathology Corpus
Index
Recommend Papers

Signal Processing in Medicine and Biology
 3030368432, 9783030368432

  • Author / Uploaded
  • Obeid
  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Iyad Obeid Ivan Selesnick Joseph Picone  Editors

Signal Processing in Medicine and Biology Emerging Trends in Research and Applications

Signal Processing in Medicine and Biology

Iyad Obeid • Ivan Selesnick • Joseph Picone Editors

Signal Processing in Medicine and Biology Emerging Trends in Research and Applications

Editors Iyad Obeid Department of Electrical & Computer Engineering Temple University Philadelphia, PA, USA

Ivan Selesnick Department of Electrical & Computer Engineering New York University Brooklyn, NY, USA

Joseph Picone Department of Electrical & Computer Engineering Temple University Philadelphia, PA, USA

ISBN 978-3-030-36843-2 ISBN 978-3-030-36844-9 (eBook) https://doi.org/10.1007/978-3-030-36844-9 © Springer Nature Switzerland AG 2020, corrected publication 2022 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG. The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

This edited volume consists of the expanded versions of the exceptional papers presented at the 2018 IEEE Signal Processing in Medicine and Biology (IEEE SPMB) Symposium held at Temple University in Philadelphia, Pennsylvania, USA. IEEE SPMB promotes interdisciplinary papers across a wide range of topics including analysis of biomedical signals and images, machine learning, data and educational resources. The symposium was first held in 2011 at New York University Polytechnic (now known as NYU Tandon School of Engineering). Since 2014, it has been hosted by the Neural Engineering Data Consortium (NEDC) at Temple University as part of a broader mission to promote machine learning and big data applications in bioengineering. The symposium typically consists of 18 highly competitive full paper submissions that include oral presentations, and 12–18 singlepage abstracts that are presented as posters. Two plenary lectures are included—one focused on research and the other focused on emerging technology. The symposium provides a stimulating environment, where multidisciplinary research in the life sciences is presented. More information about the symposium can be found at www.ieeespmb.org. Machine learning is having considerable impact on bioengineering applications recently with the emergence of deep learning techniques. Deep learning is revolutionizing the way we think about pattern recognition and delivering unprecedented levels of performance on well-understood tasks such as speech recognition and machine translation. However, such approaches require a large amount of data to be successful, and such a large amount of data often is not available in typical bioengineering applications. Therefore, the NEDC at Temple University, which sponsors this symposium, has a primary goal of promoting community interest in the development of big data resources. The symposium has been one of the premier forums for the discussion of big data applications in bioengineering. Common evaluation paradigms with shared performance metrics are also extremely important in making progress in such experimental fields. Therefore, an important secondary theme of the symposium has been the promotion of open source evaluation paradigms that allow researchers to easily share results and replicate experiments.

v

vi

Preface

The papers selected for this volume share three themes: (1) a machine-learning or statistical signal processing approach, (2) application to a biomedical signal such as an electroencephalogram (EEG) or electromyogram (EMG), and (3) a focus on clinical applications. The latter is very important because clinical data introduces many practical challenges such as noise artifacts, variability in data acquisition environments, unpredictable subject behavior, and a lack of a cleanly segmented signal data. These real-world concerns greatly impact the complexity of the machine-learning problem resulting in levels of performance much lower than what is published in research literature using pristine data. We refer to this as the clinical gap, and closing this gap is important if the technology is going to achieve clinical acceptance. The first group of four papers directly address clinical applications. The first two papers, titled An Analysis of Automated Parkinson’s Diagnosis Using Voice: Methodology and Future Directions and Noninvasive Vascular Blood Sound Monitoring Through Flexible Microphone, deal with the challenges that result from acquisition of data in clinical settings in which leveraging existing well-defined signal interfaces is important. The third paper, titled The Temple University Hospital Digital Pathology Corpus, focuses on an extensive big data resource for digital pathology and introduces preliminary experiments on machine-learning approaches to automatic classification of pathology slide images. The fourth paper, titled Transient Artifacts Suppression in Time Series via Convex Analysis, deals with the suppression of artifacts in time series data with an emphasis on impulsive noise, which is quite common in clinical settings. The next group of four papers focus more on signal processing methods for high performance classification. The first paper, titled The Hurst Exponent—A Novel Approach for Assessing Focus During Trauma Resuscitation, discusses the use of a measure of nonlinearities in the signal to aid in monitoring patients during resuscitation in trauma. The second paper in this group, titled Gaussian Smoothing Filter for Improved EMG Signal Modeling, discusses the use of Gaussian filtering in EMG analysis. The third paper, titled Clustering of SCG Events Using Unsupervised Machine Learning, discusses the use of machine learning to classify seismocardiography (SCG) events. The final paper, titled Deep Learning Approaches for Automated Seizure Detection from Scalp Electroencephalograms, discusses the use of deep learning principles to predict seizures. These papers represent leading developments in automated processing of physical signals. A sincere thanks goes to all our authors who helped make IEEE SPMB 2018 a great success. We are particularly appreciative of the authors represented in this volume who provided excellent expanded chapters. It is our goal to continue publishing these volumes consisting of the very best papers from IEEE SPMB. Philadelphia, PA, USA Brooklyn, NY, USA Philadelphia, PA, USA September 2019

Iyad Obeid Ivan Selesnick Joseph Picone

Contents

1 An Analysis of Automated Parkinson’s Diagnosis Using Voice: Methodology and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Timothy J. Wroge and Reza Hosseini Ghomi 2

Noninvasive Vascular Blood Sound Monitoring Through Flexible Microphone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Binit Panda, Stephanie Chin, Soumyajit Mandal, and Steve J. A. Majerus

1

35

3

The Temple University Hospital Digital Pathology Corpus . . . . . . . . . . . . . Nabila Shawki, M. Golam Shadin, Tarek Elseify, Luke Jakielaszek, Tunde Farkas, Yuri Persidsky, Nirag Jhala, Iyad Obeid, and Joseph Picone

69

4

Transient Artifacts Suppression in Time Series via Convex Analysis . . . . 107 Yining Feng, Baoqing Ding, Harry Graber, and Ivan Selesnick

5

The Hurst Exponent: A Novel Approach for Assessing Focus During Trauma Resuscitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Ikechukwu P. Ohu, Jestin N. Carlson, and Davide Piovesan

6

Gaussian Smoothing Filter for Improved EMG Signal Modeling . . . . . . 161 Ibrahim F. J. Ghalyan, Ziyad M. Abouelenin, Gnanapoongkothai Annamalai, and Vikram Kapila

7

Clustering of SCG Events Using Unsupervised Machine Learning . . . . 205 Peshala T. Gamage, Md Khurshidul Azad, Amirtaha Taebi, Richard H. Sandler, and Hansen A. Mansy

8

Deep Learning Approaches for Automated Seizure Detection from Scalp Electroencephalograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Meysam Golmohammadi, Vinit Shah, Iyad Obeid, and Joseph Picone

vii

viii

Contents

Correction to: The Temple University Hospital Digital Pathology Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

C1

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277

Chapter 1

An Analysis of Automated Parkinson’s Diagnosis Using Voice: Methodology and Future Directions Timothy J. Wroge and Reza Hosseini Ghomi

1.1 Introduction Artificial intelligence (AI) has been actively discussed but largely dormant in implementation for 75 years for most use cases including healthcare. In a similar vein, voice has been explored as a dense signal since the 1980s but, much like AI, implementation of these concepts has only been made possible in recent years. As a result, voice is a field which has not developed a standardized approach limiting its development. Existing research studies are siloed and small in size with generalizability of results limited by lack of transparency of technical approaches (method of recording, quality, etc.). Although this has limited the field, there is reason to expect voice to play an active role in healthcare delivery in the next 5–20 years given the trajectory we have seen in voice technology (e.g., AI-based scribe technology for clinical visits is currently in development). We are in a critical time to merge machine learning and voice analysis techniques together in health-related applications using standardized and generalizable methods. The voice field has exploded in recent years with major advances by several consumer and business facing organizations, allowing for unprecedented advances in application of voice technology to healthcare. Transcription has reached an unprecedented accuracy (50 throughout 2015–2017) that demonstrate existing machine learning algorithms (e.g., support vector machines (SVMs)) can classify a range of neurocognitive disorders with high accuracy using voice recordings. This work, however, has yet to achieve commercial viability. These trends indicate a need to enable this work to reach a capacity to be used in clinical trials to eventually have clinical impact. We have worked to perform a comprehensive inclusion of features using datasets much larger than previously available. The outcome of this study (including some of our work pending publication) and others demonstrates voice can be used as a tool to detect and track symptoms. This field of research holds promise for increasing remote health monitoring and decreasing cost associated with evaluating disease status and associated outcomes in clinic settings. One limitation in the field, however, is a lack of standardization regarding hardware and post-collection processing to allow for data sharing and increasing accuracy of analyses. This is an area we are working on by collaborating across institutions to build the largest database of voice samples easily accessible by any researcher while facilitating voice audio collection with universal methods. One form this is taking is collecting the largest standardized and normalized voice dataset to date via the Voiceome Study [7]. One collaborator, NeuroLex Laboratories, is leading the way here by open-sourcing much of their work and making voice computing more accessible.

1.1.2 Parkinson’s Disease Background Affecting more than 1% of individuals above age 60 and 10 million people globally, Parkinson’s disease (PD) is the second most prevalent neurodegenerative disease worldwide [8]. According to the Parkinson’s Disease Foundation, medication costs for a person living with PD can range from $2500 to $100,000 dollars per patient, depending on the type of treatment such as surgery options like deep brain stimulation (DBS) implants [9]. PD is typically diagnosed by primary care physicians, neurologists, or movement disorder specialists. The process is based on symptom screening, physical exam, and trial-and-error medication use, and is often inaccurate and timely (2–3 years) due to the lack of a standard diagnostic test for Parkinson’s. In some cases, though, a DaT scan, which uses a radioactive agent and a special camera, can detect early signs of Parkinson’s, but this is rarely used due to cost, accessibility, and poor sensitivity and specificity [10]. Treatment often begins too late because physical examinations rely on advanced symptoms, such as tremors, rigidity, and bradykinesia. However, Parkinson’s is more treatable than most neurodegenerative diseases, with many patients experiencing significant

4

T. J. Wroge and R. H. Ghomi

symptom improvement after treatment onset. Thus, there is motivation to detect and treat PD earlier to improve health outcomes and lower healthcare costs. To have the ability to diagnose PD earlier and accurately, a sensitive and specific biomarker is needed.

1.1.3 Parkinson’s Disease Detection—Current Methods Symptom inventories are used to evaluate the neurological state of PD patients: the Unified Parkinson’s Disease Rating Scale (UPDRS) [11] and Hoehn and Yahr staging scale [12]. These inventories consider motor and non-motor symptoms. Clinicians and researchers alike use the UPDRS to follow the progression of a person’s Parkinson’s disease. The scale is beneficial because it provides researchers with a standardized way to measure benefits and a unified accepted rating system; however, it is burdensome—requiring significant time—and requires a movement disorder specialist to complete. It also remains subjective despite representing a significant improvement for the field previously without a standardized scale. For these reasons, inventories such as the UPDRS should eventually be replaced with quick, cheap, accessible biomarkers. Due to the length of the scale, most providers do not complete a full UPDRS and rather rely on a more focused exam. The specificity of the UPDRS is also a challenge because many other diseases may score highly but not represent PD (e.g., progressive supranuclear palsy, multiple system atrophy, corticobasal degeneration, etc.). Machine learning can be helpful here because if given enough data to learn, it can become very reliable in classifying a disease with high specificity and preclude the need for further expensive and time consuming testing, such as a DaT scan.

1.1.4 Parkinson’s Disease Pathophysiology The disease affects functions of the basal ganglia and is characterized by the progressive loss of dopaminergic neurons in the midbrain, the loss of which disrupts corticostriatal circuits involved in motor function and multiple high level domains [13]. Those with the disease experience a number of motor and cognitive symptoms including resting tremors, bradykinesia, rigidity, postural instability, slowed thinking, behavior, cognition, and emotion memory loss, dementia, and negative effects of the sensory system, cognition, and behavior [12, 14]. Thus, the impact of PD goes far beyond movement disorders, as especially seen with linguistic performance and articulatory disorders in PD. The majority (almost 34 ) of patients with PD experience some voice and speech impairments known as hypokinetic dysarthria, characterized by reduced loudness, monopitch, monoloudness, reduced stress, breathy, hoarse voice quality, and imprecise articulation [15]. Hypokinetic dysarthria is due to a lesion in the substantia nigra; however, it can also result

1 An Analysis of Automated Parkinson’s Diagnosis Using Voice:. . .

5

from anti-psychotic medications, frequent blows to the head, and other etiologies described above. In order for normal muscle movement to occur it is thought dopaminergic and cholinergic (ACh) pathways must be in balance [16]. Articulatory disorders in PD are often accompanied by impairments in grammar pragmatics. Thus, it has been hypothesized the waveform and linguistic features of these symptoms could be used as voice biomarkers for PD.

1.1.5 Previous Work in Parkinson’s Diagnosis Using Voice In recent years, the use of speech based data in the classification of Parkinson disease (PD) has been gaining credibility as a potential biomarker to provide an effective, non-invasive mode of disease detection. Thus, there has been an increased interest and research in speech pattern analysis methods applicable to PD within the scientific community. Harel et al. conducted longitudinal acoustic analyses surrounding the time of the disease diagnosis, revealing that decrease in fundamental frequency (F0) variability during free speech was detected prior to clinical diagnosis [17, 18]. Changes in F0 variability and voice onset time (VOT) were also detected upon the initiation of symptomatic treatment. Clinically this is expected given what is seen in PD, which includes not only voice changes but also increased freezing. Freezing refers to delays in initiating a motor activity, such as talking. These observations suggest that early changes in speech may be prodromal signs detectable using certain acoustic measures. A decrease in F0 variability appears to be particularly sensitive to the early progression of the disease and to the initiation of pharmacologic intervention. Cecchi and colleagues showed that it is possible to measure linguistic features such as grammatical choices and word-level repetitions in voice transcripts of brief monologues to classify PD patients vs. controls and infer the patients’ level of motor impairment with 75 and 77% accuracy, respectively [19]. Furthermore, the 2012 Tsanas and Little study used a speech sample database (n = 263) in which only 10 dysphonia features were enough to distinguish between PD and controls with 99% accuracy [20]. Here the focus was on features which reflected pathology in vocal fold closure and opening during phonation. Tsanas and Little also produced a minimalistic analysis of phonetic features to prove the effectiveness of machine learning applications in the use case of Parkinson’s disease diagnosis [21]. The group achieved a tenfold cross validated accuracy of 97.7±2.8% with all the 132 features using a support vector machine model. The group used a dataset from the National Center for Voice and Speech (NCVS) containing 263 samples from 43 subjects. The task recorded in the audio is the pronunciation of the /aa/ phoneme. The subjects were recorded in a controlled setting with 16-bit resolution audio sampled at 44.1 kHz. Tsanas used Random Forest and Support Vector Machines as the main machine learning models. Limited accuracy was found using standard telephone recordings due to signal compression and much of this work was limited by the small number of participants increasing bias of the

6

T. J. Wroge and R. H. Ghomi

data toward individuals rather than characterizing the disease itself. However, it is important to note, learning specific phenotypes of the disease for an individual once diagnosed in order to track symptoms over time is an important goal which this work demonstrated is feasible. Khan further advanced this work by demonstrating mel-frequency cepstral coefficients (MFCCs) incorporated into a SVM model were more accurate in classifying PD correctly when taken from running speech rather than from a sustained phonation or a dysdiadochokinesia (DDK) voice task [22]. In this study, data was obtained from 60 PD patients and 20 normal controls. Running speech referred to a natural language sample of the subject speaking, while sustained phonation refers to the subject holding a sound with a single breath. DDK is traditionally used to test rapid alternating movements physically such as asking the patient to flip their palms up and down quickly in their lap, while analogously to test voice, the subject is asked to rapidly repeat three alternating sounds such as “pa-ta-ka.” Sustained phonation and DDK have traditionally been used as tests to detect signs of PD. Arora, Dorsey, and Little et al. demonstrated an initial feasibility study, using a small group of 20 patients, to monitor PD symptoms accurately using a mobile application collecting data from five separate tasks, one of which was voice [23]. They were able to approximate the UPDRS score within 1–2 points which was more accurate than the inter-rater reliability of the UPDRS. Dorsey and Little followed up on this work to expand their initial dataset to 129 individuals and explore the contribution of each task to the accuracy of a PD severity score [24]. Their goal here was to demonstrate the feasibility of a PD severity score calculate automatically using machine learning from data collected from a mobile device. While voice was only found to contribute 17% to the composite score, details of their voice analysis were not shared in order to evaluate their methods.

1.2 mPower Voice Dataset In this study we use the mPower dataset. The mPower voice dataset [25] was designed to demonstrate the use of widely available sensors within smart phones in recording digital biomarkers to use for PD diagnosis. Ultimately, the goal of the study is to enhance the quality of life of patients with PD with the use of telemonitoring of PD symptoms using widely available devices. While regular doctor visits typically occur on the scale of months [25], the use of widely available phone sensors can provide disease severity monitoring that can lead to better health outcomes and more information for clinicians to prescribe medication. The study focused the analysis on motor specific symptoms such as dexterity, voice production, and gait. The test was broken into four unique activities as shown in Table 1.1 as well as a number of surveys that assessed the patients’ health during the course of the study. The user could choose to perform as few or as many activities as they wished but they were encouraged to complete the activities three times per day. For those that self-identified as having a PD diagnosis, the

1 An Analysis of Automated Parkinson’s Diagnosis Using Voice:. . .

7

Table 1.1 Sage bionetworks mPower study [25] Module Demographics Voice

In-App-Implementation Participants responded to questions about general demographic topics and health history. Participants first record ambient noise level for 5 s and if acceptable, they record themselves saying “aaah” for 10 s.

Unique patients surveyed 6805 5826

app suggested that they perform the activities before medication, immediately after medication, and some other time during the day. These medication time points were interpreted to mean: time of best symptom control, on medication but not immediately before or after, time of worst symptoms, not on medications, and not applicable, respectively. This information, crossed with the clinical diagnosis responses from the demographics survey led to three groups of patients and data, as shown in Fig. 1.2. Patients that had medication prior to the voice test were not used as participants in the analysis. This data cleaning was done because the voice of the patient will depict the most extreme effects of the PD without the effect of any medication. The assumption is that the voice features will be noticeably different from those of the controls. In this experiment, participants who do not have a professional diagnosis of PD are used as controls. The voice data available in the mPower dataset only consists of 10 s of sustained phonation of the /aa/ phoneme. This was primarily driven by the need to maintain anonymity and confidentiality of voice samples as there was concern for voice samples being identifiable if free speech was collected. This was not based on clinical relevance, although there was some validation as mentioned above by work from Tsanas et al. Using vocal tests is not currently a part of the clinical diagnosis of PD beyond noting hypophonia or general loss of voice volume, which is part of parkinsonism (this is a question on the UPDRS). There is a clear need to update the state of Parkinson’s diagnosis given the poor average clinical diagnosis accuracy of non-experts (73.8%) and average accuracy of movement disorder specialists (79.6% without follow-up, 83.9% after followup) using pathological post-mortem examination as ground truth [26]. This study considered parkinsonism due to something other than PD (e.g., essential tremor) as an error. When PD was diagnosed early in its course, accuracy approached chance, near 50%. Considering the heterogeneity in the biology driving parkinsonism, PD can be difficult to diagnose accurately. The mPower dataset provides a means of demonstrating the effectiveness of automated machine learning based models to accurately diagnose Parkinson’s disease patients. This dataset provides many biomarkers, which combined, can offer reliable diagnoses for patients. The tool also provides two main advantages: more detailed data about the state of a patient for clinicians and the ease of use of the app which can be completed within minutes. In total the dataset contained 8320 patients which completed at least one survey. Of these, 5826 total patients enrolled in the voice survey. Please reference the

8

T. J. Wroge and R. H. Ghomi

Fig. 1.2 Flowchart showing the way the data was split according to health codes provided by mPower

mPower study for more information on the demographic breakdown of the dataset [25]. The dataset was not cross-sectional so a single patient can contribute multiple voice recordings. Patients of a specific health code appeared exclusively in either the testing or training set in the initial splitting of the samples so a patient in the training set did not appear in the testing set.

1.3 Methods Broadly, Machine learning requires two steps: feature selection and model application. Due to the high dimensional nature of audio, the features had to first be

1 An Analysis of Automated Parkinson’s Diagnosis Using Voice:. . .

9

split into underlying components that reflected the physiological characteristics we wished to use for diagnosis. These features are correlated to the target class we were looking to measure (disease diagnosis). In order to use the audio collected from the mPower study, the data needed to first pass through audio feature selection algorithms. These produced two classes of features. In the case of diagnosis, the algorithms solve the problem of taking a set of features F = {f1 , . . . , fn } where n is the number of features, and finds the class ctarget ∈ c1 , . . . cm , where m is the number of classes (here taken to be 2). In the case of this PD diagnosis using audio, the target classes are either PD patient or control and the features are the spectral and lower level information that is able to be extracted from the audio. The advantage of a machine learning approach compared to conventional treatment is that the result is readily reproducible and automated. An expert in diagnosis may be influenced by subjective factors while the algorithms can provide reliable and accurate prediction of disease severity and diagnosis. In addition, machine learning can detect and quantify features which are readily apparent to the human ear but to a much finer and accurate degree allowing, for example, detection of the same features much earlier in the disease course before it is audible by an expert human observer. The approach taken in this study was to first remove background noise of the audio using Voice Activation detection. This prepossessing step was required in order to pass only raw voice into the audio feature extraction algorithms. This cleaned audio was then passed through two separate algorithms for feature extraction: Geneva Minimalistic Acoustic Parameter Set (GeMAPS) as well as the Audio Visual Emotion recognition Challenge 2013 (AVEC) Features. The output of the AVEC feature extraction algorithm was then passed through the maximum relevance minimum redundancy algorithm for robust dimensionality reduction. These features were then passed into the machine learning models as shown in Fig. 1.3.

Fig. 1.3 Algorithm for PD Diagnosis

10

T. J. Wroge and R. H. Ghomi

1.3.1 Voice Activation Detection The audio was first passed into an algorithm for Voice Activation Detection (VAD). This approach removes the background noise from the audio while preserving the audio from voice production. This approach was completed using the VOICEBOX toolkit. The method is based on an objective measurement definition defined by the telecommunication standardization sector of the International Telecommunication Union (ITU) [27]. The algorithm for Voice Activation Detection is based on measuring the instantaneous power over a period of time that the speech is occurring, called the speech level. The audio of speech is defined as a discrete signal x0 , . . . , xn sampled at a rate f Hz with a resolution of at least 12 bits per sample. By defining the time constant of smoothing as τ , the rectified signal can be smoothed using exponential averaging: Pi = e Qi = e

−t τ −t τ

−t

· Pi−1 + (1 − e τ )|xi | −t τ

· Qi−1 + (1 − e )|Pi |

(1.1) (1.2)

The process of exponential averaging is completed, producing the twice smoothed signal Q0 , . . . , Qn where t is f1 and: P0 = 0

(1.3)

Q0 = 0

(1.4)

Because of the natural aspiration during running speech, there are gaps when the speaker is not producing any sounds. To account for this, in voice activation detection, there is a defined hangover which is set in seconds. This ensures that the algorithm is able to recognize these aspirated regions as voiced and does not abruptly stop during a specific portion of audio. Here it is defined as H and the normalized value based on the sampling time is called I :  I=

H t

 (1.5)

I is the number of time steps to allow for before labeling a portion of audio an unvoiced region. M is defined as the margin in decibels, the difference between threshold and active speech level. xˆi =

n  i=0

xi

(1.6)

1 An Analysis of Automated Parkinson’s Diagnosis Using Voice:. . .

xˆi sq =

n 

xi2

11

(1.7)

i=0

We then define a series of thresholds that is applied over the signal Qi (also known as the envelope), called Ci for each time step. Beginning with the start of the audio, set the activity counts aj equal to zero and hangover counts hj equal to threshold I . If the value of Qi is greater than or equal to the threshold Ci , then the activity count aj is incremented and the hangover time hj is reset to zero. If the threshold time is not met and hangover time has not expired (hj < I ), then increment the hangover count hj . Otherwise, the variables should not be changed. Then the values:  2 sq v − 20 · log(r) (1.8) Aj = 10 · log xˆn · aj Cj = 10 · log(cj · v) − 20 · log(r)

(1.9)

can be computed where Aj is defined as the active level estimate, Cj is the threshold in decibels. Implementations of the voice activation detection algorithm used in this analysis are provided in the VOICEBOX toolkit with the activlev function provided by Brookes et al. [28].

1.3.2 Feature Selection In order to obtain useful information from the audio, and perform dimensionality reduction, the data was passed through separate feature extraction algorithms. Each produced separate ranges of features which correspond to high level and low level audio information. The output features correspond to the physiological state of the patient through voice production [22, 29]. The end goal of these feature selection algorithms was to produce the largest separation in feature space between the controls and patients diagnosed with PD. The more sophisticated advanced signal processing algorithms used descriptors and functionals to represent high level information about the audio recordings. Descriptors are information like “jitter” and “loudness” of the audio which are a series of values for an audio signal. A functional is a function that takes in multiple values and maps it to one value [30]: X ∈ Rn → x ∈ R

(1.10)

Common functionals are simple statistical functions like mean, median, mode, and standard deviation.

12

T. J. Wroge and R. H. Ghomi

1.3.2.1

Mel-Frequency Cepstrum Coefficients

Mel-Frequency Cepstrum Coefficients (MFCCs) are a very strong feature that is found in nearly all speech models because of the logarithmic nature of human speech production and recognition [22]. MFCCs accurately relate the power and frequencies of perceived sounds to their true frequencies while encoding the structure of the vocal tract which imposes changes to those frequencies as air passes from the lungs and out of the nose and mouth. For this reason, these coefficients have been among the most powerful drivers of voice technology improvements. MFCCs are calculated by first computing the discrete Fourier transform on the windows of the speech signal, then the short term power spectrum is extracted showing the frequencies that made up the window [31]. The power spectrum is then transformed in to a log mel-frequency spectrum based on the equation:   f (1.11) M(f ) = 2595 · log10 1 + 700 With the constants tuned such that the resulting spectrum simulates the perception of the human ear. Then the M(f ) spectrum is convolved with a triangular band-pass filter P (f ) to produce a new spectrum θ (f )  θ (Mk ) = P (M − Mk )ψ(M), k = 1, . . . , K (1.12) M

Then the final MFCCs are produced as shown in Eq. (1.13): MFCC(d) =

K  k=1

 π , d = 1, . . . , D Xk cos d(k − 0.5) K

(1.13)

where there are K different filters and ψ(M) is the circular critical band-pass function that simulates the frequency band that is perceptible by the human ear. The value of d is the coefficient value, where the 0th is often ignored [31].

1.3.2.2

GeMAPS Features

The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) was used in the analysis before the features were passed to the machine learning models. The GeMAPS features were generated in response to the large amount of variability in voice features in analysis for clinical and predictive use. The parameter set allows for clear replication of studies and a transparent means to demonstrate the effectiveness of models on the use of voice. These features were originally used to produce a minimal set of features that correspond to physiological voice production. They also were designed with the hope of producing a standard for voice analysis as well as the feature’s theoretical relevance [29].

1 An Analysis of Automated Parkinson’s Diagnosis Using Voice:. . .

13

One distinct advantage of training machine learning models with a minimal set of parameters is that they allow for better generalization on unseen data. The central criteria Eyben et al. [29] used when choosing the parameters was the feature’s relationship in the physiological production of voice, and the successful use of the feature in past literature. In the low level descriptors, the model includes smoothing the voiced regions. The functionals applied to the low level descriptors were the arithmetic mean, and the standard deviation normalized by the arithmetic mean. The GeMAPS model also includes additional functionals for pitch and loudness: the 20th, 50th and 80th percentile as well as the range of 20th–80th percentile as well as the mean and standard deviation of the slope of rising/falling audio [29]. All the functionals are only applied to the voiced regions except the loudness descriptor. In total, the low level descriptors were 56 of the 62 total features within the GeMAPS. The remaining six features were the temporal features. The full list of low level descriptors is provided in Table 1.2 and the list of temporal features is provided in Table 1.3. All these features were directly passed to the machine learning models.

1.3.2.3

Audio Visual Emotion Recognition Challenge 2013 (AVEC) Features

The Audio Visual Emotion recognition Challenge 2013 (AVEC) was used as features in the Parkinson’s disease diagnosis model. The AVEC features were composed of 2268 features in total. These were composed of 32 energy and spectral descriptors × 42 functionals, 6 voice related low level descriptors × 32 functionals, 32 delta coefficients relating to energy spectral low level descriptors × 19 functionals, 6 delta coefficients related to voicing × 19 functionals, as well as 10 other durational features about the timing of voiced and unvoiced regions [32]. These descriptors are provided in Table 1.4 and the functionals are provided in Table 1.5. In the 32 energy and spectral features provided by AVEC 2013 included MFCCs 1–13 (d = 1, . . . , 13 as per Eq. (1.13)). The openSMILE toolkit provides the functions to extract these features directly [33]. The AVEC features extracted here were passed into the Maximum Relevance Minimum Redundancy algorithm in order to reduce the dimensionality of the data and provide more pertinent features regarding Parkinson’s disease diagnosis to the machine learning models.

1.3.3 Maximum Relevance Minimum Redundancy Maximum Relevance Minimum Redundancy (mRMR) algorithm [34] evaluates the redundancy among features and preserves the features that have the highest predictive correlation to the target class. This approach has been shown to be very

14

T. J. Wroge and R. H. Ghomi

Table 1.2 GeMAPS 18 low level descriptors [32]

Frequency related parameters Pitch, logarithmic F0 on a semitone frequency scale, starting at 27.5 Hz (semitone 0). Jitter, deviations in individual consecutive F0 period lengths. Formant 1, 2, and 3 frequency, center frequency of first, second, and third formant. Formant 1, bandwidth of first formant. Energy/Amplitude related parameters Shimmer, difference of the peak amplitudes of consecutive F0 periods. Loudness, estimate of perceived signal intensity from an auditory spectrum. Harmonics-to-noise ratio (HNR), relation of energy in harmonic components to energy in noise-like components. Spectral (balance) parameters Alpha Ratio, ratio of the summed energy from 50–1000 Hz and 1–5 kHz. Hammarberg Index, ratio of the strongest energy peak in the 0–2 kHz region to the strongest peak in the 2–5 kHz region. Spectral Slope 0–500 Hz and 500–1500 Hz, linear regression slope of the logarithmic power spectrum within the two given bands. Formant 1, 2, and 3 relative energy, as well as the ratio of the energy of the spectral harmonic peak at the first, second, third formant’s center frequency to the energy of the spectral peak at F0. Harmonic difference H1–H2, ratio of energy of the first F0 harmonic (H1) to the energy of the second F0 harmonic (H2). Harmonic difference H1–A3, ratio of energy of the first F0 harmonic (H1) to the energy of the highest harmonic in the third formant range (A3).

Table 1.3 GeMAPS 6 temporal features [32]

Temporal features Rate of loudness peaks, i.e., the number of loudness peaks per second Mean length and standard deviation of continuously voiced regions (F 0 > 0) Mean length and the standard deviation of unvoiced regions (F 0 = 0; approximating pauses) Number of continuous voiced regions per second (pseudo syllable rate).

1 An Analysis of Automated Parkinson’s Diagnosis Using Voice:. . . Table 1.4 AVEC 2013 32 low level descriptors [32]

Energy and spectral (32) Loudness (auditory model based), zero crossing rate, energy in bands from 250–650 Hz, 1 kHz–4 kHz, 25%, 50%, 75%, and 90% spectral roll-off points, spectral flux, entropy, variance, skewness, kurtosis, psychoacoustic sharpness, harmonicity, flatness, MFCC 1–16 Voicing related (6) F0 (sub-harmonic summation, followed by Viterbi smoothing), probability of voicing, jitter, shimmer (local), jitter (delta: “jitter of jitter”), logarithmic Harmonics-to-Noise Ratio (logHNR)

Table 1.5 Set of all AVEC 42 functionals

Statistical functionals (23) (Positivea ) arithmetic mean, root quadratic mean, standard deviation, flatness, skewness, kurtosis, quartiles, inter-quartile ranges, 1%, 99% percentile, percentile range 1–99%, percentage of frames contour is above: minimum +25%, 50%, and 90% of the range, percentage of frames contour is rising, maximum, mean, minimum segment lengthb,c Standard deviation of segment lengthb,c Regression functionalsb (4) Linear regression slope, and corresponding approximation error (linear), quadratic regression coefficients, and approximation error (linear) Local minima/maxima related functionalsb (9) Mean and standard deviation of rising and falling slopes (minimum to maximum), mean and standard deviation of intermaxima distances, amplitude mean of maxima, amplitude range of minima, amplitude range of maxima Otherb,c (6) LP gain, LPC 1–5

15

a For

delta coefficients the mean of only positive values is applied, otherwise the arithmetic mean is applied b Not applied to delta coefficient contours c Not applied to voicing related LLD [32]

effective in a number of different applications including depression assessment [35], and gene selection [36]. The output of the mRMR algorithm is a ranked set of features that minimize the redundancy among the features and maximize the relevance to the input features with respect to the target labels. The F statistic determines the strength of the relevance of a specific feature vector with respect to the target class. It is defined as:

16

T. J. Wroge and R. H. Ghomi K 

nk (x¯k − x) ¯ 2

k=1

F (xi ) =

σ 2 (K − 1)

(1.14)

where xk are the feature vectors with class k. The value x¯ is the mean of a specific feature across all the samples. nk is the number of feature vectors of a specific class, k. K is the number of classes and σ 2 is the pooled variance, defined as: K  (nk − 1)σk2

σ2 =

k=1

n−K

(1.15)

where σk is the variance of a specific feature cross all the samples and n is the number of samples. The average relevance of the features can be defined as VF : VF =

1  F (i) |S|

(1.16)

i∈S

where S is the set of feature vectors. The redundancy of the individual features is defined using the Pearson correlation: Wc =

1  |cp (i, j )| |S|2

(1.17)

i∈S j ∈S

with redundancy and relevance defined, the goal is to minimize redundancy and maximize relevance. This is achieved with this optimization: arg max(VF − WF )

(1.18)

x∈S

when WF and VF are defined as vectors with the index representing the feature number, the output of the mRMR algorithm produces a set of ranked feature vectors based on minimized redundancy and maximum relevance.

1.3.4 Machine Learning For our model we utilized machine learning to develop algorithms to accurately differentiate Parkinson’s disease patients from control. This is a binary classification problem because there are only two classes. There are other types of models that would aim to differentiate Parkinson’s disease patients from patients that could have other symptoms but those models are not explored here. We used a series of

1 An Analysis of Automated Parkinson’s Diagnosis Using Voice:. . .

17

standard classifiers from decision trees to artificial neural networks and compared the performance of each on the test set using tenfold cross validation. Each set of features: GeMAPS and AVEC required separate models due to the difference in input size and feature meaning. So in total there were 12 models generated for the entire dataset. With the exception of the neural network models, all models were trained using scikit-learn’s open-source machine learning toolbox [37].

1.3.4.1

Cross Validation and Grid Search

After generating the cleaned raw audio through voice activation detection and passing those features through feature extraction algorithms as shown in Fig. 1.3, the features were fed into the machine learning models. In order to demonstrate generality in the produced models, the data needed to be split using stratifiedcross validation. This technique randomly separates the samples into 10 sets each containing the same proportion of positives and controls. One split of the ten was labeled the test set. This process is repeated 10 times with a different test set being left out and the rest of the splits being used for training the models. Cross validation is used in order to ensure that the performance of the models is not a result of the distribution of training and testing samples. The models were trained with grid search. The hyperparameters that need to be tuned are exhaustively searched for every combination of parameters. For each parameter combination, tenfold cross validation was performed. The resulting models were the ones with the highest average accuracy on the testing set.

1.3.4.2

Decision Trees

Decision trees are a machine learning approach that takes the input feature vectors X = x1 , x2 . . . , xn and applies splitting thresholds called a decisions along a specific dimension. A tree is formed by doing this operation recursively on the new split. Each decision tries to maximize the separation of the class by fulfilling a separation criterion. Common criterions are the Gini impurity, defined in Eq. (1.20) and information entropy, defined in Eq. (1.21) [38–40]. πkr = p(k|r) = G(πkr ) =

K 

nk n

(1.19)

πkr (1 − πkr )

(1.20)

πkr ln(πkr )

(1.21)

i=1

E(πkr ) =

K  i=1

18

T. J. Wroge and R. H. Ghomi

where πkc is the proportion of members of class k in region r and where there are K separate classes. Function G is the Gini impurity of the tree and function E is the information entropy of the split. These functions have the property that they are maximum where there is equal numbers of both classes in the output for a region (prc = 0.5) and zero when there is a completely homogeneous region (prc = 0, 1). So minimizing these criteria have the effect of producing an accurate decision linear decision boundary across the inputs to separate the classes. Recursive splitting of the leaves of the tree produces finer and finer decision boundaries in order to reduce the overall error. Generally, decision trees tend to overfit to the training set in order to minimize the cost criterion. In order to minimize the issue of overfitting, the trees are pruned after training so that they can perform better in general cases. This is done by removing branches in the tree that cause the smallest change in overall error in the final model. In general, these models tend to be suboptimal because the decision boundaries that are made are aligned directly with the axes of the feature space.

1.3.4.3

Random Forest

Random forests attempt to reduce the overfitting error produced by greedy optimization in traditional decision trees by creating many different random samples of the data (called bags) and training individual decision trees on those different combinations of samples. The overall prediction is given by the average of predictions from the ensemble of trees given by K 

p(k|x) =

(T (x) = k)

k=1

K

(1.22)

for K decision trees and where T (X) is the prediction class for input features x. This process of “bagging” the inputs is known as bootstrap aggregation. In order to reduce the correlation across the individually randomly trained trees, the “random subspace” method is used to also randomly select what features each model is trained on [39, 41].

1.3.4.4

Extra Trees

Initially introduced as Extremely randomized trees, Extra Trees is a machine learning approach that applies strong stochastic nature of inputs to generate better prediction and generalization. As opposed to the Random Forest approach, the Extra Trees model does not incorporate bootstrap aggregation, and instead uses the entire sample in order to generate the prediction [42].

1 An Analysis of Automated Parkinson’s Diagnosis Using Voice:. . .

1.3.4.5

19

Gradient Boosted Decision Trees

The gradient boosted decision tree technique is based on an ensemble model technique where poor classifiers the inputs to the model are applied iteratively with different sampling distributions of the inputs so that over time the model converges on a strong solution for the problem [43]. The model trained in this experiment used mean squared error for both feature sets.

1.3.4.6

Support Vector Machine

Support Vector Machines (SVMs) [44, 45] work off the principle of linearly separating the classes of points while maximizing the decision boundary of the points. This is done by taking the sample points: X = x1 , x2 . . . , xN and constructing a model in the form: N

 y(x) = sign αk yk (x, xk ) + b (1.23) k=1

This function is constructed by introducing a parameter w which is perpendicular to the separating hyperplane where φ() is the function which maps x to the transformed space. yk (w T φ(xk ) + b) ≥ 1 − ξk

(1.24)

αk is the Lagrange multiplier kth sample and  is the projection into a space using a kernel function that allows the points to be linearly separable in that space. Some different basis functions  are: Linear : (x, xk ) = xkT x Polynomial of order D : (x, xk ) = (xkT x + 1)D −γ (xkT x + 1)2 σ2 Exponential : (x, xk ) = e This involves minimizing the function:  1 T w w+c ξk 2 N

k=1

(1.25)

20

T. J. Wroge and R. H. Ghomi

Subject to the constraint outlined in Eq. (1.24). ξ is introduced as a slack term to account for the fact that there is not always a hyperplane that completely separates the space [46, 47]. The exponential form is also known as the radial basis function (RBF) kernel, and this is the function that was used in both feature sets. Generating an SVM model with the RBF kernel involves tuning these parameters γ and c. c is similar to a regularization term. Lower values allow for less deviation generate simpler SVM models while larger values generate models that are very complex but classify more training samples accurately. γ affects how the model deals with groups of data points, where higher values of γ generate classifications on a smaller local scale and larger values generate decision boundaries for larger numbers of points. In this experiment, the value for C was set to 1.0 and the value of γ was set to the inverse of the number of features as shown in Eq. (1.26). In the GeMAPS features classification model, the number of features is equal to 62, and in the model based on the AVEC features, the number of features is 1200. The kernel function in both models was the radial basis function kernel. Where gamma was defined as: γ =

1.3.4.7

1 Nfeatures

(1.26)

Artificial Neural Networks

Artificial neural networks and by extension, deep learning are based on the principle that data is ordered in hierarchically. Lower levels of the neural network can encode lower level descriptions of the data and high level layers can act as detectors for very specific signals. For this reason, they have been very popular and successful in machine learning problems. Artificial neural networks take a number of forms including Long Short Term Memory Networks (LSTMs) [48], Convolutional Networks [49], and feed-forward networks, also known as fully connected feed-forward neural networks or multilayer perceptrons [50]. Our analysis only used feed-forward neural networks, but with the temporal nature of audio, there is reason to believe that LSTMs would also provide a strong model to predict disease outcomes for patients. Feed-forward artificial neural networks take in an input x and multiply it by a weight matrix W and add a bias. This simulates the process of the input passing through the neurons with the ‘weight’ representing the strength to a given neuron. To account for nonlinearity in the data, a nonlinear activation a() is applied to the output. This process is applied iteratively to generate the output of the network. This is summed up in Eq. (1.27). zi = ai (Wi zi−1 + bi ), z0 = x where zi represents the output of layer i.

(1.27)

1 An Analysis of Automated Parkinson’s Diagnosis Using Voice:. . .

21

The inputs were normalized with batch normalization, this makes the outputs of the layers centered from 0 to 1, reducing the impact of the vanishing gradient problem. The vanishing gradient problem is where the gradient update of the model is pushed to zero because the value being passed into the activation function a() is outside the usable limit of the function. For example, the tanh() activation function has derivatives close to zero outside of the input range [−2, 2]. For our implementation, we used the sigmoid activation function in the output layer and the scaled linear exponential linear unit [51] shown in Eq. (1.28): selu(x) = λ

α(ex − 1)

if x ≤ 0

x

if x > 0

(1.28)

α = 1.6733 and λ = 1.0507 The model architectures are shown in Fig. 1.8. Artificial neural networks are optimized through gradient decent. There are a number of variations of gradient descent optimization algorithms but the one that was used in this analysis was Adagrad. Adagrad has the advantage of not having to adaptively change the learning rate, and the model performs larger parameter updates for less frequent data so it is ideal for sparse datasets [52]. Many times, artificial neural networks are plagued with overfitting. To combat this, regularization is used to add a cost to the machine learning model for overly complex models. Common regularizers are L1 and L2 . Any Ln regularizer has the form: L||θ ||n =



1

n

θin

(1.29)

i

where θ are the free parameters of the model [53]. The loss of the model was the mean squared logarithmic error: L(y, y) ˆ =

N  (log(yi + 1) − log(yˆ + 1))2

(1.30)

i=0

where yˆ is the output of the model and y is the true target class. In addition to regularization, our model used 15% dropout on every layer but the output. Dropout randomly selects a set percentage of neurons and sets the output to zero. This is a way of reducing overfitting for the neural network models. The neural networks built in this experiment were developed using the TensorFlow [54] and Keras [55] Deep Learning Libraries. The mean squared logarithmic error and the native TensorFlow Adagrad optimizer was used for the AVEC and

22

T. J. Wroge and R. H. Ghomi

GeMAPS models in this experiment. The neural network models trained for the AVEC and GeMAPS datasets were feed-forward, fully connected deep neural networks.

1.4 Results Scores of accuracy, recall, precision, and F1 were used over the training set for model selection during the course of the study. These metrics are defined in Eqs. (1.31)–(1.34). In these equations TP stands for true positives, TN stands for true negatives, FP stands for false positives and FN stands for false negatives. TP TP + TN + FN + FP

(1.31)

Recall =

TP TP + FN

(1.32)

Precision =

TP TP + FP

(1.33)

2(Recall · Precision) Recall + Precision

(1.34)

Accuracy =

F1 =

The training and testing set were split 90% training and 10% testing. All models used stratified tenfold cross validation in order to eliminate bias in splitting the testing and training set. The data shown in Tables 1.6 and 1.7 are averages of the tenfold stratified cross validated results plus or minus one standard deviation. The bar graphs for this data is shown in Figs. 1.4 and 1.5. Figures 1.6 and 1.7 show the receiver operating characteristic curves for one split of the training and testing data which includes the AUC metrics given in the legend. Classifiers that are said to perform well outperformed other classifiers in the same metric or received a higher than 75% score in a particular metric. Classifiers that are said to perform poorly under-performed other classifiers or received less than a 75% score on that metric.

Table 1.6 AVEC stratified tenfold cross validated results Decision trees Extra trees Gradient boosted decision tree Artificial neural network Random forest Support vector machine

Accuracy 0.75 ± 0.02 0.81 ± 0.02 0.86 ± 0.02 0.86 ± 0.01 0.83 ± 0.03 0.85 ± 0.02

F1 0.61 ± 0.04 0.65 ± 0.05 0.79 ± 0.04 0.78 ± 0.02 0.72 ± 0.05 0.77 ± 0.03

Precision 0.65 ± 0.04 0.89 ± 0.04 0.85 ± 0.03 0.75 ± 0.03 0.86 ± 0.04 0.84 ± 0.03

Recall 0.57 ± 0.05 0.52 ± 0.05 0.73 ± 0.04 0.82 ± 0.02 0.62 ± 0.06 0.71 ± 0.04

1 An Analysis of Automated Parkinson’s Diagnosis Using Voice:. . .

23

Table 1.7 GeMAPS stratified tenfold cross validated results Decision trees Extra trees Gradient boosted decision tree Artificial neural network Random forest Support vector machine

Fig. 1.4 AVEC results

Fig. 1.5 GeMAPS results

Accuracy 0.72 ± 0.02 0.78 ± 0.02 0.82 ± 0.03 0.76 ± 0.02 0.81 ± 0.03 0.80 ± 0.02

F1 0.53 ± 0.05 0.57 ± 0.06 0.71 ± 0.05 0.54 ± 0.06 0.67 ± 0.06 0.66 ± 0.05

Precision 0.64 ± 0.04 0.85 ± 0.04 0.79 ± 0.04 0.41 ± 0.05 0.82 ± 0.04 0.78 ± 0.04

Recall 0.46 ± 0.06 0.43 ± 0.06 0.65 ± 0.06 0.79 ± 0.06 0.56 ± 0.06 0.57 ± 0.05

24

Fig. 1.6 AVEC ROC curve

Fig. 1.7 GeMAPS ROC curve

T. J. Wroge and R. H. Ghomi

1 An Analysis of Automated Parkinson’s Diagnosis Using Voice:. . .

25

Input layer Hidden layer Hidden layer Hidden layer Batch normalization

Batch normalization

Batch normalization

Dropout 15%

Dropout 15%

Dropout 15%

Output prediction

L1 regularization 1 output neuron for both models

Number of hidden layer neurons: AVEC Model: [512, 32, 8, 4, 21] GeMaps Model: [62, 32, 16, 4, 2, 1] AVEC model: 1200 neurons GeMaps model: 62 neurons

arrows represent fully connected layer transitions

Fig. 1.8 Neural network architectures

The Random Forest Model received a high overall area under curve (AUC) score of 0.899 for the AVEC dataset and 0.880 for the GeMAPS dataset. For the AVEC dataset, the model showed a high accuracy of 83%. The random forest model also performed with a poor recall score of 62% in the AVEC classifier and 56% for recall in the GeMAPS classifier. For the GeMAPS dataset, the model also performed with a high accuracy of 81% and precision of 82% but low F1 of 67% and recall of 56% (Table 1.8). The Artificial Neural Network performed well on the dataset by obtaining the highest overall accuracy of 86% with the smallest variance in cross validation. As Figs. 1.6 and 1.7 show the artificial neural network also performed similarly well with a very clear separation in classes shown by the AUC scores of 0.915 and 0.823 for the AVEC and GeMAPS features respectively. The artificial neural network also performed with the best recall of 82% and a close second best F1 score of 78%. Overall, the network did not perform well on the GeMAPS feature set with low F1 scores of 54% and low precision of 41% and a poor accuracy on GeMAPS of 76%. The Decision Tree Classifier performed with an accuracy of 75% and 72% on the AVEC and GeMAPS features respectively. The decision tree performed poorly on metrics of precision, recall and F1 on both the AVEC and GeMAPS features, often scoring less than 70% and as low as an average recall score of 46% on the GeMAPS features. The decision tree performed the worst on the AUC metric with an AUC score of 0.78 and 0.745 for AVEC and GeMAPS features respectively. The Gradient Boosted Classifier performed well on nearly every metric. The classifier was able to generate the best overall accuracy scores of 86% for the AVEC

26

T. J. Wroge and R. H. Ghomi

Table 1.8 Extracted voice features using PyAudio [3] Feature ID 1

Feature name Zero crossing rate

2

Energy

3

Entropy of energy

4 5 6

Spectral centroid Spectral spread Spectral entropy

7

Spectral flux

8

Spectral roll-off

9–21

MFCCs

22–23

Chroma vector

34

Chroma deviation

Description The rate of sign-changes of the signal during the duration of a particular frame. The sum of squares of the signal values, normalized by the respective frame length. The entropy of sub-frames’ normalized energies. It can be interpreted as a measure of abrupt changes. The center of gravity of the spectrum. The second central moment of the spectrum. Entropy of the normalized spectral energies for a set of sub-frames. The squared difference between the normalized magnitudes of the spectra of the two successive frames. The frequency below which 90% of the magnitude distribution of the spectrum is concentrated. Mel-Frequency Cepstral Coefficients form a cepstral representation where the frequency bands are not linear but distributed according to the mel-scale. A 12-element representation of the spectral energy where the bins represent the 12 equal-tempered pitch classes of western-type music (semitone spacing). The standard deviation of the 12 chroma coefficients.

features and 82% for the GeMAPS features. The gradient boosted classifier also performed the best on the ROC AUC score with 0.924 and 0.892 for AVEC and GeMAPS respectively. This indicates that this model can produce the best separation of the two classes—PD and control. The classifier also outperformed all models by achieving the highest average F1 score of 79% on the AVEC features. The Extra Tree Classifier performed the best average precision of 89% on the AVEC features and a high precision on the GeMAPS dataset of 85%. The Extra Tree Classifier also performed with a high accuracy of 81% and 78% on AVEC and GeMAPS respectively. However, the model performed with low recall and F1 scores—all were below 60% except the F1 score for the AVEC feature set. The SVM outperformed many other classifiers with high overall accuracy and high F1, precision and recall scores on the AVEC features but tended to perform worse on the GeMAPS features. The SVM model was close to the artificial neural network and gradient boosted decision tree with scores of 85% accuracy on the AVEC features and a high AUC score of 0.911 on the AVEC features. The SVM also performed high precision on the AVEC and GeMAPS features and a high F1 score of 0.77 for AVEC and 0.66 for GeMAPS. Nearly all models performed better on the AVEC feature set than the GeMAPS features. The AVEC features provided the highest overall accuracy for the Gradient Boosted Decision Tree and Artificial Neural Network. The ROC Curve

1 An Analysis of Automated Parkinson’s Diagnosis Using Voice:. . .

27

demonstrates a clear trend that all machine learning models were gaining higher AUC scores and generating better separations between the classes with the AVEC features, showing more easily separated classes than the GeMAPS dataset.

1.5 Discussion The audio samples used in the machine learning models were very short—only 10 s. Given the high accuracy performed by the models, we are optimistic about the use of voice as a dense biomarker for PD diagnosis, used along with other clinical measures. Our model only uses self-reported measures of clinical diagnosis as opposed to the most widely accepted biomarkers for diagnosis such as DaT scans, or clinician-scored monitored motor evaluation in the Unified Parkinson’s Disease Rating Scale (UPDRS). With better benchmarks for disease severity or diagnosis, better machine learning models can be constructed and implemented. In addition, the amount of data used in the analysis was low compared to the number of samples used for analysis and the form of data. A patient vocalizing /aa/ for 10 s is much less rich than a clinician visit where multiple symptoms can be assessed. Previous studies have shown, using strict clinical criteria and following patients to post-mortem brain autopsy as the gold standard for diagnosis, accuracy of the PD diagnosis depending on stage of symptoms and responsiveness to medications as only 26–85% accurate [26, 56, 57]. Post-mortem biopsy confirmation is the ideal ground truth given the biological confirmation, but also not feasible in practice at scale. For this reason, these computing approaches are likely best suited to quantify patient symptoms and perhaps focus more on the individual and learning each phenotype to help with tracking and managing symptoms rather than focusing on achieving 100% accuracy in diagnosis. Notably, when exploring performances across studies we see mention of sensitivity, specificity, accuracy, recall, bias, and F1 scores. It is important to note when applying machine learning techniques to healthcare data, although there is a well-established trend of using sensitivity, specificity, and accuracy in describing results of a test, these may not be best for interpreting results. Recall has a higher sensitivity to false negatives, while precision has a higher sensitivity to false positives. F1 provides a general assessment of model performance based on a weighted average of Recall and Precision. Although accuracy is more intuitive, basing performance on the number of true positives detected over the total observations, it is evident that if a dataset is unbalanced with an unequal number of false positives or false negatives or otherwise the cost of each error is not the same, it will provide a misleading performance estimate. For this reason we provide F1 scores in addition which adjust for false negatives and positives and provide a better estimate of performance. Especially in human studies, including this dataset, we often have imbalanced data and it is important to exercise caution when evaluating performance of models. In this case our true positive is based on a patient’s self-report of the existence of a diagnosis which may not be an accurate representation of the disease.

28

T. J. Wroge and R. H. Ghomi

F1 accounts for both type I (false positive) and type II errors (false negative). In most clinical diagnoses, the aim is to minimize type II error which minimizes not detecting a disease as opposed to misdiagnosis because the former is more detrimental to the patient. Therefore, recall is a good metric to compare to traditional diagnostic procedures since it minimizes false negatives. However, there should be an effort to improve both recall and precision in order to improve overall patient quality of life. However, our paper proposes automated voice analysis as a validation of a clinician diagnosis given we rely here on patient self-report of their diagnosis. The algorithm’s performance here is limited to that of the clinician. We are confident that with more patient driven data, the accuracy of these models using speech as a biomarker for disease can be improved, as shown in Tsanas et al. [20], especially when validated against currently available biomarkers such as the DaT scan. Ultimately, PD diagnosis primarily relies on clinically observed ratings and biomarker confirmation and is not sought in the majority of cases because of the clear response most patients show to treatment. The goal of a digital biomarker then shifts more toward not only accurately capturing the state of PD in a patient, but also learning the individual patient’s symptoms and providing enhanced care by assisting with treatment management and assessing severity progression. The models trained on the AVEC features often outperformed the models trained on GeMAPS features based on metrics of accuracy, precision, recall and F1. A possible reason for this trend is that there is more information encoded within the feature vectors for AVEC that can correlate to PD diagnosis. The AVEC features contain 1200 unique dimensions of information drawn from the audio recording while the GeMAPS features only contain 62 dimensions. This validates the concept that as more information can be drawn from the patient regarding their health, better diagnostic accuracy can be acquired using automated machine learning models. This issue also leads to the important topic of interpretability which is a challenge in applying machine learning to health data. Specifically, as we incorporate more features and ones which are less intuitive and more abstract, understanding what features of PD the model is predicting based on becomes more difficult. This is an important consideration given the importance of clinical validation and the need to bridge clinical care with providers. It will be challenging to adopt technology as a “black box” if regulatory agencies and providers are unable to make sense of what the model is basing its decisions on. For these reasons it is clear why the GeMAPS feature set was created—in an attempt to simplify and reduce the important features for detecting illness. At the same time, given the complexity and breadth of illness pathology encoded in voice, we may need to harness larger feature sets such as AVEC. This places emphasis on the data science methodology to communicate and visualize the methods making them accessible to a larger audience. We attempt to hold to this process here by drawing clinical correlations to the important features we found and in our attempt to explain variations in performance of different models. Several extracted features have clinical significance including loudness. People with PD tend to have loss of volume in their voice and loudness changes and other

1 An Analysis of Automated Parkinson’s Diagnosis Using Voice:. . .

29

amplitude measures may demonstrate this. Other features including formants and fundamental frequency measures also may reflect typical changes in PD due to the change in vibration ability and stiffness of the vocal tract as dopamine loss continues. MFCCs also reflect the power spectrum of the voice which is affected by PD. The best classifier based on the data provided in Tables 1.6, and 1.7 is the Gradient Boosted Decision Tree. This model seems to be especially effective in classifying the dataset as well as maintaining high values of precision, recall, and F1. This model shows the best separation of the PD and control through the AUC metric of 0.924 and performs with an accuracy of 86% on the AVEC selected features. There are clear limitations to speech as a single biomarker for clinical diagnosis, but given the success of this model and others [20], we are optimistic algorithms that incorporate multiple modalities, such as speech, brain scans, or accelerometers could be used in concert to create a robust clinical tool to aid neurologists in diagnosing PD and PD like symptoms. Notably, PD has a variety of presentation with a little as a third of patient presenting with resting tremor, making reliance on only a single symptom likely to yield many false negatives. We believe for these reasons, tracking multiple symptoms will increase sensitivity and specificity for PD. Our earlier report using accelerometer data also showed promising results separately from voice to classify PD from control using only movement data from the iPhone carried in a pants pocket [58]. Further work remains to demonstrate accuracy by pooling together different markers.

1.6 Conclusion Disease diagnosis and prediction is possible through automated machine learning architectures using non-invasive voice biomarkers as features. Our analysis provides a comparison of the effectiveness of various machine learning classifiers in disease diagnosis with noisy and high dimensional data. After thorough feature selection, clinical level accuracy is possible. Our aim was to provide a walk-through of our analysis such that anyone can recreate the same work or apply the same methods to a new dataset. We have built a feature extraction pipeline for this available can be found at the following GitHub link (https://github.com/NeurolexDiagnostics/VoiceAnalysis-Pipeline), which can be used out-of-box [59]. There is recently a new guide to voice computing in Python also released by our collaborator with an opensource library available via GitHub to make voice computing accessible to all [60]. These results are promising because they may introduce novel means to assess patient health and neurological diseases using voice data. Due to the high accuracy performed by the models with these short audio clips there is reason to believe denser feature sets with spoken word, video, or other modalities would aid in disease prediction and clinical validation of diagnosis in the future but much work remains. We are currently working on additional and more robust datasets which include voice data beyond a single spoken phoneme. We hope to demonstrate robust voice models to detect not only individual diseases but help clarify difficult clinical boundaries such as those between PD and other illnesses which involved parkin-

30

T. J. Wroge and R. H. Ghomi

sonism including Lewy Body Dementia, Multiple System Atrophy, Progressive Supranuclear Palsy, and Corticobasal Degeneration among other PD mimickers. Acknowledgements We would like to acknowledge Yasin Özkanca, Cenk Demiroglu, Dong Si and David C. Atkins for their help in the preparation and review for the experiments described in this research. Yasin Özkanca and Cenk Demiroglu helped in the feature extraction and generation of the mRMR algorithms. Dong Si and David C. Atkins provided their input in the original study and feedback about the content on this work. Data was contributed by users of the Parkinson mPower mobile application as part of the mPower study developed by Sage Bionetworks and described in Synapse [25]. https://doi.org/10. 7303/syn4993293. Disclosure Statement At the time of manuscript preparation, Dr. Hosseini Ghomi was an employee of NeuroLex Laboratories and owns stock in the company.

References 1. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., et al. (2016). Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (pp. 265–283). 2. Wood, M. (2017). Introducing Gluon: a new library for machine learning from AWS and Microsoft: Introducing Gluon. Amazon Web Services. https://aws.amazon.com/blogs/aws/ introducing-gluon-a-new-library-for-machine-learning-from-aws-and-microsoft/. 3. Giannakopoulos, T. (2015). pyAudioAnalysis: An open-source python library for audio signal analysis. PLoS One 10(12), e0144610. 4. Bedi, G., Carrillo, F., Cecchi, G. A., Slezak, D. F., Sigman, M., Mota, N. B., et al. (2015). Automated analysis of free speech predicts psychosis onset in high-risk youths. npj Schizophrenia, 1(1). Article number: 15030. 5. Pestian, J. P., Sorter, M., Connolly, B., Bretonnel Cohen, K., McCullumsmith, C., Gee, J. T., et al. (2017). A machine learning approach to identifying the thought markers of suicidal subjects: A prospective multicenter trial. Suicide and Life-Threatening Behavior, 47(1), 112–121. 6. Khodabakhsh, A., Yesil, F., Guner, E., & Demiroglu, C. (2015). Evaluation of linguistic and prosodic features for detection of Alzheimer’s disease in Turkish conversational speech. EURASIP Journal on Audio, Speech, and Music Processing, 2015, 9. 7. Human voiceome project 2019. 8. Tysnes, O.-B., & Storstein, A. (2017). Epidemiology of Parkinson’s disease. Journal of Neural Transmission, 124(8), 901–905. 9. Parkinson’s Foundation. Statistics. https://www.parkinson.org/Understanding-Parkinsons/ Statistics. 10. Savitt, J. M., Dawson, V. L., & Dawson, T. M. (2006). Diagnosis and treatment of Parkinson disease: Molecules to medicine. Journal of Clinical Investigation, 116(7), 1744–1754. 11. Goetz, C. G., Tilley, B. C., Shaftman, S. R., Stebbins, G. T., Fahn, S., Martinez-Martin, P., et al. (2008). Movement disorder society-sponsored revision of the unified Parkinson’s disease rating scale (MDS-UPDRS): Scale presentation and clinimetric testing results. Movement Disorders, 23(15), 2129–2170. 12. Hoehn, M. M., & Yahr, M. D. (1967). Parkinsonism: Onset, progression and mortality. Neurology, 17(5), 427–442. 13. Magrinelli, F., Picelli, A., Tocco, P., Federico, A., Roncari, L., Smania, N., et al. (2016). Pathophysiology of motor dysfunction in Parkinson’s disease as the rationale for drug treatment and rehabilitation. Parkinson’s Disease, 2016, 9832839.

1 An Analysis of Automated Parkinson’s Diagnosis Using Voice:. . .

31

14. Uitti, R. J., Baba, Y., Wszolek, Z. K., & Putzke, D. J. (2005). Defining the Parkinson’s disease phenotype: Initial symptoms and baseline characteristics in a clinical cohort. Parkinsonism & Related Disorders, 11(3), 139–145. 15. Asgari, M., & Shafran, I. (2010). Extracting cues from speech for predicting severity of Parkinson’s disease. In 2010 IEEE International Workshop on Machine Learning for Signal Processing, pp. 462–467. 16. Bernheimer, H., Birkmayer, W., Hornykiewicz, O., Jellinger, K., & Seitelberger, F. (1973). Brain dopamine and the syndromes of Parkinson and Huntington clinical, morphological and neurochemical correlations. Journal of the Neurological Sciences, 20(4), 415–455. 17. Harel, B., Cannizzaro, M., & Snyder, P. J. (2004). Variability in fundamental frequency during speech in prodromal and incipient Parkinson’s disease: A longitudinal case study. Brain Cognition, 56(1), 24–29. 18. Harel, B. T., Cannizzaro, M. S., Cohen, H., Reilly, N., & Snyder, P. J. (2004). Acoustic characteristics of Parkinsonian speech: A potential biomarker of early disease progression and treatment. Journal of Neurolinguistics, 17(6), 439–453. 19. Garcia, A. M., Carrillo, F., Orozco-Arroyave, J. R., Trujillo, N., Vargas Bonilla, J. F., Fittipaldi, S., et al. (2016). How language flows when movements don’t: An automated analysis of spontaneous discourse in Parkinson’s disease. Brain and Language, 162, 19–28. 20. Tsanas, A., Little, M. A., McSharry, P. E., Spielman, J., & Ramig, L. O. (2012). Novel speech signal processing algorithms for high-accuracy classification of Parkinson’s disease. IEEE Transactions on Biomedical Engineering, 59(5), 1264–1271. 21. Tsanas, A., Little, M. A., & Ramig, L. O. (2010). Accurate telemonitoring of Parkinson’s disease progression by non-invasive speech tests. IEEE Transactions on Biomedical Engineering, 57(4), p. 10. 22. Khan, T. (2014). Running-speech MFCC are better markers of Parkinsonian speech deficits than vowel phonation and diadochokinetic. http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva24645. 23. Arora, S., Venkataraman, V., Zhan, A., Donohue, S., Biglan, K. M., Dorsey, E. R., et al. (2015). Detecting and monitoring the symptoms of Parkinson’s disease using smartphones: A pilot study. Parkinsonism & Related Disorders, 21(6), 650–653. 24. Zhan, A., Mohan, S., Tarolli, C., Schneider, R. B., Adams, J. L., Sharma, S., et al. (2018). Using smartphones and machine learning to quantify Parkinson disease severity: The mobile Parkinson disease score. JAMA Neurology, 75(7), 876–880. 25. Bot, B. M., Suver, C., Neto, E. C., Kellen, M., Klein, A., Bare, C., et al. (2016). The mPower study, Parkinson disease mobile data collected using ResearchKit. Scientific Data, 3, 160011. 26. Rizzo, G., Copetti, M., Arcuti, S., Martino, D., Fontana, A., & Logroscino, G. (2016). Accuracy of clinical diagnosis of Parkinson disease a systematic review and meta-analysis. Neurology, 86(6), 566–576. 27. ITU-T. Objective measurement of active speech level. Recommendation P.56. International Telecommunications Union, 2011. 28. Brookes, M. (1997). VOICEBOX: A speech processing toolbox for MATLAB. Software library, Imperial College, London, 1997–2018. 29. Eyben, F., Scherer, K. R., Schuller, B. W., Sundberg, J., André, E., Busso, C., et al. (2016). The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing, 7(2), 190–202. 30. Schuller, B., Steidl, S., Batliner, A., Vinciarelli, A., Scherer, K., Ringeval, F., et al. (2013). The INTERSPEECH 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism. In Proceedings INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association, Lyon. 31. Zheng, F., Zhang, G., & Song, Z. (2001). Comparison of different implementations of MFCC. Journal of Computer Science and Technology, 16(6), 582–589.

32

T. J. Wroge and R. H. Ghomi

32. Valstar, M., Schuller, B., Smith, K., Eyben, F., Jiang, B., Bilakhia, S., et al. (2013). AVEC 2013: The continuous audio/visual emotion and depression recognition challenge. In Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge (pp. 3–10). New York: ACM. 33. Eyben, F., Wöllmer, M., & Schuller, B. (2010). Opensmile: the munich versatile and fast opensource audio feature extractor. In Proceedings of the 18th ACM International Conference on Multimedia (pp. 1459–1462). New York: ACM. 34. Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1226–1238. 35. Özkanca, Y., Demiroglu, C., Besirli, A., & Celik, S. (2018). Multi-lingual depression-level assessment from conversational speech using acoustic and text features. Proceedings of the INTERSPEECH 2018 (pp. 3398–3402). 36. Zhang, Y., Ding, C., & Li, T. (2008). Gene selection algorithm by combining ReliefF and mRMR. BMC Genomics, 9(2), S27. 37. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825– 2830. 38. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106. 39. Bishop, C. M. (2006). Pattern recognition and machine learning. New York: Springer. 40. Breiman, L. (2017). Classification and regression trees. Abingdon: Routledge. 41. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. 42. Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees. Machine Learning, 63(1), 3–42. 43. Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29, 1189–1232. 44. Suykens, J. A., & Vandewalle, J. (1999). Least squares support vector machine classifiers. Neural Processing Letters, 9(3), 293–300. 45. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. 46. Hsu, C.-W., Chang, C.-C., Lin, C.-J., et al. (2003). A Practical Guide to Support Vector Classification. 47. Franklin, J. (2005). The elements of statistical learning: Data mining, inference and prediction. The Mathematical Intelligencer, 27(2), 83–85. 48. Gers, F. A., Schmidhuber, J., & Cummins, F. (1999). Learning to Forget: Continual Prediction with LSTM. 49. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (pp. 1097–1105). 50. Rosenblatt, F. (1961). Principles of neurodynamics. Perceptrons and the theory of brain mechanisms. Tech. rep., Cornell Aeronautical Lab Inc., Buffalo, NY. 51. Pedamonti, D. (2018). Comparison of nonlinear activation functions for deep neural networks on MNIST classification task. Preprint. arXiv:1804.02763. 52. Ruder, S. (2016). An overview of gradient descent optimization algorithms. Preprint. arXiv:1609.04747. 53. Ng, A. Y. (2004). Feature selection, L1 vs. L2 regularization, and rotational invariance. In Proceedings of the Twenty-First International Conference on Machine Learning (p. 78). New York: ACM. 54. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., et al. (2016). Tensorflow: A system for large-scale machine learning. In OSDI (Vol. 16, pp. 265–283). 55. Chollet, F., et al. (2015). Keras. 56. Adler, C. H., Beach, T. G., Hentz, J. G., Shill, H. A., Caviness, J. N., Driver-Dunckley, E., et al. (2014). Low clinical diagnostic accuracy of early vs advanced Parkinson disease. Neurology, 83, 406–412.

1 An Analysis of Automated Parkinson’s Diagnosis Using Voice:. . .

33

57. Schrag, A., Ben-Shlomo, Y., & Quinn, N. (2002). How valid is the clinical diagnosis of Parkinson’s disease in the community? Journal of Neurology, Neurosurgery, and Psychiatry, 73(5), 529–534. 58. Pittman, B., Hosseini Ghomi, R., & Si, D. (2018). Parkinson’s disease classification of mPower walking activity participants. In IEEE Engineering in Medicine and Biology Conference. 59. Zhang, L., Chen, X., Vakil, A., Byott, A., & Ghom, R. H. (2019). DigiVoice: Voice biomarker featurization and analysis pipeline. Preprint. arXiv:1906.07222. 60. Schwoebel, J. (2019). Introduction to Voice Computing in Python. Scotts Valley: CreateSpace Independent Publishing Platform. Google-Books-ID.

Chapter 2

Noninvasive Vascular Blood Sound Monitoring Through Flexible Microphone Binit Panda, Stephanie Chin, Soumyajit Mandal, and Steve J. A. Majerus

2.1 Introduction and Background Hemodialysis is a treatment to filter waste and excess water from the blood for individuals who have acute or chronic kidney disease. For those with end-stage renal disease (ESRD) (with no remaining kidney function), hemodialysis is a renal replacement therapy required to maintain survival. Despite the mortality risk associated with ESRD, successful hemodialysis can greatly prolong patient lifespans and increase the chance of receiving a donor transplant. During hemodialysis, arterial blood is pumped into a dialyzer and the filtered blood is returned to the venous system. To improve hemodialysis, permanent vascular access is often obtained using arteriovenous fistulas or grafts or central venous catheters (Fig. 2.1) [1].

B. Panda Department of Biomedical Engineering, Case Western Reserve University, Cleveland, OH, USA Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH, USA S. Chin Department of Biomedical Engineering, Case Western Reserve University, Cleveland, OH, USA Advanced Platform Technology Center, Louis Stokes Cleveland Veterans Affairs Medical Center, Cleveland, OH, USA S. Mandal Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH, USA S. J. A. Majerus () Advanced Platform Technology Center, Louis Stokes Cleveland Veterans Affairs Medical Center, Cleveland, OH, USA e-mail: [email protected] © Springer Nature Switzerland AG 2020 I. Obeid et al. (eds.), Signal Processing in Medicine and Biology, https://doi.org/10.1007/978-3-030-36844-9_2

35

36

B. Panda et al.

Fig. 2.1 Hemodialysis (a) vascular access fistula and (b) vascular access graft during dialysis treatment. (From Mayo Foundation for Medical Education and Research. All rights reserved)

Long-term dialysis success is dependent on maintaining the patient’s vascular access—the so-called “Achilles Heel” of hemodialysis [2]. Maintenance of vascular access is a key objective of the National Kidney Foundation and dialysis care in general [3]. Access dysfunction accounts for two hospital visits/year [4, 5] for dialysis patients and the loss of access patency can double their mortality risk [6]. The predominant causes of access dysfunction are stenosis (vascular narrowing) and thrombosis (vascular occlusion), which have a combined incidence of 66–73% in arteriovenous fistulas (AVFs) and 85% in arteriovenous grafts (AVGs). Venous stenosis in the juxta-anastomotic regions occurs in 50–71% of grafts and fistulas, but stenoses can occur anywhere along the vascular access or central veins [7, 8]. Due to the prevalence of stenosis in AVFs and AVGs, clinical monitoring is mandated to identify at-risk accesses as early as possible for diagnostic imaging and treatment planning [9, 10]. Monitoring for vascular access dysfunction relies on data efficiently gathered in the dialysis center, including physical exam and the use of stethoscopes to detect pathologic blood sounds (bruits). When blood flows through a constricted vessel, the resulting flow jet induces a turbulent flow and a pressure fluctuation in

2 Noninvasive Vascular Blood Sound Monitoring Through Flexible Microphone

37

the blood vessel wall [11]. This produces distinct sounds known as arterial bruits which can be heard with a stethoscope. Access surveillance, performed monthly using flow-measuring equipment, cannot detect fast-growing lesions or restenosis and is often a late indicator of access risk [12]. Late detection limits the utility of diagnostic imaging, reduces treatment planning, and increases thrombosis risk [12–14]. Higher-frequency monitoring for access dysfunction would be ideal for early detection of stenosis (because hemodialysis patients are typically seen thrice weekly), but must be balanced between the labor required and objectivity of a given test. Existing monitoring techniques have variable sensitivities (35–80%), in part due to the expertise-dependence of bruit interpretation and physical exam techniques [15]. However, since listening to bruits is an important aspect of physical exams, clinicians have sought to identify auditory features of bruits for quantitative analysis since the 1970s [16]. Recording and mathematical analysis of bruit sounds is called phonoangiography (PAGs) [11, 17–19]. Spectral analysis of PAGs offers an objective method of describing the properties of turbulent blood flow which occurs near vascular stenoses. Recent advances in spectral and multiresolution analysis, autoregressive models, and machine learning techniques have renewed interest in PAG analysis because complex computations can be performed at the point of care for rapid testing of patients. The promise of PAG monitoring is widespread, objective screening of hemodialysis vascular access function for early detection of asymptomatic stenosis or accesses at risk of thrombosis. Because vascular access structures are surgically created and superficial to the skin surface, they are easy to locate for physical exam performed at the bedside before treatment. However, there is currently no method (besides imaging) capable of describing the underlying interior anatomy of the access, or for detecting a change in the anatomy which could put the patient at risk. Because phonoangiography can detect locations and occurrences of turbulent blood flow, it is one potential approach for rapid, quantitative assessment. This chapter will cover relevant signal de-noising and feature extraction of PAGs and the development of microphonic sensor arrays to enable stenosis localization in a vascular access based on site-to-site changes in PAG spectral features. The chapter is organized beginning with a brief summary of foundational work on PAG characterization and more recent work on stenosis classifiers based on spectral features. Next, signal processing methods for PAG de-noising and spectral analysis are presented. Third, the initial design of flexible microphone sensor arrays is described. Finally, proof-of-concept experiments on vascular access phantoms are presented to determine the performance of automated PAG analysis to describe the location and severity of vascular access stenosis.

2.2 Prior Work in Phonoangiographic Detection of Stenosis Although PAGs have been studied over several decades, there is a relative sparsity of work and widely varying descriptions of relevant spectral properties in functional and dysfunctional vascular accesses. The underlying theme of prior work has been

38

B. Panda et al.

to identify how PAG spectra change relative to the degree of stenosis (DOS) and the recording location relative to the stenosis. DOS is defined as the ratio of the stenosed cross-sectional area of the blood vessel to the proximal (non-stenosed) lumen diameter. The clinical standard to locate stenosis and calculate DOS is angiography, in which X-ray scans are used with radiocontrast in the blood for precise imaging of the vascular geometry [20]. DOS is typically estimated within 10% due to the precision attainable in clinical environments. Previous studies have analyzed PAGs from humans, from vascular bench phantoms, and from computer simulations of blood flow. Here, we describe three topics which have been studied previously: the spectral properties of PAGs in normal and stenosed cases, the impact of recording location on PAG spectra, and classification of DOS severity based on PAG analysis.

2.2.1 Effect of Stenosis and Blood Flow on Bruit Spectra and Intensity Much early work in PAG analysis used Fourier transform and other time-invariant analyses to quantify the frequency content of PAGs. These methods represented the combined frequencies generated during systolic and diastolic phases of turbulent blood flow. Because clinical interpretation of pathologic bruits relies on detecting a high-pitched whistling character, it was hypothesized that stenosis would shift spectral power within a certain frequency band [1]. Although all studies agree that the frequency range of interest is in the 0–1000 Hz band, identification of specific frequency bands varied widely (Table 2.1). More recent work has abandoned simple

Table 2.1 Frequency ranges affected by stenosis References Chen et al. [27] Obando and Mandersson [28] Vasquez et al. [29] Mansy et al. [24] Wang et al. [30] Sato et al. [31] Wang et al. [26] Chen et al. [32] Gram et al. [33] Sung et al. [1] Todo et al. [34] Chen et al. [35] Mungia et al. 2009 [36] Grochowina et al. [37] Rousselot [38]

Frequency ranges affected by stenosis Specific bands of 130–260, 250–500, and 550–800 Hz 200–600 Hz 300–500, 200–800, 250–1000 Hz 300–600 Hz 700–800 Hz after signal processing 20–800 Hz 200–600 Hz 40–800 Hz 625–750 and 875–1000 Hz 50–800 Hz 51–270 Hz 25–800 Hz 200–1000 Hz 150 and 350–400 Hz 70–800 Hz; specific bands at 180–300, 310–390, 440–700 Hz

2 Noninvasive Vascular Blood Sound Monitoring Through Flexible Microphone

39

spectral analysis in favor of more complex multi-feature classifiers (described below) [1, 21–26]. Moreover, although most studies agreed that DOS caused enhanced highfrequency spectral power [1, 26–28, 30, 31, 33–35, 37, 38], one prominent study described an opposite effect [24]. This was perhaps due to the effect of DOS to reduce blood flow because blood pressure is regulated; reduced flow would produce lower overall bruit amplitudes due to reduced turbulence for low blood flow [39, 40]. Studies have also demonstrated that stenosis can increase the intensity (amplitude) of PAGs [27, 43, 44]. Amplitude-based analysis, however, is affected by the mechanical-acoustical coupling of the skin and recording transducer and can be affected by skin scratch artifacts. Therefore, most works have focused on spectral analysis. Despite the disagreement in the precise effect of stenosis on bruit spectra, however, these prior studies confirmed that stenosis definitively changes PAG amplitude and pitch. The change, however, could be an enhancement or a suppression of certain frequencies depending on the impact of stenosis on blood flow. Other patientdependent variables such as PAG amplitude and the recording location relative to stenosis must also be accounted for and are described below.

2.2.2 Effect of Recording Location on Bruit Spectra Compared to frequency analysis reports, there is significantly less disagreement on the effect of recording location on PAG spectra. This is partly due to the obvious property that a bruit cannot be detected when very far away from the location of stenosis (for example on the contralateral limb). Therefore, bruits can only be detected against the background physiologic sounds when close to the source of turbulent blood flow. Fluid dynamic simulations have established to a high degree of precision that stenosis induces turbulent flow at physiologic blood pressures, flow rates, and nominal lumen diameters of vascular accesses [39, 40]. These simulations have been confirmed by Doppler ultrasound measurements, which agree that turbulent flow is present at a distance of 2–5 times the diameter of the unoccluded tube distal of stenosis. The important distinction is that turbulence and increased pressure always occurs on the downstream, or venous, side of the stenosis. Therefore, bruits recorded proximally and distally to stenosis will have different frequency spectra due to stenosis turbulence. This effect was confirmed using dual-channel stethoscopes [21]. However, the acoustical influence of biomechanical properties and thickness of tissue over the vascular access varies between patients. Because tissue acts as a low-pass filter at auditory frequencies, it is presumed that the most accurate bruit recordings would be obtained in the 1–3-cm region distal to stenosis (assuming unoccluded tube diameter to be 6 mm), where turbulent flow is maximal [39, 40].

40

B. Panda et al.

2.2.3 Stenosis Severity Classification from PAG Signal Analysis Many recent studies have focused on the use of various classification methods based on spectro-temporal features. These techniques are promising because they more closely approximate how trained clinicians interpret bruits—through a combination of time-based and spectral-based observations. However, because of the complexities and wide dynamic range of PAGs, no single technique has been identified as superior. Some of these approaches leverage the pulsatile nature of blood flow, where a distinct change in spectra occurs in systole and diastole. Other approaches rely on clustering or machine learning strategies to combine several time-domain and spectral features. Because various techniques have been used, here we only summarize prominent examples (Table 2.2). Several limitations of these early-stage studies should be noted. First, all these studies recruited pathologic patients, i.e., patients who were already suspected for vascular access dysfunction and who were scheduled for surgical interventions. Therefore, there is a strong bias in interpretation of results, and limited opportunity for confirmation of classifier algorithms on a randomly selected test set. Second, while classifiers were trained to determine DOS from PAG features, most studies have reported on the performance of a binary classifier, i.e., detecting if a vascular access has DOS > 50%. This detection level is clinically important because it is at the level at which stenoses become hemodynamically significant and surgical intervention may be justified [1]. Therefore, although many methods sought to describe DOS directly from PAG features, the primary utility of classification would be to simply identify which patients have DOS > 50% to enable the use of clinical imaging for more detailed investigation of pathology.

2.3 Phonoangiogram Signal Processing PAGs are processed in three stages described separately. First, signals are preprocessed using an autoregressive bruit-enhancing filter. Next, systole and diastole phases are determined from the spectral flux. Finally, spectral features are separately calculated for systole and diastole. After features are calculated, stenosis severity and location can be determined by classification methods, discussed in a later section. This section, therefore, only describes the process for extracting a set of analytical parameters from a PAG.

2.3.1 Bruit-Enhancing Filter The systole portion of flow is enhanced automatically using a Bruit-Enhancing Filter (BEF) based on sub-band frequency-domain linear prediction (SB-FDLP) (Fig. 2.2). The BEF also reduces skin scratch and pop noise artifacts obtained due

Support vector machine (SVM) to classify between wavelet scale energies

Vesquez et al. 2009 [29]

Wang et al. 2011 [30]

Principal component analysis and sequential forward selection algorithm Narrowband power threshold

Mungia et al. 2011 [36]

Wang et al. 2014 [26]

Sung et al. 2014 [1]

Principal component analysis based on two-dimensional Gaussian distribution Radial basis function neural network

Burg autoregressive model of frequency spectrum and fuzzy petri net Linear discriminant analysis

Chen et al. 2014 [32]

Gram et al. 2010 [33]

Classifier method Burg autoregressive model of frequency spectrum Burg autoregressive model of frequency spectrum and fractional-order chaotic system

References Chen et al. 2013 [27] Chen et al. 2013 [35]

Table 2.2 Previously published PAG stenosis classifier methods

Power distribution in highest scale of discrete wavelet transform relative to power in other scales allows detection of stenosis Covariance between spectral centroid and spectral flux correlate most strongly with stenosis S-transform used for feature detection; time characteristics of spectra during systolic pulse are essential to analyze accurately Best feature was energy difference between wavelet scales at 500–1000 and 250–500 Hz Stenosis detectable if power in 700–800 Hz band exceed 1/15th of the maximum power SVM detected stenosis based on energy differences between 500–1000 and 250–500 Hz bands with >98% accuracy (n = 1) and 83% specificity (n = 7)

Classifier details and performance Stenosis causes increase of frequency of fitted poles in model Correlation (0.38) between chaotic model output parameter and degree of stenosis reduction from PAGs at venous site of AVF Stenosis can be graded by Petri net into 1 of 3 classes

2 Noninvasive Vascular Blood Sound Monitoring Through Flexible Microphone 41

42

B. Panda et al.

Fig. 2.2 PAGs are processed in frames using a bruit-enhancing filter to produce an enhanced PAG XBEF [n] and the bruit power envelope EBEF [n]. The BEF breaks the discrete cosine transform of X[n] into N sub-bands. Sub-bands are modeled using linear predictive coding and recombined with weights WN to produce a bruit envelope EBEF [n] Table 2.3 Selected bands for frequency-domain modeling

Band 0 Band 1 Band 2 Band 3

Sub-band width (Hz) 25–225 350–700 650–1000 950–1200

Sub-band weight (%) 5 50 30 15

to movement of the recording transducer over the skin. The BEF was developed based on the nominal PAG spectra of over 1000 bruit recordings obtained over 12 months in patients with hemodialysis vascular access. Based on an analysis of spectral variance, four frequency bands were identified which showed the most variation between subjects and recording locations. During the filtering step, power contributions of each band are weighted to produce a time-domain enhancement envelope. Weights for each band were calculated based on the nominal spectral density and degree of variation within the band. Because low-frequency sounds dominate the PAG power spectrum, they were proportionally attenuated relative to higher-frequency bands (which had more patient variation) (Table 2.3) [18]. The BEF segments the PAG into 25% overlapping frames, and models the time-domain envelope in sub-bands using SB-FDLP. Each sub-band envelope is multiplied by sub-band weights W0 − W3 and summed to produce a modeled temporal bruit envelope (EBEF ). The envelope EBEF is multiplied by the original recording to produce a bruit-enhanced PAG (XBEF ). This approach is generalizable

2 Noninvasive Vascular Blood Sound Monitoring Through Flexible Microphone

43

to other signals through the selection of sub-bands, sub-band weights, and the pole rate (number of poles used in linear prediction relative to frame length). The core component of the BEF is frequency-domain linear prediction. When used on a broadband transform, FDLP models the Hilbert envelope of a signal [41, 42]. By controlling the number of poles used in the predictive model, significant smoothing of the modeled envelope is obtained. The FLDP envelopes are calculated from the real discrete cosine transform (DCT) values of the PAG, XDCT [k, n] over 20,000 frequency coefficients, i.e., XDCT [k] = a [k]

N −1 

 x [n] cos

0

(2n + 1) π k 2N

 ,

(2.1)

where  a [k] =

1√

2

k=0 . k = 1, 2, . . . N − 1

(2.2)

The DCT approximates the envelope of the discrete Fourier transform. This implies that the spectrogram of the DCT (treating the DCT as a time sequence) mirrors the time-domain spectrogram around the time/frequency axes. Just like time-domain linear predictive coding (LPC) models the frequency-domain envelope, FDLP estimates the temporal envelope when applied in the frequency domain. This implementation uses LPC to model the spectral envelope using a Pth-order, all-pole FIR filter. xˆ [n] =

p 

ak x [n − k] ,

i.e.

P (z) =

k=1

p 

ak z−k ,

(2.3)

k=1

where P is the order of the filter polynomial and P(z) is its z-transform. LPC uses least-squares iterative fitting to determine the coefficients ak of the filter P(z) such that the error in determining the next value of a series xˆ [n] is minimized. The calculated filter is an autoregressive model with significantly lower variance than the Hilbert envelope or computed ASF when used for PAG analysis [18]. When LPC is applied in sub-bands, we can call P(z) in each individual band as Hm (Z), where m is for the mth sub-band. Thus, we can say impulse response of Hm (Z), hm [n] predicts the time-domain envelope produced by frequencies of x[n] within the mth sub-band. N sub-band envelopes hm [n] are combined to produce a bruit-enhanced envelope EBEF [n] using sub-band weights Wm , EBEF [n] =

N  m=1

Wm Hm [n] .

(2.4)

44

B. Panda et al.

EBEF [n] is calculated, it is multiplied by x[n] to produce the enhanced signal XBEF . Because EBEF [n] models the systolic portions of the PAG (since that is where the frequency content in the selected sub-bands is concentrated), it can be used for feature extraction or systole-diastole segmentation. Otherwise, XBEF alone may be further analyzed as described below.

2.3.2 PAG Wavelet Analysis After we obtain the enhanced PAG at the BEF output, continuous wavelet transform (CWT) is used to describe the spectral variance over time. Wavelet transform over k scales W[k, n] was computed, e.g., W [k, n] = xPAG [n] ∗ ψ [n/k] ,

(2.5)

where ψ[n/k] is the analyzing wavelet at scale k. We used the complex Morlet wavelet because it has good mapping from scale to frequency, defined as: ψ [n] = e−(n/2fc ) ej 2πfc n , 2

(2.6)

where fc is the wavelet center frequency. In the limit fc → ∞, the CWT with Morlet wavelet becomes a Fourier transform. Because of the construction of the Morlet wavelet as the wavelet ψ[n] is scaled to ψ[n/k], and k is a factor of 2, the wavelet center frequency will be shifted by one octave. Therefore, CWT analysis with the Morlet wavelet can be described by the number of octaves (NO ) being analyzed (frequency span) and the number of voices per octave NV (divisions within each octave, i.e., frequency scales). Mathematically, the set of scale factors k can be expressed as: k [iO , iV ] = 2(iO +iV /NV ) iO = k0 , k0 + 1, k0 + 2, . . . NO , iV = k0 , k0 + 1, k0 + 2, . . . NV

(2.7)

where k0 is the starting scale and defines the smallest scale value and the total number of scales K = NO NV . For PAG analysis we computed CWT with NO = 6 octaves and NV = 12 voices/octave, starting at k0 = 3. The number of octaves were chosen on an analysis of aggregate power spectrum of PAGs from 3283 unique 10-s recordings from hemodialysis patients [18]. The same set of human data was used to calculate the lowest number of coefficients per octave to achieve a signal reconstruction accuracy below 5% RMS error. After computing the CWT, pseudofrequencies F[k] across all K scales were calculated as: F [k] = fc /k.

(2.8)

2 Noninvasive Vascular Blood Sound Monitoring Through Flexible Microphone

45

2.3.3 Wavelet-Derived Auditory Signals Because the CWT involves time-domain convolution, each time point n has a paired sequence of k CWT coefficients. Therefore, the CWT can be visualized like a spectrogram plotted against the pseudofrequency axis F[k]. However, several temporal-spectral waveforms can also be derived to reduce the dimensionality of the CWT back to n. These waveforms enable simple comparison between spectral and time-domain features, enabling simplified segmentation, for example. In our work, two fundamental n-point waveforms are calculated from the set of CWT coefficients: Auditory Spectral Flux (ASF) and Auditory Spectral Centroid (ASC). ASF describes the rate at which the magnitude of the auditory spectrum changes, and approximates a spectral first-order difference or first derivative. It is calculated as the variation between two adjacent time frames. ASF is calculated as:  K  1 ASF [n] =  (|W [k, n]| − |W [k, n − 1]|)2 , K

(2.9)

k=1

where W[k, n] is the continuous wavelet transform obtained over K total scales of the PAG. In [1], the severity of Degree of Stenosis (DOS) was correlated with the sharpness of the peak in the ASF. In Fig. 2.3, we demonstrate the spectral features that ASF describe using an artificially generated test waveform. The test waveform is a concatenated sine signals with six different frequencies from 100 to 1500 Hz. We can see that the ASF curve shows a spike whenever there is a sudden change in

Fig. 2.3 Spectrogram of artificially generated test waveform with six single-tone frequencies from 100 to 1500 Hz. The ASF curve (lower) shows a spike at every change of frequency, approximating the spectral first derivative

46

B. Panda et al.

frequency. The magnitude of the ASF waveform, therefore, describes how quickly spectral power is shifting between bands. This is useful in mapping large variations in a PAG signal, such as the systole and diastole phases. Segmentation of these phases, therefore, uses the ASF waveform (described below). ASC describes the spectral “center of gravity” at each point in time. For Gaussian-distributed noise (same spectral power at all frequencies), ASC would be centered at pseudofrequency F[K/2]. The higher value of centroid corresponds to “brighter” textures with more high-frequency content [43]. ASC is commonly used to estimate the average pitch of audio recordings. ASC is calculated as: K ASC [n] =

k=1 (|W [k, n]| .fc [k]) , K k=1 |W [k, n]|

(2.10)

where W[k, n] is the continuous wavelet transform obtained over K total scales of the PAG and fC [k] is the center frequency. The spectral distribution of the bruit varies in accordance with DOS. Therefore, the severity of vascular access stenosis can be evaluated by monitoring the variation of ASC [1]. To demonstrate how ASC describes spectral content, ASC for the same test waveform was calculated (Fig. 2.4). Because only a single tone is used at each time point, ASC consistently describes the frequency of the sine wave at each time point. As the frequency of the sine wave increases, the ASC feature also increases. Although F[k] represents pseudofrequencies, the properties of the Morlet wavelet provide a close mapping between pseudofrequency and real frequency. For the sake of PAG analysis, absolute frequency accuracy is not needed (as discussed below).

Fig. 2.4 Spectrogram of artificially generated test waveform with six single-tone frequencies from 100 to 1500 Hz. The ASC curve describes the frequency of the sine wave at each time point

2 Noninvasive Vascular Blood Sound Monitoring Through Flexible Microphone

47

Fig. 2.5 Segmentation is performed on ASF with a two-stage approach. Stage 1 uses threshold crossing to locate initial start and end of a segment. Stage 2 filters out spurious segments caused by recording noise peaks

2.3.4 PAG Systole/Diastole Segmentation During pulsatile blood flow, the high velocity flow in systole produces significant turbulence, causing the characteristic pitch of the bruit [1]. Therefore, the systolic portion of the bruit contains spectral content related to turbulent flow and should be analyzed separately from diastole to improve sensitivity. We implemented a segmentation technique to segment the pulse into systole and diastole periods (Fig. 2.5). Segmentation relied on analyzing the auditory spectral flux (ASF) to identify these epochs. First, 50% of the RMS value of the ASF waveform was calculated as a threshold. Timepoints corresponding to alternating positive- and negative-going crossings of the threshold were selected as potential start and end times for each systolic pulse. Pulse width was calculated as the time difference between alternating threshold crossings. A second segmentation stage was used to filter out short pulse widths caused by threshold double-crossings or crossings caused by transient recording artifacts. Valid segments were selected to meet two conditions: maximum systolic segment length less than 1 s and greater than 40% of the longest systolic segment (Fig. 2.5). Benefits of the segmentation can be seen in an example (Fig. 2.6a), where DOS is low (30%) and the systole has more power than the diastole. In the other case (Fig. 2.6b), DOS is high (80%) and the flow is turbulent. The diastole is not as calm and has a significant amount of power at frequency between 700 and 1000 Hz. This is caused by turbulent flow persisting into diastole and not confined to just systole. Segmentation permits separate calculation of features in systole and diastole, or comparisons of relative changes between these phases, which is linked to DOS.

2.3.5 PAG Spectral Feature Extraction We calculated 13 features for each PAG based on previous work [44]. All the features were calculated in Matlab software from 10-s PAG recordings from the vascular access phantoms (explained in detail in later chapters). Some features were calculated over the whole waveform, and others were calculated separately for systole and diastole phases. Our objective in feature selection was to determine

48

B. Panda et al.

Fig. 2.6 When degree of stenosis is 30% (a), the power is confined to the systole phase (white box). For high DOS (b), turbulent flow increases the power in systole and diastole (black box) segments

features that were easily calculated and used in a simple binary classifier (discussed below). The features most sensitive to detecting DOS were determined using ANOVA and principal component analysis. Because many of the features are strongly correlated (e.g., mean and RMS), feature selection focused on features which varied the most independently. These included features 5, 6, 12, and 13 (Table 2.4). The performance of these features in a binary classification scheme was tested and is described below.

2.4 Skin-Coupled Recording Microphone Design and Assembly This section describes the construction of a new type of skin-coupled microphone suitable for integration into large, flexible arrays. The sensor geometry can be customized for different applications. In this work, 5-channel sensor prototypes were built for testing on vascular phantoms. Sensor size and acoustic backings were first tested using frequency sweeps to ensure that sensors had sufficient bandwidth for accurate bruit recording.

2.4.1 Sensor Construction Sensors were constructed using 28-μm silver ink metallized polyvinylidene fluoride (PVDF) film (Measurement Specialties, MEAS) [45]. PVDF was used since it is a piezoelectric polymer which produces a differential charge when vibrated at acoustic frequencies. Films were cut into required sizes by laser cutter (Versa LASER,

2 Noninvasive Vascular Blood Sound Monitoring Through Flexible Microphone

49

Table 2.4 List of features explored Feature equation ASC ASF

Feature description Average of ASC Average of ASF

ASCRMS

RMS of ASC

ASFRMS

RMS of ASF

ASCS

Average of ASC in systole segment Average of ASC in diastole segment Difference in average ASC in systole and diastole

ASCD | ASCS − ASCD |

ASCS /ASCD ASFRMS, S

Ratio of average of ASC in systole and diastole segments RMS of ASF in systole segment

ASFRMS, D

RMS of ASF in diastole segment

ASFRMS, S − ASFRMS, D

Difference between RMS ASF in systole and diastole segments

ASFRMS, S /ASFRMS, D

Ratio of RMS ASF in systole and diastole segments

ASCS ∗ ASFRMS,S

Average of ASC in systole segment multiplied with RMS of ASF in systole segment

Expected effect during turbulent blood flow Increased mean pitch Increased mean spectral rate of change Increased mean pitch and pitch range Increased range of spectral rates of change Increased pitch during blood high-pressure period Increased pitch during blood low-pressure period Increased absolute pitch between high- and low-pressure periods Increased relative pitch between high- and low-pressure periods Increased range of spectral rate of change during blood high-pressure period Increased range of spectral rate of change during blood low-pressure period Increased absolute range of spectral change between highand low-pressure periods Increased relative range of spectral change between highand low-pressure periods Increased mean pitch and increased range of spectral rate of change in blood high-pressure period

Models VLS2.30 & VLS 3.50) in two steps. First, the outer 0.5 mm of metallization was removed from one sensor surface by raster-scanning at 14% power and 30% speed to weaken the metal ink-PVDF bond. Weakened regions were exfoliated by tape prior to cutting through the PVDF film to prevent shortening of the metal across the cut film. A final thickness cut made at 17% power and 100% speed released each transducer element. Transducers were sandwiched between two identical printed circuit boards with annular electrical contacts surrounding drilled holes to electrically contact each film’s sides separately. Circuit board electrodes were attached to the PVDF film using silver conductive epoxy adhesive (MG Chemicals, Model 8331). The skinfacing side of the PVDF was electrically grounded and covered in a 1-mm PDMS

50

B. Panda et al.

Fig. 2.7 (a) Schematic structure of a sensor showing the PVDF film sandwiched between two electrodes and a layer of PDMS on the skin-facing side. (b) Cross-sectional view of the sensor

film (Ecoflex 00-10) with a similar mechanical impedance to muscle. The opposite side of the PVDF film used one of three backing materials to constrain the mechanical resonant modes and control the acoustic response as shown in Fig. 2.7. The three backings included: PDMS (Ecoflex 00-10), silicone gel (Dow Corning SYLGARD 527 dielectric gel) plus polyimide tape, or an open backing (air-backed). Backings were applied at thicknesses of 1-mm (PDMS) and 1-mm (gel). Four microphone diameters (2, 4, 8, and 16 mm) were also tested to study the effect of diaphragm size on acoustic sensitivity and frequency response. Each combination of size and backing material was replicated in 36 test boards.

2.4.1.1

Frequency Response

A frequency generator (Agilent 33522A, Function/Arbitrary Waveform Generator) was used to generate a logarithmic frequency sweep from 20 Hz to 5 kHz over 60 s. To avoid harmonics due to overdriving, the driving voltage was kept to 50 mVpp. The output was connected to a contact speaker element to generate acoustic

2 Noninvasive Vascular Blood Sound Monitoring Through Flexible Microphone

Function Generator

Sensor Contact Speaker PDMS

51

MATLAB Plots

Fig. 2.8 Test setup for frequency response measurements Table 2.5 Measured SNR of PVDF microphones, contact microphone, and stethoscope

Microphone type Stethoscope Contact microphone PVDF film, air-backed PVDF film, PDMS-backed PVDF film, silicone gel-backed

Observations 9 9 27 27 27

SNR (dB) Mean ± SD 6.34 ± 5.45 24.86 ± 9.95 15.66 ± 2.97 22.16 ± 4.97 24.40 ± 3.83

95% Confidence interval for the mean −0.9–13.57 9.55–40.17 13.30–18.02 18.22–26.09 21.36–27.43

vibrations through 6-mm layer of PDMS rubber (Fig. 2.8). This arrangement mimicked the typical thickness of tissue over a blood vessel in a vascular access. To account for variations in surface coupling pressure and gain differences, recordings from microphones and the stethoscope were normalized to a −10 dB RMS level in MATLAB. This test showed that all the four diameters had comparable frequency response with random error within a margin of 5 dB. Thus, we chose the smallest size (2 mm) in order to keep the microphone as small as possible.

2.4.1.2

Signal to Noise Ratio Calculation

Single-tone testing was done at three frequencies (100, 1000, and 1500 Hz) using the same setup. PVDF film microphones were tested alongside a digital recording stethoscope (Littman 3200) and a precision contact microphone (TE CM01B). Signal to noise ratio (SNR) was calculated for each microphone type and averaged across all tested frequencies and all microphone sizes. On average, PVDF film microphones outperformed the stethoscope and had similar performance to the contact microphone (Table 2.5). The microphones with silicone gel backing performed better than other two backings. Thus, 2-mm microphones with silicone gel were chosen for array integration.

2.4.1.3

Array of Microphones

The 2-mm-diameter microphone with silicone gel was used to build a recording array. The main objective was to use the array to localize the stenosis and analyze the acoustic characteristics around the stenosis rather than a single point analysis.

52

B. Panda et al.

Fig. 2.9 Prototype PAG sensor array based on 2-mm-diameter PVDF microphones on polyimide substrate with silicone skin-coupling interface (a). PAG array is thinner than 1 mm (b) and flexible to conform to vascular access shape (c)

Therefore, a 1 × 5 array of microphones was designed which could be oriented along the direction of vascular flow. Testing PAGs from different locations around the stenosis on vascular access phantom showed that a resolution of 1 cm is enough to characterize different acoustic properties as the test showed recordings at different locations (up to 3 cm proximal and distal to the stenosis) had different ASC and ASFRMS value (detailed results below). Thus, the array microphone consisted of five 2-mm microphones with silicone gel backing, each spaced 1 cm apart. Microphones were integrated using a flexible polyimide substrate with laser-cut relief between recording sites (Fig. 2.9). The polyimide substrate with relief cuts had reduced surface area and good vibration damping, allowing recording from five locations simultaneously without any measurable crosstalk. This microphone array was used on vascular bench phantoms to determine the sensitivity and specificity of stenosis detection and classification using PAG spectral feature analysis.

2 Noninvasive Vascular Blood Sound Monitoring Through Flexible Microphone

53

2.5 Detection of Vascular Access Stenosis Location and Severity In Vitro To simulate how the sensor would perform on a real patient (but in a controlled environment), recordings were made on a vascular access phantom. The phantom consisted of a 6-mm silicone tube embedded in PDMS (Ecoflex 00-10) at a 6mm depth. Stenosis was simulated in the center of the phantom with a band tied around the tube to produce an abrupt narrowing. The degree of stenosis (DOS) was controlled by tying the band around metal rods with fixed diameters. DOS was later confirmed by calculating the percentage of lumen diameter reduction from CT scans of phantoms (Fig. 2.10). The phantom was connected to a pulsatile flow pumping system (Cole Parmer MasterFlex L/S, Shurflo 4008) to simulate human hemodynamics flows from 600 to 1200 mL/min. Pulsatile pressures and aggregate flow rate were measured with a pressure sensor (PendoTech N-038 PressureMAT) and flow sensor (Omega FMG91PVDF), respectively. To validate the phantoms auditory signals, recordings were made using a digital stethoscope (Littman 3200) first and were compared to those from humans to

Fig. 2.10 Ten to 85% stenosis AVG phantoms used to reproduce bruits were constructed using banded 6-mm silicone tubing (a) embedded in tissue-mimicking silicone rubber to a depth of 6 mm (b). Contrast angiography validated recording site locations (c), with computed CT slices used to calculate degree of stenosis (d)

54

B. Panda et al.

Fig. 2.11 Aggregate power spectrum of patient and phantom PAGs. The spectra varied by only 1.8% RMS, indicating that the phantom reproduces hemoacoustics as recorded from humans

ensure that they were physiologically relevant. An aggregate power spectrum was produced from 3283 unique 10-s recordings obtained from 24 hemodialysis patients over 18 months [18]. Spectral comparisons between human and phantom data were used to reduce time-domain variability, e.g., due to heart rate differences. Human and phantom spectra followed similar spectral trends and showed similar dynamic range in time-domain analysis. Quantitatively, normalized RMS error of the aggregate power spectrum was calculated over the aggregate human PAG spectra range (Fig. 2.11). The vascular phantoms matched the aggregate human PAG spectra with a 1.84% normalized RMS error, scaled to the total power spectrum range. This comparison suggested that the vascular stenosis phantoms can adequately replicate PAGs as measured from humans, but with control over vascular access flow rate and stenosis severity [44]. We investigated PAGs from ten phantoms with degree of stenosis from 10 to 85%. Ten-second recordings were recorded with the flexible microphone array (discussed earlier) at six flow rates (700–1200 mL/min) to represent nominal physiological flow rate in functional vascular accesses [44]. PAGs were recorded at four locations simultaneously, with one being proximal, one on the stenosis, and two other distal to stenosis (Fig. 2.12) with Labview software and National Instruments data acquisition hardware. After recording, signals were transferred to MATLAB for signal processing and calculating the features based on auditory spectral flux (ASF) and auditory spectral centroid (ASC). Signals were segmented into systole and diastole phases and features calculated separately. Because the ASF has a significant pulsatile component, RMS ASF values were used, while otherwise mean ASC values were calculated in each pulse phase. ASC and ASF values were analyzed with two primary outcomes: localization of the stenosis and differentiation of the degree of stenosis. Balanced Analysis

2 Noninvasive Vascular Blood Sound Monitoring Through Flexible Microphone

55

Fig. 2.12 Vascular stenosis phantom flow diagram. The pumping system produced pulsatile flows in a vascular access phantom within physiological ranges of flow and pressure. The recording sites are shown in the stenosis phantom diagram

of Variance (ANOVA) was conducted to detect differences in PAGs at different locations and at different degree of stenosis over a range of flows. The effect of flow on ASC and ASF was not studied; variable flow was used to represent this confounding variable in humans [1]. ANOVA was used to test the differences between means for statistical significance. The total number of recordings were 50 per location including all phantoms and flow rates. After segmentation there were approximately 350 systole and diastole segments for each phantom-location combination in this analysis.

2.5.1 Feature Performance 2.5.1.1

Auditory Spectral Flux (ASF)

Systole ASF RMS (ASFRMS, S ) was different at all recording locations (p < 0.033 location 1–3 and 1–2, p < 0.001 all others) (Fig. 2.13). Because array sites were located 1 cm apart, this confirms that multiple recording locations are needed for accurate detection of stenosis location. Based on this analysis, a sharp drop in ASFRMS, S suggests the presence of stenosis within 1 cm proximal to the direction of flow. As the DOS increased, the ASFRMS, S value also increased [1]. This trend can be very well seen while moving from mild to moderate to severe, there is a monotonic increase in the ASFRMS, S irrespective of the location (Fig. 2.14). Measured results showed a mean ASF shift of 0.059 ± 0.0035 between minor and moderate DOS and a mean ASF shift of 0.143 ± 0.004 between moderate and severe DOS (n = 422, p < 0.001). This result suggests that simple threshold-based detection can be used for classification of stenosis severity.

56

B. Panda et al.

Fig. 2.13 The flexible PVDF sensor array was tested on 10–85% stenosis phantoms with variable flow. PAG systole spectral flux RMS (regardless of DOS or flow) was different at each recording site. A sharp drop in ASFRMS,S at location 2 suggest the presence of stenosis within 1 cm proximal to the direction of the flow

Fig. 2.14 As the degree of stenosis increased, the ASFRMS,S value also increased. This trend can be seen while moving from mild to moderate to severe, there is a monotonic increase in the ASFRMS,S irrespective of the location

2.5.1.2

Auditory Spectral Centroid (ASC)

Systole ASC mean (ASCS ) showed a similar dependence on recording location for stenosis localization. ANOVA revealed that there is a significant difference between the mean of PAGs from different locations (p = 0.022 locations 2–3, p < 0.001 all others) (Fig. 2.15). An interesting observation was that the decreased significantly (from 323 Hz to 223 Hz) from location 3 to location 4. This can be explained by the formation of vortex flow at areas distal to stenosis [46]. Prior work estimated that vortex flow occurs at a distance of 1–3 times the vessel diameter from the stenosis. In the tested phantoms, the vortex was observed at 2 cm from the stenosis which is 3.3 times the vessel diameter [47]. Therefore, the large drop in ASC between these locations is likely due to turbulent flow as determined by prior work [39, 40].

2 Noninvasive Vascular Blood Sound Monitoring Through Flexible Microphone

57

Fig. 2.15 The flexible PVDF sensor array was tested on 10–85% stenosis phantoms with variable flow. PAG spectra (regardless of DOS or flow) had distinct spectral centroid at 1-cm spacings proximal and distal to stenosis

To find a trend in the degree of stenosis, we subtracted systole ASC mean (ASCS ) from diastole ASC mean (ASCD ). As DOS increases the flow becomes turbulent and power is present in both the systole and diastole segment. Thus, the difference between the mean power level is low. Since, in a low DOS phantom, most of the power is present in the systole segment, there is a huge difference between the mean power level in systole and diastole. In Fig. 2.16a, DOS is 40% and the difference between the peak in systole and valley in diastole is greater than 500 Hz, while, in Fig. 2.16b, this difference is less than 500 Hz. Therefore, as DOS increases, the variation of ASC between systole and diastole diminishes and converges toward the mean ASC value of the total waveform. The converging behavior of ASC between systole and diastole was evaluated using ANOVA. Mean systole-diastole 95% confidence intervals were 56.22 ± 2.35 in mild (10–40% DOS), 44.43 ± 3.53 in moderate (40–60% DOS), and 14.41 ± 2.65 in severe (>60% DOS)  (Fig.2.17).  Thus,  the difference between the systole and diastole ASC values ( |ASCS − ASCD |) correlated strongly with stenosis grade. Tukey pairwise comparisons revealed that there was a significant difference between all stenosis grades (p < 0.001) regardless of stenosis grade.

2.5.2 Threshold-Based Phonoangiographic Detection of Vascular Access Stenosis Based on ASC and ASF values, we designed a binary classifier to classify DOS into two groups. This classifier would be useful to identify patients with a hemodynamically significant stenosis, defined clinically as DOS > 50%. Such patients might be selected for an imaging study or entered into a vascular surveillance program

58

B. Panda et al.

Fig. 2.16 Low DOS (a) showed a greater peak-valley change in spectral centroid compared to high DOS (b). The difference in systole and diastole is computed after ASF-based segmentation

Fig. 2.17 The difference between systole ASC mean and diastole ASC mean is inversely related to the DOS. As the DOS increases, the flow becomes turbulent and the difference between systole and diastole ASC mean decreases

to reduce emergency interventions or for treatment planning. The following were performance parameters to measure the performance of the classifier: • • • •

True Positive (TP): Phantom with DOS > 50% detected correctly True Negative (TN): Phantom with DOS < 50% rejected correctly False Positive (FP): Phantom with DOS < 50% detected incorrectly False Negative (FN): Phantom with DOS > 50% rejected incorrectly

2 Noninvasive Vascular Blood Sound Monitoring Through Flexible Microphone

59

Accuracy: defined as the measure of percentage of correctly classified DOS by the total number of cases. Accuracy =

TP + TN . Total Cases

(2.11)

Specificity: defined as the measure of percentage of DOS < 50% that are recognized as DOS < 50%. Specificity =

TN . TN + FP

(2.12)

Sensitivity: defined as the measure of percentage of DOS > 50% that are recognized as DOS > 50%. Sensitivity =

TP . TP + FN

(2.13)

We investigated four main features using binary classifier: systole ASC mean (ASCS ), systole ASF  RMS (ASFRMS,  S ), difference between systole ASC and diastole ASC means (ASCS − ASCD ), and multiplication of systole ASC mean and systole ASF RMS (ASCS ∗ ASFRMS,S ). These features were selected from the larger set described above, based on maximum sensitivity to DOS. Each classifier was tested independently for each recording location. The binary classifier simply compares the feature value to a threshold; values exceeding the threshold are labeled as pathological (DOS > 50%). Optimized thresholds for each feature are selected to improve classifier performance and reduce false positive and negative rates. Classifier performance was measured by receiver operating characteristic (ROC) curve [48]. ROC curves were generated by testing the classifier performance against varying thresholds and computing sensitivity and specificity. The optimized threshold was calculated by maximizing Youden’s index (J). Youden’s index is a function of sensitivity (q) and specificity (p) and is a commonly used measure of overall diagnostic effectiveness [49, 50]. Youden’s index is the threshold where best differentiating ability is achieved when equal weight is given to sensitivity and specificity, i.e., J = Maximimum {sensitivity (c) + specificity(c) − 1}

(2.14)

over c tested thresholds. For each ASC feature tested, 300 thresholds were tested, and 50 thresholds were tested for ASF features. Thresholds were evenly distributed over the range obtained over all rates of blood flow and phantom degree of stenosis. The optimum threshold, JO was calculated for each classifier and for each location (Table 2.6).

60

B. Panda et al.

Table 2.6 Best-performing features for binary classification Feature ASCS | ASCS − ASCD | ASFRMS, S ASCS ∗ASFRMS,S

JO range 222–323 Hz 18–61 Hz 80–134 kHz/s 14.7–48.3 Hz2 /s

Best JO and location 222 Hz at site 4 56 Hz at site 3 88 kHz/s at site 4 14.7 Hz2 /s at site 4

Accuracy Specificity Sensitivity (%) (%) (%) 82 96 68 84 76 92 94 100 88 90 92 92

Fig. 2.18 ROC curve of systole ASC mean feature. Greatest area-under-the curve is achieved at location 3 just distal to the stenosis

ROC curves (Figs. 2.18, 2.19, 2.20, and 2.21), JO values, and peak classifier performance differed for all features and recording sites. If used clinically for detection of failing vascular access, JO could be used as a threshold; patients above threshold could be scheduled for imaging with elective surgery or enrolled in a behavioral intervention program to help reduce the likelihood of sudden vascular access thrombosis. Best performance at estimating DOS > 50% was achieved with the combined feature ASCS ∗ ASFRMS,S . On their own, ASCS had good specificity, while ASFRMS, S had high sensitivity and the combined feature showed enhanced classifier accuracy. This result suggests that this simple classification would detect DOS > 50% with an accuracy of 90%, specificity of 92%, and sensitivity of 92% when using a threshold of 14.7 Hz2 /s.

2 Noninvasive Vascular Blood Sound Monitoring Through Flexible Microphone

61

Fig. 2.19 ROC curve of ASC difference between systole and diastole. Greatest area-under-the curve is achieved at location 3 just distal to the stenosis

Fig. 2.20 ROC curve of systole ASF RMS. Greatest area-under-the curve is achieved at location 4 distal to the stenosis

62

B. Panda et al.

Fig. 2.21 ROC curve of systole ASC mean multiplied with systole ASF RMS. Greatest areaunder-the curve is achieved at location 4 distal to the stenosis

2.6 Summary of Stenosis Detection and Classification Performance Our in vitro tests on vascular phantoms spanned a range of simulated stenosis severity (10–85%) over blood flows ranging from 700 to 1200 mL/min. These values were chosen to represent adequate blood flow in vascular accesses to support dialysis treatment. Despite having sufficient blood flow, however, accesses with significant stenosis (>50%) represent patients at risk for clotting or rapid stenosis progression due to cell growth. The clinical challenge, in this case, is that patients with significant stenosis still receiving dialysis treatment might not be identified. Phonoangiographic detection of stenosis location, and severity, could help to screen at-risk patients as they are receiving treatment. Therefore, this work had two primary goals: to detect the location of a stenotic lesion, and to detect stenosis >50% using PAGs, regardless of the level of blood flow in the vascular access. For stenosis localization, these results demonstrate two important outcomes. First, for all levels of stenosis and all blood flow rates, many ASC and ASF features were statistically different at recording sites 1 cm apart along the direction of blood flow. This suggests that multiple recordings are needed to accurately measure PAGs from a vascular access. Most previous work in this field has only taken a single recording at the site of a known lesion, or dual recordings at the arterial anastomosis and in the venous outflow tract [24, 33, 35, 37]. The need for multiple locations is highlighted by our second outcome, showing that stenoses can be easily localized within 1 cm by analyzing changes in ASC or ASF features between successive

2 Noninvasive Vascular Blood Sound Monitoring Through Flexible Microphone

63

recording sites. Because stenosis induces turbulent flow in the region 0.6–1.8 cm distal to the lesion (assuming a 6-mm vessel diameter), this outcome is supported by the current understanding of hemodynamics in vascular access [39, 40]. In terms of stenosis classification, we demonstrated a simple threshold-based binary classifier based on a combined feature calculated during systolic phases. A simple threshold-based classifier is potentially less susceptible to over-fitting and more generalizable to clinical practice. Further, these results provided insight into the influence of stenosis on the spectral character of PAGs through three findings. First, best classification performance for all tested features was obtained at recording sites 3 and 4, which were distal of stenosis. This is likely due to the presence of vortices and turbulent flow in the distal region which produce higher acoustic frequencies due to velocity acceleration. Blood flow studies confirmed that this effect occurs with DOS > 50%, and turbulent flow persists in the 1.2–4.8-cm region beyond a lesion [39, 40]. Because recording sites 3 and 4 were over this region, it is likely that the associated increase in systolic ASC resulted from local turbulent vortices occurring for DOS greater than 50%. Second, although the classifier ASCS > 222 Hz was 96% specific for ruling out DOS > 50%, it only correctly identified pathologic DOS 68% of the time. Because our test methods used flow rates over a wide physiologic range, low flow rates produced proportionally lower ASC regardless of DOS. While this suggests that ASC could be used as a marker for low flow rate, absolute spectral measures like ASC cannot distinguish between a low flow rate with no stenosis and a low flow rate with significant stenosis. Therefore, we also used ASF—which is a relative spectral measure—to improve the classifier performance at low flow rates. Rather than using a two-stage classifier, we found that ASC and ASF features could be simply multiplied to yield a single combined feature to compare against a threshold for stenosis classification. Third, our analysis found that there are significant acoustic spectral differences between systolic and diastolic phases of blood flow. Segmentation prior to feature calculation permits relative comparison between these two phases, e.g., using the feature | ASCS − ASCD |. In our study, we found that when the pitch difference between systolic and diastolic phases exceeded 56 Hz there was a 92% chance that a significant stenosis produced the effect. Because this was a relative measure, spectral difference was not as dependent on flow rate. However, specificity was reduced because of increased variance of the combined feature due to minor differences in acoustic pitch occurring between pulsatile cycles. Nevertheless, this finding reinforces the concept that increased acoustic pitch in PAGs is produced by turbulent flow occurring on the forward (systolic) phase of blood flow. The presented results have several limitations, including the use of vascular phantoms to produce controllable degrees of stenosis and adjustable flow rate. In a clinical monitoring setting, additional anatomic variability and the presence of other vascular malformations could produce multiple locations of turbulent flow without increasing the clotting risk. Due to the wide range of scenarios in the patient population, simple binary classification as described here may be more

64

B. Panda et al.

robust due to simplicity. However, the promise for PAG monitoring is not only in the detection of flow disruptions, but in trend analysis to detect an underlying change in anatomy producing spectral changes in PAGs. Our results indicate that stenosis can be detected to an accuracy of 1 cm, suggesting that a growing lesion could be detected through regular monitoring. This would enable the use of imaging or other preventative interventions to reduce the risk of thrombosis in patients with progressing stenosis or other vascular access malformations.

2.7 Conclusion Failure of vascular access is one of the main causes of hospital visits for a hemodialysis patient. Because PAGs can be collected noninvasively directly before dialysis treatment, they offer a potential method for screening of vascular accesses at risk for sudden clotting. However, analysis of PAGs includes several challenges, including microphone array design and signal processing of complex auditory signals with wide dynamic range. Former work in this field has established a link between PAG spectral content and degree of stenosis, but new tools and techniques are needed to enable clinical utility. In this chapter, we have described signal processing techniques to filter, analyze, and segment PAG signals to obtain several features correlated with degree of stenosis, location of turbulent blood flow, and flow rate. This method of analysis relied on multi-site recordings enabled by flexible miniaturized microphone arrays. Testing on vascular access phantoms at physiologic flows and pressures showed that stenosis could be localized within 1 cm of the true location and classified into three grades (mild, moderate, and severe). To simulate a clinical use case, a binary classification method was adopted in which features were simply compared to a threshold to identify access conditions at risk for thrombosis (stenosis > 50%). Regardless of flow rate, this analysis showed that an optimized feature achieved an accuracy of 90%, specificity of 92%, and sensitivity of 92% at estimating pathologic stenoses from 10-s PAG recordings. Future efforts to combine spectral- and timedomain features through a machine learning approach could further enhance the ability to automatically and rapidly detect vascular access dysfunction at the point of care. Acknowledgements This work was supported in part by RX001968-01 from US Dept. of Veterans Affairs Rehabilitation Research and Development Service, the Advanced Platform Technology Center of the Louis Stokes Cleveland Veterans Affairs Medical Center, and Case Western Reserve University. The contents do not represent the views of the US Government.

2 Noninvasive Vascular Blood Sound Monitoring Through Flexible Microphone

65

References 1. Sung, P. H., Kan, C. D., Chen, W. L., Jang, L. S., & Wang, J. F. (2015). Hemodialysis vascular access stenosis detection using auditory spectro-temporal features of phonoangiography. Medical & Biological Engineering & Computing, 53(5), 393–403. 2. Pisoni, R. L., Zepel, L., Port, F. K., & Robinson, B. M. (2015). Trends in US vascular access use, patient preferences, and related practices: An update From the US DOPPS Practice Monitor With International Comparisons. American Journal of Kidney Diseases, 65(6), 905– 915. 3. Feldman, H. I., Kobrin, S., & Wasserstein, A. (1996). Hemodialysis vascular access morbidity. Journal of Amerian Society of Nephrology, 7(4), 523–535. 4. Cayco, A. V., Abu-Alfa, A. K., Mahnensmith, R. L., & Perazella, M. A. (1998). Reduction in arteriovenous graft impairment: Results of a vascular access surveillance protocol. American Journal of Kidney Diseases, 32, 302–308. 5. Sehgal, A. R., Dor, A., & Tsai, A. C. (2001). Morbidity and cost implications of inadequate hemodialysis. American Journal of Kidney Diseases, 37(6), 1223–1231. 6. Lacson, E., Wang, W., Lazarus, J. M., Hakim, R. M., & Hakim, R. M. (2010). Change in vascular access and hospitalization risk in long-term hemodialysis patients. Clinical Journal of the American Society of Nephrology, 5(11), 1996–2003. 7. Duque, J. C., Tabbara, M., Martinez, L., Cardona, J., Vazquez-Padron, R. I., & Salman, L. H. (2017). Dialysis arteriovenous fistula failure and angioplasty: Intimal hyperplasia and other causes of access failure. American Journal of Kidney Diseases, 69(1), 147–151. 8. Roy-Chaudhury, P., Sukhatme, V. P., & Cheung, A. K. (2006). Hemodialysis vascular access dysfunction: A cellular and molecular viewpoint. Journal of American Society of Nephrology, 17(4), 1112–1127. 9. Medicare claims processing manual. Publication # 100-04. Retrieved June 1, 2019, from https:/ /www.cms.gov/Regulations-and-Guidance/Guidance/Manuals/downloads/clm104c08.pdf 10. Hemodialysis | NIDDK. [Online]. Retrieved January 4, 2019, from https://www.niddk.nih.gov/ health-information/kidney-disease/kidney-failure/hemodialysis. 11. Seo, J. H., & Mittal, R. (2012). A coupled flow-acoustic computational study of bruits from a modeled stenosed artery. Medical & Biological Engineering & Computing, 50, 1025–1035. 12. Krivitski, N. (2014). Why vascular access trials on flow surveillance failed. The Journal of Vascular Access, 15(7_Suppl), 15–19. 13. White, J. J., Ram, S. J., Jones, S. A., Schwab, S. J., & Paulson, W. D. (2006). Influence of luminal diameters on flow surveillance of hemodialysis grafts: insights from a mathematical model. Clinical Journal of the American Society of Nephrology, 1(5), 972–978. 14. Moist, L., & Lok, C. E. (2019). Con: Vascular access surveillance in mature fistulas: Is it worthwhile? Nephrology, Dialysis, Transplantation, 34, 1106–1111. 15. Tessitore, N., Bedogna, V., Verlato, G., & Poli, A. (2014). The rise and fall of access blood flow surveillance in arteriovenous fistulas. Seminars in Dialysis, 27(2), 108–118. 16. Duncan, G. W., Gruber, J. O., Dewey, C. F., Myers, G. S., & Lees, R. S. (1975). Evaluation of carotid stenosis by phonoangiography. The New England Journal of Medicine, 293(22), 1124– 1128. 17. Chen, W.-L., Chen, T., Lin, C.-H., Chen, P.-J., & Kan, C.-D. (2013). Phonoangiography with a fractional order chaotic system-a new and easy algorithm in analyzing residual arteriovenous access stenosis. Medical & Biological Engineering & Computing, 51(9), 1011–1019. 18. Majerus, S. J. A., Knauss, T., Mandal, S., Vince, G., & Damaser, M. S. (2018). Bruit-enhancing phonoangiogram filter using sub-band autoregressive linear predictive coding. In 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) (pp. 1416–1419). 19. Doyle, D. J., Mandell, D. M., & Richardson, R. M. (2002). Monitoring hemodialysis vascular access by digital phonoangiography. Annals of Biomedical Engineering, 30(7), 982.

66

B. Panda et al.

20. Allon, M., & Robbin, M. L. (2009). Hemodialysis vascular access monitoring: current concepts. Hemodialysis International, 13(2), 153–162. 21. Du, Y.-C., Chen, W.-L., Lin, C.-H., Kan, C.-D., & Wu, M.-J. (2015). Residual stenosis estimation of arteriovenous grafts using a dual-channel phonoangiography with fractionalorder features. IEEE Journal of Biomedical and Health Informatics, 19(2), 590–600. 22. Du, Y.-C., Kan, C.-D., Chen, W.-L., & Lin, C.-H. (2014). estimating residual stenosis for an arteriovenous shunt using a flexible fuzzy classifier. Computing in Science & Engineering, 16(6), 80–91. 23. Wu, M.-J., et al. (2015). Dysfunction screening in experimental arteriovenous grafts for hemodialysis using fractional-order extractor and color relation analysis. Cardiovascular Engineering and Technology, 6(4), 463–473. 24. Mansy, H. A., Hoxie, S. J., Patel, N. H., & Sandler, R. H. (2005). Computerised analysis of auscultatory sounds associated with vascular patency of haemodialysis access. Medical & Biological Engineering & Computing, 43(1), 56–62. 25. Shinzato, T., Nakai, S., Takai, I., Kato, T., Inoue, I., & Maeda, K. (1993). A new wearable system for continuous monitoring of arteriovenous fistulae. ASAIO Journal, 39(2), 137–140. 26. Hsien-Yi Wang, H.-Y., Cho-Han Wu, C.-H., Chien-Yue Chen, C.-Y., & Bor-Shyh Lin, B.S. (2014). Novel noninvasive approach for detecting arteriovenous fistula stenosis. IEEE Transactions on Biomedical Engineering, 61(6), 1851–1857. 27. Chen, W.-L., Lin, C.-H., Chen, T., Chen, P.-J., & Kan, C.-D. (2013). Stenosis detection using Burg method with autoregressive model for hemodialysis patients. Journal of Medical and Biological Engineering., 33(4), 356. 28. Obando, P. V., & Mandersson, B. (2012). Frequency tracking of resonant-like sounds from audio recordings of arterio-venous fistula stenosis. In 2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops (pp. 771–773). 29. Vesquez, P. O., Marco, M. M., & Mandersson, B. (2009). Arteriovenous fistula stenosis detection using wavelets and support vector machines. In 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society (pp. 1298–1301). 30. Wang, Y.-N., Chan, C.-Y., & Chou, S.-J. (2011). The detection of arteriovenous fistula stenosis for hemodialysis based on wavelet transform. International Journal of Advanced Computer Science, 1(1), 16–22. 31. Sato, T., Tsuji, K., Kawashima, N., Agishi, T., & Toma, H. (2006). Evaluation of blood access dysfunction based on a wavelet transform analysis of shunt murmurs. Journal of Artificial Organs, 9(2), 97–104. 32. Chen, W.-L., Kan, C.-D., Lin, C.-H., Chen, W.-L., Kan, C.-D., & Lin, C.-H. (2014). Arteriovenous shunt stenosis evaluation using a fractional-order Fuzzy Petri net based screening system for long-term hemodialysis patients. Journal of Biomedical Science and Engineering, 07(05), 258–275. 33. Gram, M., et al. (2011). Stenosis detection algorithm for screening of arteriovenous fistulae. In K. Dremstrup, S. Rees, & M. Ø. Jensen (Eds.), 15th Nordic-Baltic Conference on Biomedical Engineering and Medical Physics (NBC 2011). IFMBE Proceedings (Vol. 34). Berlin: Springer. 34. Todo, A., et al. (2012). Frequency analysis of shunt sounds in the arteriovenous fistula on hemodialysis patients. In The 6th International Conference on Soft Computing and Intelligent Systems, and The 13th International Symposium on Advanced Intelligence Systems (pp. 1113– 1118). 35. Chen, W.-L., Chen, T., Lin, C.-H., Chen, P.-J., & Kan, C.-D. (2013). Phonographic signal with a fractional-order chaotic system: a novel and simple algorithm for analyzing residual arteriovenous access stenosis. Medical & Biological Engineering & Computing, 51(9), 1011– 1019. 36. Munguia, M. M., & Mandersson, B. (2011). Analysis of the vascular sounds of the arteriovenous fistula’s anastomosis. In 2011 Annual International Conference of the IEEE Engineering in Medicine and Biology Society (pp. 3784–3787).

2 Noninvasive Vascular Blood Sound Monitoring Through Flexible Microphone

67

37. Grochowina, M., Leniowska, L., & Dulkiewicz, P. (2014). Application of artificial neural networks for the diagnosis of the condition of the arterio-venous fistula on the basis of acoustic signals. Brain Informatics and Health, pp. 400–411. https://doi.org/10.1007/978-3-319-098913_37. 38. Rousselot, L. (2014). Acoustical monitoring of model system for vascular access in haemodialysis. Master thesis. Retrieved from http://resolver.tudelft.nl/uuid:bcd26bbd-e2a1-48d9-9b9fdbb1138b8009 39. Gaupp, S., Wang, Y., How, T. V., & Fish, P. J. (2000). Characterization of vortices using pulsedwave Doppler ultrasound. Proceedings of the Institution of Mechanical Engineering Part H: Journal of Engineering in Medicine, 214(6), 677–684. 40. Gårdhagen, R. (2013). Turbulent flow in constricted blood vessels: Quantification of wall shear stress using large Eddy simulation. PhD dissertation, Linköping. https://doi.org/10.3384/ diss.diva-100918 41. Athineos, M., & Ellis, D. P. W. (2003). Frequency-domain linear prediction for temporal features. In 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721), St Thomas, VI, (pp. 261–266). https://doi.org/10.1109/ ASRU.2003.1318451 42. Athineos, M., & Ellis, D. P. W. (2007). Autoregressive modeling of temporal envelopes. IEEE Transactions on Signal Processing, 55(11), 5237–5245. 43. Tzanetakis, G., & Cook, P. (2002). Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing, 10(5), 293. 44. Chin, S., Panda, B., Damaser, M. S., & Majerus, S. J. A. (2018). Stenosis characterization and identification for dialysis vascular access. In IEEE Signal Processing in Medicine and Biology Symposium 2018. https://doi.org/10.1109/SPMB.2018.8615597. 45. Panda, B., Chin, S., Mandal, S., & Majerus, S. (2018). Skin-coupled pvdf microphones for noninvasive vascular blood sound monitoring. In 2018 IEEE Signal Processing in Medicine and Biology Symposium (SPMB) (pp. 1–4). 46. Gould, K. L. (2013). Effects of stenosis on coronary flow. Cleveland Clinic Journal of Medicine, 47(3), 140–144. 47. Bluestein, D., Gutierrez, C., Londono, M., & Schoephoerster, R. T. (1999). Vortex shedding in steady flow through a model of an arterial stenosis and its relevance to mural platelet deposition. Annals of Biomedical Engineering, 27(6), 763–773. 48. Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861– 874. 49. Ruopp, M. D., Perkins, N. J., Whitcomb, B. W., & Schisterman, E. F. (2008). Youden Index and optimal cut-point estimated from observations affected by a lower limit of detection. Biometrical Journal, 50(3), 419–430. 50. Schisterman, E. F., Perkins, N. J., Liu, A., & Bondell, H. (2005). Optimal cut-point and its corresponding Youden Index to discriminate individuals using pooled blood samples. Epidemiology, 16(1), 73–81.

Chapter 3

The Temple University Hospital Digital Pathology Corpus Nabila Shawki, M. Golam Shadin, Tarek Elseify, Luke Jakielaszek, Tunde Farkas, Yuri Persidsky, Nirag Jhala, Iyad Obeid, and Joseph Picone

3.1 Introduction Pathology is a branch of medical science focused on the cause, origin, and nature of disease [1]. It involves the examination of tissues, organs, and bodily fluids, to diagnose disease. A typical pathology laboratory workflow involves the preparation of a very thin tissue specimen mounted on a glass slide using either a frozen section or a paraffin wax agent [2]. A stain designed to enhance imaging and facilitate analysis by a board-certified pathologist is also applied [3]. Though hospitals are required by legislation such as the Clinical Laboratory Improvement Amendments (https://www.cdc.gov/clia/law-regulations.html) to archive these slides for a minimum period of time ranging from 2 to 10 years, most major hospitals house them for 10 years or more [4, 5]. Unfortunately, these extensive archives, which are organized only by case numbers, often exist offsite, making it extremely difficult for pathologists to query the available data for routine decision support. The slides also tend to degrade over time, which is another argument for archival of high-resolution digital images [6]. Pathology laboratories are a vital, behind the scenes, round-the-clock operation that play a critical role in a hospital’s surgical mission. They provide an analysis of various samples (e.g., surgical biopsies) in real-time and contribute to a large percentage of the medical decisions made at a hospital that involve surgery or The original version of the chapter has been revised. A correction to this chapter can be found at https://doi.org/10.1007/978-3-030-36844-9_9 N. Shawki · M. G. Shadin · T. Elseify · L. Jakielaszek · I. Obeid · J. Picone () The Neural Engineering Data Consortium, Temple University, Philadelphia, PA, USA T. Farkas · Y. Persidsky · N. Jhala Department of Pathology and Lewis Katz School of Medicine, Temple University, Philadelphia, PA, USA © Springer Nature Switzerland AG 2020, corrected publication 2022 I. Obeid et al. (eds.), Signal Processing in Medicine and Biology, https://doi.org/10.1007/978-3-030-36844-9_3

69

70

N. Shawki et al.

Fig. 3.1 A textbook case of low-grade prostate cancer

emergency care [7]. Interpretation skills take years to develop (4 years of college, 4 years of medical school, 4 years of residency, and 1–2 years of fellowship). Each diagnosis can be life-altering, so accuracy is critical. Hence, skilled professionals trained to perform this task with precision and accuracy are in short supply. This pool is expected to decrease in the next two decades [8], increasing the need for decision support technology that can increase a pathologist’s productivity. An example of a typical pathology slide is displayed in Fig. 3.1. A textbook case of low-grade prostate cancer is shown [9]. This slide was determined to be cancerous because approximately 10% of the core from the left base (A) and about 1% of the core from the right mid portion of the gland (B), both highlighted with arrows in Fig. 3.1, were cancerous. Because pathologists must look for extremely small and subtle abnormalities in these slides, they have traditionally relied on a conventional analog optical microscope [10] that can be adjusted to a variety of magnification levels and focus points [11]. It has been difficult for pathologists to embrace digital technology because they require scanned slides which can be viewed at ultra-highresolutions. Digital technology has only recently made this feasible. However, this technology provides a much different user experience than they are accustomed to, and there are concerns that diagnostic skills developed for microscopic inspection may not carry over to the inspection of digital slides [12].

3 The Temple University Hospital Digital Pathology Corpus

71

Digital pathology refers to the process of digitizing an analog image so that images can be retrieved, manipulated, and evaluated by computers. It is rapidly gaining momentum as a proven and essential technological tool now that the cost of digital storage has dropped significantly. In addition to enabling a broad array of research agendas, digital imaging is impacting primary diagnosis, diagnostic consultation, medical student training, peer review, and tumor boards. The latter two are important parts of a pathologist’s workflow—regular evaluation and review play a crucial role in their ability to maintain certification. Scanning slides at resolutions acceptable to pathologists can be problematic. There are currently two dominant vendors in the commercial marketplace: the Leica Biosystems Aperio AT2 used in this study [13] and the Philips IntelliSite Pathology Solution [14]. Images are routinely scanned at a resolution of 50K × 50K pixels per slice, and often these slices are stacked to produce a 3D image, a process known as z-stacking [15]. These digitized slides, which can exceed the limits of a JPEG representation [16] due to the large pixel count, can require gigabytes of disk space to store a single image. Large archives of these images can require petabytes of storage. Fortunately, creating a file server to host this amount of disk space is relatively inexpensive [17], enabling a transition to digital imaging. Digital pathology can also have a positive impact on the practitioner’s lifestyle. Pathologists spend a significant amount of their time on call and must travel to their worksite, often at inopportune hours or peak traffic periods, to review slides for emergencies (e.g., organ transplants). Digital pathology allows slides to be reviewed remotely, thereby reducing the need for lengthy commutes and lifestyle disruptions while also lowering labor costs. This form of telemedicine is becoming increasingly popular in low resource countries where there are no alternatives [18]. This growing demand for digital pathology is creating an enormous opportunity for the application of machine learning techniques to accelerate diagnostics. Digital pathology is one of several healthcare-related imaging fields poised to embrace a new generation of artificial intelligence-based decision support technology [19– 21]. Over ten million pathology slides are produced and interpreted by experts annually in the United States alone. This suggests that there is an ample supply of data to support machine learning research if it can be acquired and curated costeffectively. This digitization process offers long-term benefits which include the prevention of physical slide decay, e.g., stain discoloration and tissue degradation [22]. Many medical arenas have adopted digital pathology in tissue-based research because it offers the convenience of image analysis techniques such as detection, segmentation, and classification [21]. Cancer research has applied digital pathology techniques along with machine learning and deep learning systems to yield very promising results [23, 24]. Though computer science research on this application currently focuses on simple problems such as a cancer/no-cancer decision [25], pathologists can diagnose thousands of conditions over the course of their career. Further, the severity of a disease, often quantified using a scale such as Gleason’s index [26] or the more recent International Society of Urological Pathologists (ISUP) Grade Group [27], is

72

N. Shawki et al.

part of a typical diagnosis. The pathology that triggers such a diagnosis can often result from the observations collected from a small percentage of the overall image. This type of unbalanced dataset, where one class assignment is much more likely than the rest, causes problems for machine learning algorithms [28]. In cases such as these, where one class occurs a disproportionately high number of times (e.g., the non-event class occurs 99% of the time while the event classes occur 1% of the time), machine learning algorithms tend to learn to guess the most frequently occurring class. This is the obvious way to maximize performance, especially if the level of performance (e.g., 50%) is significantly lower than the prior probability of the most frequently occurring class (e.g., 99%). It is not unusual that guessing based on priors outperforms a machine learning algorithm in the early stages of technology development. We often use the analogy of finding a “needle in a haystack” to describe these kinds of problems since the entire image must be searched for these small, localized events that occur infrequently. Though pathologists can do this effortlessly, it is extremely challenging for a machine learning system. The ability to accurately segment the image plays a crucial role in high performance classification. Reduction of false alarms, which we will discuss in detail later, is a very significant part of the algorithm design in these kinds of applications, since the system must be pressured to not always guess the most likely class. Machine learning approaches for digital pathology, specifically deep learning approaches, are still in their infancy, lacking the necessary data to support complex model development. Hence, the goal of this chapter is twofold. First, we will discuss the development of a digital pathology corpus that will support clinical decision support and medical student training. This corpus is also being created to energize deep learning research on automatic interpretation of images. It is part of a National Science Foundation Major Research Instrumentation (NSF MRI) grant [29] focused on the development of digital pathology resource. Second, we present some preliminary findings on the development of deep learning architectures to classify high-resolution digital pathology images.

3.1.1 Digital Pathology Annotation of a large archive of whole slide images (WSI) on a cloud-based server provides an opportunity for pathologists to quickly retrieve WSIs through simple natural language queries that integrate both image data and electronic medical records. We have been developing such technology for many years in related areas of human language and medicine such as speech [30] and electroencephalograms [31, 32]. The ability to query data representing physical signals (e.g., digital images) symbolically can be quite powerful as both a decision support and teaching tool. The workflow of a pathologist begins with fixation of biological tissue in order to prevent their deterioration via autolysis [33]. Fixation is the chemical process

3 The Temple University Hospital Digital Pathology Corpus

73

by which biological tissues are preserved from decay, either through autolysis or putrefaction. Tissue fixation is a critical step in the preparation of histological sections since it allows for the assembly of thin, stained sections. Though the details of this process vary depending on the type of specimen, the general process for creating the samples can be summarized as follows. The tissues are trimmed with a scalpel and transferred to a cassette where they are processed following these steps: (1) dehydration with increasing concentrations of alcohol, (2) clearing with an organic solvent such as xylene, and (3) embedding with paraffin wax. Next, the embedded tissue is cut into thin slices, known as sections, about the thickness of a piece of paper (~5 μm for light microscopy). Finally, each section is transferred to a glass slide where a stain is applied to the sections, and a cover is placed on them to generate a specimen sample. At the end of this process, a board-certified pathologist inspects the specimen sample through an analog optical microscope to generate a diagnosis. The glass slides containing the sample are then sorted by their specimen type and archived at an offsite location. In digital pathology, an image of an analog specimen is captured as a whole slide image (WSI) using a laser scanner that produces an image with a resolution of 0.2 μm/pixel. Though a large number of slides can often be scanned automatically, approximately 5% of our slides require selecting manual focus points to help the scanner properly focus. WSIs can contain one or more specimens, which further complicates the machine learning problem. In Fig. 3.2, several examples of WSIs from the breast specimen type are shown. The first (top) specimen was obtained from a lumpectomy of the right breast and exhibits a condition known as atypical lobular hyperplasia. The second (middle) specimen, obtained from an excision of the right breast duct, is an example of intraductal papilloma with ductal hyperplasia. The third (bottom) specimen, obtained from a biopsy of the right breast, exhibits fibroepithelial lesions consistent with fibroadenoma. These examples demonstrate that in real clinical data the number of specimens per slide ranges from one to six. These types of WSIs warrant additional time to study as the focal points and scanning area need manual adjustments. Such data requires the machine learning system to segment that data and determine the number of specimens as part of the classification process. This significantly complicates the problem. Despite the existence of a large volume of pathology data in private institutions, there is no existing comprehensive public WSI corpus. Currently available resources, such as The Cancer Genome Atlas (TCGA) [34], contain only hundreds of slides per cancer type. Barker et al. [25] utilized the TCGA Corpus to create an automated classification system for types of brain tumors and claimed that the machine’s performance surpassed human execution. This study utilized a subset of the TCGA Corpus that consisted of 604 WSIs of two types of brain cancer (240 lower grade glioma and 364 glioblastoma multiforme slides were used). These types of limited studies can commonly be found in digital pathology literature. Performance estimates based on such small corpora can often be overly optimistic. Private corpora, on the other hand, contain WSIs on the scale of hundreds of thousands. One such example is a dataset of approximately 300,000 WSIs developed by Philips and LabPON [35]. These private corpora, however, are built

74

N. Shawki et al.

Fig. 3.2 A range of WSIs of breast specimens

and maintained for the development of proprietary commercial software and are not available to the general public as open source resources. Open source access usually requires release of the data in such a way that licenses or data sharing agreements are not mandatory. This creates an added level of complexity since unencumbered sharing of data in the bioengineering community is less common than one might imagine. The Neural Engineering Data Consortium (www.nedcdata.org) and the Institute for Signal and Information Processing (www.isip.piconepress.com), both at Temple University (TU), have a history of delivering such data (and related

3 The Temple University Hospital Digital Pathology Corpus

75

resources such as software) for research and commercial use that dates back to the early 1980s. NEDC currently has over 2000 bioengineering researchers subscribed to its EEG resources (https://www.isip.piconepress.com/projects/tuh_eeg) [36]. These corpora are one of the only resources of this scale that is truly available in an open source manner [37]. No licensing or data agreements are required to download the data. No research-only restrictions have been placed on the data. The data is freely available for research and commercial use. The corpus described in this chapter, known as the TUH Digital Pathology Corpus (TUDP), is being developed and released in a similar unencumbered manner (see https://www.isip.piconepress.com/projects/nsf_ dpath for more details). The near-term goal of this project is to release 100,000 slides by the end of 2020. Our ultimate goal is to release one million slides by the end of the decade.

3.1.2 Deep Learning Due to its success in diverse areas such as speech recognition, machine translation, and computer vision, deep learning has enjoyed great popularity in recent years. Previous neural network systems based on proven technologies such as a multilayer perceptron traditionally used three layers. A deep learning system [38] typically consists of a much larger number of layers, or nonlinear learning modules, that successively abstract the data until a final classification is made. These abstractions formed at one layer become the input to the next layer and are trained using a combination of supervised and unsupervised learning. Supervised training requires annotated data, which is often difficult to acquire. Unsupervised training is attractive because it does not require detailed annotations of the data. The goal of our corpus is to lessen the burdens of supervised training by providing consistent and correctly annotated digital pathology slides to further develop the machine learning and healthcare fields. In this chapter, we will discuss the challenges of annotating large amounts of pathology data. Deep learning architectures are notoriously sensitive to minor changes in the input data and must often be optimized for a specific application. Hence, a second goal in this chapter is to provide a baseline architecture for classification of highresolution pathology images. We leverage our previous work on developing deep learning systems for electroencephalograms (EEGs) [39]. Our work on EEGs and WSIs share a common theme—both applications require spatial context to achieve high performance. The framework we introduce can easily accommodate a large variety of deep learning algorithms. Applications that require spatial or temporal context typically use an architecture based on Convolutional Neural Networks (CNNs) [40]. These have proven useful in many applications including EEG event detection [41], speech recognition [42], and image recognition [43]. CNN architectures have made great advances in the field of image classification. For example, in Ghiasi et al. [44], state-of-

76

N. Shawki et al.

the-art object recognition was achieved at Google using CNNs by introducing an approach, DropBlock, which selectively deletes features. Regular dropout discards features regardless of their spatial correlation among layers. This can result in overfitting. In Chen et al. [45], Facebook achieved the state-of-the-art performance on facial recognition by introducing a “double attention block” that “aggregates and propagates informative global features from the entire spatio-temporal space of input images/videos, enabling subsequent convolution layers to access features from the entire space efficiently.” Deep CNN architectures have also been applied to the field of digital pathology. Cire¸san et al. [46] employed deep CNN networks for detection of mitosis within breast cancer histology images on a centralized pixel basis. Cruz-Roa et al. [47] applied patch-based CNNs that automatically derived hierarchical features and showed them to be superior to hand-crafted counterparts in detecting invasive ductal carcinoma. Hua et al. [48] demonstrated that machine learning techniques based on CNNs and deep belief networks outperformed traditional computer-aided diagnosis schemes in lung nodule classification. Sirinukunwattana et al. [49] demonstrated that spatially constrained CNNs can be used not only to classify a nucleus in a cell, but also to analyze the nuclei’s localities in the image. Litjens et al. [24] concluded in their survey that state-of-the-art machine learning methods are now pervasive in the medical field, providing a convenient and effective solution for a variety of classification problems. Bejnordi et al. [50] recently used stacked convolutional neural networks to effectively classify whole slide images of breast tissue. Finally, Wang et al. [51] showed that deep learning systems, combined with professional pathologist classification, reduce the human error rate in diagnosis by approximately 85%. Integrating human knowledge into deep learning systems is a major focus for the next generation of machine learning approaches. High-resolution WSIs introduce some unique challenges with respect to deep learning systems because these images cannot be processed as a whole. They must be segmented and scanned using a process similar to what is used for temporal signals. Hence, in this chapter, we also discuss some preliminary results on deep learning architectures, developed for these high-resolution images, that sequentially scan images using small frames and classify the entirety of the image by integrating a series of local decisions.

3.2 The TUH Digital Pathology Corpus (TUDP) An overview of the major components of our NSF MRI project is provided in Fig. 3.3. These kinds of corpus-centric projects often require a significant number of iterations before we can arrive at a final corpus design and implementation strategy. Hence, we often follow a concurrent design process that includes frequent input from our subject matter experts—the Temple University Hospital (TUH) Department of Pathology in this case—and the community of researchers who will consume this data. We include researchers in the design process by maintaining a

3 The Temple University Hospital Digital Pathology Corpus

77

Fig. 3.3 The three phases of our NSF MRI corpus development project

community-wide listserv that facilitates discussions about the data. Further, we run concurrent machine learning experiments on the data to ensure that it is useful to machine learning researchers. Concurrent experimentation on the data is particularly important in guiding the data development and annotation processes since we can identify and avoid systematic biases or inconsistencies in the data. A long-term goal of our digital pathology project is to integrate digital imaging into TUH’s clinical operation. This will give us continuous access to data generated at the hospital and allow us to continue augmenting the corpus over the next decade. There are five major components to this goal: (1) computing infrastructure, (2) image digitization, (3) data organization, (4) data anonymization, and (5) data annotation. We briefly describe each of these in this section.

3.2.1 Computing Infrastructure The implementation of a cost-effective storage architecture to accommodate the 1 petabyte (PB) of storage required to hold one million pathology images was an important goal in this project. The architecture developed for this project is described in detail in Campbell et al. [17]. Although cloud computing is an enticing technology for many problem spaces, it does not come without its drawbacks, including, most notably, data privacy and security issues. In addition to its virtualization overhead, the cost of cloud storage at scale can be exorbitant. One example would be Amazon’s S3 storage which was priced at $ 0.021 per gigabytemonth (using their US East Coast pricing) [52], or approximately $35K/month/PB. For the cost of 1 month of storage, we were able to build dedicated hardware from

78

N. Shawki et al.

low-cost commercial components that cost roughly $45K/TB. This is an order of magnitude lower in cost than most high-end commercial solutions using proprietary hardware. For the pathology corpus, it was essential that the operations performed on the data, particularly viewing and annotation, did not experience low throughput and high latency. Latency, in particular, would result in the system failing to meet the needs of clinicians. Another essential requirement for the infrastructure housing the corpus, especially in clinical settings, was its robustness with respect to physical hardware failure. This was because the data stored in the corpus would have to be Health Insurance Portability and Accountability Act (HIPAA) protected for clinical or diagnostic purposes, hence any loss in data due to disk or hardware failure would be disastrous. Therefore, it was imperative that the system be capable of withstanding hardware failure events and preserving the integrity of the data. The computing infrastructure had to address two issues: computation and storage. For this reason, several improvements and expansions to our high-performance computing (HPC) cluster, known as Neuronix [53], were implemented. To manage the increased computational capacity and job throughput, Slurm, an open source workload manager, was employed as the scheduler and resource manager [54]. For storage, a multilayer filesystem based on Gluster [55] and a Zettabyte File System (ZFS) [56] was implemented to distribute a filesystem across numerous machines. Slurm provides numerous features for handling a number of diverse situations. Its most notable feature is the built-in capability to support Graphics Processing Unit (GPU) scheduling, including jobs that use multiple GPUs simultaneously. Moreover, Slurm’s general resources (GRES) feature allows the support of resources along with those provided by traditional HPCs such as memory and wall clock time. Additionally, Slurm performs meticulous enforcement of resource allocation, providing the necessary resources to satiate increasing computational data as the number of jobs increases and preventing resources from other jobs being taken. In GPU-enabled frameworks, it is common to query the CUDA runtime library/drivers and iterate over a list of GPUs in order to institute a context on every GPU. Slurm has the ability to influence the hardware discovery process of such jobs, facilitating them to operate simultaneously even if the GPUs are in exclusive-process mode [17]. For the storage architecture of the corpus, a robust and extensible distributed storage device was devised which employed numerous open source tools for the creation of a single filesystem. This filesystem can be mounted by any machine on the network. Figure 3.4 shows the file server architecture that has been developed for the corpus. The lowest abstraction level of the infrastructure consisted of the hard drives, which were split into four 60-disk chassis. Each disk chassis contained 8 TB drives. These systems are maintained by two server units, each of which were equipped with Intel Xeon CPUs and 128 GB of RAM. One server serves as the primary server, while the second serves as the backup, mirroring the primary. A multilayer file solution was implemented that provides the entire disk farm over 2 TB from a single mount point. The distribution of the storage among a network of machines provides the benefit of a fault tolerant and extensible system. Each machine provides an independent copy of the data by taking advantage of the

3 The Temple University Hospital Digital Pathology Corpus

79

Fig. 3.4 The file server architecture used to develop the TUDP Corpus

ZFS’s disk-level awareness. The ZFS RAID implementation [56] protects data from corruption as well as the RAID write-hole bug by utilizing copy-on-write functionality and a journaling scheme (the ZFS intent log or ZIL). The mirrored configuration enables data written to one machine to be automatically copied to another machine without running explicit backup software. This is a necessary feature given the amount of disk space involved. Ensuring data security and privacy are major issues for a project of this nature. Figure 3.5 provides an overview of the physical implementation of the systems. Networking issues are quite significant here since the scanner resides on the hospital’s secure HIPAA network (TUHS-HIPPA), while the file servers reside on a university HIPAA network (TUMC-HIPPA). The latter can accommodate researchers and the computing cluster that resides on the university’s main campus network (TUSecure). It is interesting to note that even though the latter two networks are geographically separated from TUHS-HIPPA by about 2 miles in

80

N. Shawki et al.

Fig. 3.5 An overview of the HIPPA-compliant network architecture

North Philadelphia, they do not pose serious issues in terms of bandwidth and latency. These machines were not allowed to sit on the same physical network due to concerns about the impact they would have on hospital operations. Creating the ability for these systems to communicate with one another in a secure but transparent way without burdening pathologists with complex VPN interfaces was no easy feat, requiring modification of several router tables and firewalls.

3.2.2 Image Digitization We are using a Leica Biosystems Aperio AT2 high volume scanner [13], as shown in Fig. 3.6, to digitize our slides. This scanner is an industry leading unit that includes an autoloader consisting of 10 trays that can hold 40 slides each, resulting in a total capacity of 400 slides. The AT2 can scan slides at magnifications of 20X and 40X. It also provides a z-stacking feature that can stack a maximum of 25 layers. The AT2 has a throughput of 50 slides per hour at 20X resolution which allows it to scan 400 slides within a period of 8h. The operation of the scanner is controlled via the Aperio’s Scanscope Console software. The images and their metadata are managed by a web-based application called eSlide Manager (eSM). A typical scanned specimen slide, as shown in Fig. 3.7, requires approximately 200 megabytes (MB) of storage. However, this size can increase to 1 gigabyte (GB)

3 The Temple University Hospital Digital Pathology Corpus

81

Fig. 3.6 The Aperio AT2 high volume scanner

for slides containing multiple specimens, as shown in Fig. 3.2, and up to 5 GB for z-stacked images. Since scanning 400 slides requires about 8h to complete, the scanning operation is run overnight, and the resulting images are then organized using eSM the next day. Pre-scan snapshots must be taken before the scanner is set to perform full scans overnight, prohibiting complete automation of the scanning operation. An overview of the process is given in Fig. 3.8. The process begins with the scanner taking a low-resolution image of the specimen by placing several focus points on the image and marking a region of scanning using a green rectangular box. This automated process takes around 2h for the 400 slides loaded and allows the user to complete a quick review of the snapshot. The number of pixels in the final image will vary depending on the size of the bounding box identified during this snapshot process. There are some cases where the scanner makes errors in its placement of the focus points and in determining the area of the scanning region. Both cases will cause a failure in the image processing of the specimen, thus causing a failure in scanning. Another error modality occurs when the scanning region is too large, which can cause unnecessary white space to be included in the WSI. In this case, the scanner might produce an image which is larger than necessary. Similarly, there is a risk of producing invalid images if the scanning region only partially encloses the specimen of interest. These events tend to occur with slides that are lightly stained or slides with a significant amount of white space between tissue samples. In such cases, the focus points must be manually placed. This makes the pre-scan snapshot phase of the scanning procedure labor intensive. Among the 400 slides set to scan overnight, we find about 2% are likely to experience a scanning failure. However, this number

82

Fig. 3.7 A typical WSI for a specimen from a breast case

Fig. 3.8 An overview of the scanning process

N. Shawki et al.

3 The Temple University Hospital Digital Pathology Corpus

83

varies according to the quality of the stain applied to the slides. These failed slides are reviewed, readjusted, and scanned again the next morning. The scanned image produced from the Aperio AT2 is stored in a file format known as ScanScope Virtual Slides (SVS) [57]. A discussion of imaging file formats is beyond the scope of this chapter because this is a fairly complex and welldeveloped area of science in the medical imaging community [58]. The SVS file produced by the AT2 can be stored in one of many user-defined formats [59]. Each pixel is represented as a red, green, and blue (RGB) triplet using 8 bits per color. An SVS file is layered image representation that includes several thumbnails and the original source image. The source image is stored using JPEG compression with a quality factor of 70. (The specific image type in the Aperio ImageScope software is “SVS/JPEG 2.” The parameter “Image Depth” is set to 1 and “Image Channels” is set to 3.) These parameters result in roughly an order of magnitude of compression over lossless compression with minimal image degradation. A full resolution image is stored as the baseline image using a tile size of 240 × 240 pixels (an image is represented as a series of adjacent tiles, or blocks). The following three layers contain downsampled version of the image at resolutions of 4:1, 16:1, and 32:1. The final layer is a low-resolution thumbnail. Each of these layers is an image encoded using lossy JPEG encoding. An SVS file also contains a low-resolution picture of the slide’s label as metadata and stores other information such as the downsample and offset information. The number of layers generated depends on the size of the original image. Smaller images (e.g., the maximum dimension is less than 50K pixels) will generally have only two layers. These files can be viewed and edited in the Aperio ImageScope software [60]. Other open source software tools are also available that can view and manipulate SVS files [61]. SVS files are used as the primary file type for the pathology corpus due to its efficiency and its ability to handle full resolution images. The number of pixels in the original image exceeds the limits of the JPEG standard, and hence these images are collected and distributed using the SVS format. Our primary software for viewing SVS files is Aperio ImageScope. This software features a wide variety of tools for image editing, adjusting, and annotation. The image adjustment tools include brightness and contrast controls, color balance, and color curve adjustment. These can be applied to all channels or for individual red, green, or blue (RGB) channels. These adjustments only apply to the viewed image and do not modify or overwrite the stored image. The settings applied to the current session can be saved and applied to other scanned images. The default presets for ImageScope are always applied to the scanned images. Then, the image adjustment tools can be utilized to calibrate the image features according to the pathologist’s preferences. The stains applied on specimen images cause a specific structure to adopt a distinct color, and this color can be enhanced by adjusting the brightness, contrast, and the color channels of the image via the image adjustment tools. This is a beneficial tool for pathology diagnosis as it can be used to allow specific areas of investigation (such as cancerous tissue) to be more focused or to enhance quality of lightly stained specimens.

84

N. Shawki et al.

As mentioned earlier, the Aperio AT2 scanner features a z-stacking option. The scanner can produce multiple images of a slide tissue that were scanned at different focal depths. This generates a 3D image that allows navigation of the image through different focal depths, which is analogous to the process pathologists use with an analog microscope. ImageScope features a tool that can adjust the focal depth that is similar to using the objective fine and coarse adjustments of a focus slider in a microscope. This feature has yet to be explored in the Department of Pathology at the Temple University Hospital but is being used by other hospitals. The zstacked images are very large in size, often several gigabytes, which poses additional challenges for machine learning research.

3.2.3 Data Organization The Aperio AT2 scanner is configured to scan images directly to the petabyte file server. These images are organized and added to Leica Aperio eSlide Manager (eSM) database which is hosted on a Windows application server. The eSM software is a web-based application that provides a management system for digital pathology information. Scanned images (eSlides using eSM terminology) can be viewed, managed, and analyzed. In eSM, there are three types of data hierarchies: 1. Research: data is ordered with projects at the top followed by the deidentified specimens and then eSlides. 2. Clinical: data is ordered with the cases at the top followed by deidentified specimens and then eSlides. 3. Educational: data is ordered with courses at the top followed by lessons, deidentified specimens, and eSlides. We are using the “Research” hierarchy for our project where the top level is organized by case, which includes terms such as breast (breast), gastrointestinal (gastro), gynecology (gyneco), and lymph (lymphn). There is also a category miscellaneous (miscel) for special cases which do not have adequate documentation to be classified into one of the above categories. Each case consists of more specific specimens which are characterized by a clinical case number called the specimen ID. The eSlides are stored using the specimen ID along with an associated report for that specimen. The images and reports are stored on the file server using a file naming convention that is best explained by the example shown in Table 3.1. The file naming convention is designed so that every file in the corpus has a unique filename, and simple UNIX commands can be used to locate data. The full filename of a standard image in the digital pathology corpus includes the case of the specimen image (e.g., “gastro”), the 8-digit Medical Record Number (MRN), specimen type (e.g., “0s”), and the type of block cut (e.g., “lvl”). The list of cases is the same as that used to organize slides in eSM. The list of valid block cut types includes deep level

3 The Temple University Hospital Digital Pathology Corpus

85

Table 3.1 TUDP filename conventions Full Pathname: tudp/v1.0.0/svs/gastro/001234/00123456/2015_03_05/0s19_12345/0s19_12345_0a001_ 0012345678_lvl0001_s000.svs Field Description Template Example Directory components database name 4-letter acronym NNNN tudp version v.. vx.x.x v1.0.0 file type type of data stored fff svs case 6-letter code for the type of specimen cccccc gastro sequential ID 6-digit directory ID (zero-padded) ###### 001234 patient ID 8-digit (zero-padded) patient ID ######## 00123456 date date specimen was collected yyyy_mm_dd 2015_03_05 specimen type 4-digit code for the specimen type tt##_ 0s19_ sequence number 5-digit sequence number ##### 12345 Filename components specimen type 4-digit code for the specimen type tttt_ 0s19_ sequence number 5-digit sequence number ##### 12345 block level ID 2-letter code followed by 3-digit sequence ll### 0a001 patient ID 8-digit (zero-padded) patient ID ######## 00123456 block cut type 3-letter code followed by a 4-digit sequence bbb#### lvl0001 sequence number 3-digit sequence s### s000 file extension three-letter filename extension .ext .svs

cut (dep), decal (dec), frozen (frz), immunohistochemistry (ihc), recut (rct), and standard (lvl). The lvl code is a general code for the site that is applied to isolate these slides from samples of the same tissue site. The dep code is intended for deep slides which are created when the initial cut used on the sample was not adequately comprehensible, and hence a deeper cut had to be used. If the deeper tissue cut was not enough to create a cut with the intended level of detail, then another cut would be applied. These slides are given a code of rct. The rct code is also used for specimens that were cut from previously cut blocks. Slides created from the specimens extracted from tissues that were frozen are given the frz code. It should be noted that the lvl, dep, and rct slides provide a higher quality of image than frz slides because the tissue from which the lvl, dep, and rct slides are extracted from are paraffin embedded tissues rather than frozen tissues. The lvl, dep, frz, and rct codes are intended for slides that have hematoxylin and eosin (H&E) stains, which is the most common staining procedure employed in pathology [2]. For slides where immunohistochemistry stains were applied, the naming convention is changed to include the code ihc followed by a 4-letter code that depicts the type of stain applied. For example, if the immunohistochemistry stain iron wet was applied, the image naming convention will be “0s19_12345_0a001_ihc0irw_s000.svs,” where 0irw is the code for this particular

86

N. Shawki et al.

stain. Our current list of codes has grown to over 200 in number and includes these frequently occurring codes: • 0irw: An iron stain kit is used in the detection of ferric iron in tissues, blood smears, or bone marrow. • 0tcw: A trichrome stain is a three-color staining protocol which is used to distinguish cells from connective tissue. • “00er”: An estrogen receptor (ER) antibody stain kit is used to recognize protein and strongly stains the nucleus of epithelial cells in breast carcinoma. The information regarding the patient’s MRN and specimen ID are extracted from the patient’s medical records. Reports are stored as Microsoft Word documents (∗ .docx) and as plain text files (∗ .txt) to facilitate command line searches of the latter for content in UNIX. MRNs and sample identifiers are, of course, randomized before the images are released to the public domain in order to protect the patient’s private information. The data regarding the specimens are also recorded in a spreadsheet that includes the dates when the specimen was collected and scanned, the patient’s name, MRN, the specimen ID, the specimen’s case, and other notes made by our technicians. Since this spreadsheet contains confidential data regarding patient identity, it is only available on the TUHS-HIPPA network.

3.2.4 Data Anonymization Any information regarding the patient’s identity must be kept anonymous under the Health Insurance Portability and Accountability Act (HIPPA) [62]. Protocol No. 24943 was approved by TU’s Institutional Review Board to aid in this process. This protocol ensures that all the essential measures are taken to ensure the anonymity of the research subject information. For example, due to this protocol, the scanner must physically reside at TUH on the TUHS-HIPPA network. The slides never physically leave the hospital. The scanned digital images are stored on the TUHSHIPPA network where they reside until the data is ready to be anonymized. The process of removing patients’ identification data from the scanned images and slides is referred to as deidentification or anonymization. For this process, we utilized a similar process that the Temple University Hospital EEG Seizure Corpus applies for the same purpose [37]. Each patient is assigned a unique randomized 8-digit MRN. The specimen ID is also randomized. The mapping file that links the anonymized data to the original is stored in a secure location on the TUHS-HIPPA network. Although there is no encryption security enabled, this information never leaves the hospital and is not transmitted via email, phones, or laptops, or viewed via VPN, videoconference, or other forms of remote access. Only student workers who physically work at the hospital see this mapping file. The hospital, like most hospitals in the USA, does not encrypt the raw data at rest. Access to the data is controlled via computer account privileges.

3 The Temple University Hospital Digital Pathology Corpus

87

Fig. 3.9 A typical clinical case report

The report accompanying each clinical case provides information regarding the patient from which the specimen was extracted. An example of a report is shown in Fig. 3.9. The report contains information such as the patient’s name, sex, age, MRN, and the collection date of the sample. This report includes more specific details such as the clinical history of the patient, a gross and microscopic description of the specimen being investigated, the stain applied on the tissue, and a medical diagnosis completed by the pathologist. The reports are originally created using a third-party medical records software product known as Epic [63], which is used throughout TUH. These reports are extracted from Epic and deidentified manually. A number of novel software tools are used to analyze these documents and ensure that no words appear that could compromise patient privacy. The anonymized reports are converted to text files and exported with the image files when the data is prepared for release. As stated previously, SVS files contain a low-resolution tiled image of the slide label. An example is shown in Fig. 3.10, which contains a visualization of an SVS

88

N. Shawki et al.

Fig. 3.10 An overview of the anonymization process

file in the Aperio ImageScope software. This label, highlighted in the upper right of the figure, also has to be manually removed as the label contains data such as the patient’s initials and the specimen ID. Other metadata, such as the collection date and case information, remain in the released files.

3.2.5 Annotation The eSM software plays a critical role in the integration of the digital pathology corpus into the TUH workflow. The eSM software is a web-based tool that connects to a back-end SQL database. Since it is web-based, no additional software needs to be downloaded and installed by the pathologists to access the scanned images (eSlides). The images are automatically uploaded to eSM after they are scanned by the Aperio AT2 scanner and can be accessed using the Leica Web Viewer tool that is included with eSM. This tool allows pathologists to view multiple slides at once, which is particularly useful when making comparisons between the different types of cuts or the features displayed by various stains applied. Another advantage of eSM is its ability to establish user groups and assign workflows. This is particularly important because we use eSM to schedule and track annotation of data—a task that is shared across a group of pathologists. We can coordinate community reviews and inter-rater agreement studies using this tool as well. The Leica Web Viewer is also an annotation tool that pathologists are using to annotate images. It provides an assortment of shapes such as rectangles, circles, and ellipses that can be used to identify a region of interest. Among these tools, the pen

3 The Temple University Hospital Digital Pathology Corpus

89

tool is the most advantageous as the features to be annotated are usually irregular in their structure. The pen tool allows the most precise types of labels to be generated. The SVS files created and stored in the petabyte server can also be viewed and annotated using ImageScope, which is free, proprietary software provided by Leica Biosystems [60]. Similar to Leica Web Viewer, ImageScope allows multiple SVS files to be viewed simultaneously and possesses the same tools for selecting the areas of annotation such as the rectangle, ellipses, and pen tools. In addition, ImageScope provides a few more tools for adjusting and inspecting the annotations implemented by the pathologists. One such tool is the negative pen tool, which complements the pen tool, and allows a selected area to be ignored (not considered for annotation). This is particularly useful for annotating specimens that contain lumens, inside spaces of tubular structures, as this area should not be included in the annotation region. A few other useful tools available in ImageScope include the ruler tool, which is used for measuring the size of a feature, and the counter tool, which is useful for numbering annotations. A full list of the tools available for annotation is shown in Fig. 3.11. Additionally, ImageScope provides a detailed view window which consists of the following sections: Layers (records the layers of the eSlide), Layer Attributes (used to add and delete attributes for the layers), and Layer Regions (used to add and delete attributes for an annotation). The Layers and Layer Attributes list the layers in ascending order, but also allow the user to add a description, which is especially useful for z-stacked images as they are generated with multiple image layers at different focal lengths. Attributes in the Layer Attribute and Layer Region sections

Fig. 3.11 Annotation tools available in ImageScope

90

N. Shawki et al.

are text fields that are used to describe the layer or annotation. For the annotation window, both the Layers pane and the Layer Region have the Description attribute where comments regarding the slide layer or the annotation can be added. Further attributes could be added or deleted depending on the needs of the pathologist. The Layer Region provides additional details about the annotation such as the length and area of the region covered by the general annotation tool and displays their value in pixels. If the resolution of the image is known, then sizes will be displayed in microns rather than pixels. A Text attribute can also be added to the Layer Region section where further details about the annotations can be added. For example, if the object enclosed by the annotated area describes a cancer type ductal carcinoma in situ (DCIS), then the pathologist would input DCIS in the Text attribute dialog box. These annotations will allow prospective users to narrow their search in the corpora. The details of the annotations are written in the deidentified pathology reports that are also included among the dataset. Additional details, such as the nuclear grade of the cancer or whether it is benign or malignant, can be added to the Description attribute of the Layer Region section. After an image has been annotated, an XML file is generated which contains the annotation details. For the annotation region, the XML file lists vertices that define the annotated region along with the number of vertices associated with that region. The number of vertices varies depending on the shape and type of the annotation region. For example, the rectangle region generates four vertices, the ellipses region generates two vertices, and the free form region generates a number that depends on the complexity of the free form region created. It should be noted that in order for the annotation to appear on the image, the XML file must be present in the same folder as the original image and must have the same image name as well. Annotation is naturally a tedious process. According to federal regulations, the maximum number of slides that can be viewed and analyzed in an 8-h workday is 100, i.e., an approximate rate of 5 min per slide [64]. Hence, annotation of the proposed one million image corpus is a daunting task due to the large amount of labor (e.g., over 800,000 h of labor). Additional methods are being explored to reduce this time and make the process more efficient. Unsupervised learning, which will be described in the next section, will play a key role ultimately in making effective use of this corpus. Exactly how we release the metadata and annotations associated with this corpus will be the subject of future community-wide discussions. We are still early in the process of deciding how to annotate the data, how to represent information in a way that is meaningful to researchers and clinicians, and how to release the data in some form that makes it easy to query.

3.3 Deep Learning Experiments The fundamental motivation behind developing the TUDP Corpus is to provide adequate amounts of data for training sophisticated deep learning systems. Most

3 The Temple University Hospital Digital Pathology Corpus

91

Table 3.2 A comparison of RF to CNN for IDC Metric F-ratio (F1) Balanced Accuracy (BAC)

Machine learning (RF) (%) 71.80 84.23

Deep learning (CNN) (%) 67.53 78.74

research to date has been conducted on small datasets. For example, Cruz-Roa et al. [47] developed a system that can detect Invasive Ductal Carcinoma (IDC), which is the most common phenotypic subtype of breast cancer. The system uses a CNN to train on WSIs using a database of 162 images. The images were sampled using 100 × 100 pixel image patches and were manually annotated using a threshold to identify each patch as positive or negative. A three-layer CNN system (DLCNN) was constructed to classify these patches. A slight improvement over a baseline machine learning algorithm based on random forests (ML-RF) [65] was reported, as shown in Table 3.2. The F-ratio (F1) used in this table is computed as (2 × precision × recall/(precision + recall)). Balanced Accuracy (BAC) is measured as (sensitivity + specificity)/2. Lung cancer can be detected and treated if small and potentially cancerous lung nodules can be detected early. Hua et al. [48] developed two deep learning systems that can classify pulmonary nodules. Chest computed tomography (CT) images were analyzed from a publicly available corpus known as the Lung Image Database Consortium and Image Database Resource Initiative (IDRI) dataset [66]. The first system was based on a Deep Belief Network (DBN) while the second system uses multiple CNN layers. Both systems outperformed a manual identification procedure. The dataset consists of 1018 thoracic CT scans. In other related work, Sirinukunwattana et al. [49] introduced a variant of CNN, named a spatially constrained CNN (SC-CNN), that achieved better results than existing CNNs in classifying tumor nuclei in routine colon cancer for a dataset consisting of 100 H&E stained images. Cire¸san et al. [46] developed a method for mitosis detection in breast cancer using 50 images from the MITOS dataset [67]. However, these studies were conducted on a small number of images and did not contain the variety of image types included in TUDP. For our initial baseline system development, we selected a preliminary dataset of 1000 pathology slides averaging around 5000 × 2500 pixels in size. These slides were selected based on an initial screening process which determined whether a mark from a stray marker (a grease pen) existed on the pathology slide. These visible marks served as the event to be classified. An example of a typical image is shown in Fig. 3.12. Every 50 × 50 pixel patch was annotated by a team of nine annotators. Each image was classified as having a mark if at least 2% of a given patch contained a mark artifact. Of these 1000 pathology slides, 500 slides were identified as having one or more marks while the other 500 did not have any marks. A control set of ten marker slides was annotated by all nine annotators throughout the annotation process to track their corresponding inter-rater agreements. Agreements between annotators were calculated as the number of matched frames divided

92

N. Shawki et al.

Fig. 3.12 A typical example from the artifact corpus showing a grease pen mark

by the total number of identified frames. This creates an accuracy calculation that penalizes mismatched annotated frames between annotators. Overall, the average inter-rater agreement between all annotators was 91.1% (at the patch level) with each annotator labeling an average of 51.2 cells per pathology slide. This artifact corpus was used in all experiments described in this section. It was designed to allow us to quickly tune key system parameters such as the frame and window sizes, model complexity, and learning rates. It also allowed us to experiment with postprocessing strategies to convert frame-level classifications into an overall image classification.

3.3.1 Baseline System Architecture An overview of the baseline system architecture is shown in Fig. 3.13. We leveraged our work on EEGs [39]. Perhaps the single biggest challenge in processing these high-resolution images is that they must be segmented into small sections as the entire image will not fit into memory. For example, a NVIDIA GTX 1070 with 8 GB of RAM can hold batches of a slide viewed at level 1 magnification, which is approximately 5K ×5K in size. However, as the magnification level increases, the RAM requirements become much severe with slides averaging around 50K × 50K in size. Our cluster currently has 16 GPUs with a combined memory of 184 GB. For the baseline system, we trained each model on a single NVIDIA RTX 2080 GPU with 8 GB of memory. For ultra-high-resolution images, we plan on training the models in parallel using multiple GPUs.

3 The Temple University Hospital Digital Pathology Corpus

93

Fig. 3.13 An overview of our baseline deep learning architecture

Though it is also possible to convert the entire image to a feature vector, it is more common today to process pixels directly, and let the initial levels of the deep learning system discover the best way to convert pixels into features. The segmentation of these images involved using a frame and window size to partition the images. Frames are non-overlapping regions of the image that determine the number of computations to be performed. Windows are regions that are larger than or equal to the size of a frame which determine the amount of data to be used for each analysis. The pixel dimensions of a frame (width x height), which we denote as F, and a window, W, are independent parameters that must be optimized through a grid search process. This is described in Sect. 3.3.2. We typically scan images sequentially in a left-to-right fashion. However, deep learning systems often prefer inputs, in this case windows, to be randomly selected from the entire corpus. Care must be taken to balance I/O and processing time so that computational issues do not prevent large-scale experiments from taking excessive amounts of time. We do not find a significant difference in performance between randomly selecting windows and sequentially scanning images to compose a batch of data to be processed. However, sequential scanning of the images significantly reduces I/O requirements. In the architecture shown in Fig. 3.13, there are a total of five convolutional layers along with a fully connected hidden layer, and an output layer. All of the images are preprocessed to strip the last layer of RGBA (red, green, blue, and alpha) channels which is common in standard SVS files. This reduces the computational complexity slightly and does not impact performance because the alpha channel is always opaque. A batch size of 1500, which is equivalent to 1500 windows, is passed through the network. The first layer of the model consists of 32 kernels of size (3, 3) and a stride length of (2, 2). Having a larger stride length prevents the first layer from becoming unnecessarily deep as this would drastically increase the memory requirements and processing time for the system. This layer outputs 32 filters for each of the 1500 window inputs. Since a stride length of (2, 2) was chosen,

94

N. Shawki et al.

each filter will have a size of 49 × 49 since the kernel size is not a multiple of the window size. Next, a max-pooling layer is applied, and the batch is normalized, essentially halving the dimensions of the filters from the previous layer to 24 × 20 by choosing the maximum value in each 2 × 2 region of a filter. This three-step process—a convolutional layer followed by a max-pooling layer followed by a batch normalization layer—is repeated for the first two layers. The remaining layers use a convolutional layer followed by a batch normalization layer. All of the convolution layers use a Rectified Linear Unit (ReLU) activation function. The convolutional layers of the network are responsible for feature extraction. This is done by convolving multiple sliding kernels to create a feature map. The outputs of one layer are used as the inputs to the next in a feed-forward fashion. Max-pooling layers help reduce the dimensionality of the feature maps, mitigate the computational complexity, and reduce the number of system parameters. A dropout layer was added that randomly drops 20% of the nodes in the system to prevent overfitting. Finally, the inputs are flattened and are propagated through the hidden layer to the final output layer which classifies the frame as a mark or a no-mark class. Batch normalization layers allow neural networks to converge quickly. This is accomplished by normalizing each batch by both the mean and variance, thereby keeping the mean activation close to zero and the standard deviation activation close to one. This prevents large variances for inputs at each layer of the system, greatly improving the time it takes for the network to converge and simultaneously improving the system’s performance [68]. To prevent the network from diverging, a low learning rate of 0.005 was used for an Adam optimizer. Adam was selected due to its memory and computational efficiency [69]. As suggested by the authors, and based on our previous experiences with EEG signals, β 1 , β 2 , and ε were set to 0.9, 0.999, and 1e-08, respectively. A softmax activation function was used to convert the logits into probabilities that sum to one. Since we ultimately want to be able to classify multiple types of events, the softmax function is the most suitable conversion function. We chose to combine the softmax activation with a categorical cross-entropy loss for the same reason. Because we are initially scoring at the frame level, we must convert these frame scores to an overall image score. In our initial baseline system, which is intended to establish a simple and replicable baseline, we used a heuristic approach that involved finding the optimal number of frames (N) containing the class with a specific confidence level (C). If an image has at least N frames classified with a confidence of at least C, the image would be identified as belonging to that class. The parameters N and C were experimentally optimized as explained below in Sect. 3.3.2. Pathology images tend to contain a large number of frames classified as null, meaning the frame does not contain an event of interest (often referred to as the background class). This type of unbalanced dataset creates problems for machine learning algorithms since the obvious thing to do to maximize performance is to always guess the most likely class assignment. This is often referred to as Bayesian guessing based only on priors, or intelligent guessing. In practice, when priors

3 The Temple University Hospital Digital Pathology Corpus

95

are extremely unbalanced, it is difficult to train a machine learning system that outperforms Bayesian guessing for a variety of very practical issues. Special care must be taken to train the system to avoid this type of degenerate behavior. To detect this pattern of guessing, the baseline system outputs a confidence for its hypothesis for every frame of the slide that can be later reviewed by the system developer. In addition, each slide and its corresponding annotation and hypothesis files can be loaded into our developed annotation viewing tool. The frames that are classified as marked will be highlighted in the GUI of the tool for easy detection and analysis of anomalies in the classification system. For both null and non-null classes, the hypothesis files contain the confidence of each prediction so pathologists can inspect the model’s output. To combat this, class weights are created for the dataset. Class weights are a mapping of each class to its corresponding assigned weight value [70, 71]. This value represents the factor at which to penalize the loss function when the model misclassifies a given class. To fairly assign this value, the largest recorded occurrence of all the classes is divided by each class’s occurrence in the dataset. For example, if the dataset contained three classes with the occurrences {X0 : 10, X1 : 100, X2 : 20}, then the class weights would be inversely proportional to the frequency of occurrence: {W0 : 10, W1 : 1, W2 : 5}. This method of weighting the loss function prevents the model from ignoring under-represented classes. Since this pilot corpus has a small number of images, we used a cross-validation approach to build and evaluate models. We created a fivefold validation test by randomly splitting the data into five training and five evaluation lists containing 800 and 200 pathology slides, respectively. Five independent models using the same algorithm were trained on each of the five training lists and performance was averaged across these sets. Because we did not have a large amount of data for these pilot experiments, we did not create a held-out set. An independent evaluation would be required to go further in this process. However, we conducted these tuning experiments as a way to determine some important basic parameters for the system. Once we have a model capable of accurately classifying real pathology data, we will release the software as open source from the project web site (https://www.isip. piconepress.com/projects/nsf_dpath). The models were trained on an NVIDIA RTX 2080 GPU with 8 GB of memory. Training each model for 15 epochs took an average of 7 min per epoch. The average peak memory usage of the network was 7.3 GB. At the end of an epoch, the model outputs the weights of the system. Since there is no certain way to know which set of weights will perform best on the evaluation set, we decoded and evaluated on each set of weights generated for every model. This also allows us to see if overfitting occurs and approximately at what epoch it began. Scoring of the deep learning system was done using both a frame-level and a whole-slide methodology. Each of these techniques has its strengths as a performance metric. Frame-level scoring of the model allows for a strong visualization and understanding of the features that the system is learning. On the other hand, whole-slide scoring provides a more generalized view of the model’s performance over the dataset. This, in turn, presents an opportunity to understand if the system

96

N. Shawki et al.

is struggling during classification, and identify the combinations of features and images which are posing challenges to the system. We adapted our standard open source scoring approach that we have used across a wide range of applications [72]. This scoring approach accounts for spatial alignments between the hypothesis and reference annotations and computes a large number of performance metrics. It should be noted that we were not able to get our best hybrid system for EEGs, which is based on a combination of CNNs and long short-term memory (LSTM) networks, to converge on this small training dataset. We expect that as we annotate more data that includes clinically relevant artifacts, we will need to revisit this hybrid architecture.

3.3.2 Experimental Results For a system to function as intended, the frame and window sizes are perhaps the most important parameters that must be optimized. Windows that are larger than the frame allow the model to gather contextual information of a given frame. However, the optimal values of the frame and window are a function of a number of operational conditions such as the typical size of the artifact to be detected. Furthermore, the computational requirements for the system are directly proportional to these parameters. To optimize these parameters, we performed an extensive sweep of reasonable values for these parameters. To find the optimal thresholds of these parameters, we split the data into 800 slides for training and 200 slides for evaluation for a held-out evaluation set. We used a cross-validation process on the training set to adjust parameters. We randomly selected five partitions of 700 training slides (train) and 100 development set slides (dev_test) out of the 800 training slides. This was done to ensure that none of the evaluation data was used to determine the best parameters while also demonstrating performance on a held-out set. A summary of the results for this frame-level classification experiment is shown in Table 3.3. We considered frame and window sizes of 50 × 50, 100 × 100, 200 × 200, and 400 × 400 pixels based on the results of a number of pilot experiments on the corpus. We have chosen sensitivity as our performance measure as opposed to error rate. For a few combinations, the system could not detect the mark class effectively, but it performed well on the null class. In such conditions, error rate fails to provide the appropriate insight into the system’s performance. Fortunately, for reasonable operating points, performance based on sensitivity tracks error rate fairly well. The best performing system had a frame size, F, of 50 × 50 pixels and a window size, W, of 100 × 100 pixels. This combination achieved 99.40% sensitivity for the mark class in the training set, 99.48% in the dev test set, and 99.82% in the evaluation set. The worst performing system had F = 50 × 50 pixels and W = 400 × 400 pixels, achieving 0% sensitivity on the mark class for all datasets. In general, the systems with smaller frame sizes outperformed those with larger frame

3 The Temple University Hospital Digital Pathology Corpus

97

Table 3.3 Performance (sensitivity of the mark and null classes) as a function of the frame and window sizes for frame-level classification Analysis parameters Frame 50 × 50

100 × 100

200 × 200 400 × 400

Window 50 × 50 100 × 100 200 × 200 400 × 400 100 × 100 200 × 200 400 × 400 200 × 200 400 × 400 400 × 400

Frame-level classification Train (%) Dev test (%) Mark Null Mark Null 96.77 99.78 98.09 99.37 99.40 99.61 99.48 99.45 98.53 99.48 99.52 99.45 0.00 100.00 0.00 100.00 97.11 99.59 99.12 99.41 98.97 99.57 99.60 99.42 99.25 98.07 99.68 96.53 90.56 99.47 95.93 98.62 98.99 99.25 99.29 99.06 92.65 99.01 95.97 98.62

Eval (%) Mark Null 97.75 99.88 99.82 99.75 99.21 99.67 0.00 100.00 98.02 99.78 99.45 99.78 99.94 98.55 91.18 99.61 99.31 99.41 91.54 99.17

sizes. This is somewhat due to the relationship between the size of pen marks and the number of pixels in an image. Since computation time varies quadratically with the frame and window sizes, a reasonable tradeoff between computational complexity and performance is a frame and window combination of 100 × 100 pixels and 200 × 200 pixels, respectively. However, for our subsequent experiments, we used the combination of F = 50 × 50 and W = 100 × 100 since computational efficiency is not a huge issue for this limited dataset. Next, we conducted a parameter sweep on the threshold parameters N (1, 2, 4, 6, 8, 10) and C (0.80, 0.85, 0.90, 0.95) for all combinations of F and W for classification of whole slide images. We used the dev test set only for this purpose. A few of the selected results of this experiment are shown in Table 3.4. We found that N = 6 and C = 0.95 yielded the best results for F = 50 × 50 and W = 100 × 100. This combination of parameters achieved 100% sensitivity for the mark class. Performance began to sharply decrease as we increased the value of N and decreased the value of C for this combination. Again, the optimal value of N is somewhat dependent on the size of the mark artifacts relative to the size of the frame and window. In Table 3.4, we also verify that the overall system performance is optimal by evaluating a few combinations of F, W, N, and C in the neighborhood of our optimal operating point. Since the relationships between these parameters are nonlinear, there is no guarantee that a sequential optimization process will find the globally best operating point. We explored two other combinations: F = 100 ×100, W = 100 × 100 with N = 4 and C = 0.85, and F = 200 × 200, W = 200 × 200 with N = 2 and C = 0.80. We see that the F = 50 × 50 pixels and W = 50 × 50 pixels, N = 2 and C = 0.80 and F = 100 ×100, W = 200 × 200 with N = 4 and C = 0.90 yield a 98.0% sensitivity for the mark class. Then, we postprocessed the whole slide images in all datasets using the highlighted parameters in Table 3.4. Considering the data shown in Table 3.5

98

N. Shawki et al.

Table 3.4 Performance (sensitivity of the mark and null classes) for a selected combination of tunable parameters Analysis parameters Frame 50 × 50

Window 50 × 50

50 × 50

100 × 100

50 × 50

200 × 200

50 × 50 100 × 100

400 × 400 100 × 100

100 × 100

200 × 200

100 × 100 200 × 200

400 × 400 200 × 200

200 × 200

400 × 400

400 × 400

400 × 400

N 4 6 6 8 10 4 6 6 8 10 4 6 6 8 10 10 2 4 6 2 4 6 8 2 4 6 2 6 8 2 4

C 0.95 0.90 0.95 0.95 0.90 0.95 0.90 0.95 0.95 0.90 0.95 0.90 0.95 0.95 0.90 0.95 0.95 0.85 0.90 0.95 0.90 0.90 0.90 0.80 0.85 0.90 0.80 0.85 0.90 0.85 0.90

Whole slide image classification Dev test (%) Mark Null 98.0 100.0 96.0 100.0 98.0 100.0 94.0 100.0 92.0 100.0 100.0 96.0 100.0 98.0 100.0 100.0 98.0 98.0 96.0 98.0 100.0 98.0 100.0 98.0 100.0 98.0 96.0 100.0 100.0 96.0 0.0 100.0 100.0 100.0 100.0 100.0 96.0 100.0 96.0 100.0 98.0 100.0 94.0 100.0 0.0 100.00 98.0 100.00 88.0 100.0 76.0 100.0 100.0 100.0 74.0 100.0 56.0 100.0 92.0 100.0 60.0 100.0

and computational complexity, we find that the combination, at F = 50 × 50, W = 100 × 100 with N = 6 and C = 0.95, is indeed the best combination of parameters for the overall dataset. One of the other candidates for the optimal combination, F = 200 × 200, W = 200 × 200 with N = 2 and C = 0.80, achieved the same results as our best combination. One other combination, F = 100 ×100, W = 100 × 100 with N = 4 and C = 0.85 obtained only 96.9% on the training set and 94.0% sensitivity on the evaluation set.

3 The Temple University Hospital Digital Pathology Corpus

99

Table 3.5 Performance (sensitivity of the mark and null classes) as a function of the frame and window sizes for whole slide image classification Analysis parameters Frame 50 × 50

100 × 100

200 × 200 400 × 400

Window 50 × 50 100 × 100 200 × 200 400 × 400 100 × 100 200 × 200 400 × 400 200 × 200 400 × 400 400 × 400

Table 3.6 Performance for the frame-level classification on the cross-validation sets

Whole slide image classification Train (%) Dev test (%) Mark Null Mark Null 98.9 99.1 100.0 98.0 98.9 98.9 100.0 100.0 99.4 98.9 98.0 100.0 0.0 100.0 0.0 100.0 96.9 98.6 100.0 100.0 98.9 98.3 100.0 100.0 96.0 96.3 98.0 100.0 96.6 99.1 98.0 98.0 99.1 98.3 100.0 100.0 91.7 98.9 90.0 94.0

Set T1 T2 T3 T4 T5 Mean

Eval (%) Mark 99.0 99.0 99.0 0.0 94.0 99.0 98.0 96.0 99.0 92.0

Frame-level classification Train (%) Dev test (%) Mark Null Mark Null 99.40 99.61 99.48 99.45 98.96 99.52 99.39 99.58 99.02 99.60 98.81 99.78 99.06 99.69 99.02 99.51 98.89 99.64 99.29 99.46 99.07 99.61 99.20 99.56

Null 100.0 100.0 100.0 100.0 100.0 99.0 99.0 100.0 98.0 100.0

Eval (%) Mark Null 99.82 99.75 99.37 99.74 99.23 99.77 99.33 99.67 99.11 99.72 99.37 99.73

Finally, in Tables 3.6 and 3.7, we show performance as a function of the crossvalidation set for frame-level and whole slide image classification, respectively. This provides some insight into the variance of the performance. The mean sensitivity rate for the mark class was 99.3% on WSI classification. On the evaluation set, our system obtained 99.4% for the mark class and 99.0% sensitivity for the null class. For the same set, the unprocessed frame-level predictions had a mean sensitivity of 99.37% for the mark class and 99.73% for the null class. Error analysis shows that the postprocessor, as expected, rejected images with very small marks. Since the minimum number of classified frames, N, was set to 6, the system ignored images with less than six marked frames even if the classification had a high confidence. This is essentially a tradeoff between accuracy and false alarms due to small artifacts. The model also failed to distinguish marks of different colors. In the dataset, almost all the marks are either blue, green, or black. The model failed to identify the only image in the dataset with a red mark. Obviously, this is something that can be easily fixed with a significantly larger training corpus.

100 Table 3.7 Performance for the whole slide image classification on the cross-validation sets

N. Shawki et al.

Set T1 T2 T3 T4 T5 Mean

Whole slide image classification Train (%) Dev test (%) Eval (%) Mark Null Mark Null Mark Null 98.9 98.9 100.0 100.0 99.0 100.0 99.1 97.1 100.0 100.0 100.0 99.0 98.3 99.7 100.0 100.0 99.0 98.0 99.7 99.7 100.0 100.0 99.0 99.0 99.1 99.1 98.0 100.0 100.0 99.0 99.0 98.9 99.6 100.0 99.4 99.0

3.4 Summary Histology slides are required to be kept for a minimum of 10 years after their date of examination as stated by the Clinical Laboratory Improvement Amendments [73]. Hence, these slide archives are substantial and constitute an extremely valuable resource for research and technology development. Digitizing pathology slides, annotating these slides for clinically relevant events, and organizing this data in a database that includes patient medical history is clearly beneficial. In this chapter, we have introduced an open source corpus being developed to enable research and clinical use of pathology data. The physical transfer of the slides to separate sites in order to share data regarding clinical cases among pathologists is a time consuming and expensive endeavor that impedes the use of the slides for research and clinical decision support. Using a single Leica Biosystems Aperio AT2 scanner, we are able to scan about 2000 slides per week with a small team of undergraduate student workers. On an evening shift, workers prep the slides by cleaning them with lens paper or alcohol prep pads to remove fingerprint marks, dust particles, or stain marks on the coverslip. The slides are trimmed to remove overhanging labels or protruding parts of the coverslip. The slides are then loaded onto racks and placed, two racks at a time, into the AT2 scanner carousel. To streamline the pre-scan snapshot process, we snapshot one rack while loading the other. This saves time because it takes around 40 min to load all eight racks, around 70 min to snapshot all racks, and around 20 min to adjust the snapshots. Workers on the morning shift review the status of the scanned slides, adjust focus points for those slides that failed, and re-scan them. The slides are scanned to disk, renamed, and then added to the eSM database. We have currently scanned over 20,000 slides. The statistics for this pilot corpus are summarized in Table 3.8. The majority of the slides fall under the breast, gastrointestinal, and urinary prostate cases. Among them, urinary prostate cases comprise the highest slide count (5254 slides) followed by breast cases (3747 slides) and gastrointestinal cases (2375 slides). To date, we have scanned slides from 1900 patients and 2125 cases with an average of 10.94 slides per patient and 9.78 slides per case. Although the WSIs usually consist of single specimen, there have been cases where two, three, and six specimens were observed in a single image. The

3 The Temple University Hospital Digital Pathology Corpus

101

Table 3.8 Preliminary statistics for the pilot corpus Tissue type Breast Gastrointestinal Gynecology Head and neck Lymph nodes Pulmonary Soft tissue Spinal epidural Urinary prostate Miscellaneous Total

Patients 303 148 35 6 33 10 10 1 188 1166 1900

Cases 398 254 34 6 33 10 10 4 210 1166 2125

Slides 3747 2375 581 83 1293 128 51 67 5254 6924 20789

Avg. slides per patient 12.37 16.05 16.60 13.83 39.18 12.80 5.10 67.00 27.95 5.94 10.94

Avg. slides per case 9.41 9.35 17.09 13.83 39.18 12.80 5.10 16.75 25.02 5.94 9.78

images of the scanned specimens usually occupy around 200 MB of space but depending on the complexity of the specimen this amount can increase to 1 GB. Each of these slide images possesses an image resolution of 0.502 μm per pixel and is in a single-file pyramidal tiled TIFF format, where the tile height and width is 240 pixels and the image height varies from 20,000 to 50,000 pixels and the width varies from 40,000 to 100,000 pixels. We have also presented a baseline deep learning system to validate the data being collected and ensure that the annotation process serves the needs of researchers. This system classifies whole slide pathology images by decomposing these ultra-highresolution images into a sequence of frames and windows. It performs frame-level classification and then postprocesses those frame hypotheses into an overall image hypothesis. We achieved a mean sensitivity of 99.4% on the classification of slides containing a pen mark artifact using a frame/window combination of 50 × 50 pixels and 100 ×100 pixels, respectively (the corresponding error rate was 0.08%). We have optimized a number of important runtime parameters of this system, including I/O and memory usage, so that it will be feasible to process large numbers of detailed slides. Our near-term goal for the TUDP Corpus is to release 100,000 slides by December 2020. We hope to continue data collection over the next decade until we reach one million slides. To reach this very ambitious goal, we hope to incorporate data from other hospitals and scanning equipment. However, it has been extremely difficult to find sites willing to release unencumbered data. Those interested in the corpus should join our listserv at www.nedcdata.org to be kept informed about the status of the project. Acknowledgments This material is supported by the National Science Foundation under grant nos. CNS-1726188 and 1925494. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

102

N. Shawki et al.

Opensource libraries that were used to develop the deep learning model presented in this chapter are: Shapely v1.6.4, OpenSlide v1.1.1, Abstract Syntax Library, OpenCV-Python v3.4.1, NumPy v1.14.2, PIL v4.2.1, TensorFlow v1.9.0, and Keras v2.2.4.

References 1. Sattar, H. (2017). Fundamentals of pathology: Medical course and step 1 review (8th ed.). Chicago, IL: Pathoma, LLC.. Retrieved from https://www.pathoma.com/fundamentals-ofpathology. 2. Rolls, G. (2018). An introduction to specimen preparation. Retrieved from https:// www.leicabiosystems.com/pathologyleaders/an-introduction-to-specimen-preparation/. 3. Anderson, J. (2019). An introduction to routine and special staining. Retrieved from https:// www.leicabiosystems.com/pathologyleaders/an-introduction-to-routine-and-special-staining/. 4. American Cancer Society. (2019). What happens to biopsy and cytology specimens? Retrieved August 19, 2019, from https://www.cancer.org/treatment/understanding-your-diagnosis/tests/ testing-biopsy-and-cytology-specimens-for-cancer/what-happens-to-specimens.html. 5. Eiseman, E., & Haga, S. (2000). In E. Eiseman (Ed.) A handbook of human tissue sources: A national resource of human tissue samples (1st ed.). Washington, DC: Rand Publishing. Retrieved from https://www.rand.org/pubs/monograph_reports/MR954.html. 6. Kapila, S. N., Boaz, K., & Natarajan, S. (2016). The post-analytical phase of histopathology practice: Storage, retention and use of human tissue specimens. International Journal of Applied & Basic Medical Research, 6(1), 3–7. https://doi.org/10.4103/2229-516X.173982. 7. Hallworth, M. J. (2011). The ‘70% claim’: what is the evidence base? Annals of Clinical Biochemistry: International Journal of Laboratory Medicine, 48(6), 487–488. https://doi.org/ 10.1258/acb.2011.011177. 8. Jhala, N. (2017). Digital pathology: Advancing frontiers. In IEEE Signal Processing in Medicine and Biology Symposium (SPMB), Philadelphia, PA. Retrieved from https:// ieeexplore.ieee.org/document/8257013/. 9. Barry, M. J., Kaufman, D. S., & Wu, C.-L. (2008). Case 15-2008 :2008: A 55-year-old man with an elevated prostate-specific antigen level and early-stage prostate cancer. The New England Journal of Medicine, 358(20), 2161–2168. https://doi.org/10.1056/NEJMcpc0707057. 10. Bongaerts, O., Clevers, C., Debets, M., Paffen, D., Senden, L., Rijks, K., et al. (2018). Conventional microscopical versus digital whole-slide imaging-based diagnosis of thin-layer cervical specimens: A validation study. Journal of Pathology Informatics, 9(1), 29–37. https:// doi.org/10.4103/jpi.jpi_28_18. 11. The Medical Futurist. (2018). The digital future of pathology. Retrieved August 19, 2019, from https://medicalfuturist.com/digital-future-pathology. 12. Stathonikos, N., Veta, M., Huisman, A., & van Diest, P. J. (2013). Going fully digital: Perspective of a Dutch academic pathology lab. Journal of Pathology Informatics, 4, 15. https:/ /doi.org/10.4103/2153-3539.114206. 13. Leica Biosystems. (2019). Aperio AT2 – High volume, digital whole slide scanning. Retrieved from https://www.leicabiosystems.com/digital-pathology/scan/aperio-at2/. 14. Philips. (2019). Clinical digital pathology system. Retrieved August 19, 2019, from https:// www.usa.philips.com/healthcare/resources/landing/philips-intellisite-pathology-solution. 15. Hanna, M. G., Monaco, S. E., Cuda, J., Xing, J., Ahmed, I., & Pantanowitz, L. (2017). Comparison of glass slides and various digital-slide modalities for cytopathology screening and interpretation. Cancer Cytopathology, 125(9), 701–709. https://doi.org/10.1002/cncy.21880. 16. Joint Photographic Experts Group. (2019). Overview of JPEG. Retrieved from https://jpeg.org/ jpeg/.

3 The Temple University Hospital Digital Pathology Corpus

103

17. Campbell, C., Mecca, N., Duong, T., Obeid, I., & Picone, J. (2018). Expanding an HPC cluster to support the computational demands of digital pathology. In I. Obeid & J. Picone (Eds.), IEEE Signal Processing in Medicine and Biology Symposium (pp. 1–2). Philadelphia, PA: IEEE. Retrieved from https://ieeexplore.ieee.org/document/8615614. 18. Mahar, J. H., Rosencrance, J. G., & Rasmussen, P. A. (2018). Telemedicine: Past, present, and future. Cleveland Clinic Journal of Medicine, 85(12), 938–942. Retrieved from https://www.mdedge.com/ccjm/article/189759/practice-management/telemedicine-pastpresent-and-future. 19. Beam, A., & Kohane, I. S. (2016). Translating artificial intelligence into clinical care. JAMA, 316(22), 2368–2369. https://doi.org/10.1001/jama.2016.17217. 20. Hamilton, P. W., Bankhead, P., Wang, Y., Hutchinson, R., Kieran, D., McArt, D. G., et al. (2014). Digital pathology and image analysis in tissue biomarker research. Methods, 70(1), 59–73. https://doi.org/10.1016/j.ymeth.2014.06.015. 21. Janowczyk, A., & Madabhushi, A. (2016). Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases. Journal of Pathology Informatics, 7. Retrieved from http://www.jpathinformatics.org/text.asp?2016/7/1/29/186902. 22. Bauer, D. R., Otter, M., & Chafin, D. R. (2018). A new paradigm for tissue diagnostics: Tools and techniques to standardize tissue collection, transport, and fixation. Current Pathobiology Reports, 6(2), 135–143. https://doi.org/10.1007/s40139-018-0170-1. 23. Kourou, K., Exarchos, T. P., Exarchos, K. P., Karamouzis, M. V., & Fotiadis, D. I. (2014). Machine learning applications in cancer prognosis and prediction. Computational and Structural Biotechnology Journal, 13, 8–17. https://doi.org/10.1016/j.csbj.2014.11.005. 24. Litjens, G., Kooi, T., Bejnordi, B. E., Setio, A. A. A., Ciompi, F., Ghafoorian, M., et al. (2017). A survey on deep learning in medical image analysis. Medical Image Analysis, 42(December 2012), 60–88. https://doi.org/10.1016/j.media.2017.07.005. 25. Barker, J., Hoogi, A., Depeursinge, A., & Rubin, D. (2016). Automated classification of brain tumor type in whole-slide digital pathology images using local representative tiles. Medical Image Analysis, 30(1), 60–71. https://doi.org/10.1016/j.media.2015.12.002. 26. Gleason, D. F. (1992). Histologic grading of prostate cancer: a perspective. Human Pathology, 23(3), 273–279. Retrieved from https://www.ncbi.nlm.nih.gov/pubmed/1555838. 27. Gordetsky, J., & Epstein, J. (2016). Grading of prostatic adenocarcinoma: current state and prognostic implications. Diagnostic Pathology, 11, 25. https://doi.org/10.1186/s13000-0160478-2. 28. He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284. https://doi.org/10.1109/TKDE.2008.239. 29. Picone, J., Farkas, T., Obeid, I., & Persidsky, Y. (2017). MRI: High performance digital pathology using big data and machine learning. Major Research Instrumentation (MRI), Division of Computer and Network Systems, January 11, 2017. Retrieved from https:// www.isip.piconepress.com/proposals/2017/nsf/mri/. 30. Harabagiu, S., Picone, J., & Moldovan, D. (2002). Voice activated question answering. In Proceedings of the International Conference on Computational Linguistics, Taipei, Taiwan (pp. 1–7). Retrieved from http://www.isip.piconepress.com/publications/conference_proceedings/ 2002/coling/vaqa/. 31. Capp, N., Campbell, C., Elseify, T., Obeid, I., & Picone, J. (2018). Optimizing EEG visualization through remote data retrieval. In IEEE Signal Processing in Medicine and Biology Symposium, Philadelphia, PA (pp. 1–2). Retrieved from https://ieeexplore.ieee.org/document/ 8615613. 32. Picone, J., Obeid, I., & Harabagiu, S. (2018). Automated cohort retrieval from EEG medical records. In 26th Conference on Intelligent Systems for Molecular Biology, Chicago, IL (pp. 1– 7). Retrieved from https://www.isip.piconepress.com/publications/conference_presentations/ 2018/ismb/cohort_retrieval/. 33. Ross, M. H., & Pawlina, W. (2019). Histology: A text and atlas: with correlated cell and molecular biology (8th ed.). Philadelphia, PA: Wolters Kluwer Health. Retrieved from https:// www.lww.co.uk/9781975115364/histology-a-text-and-atlas/.

104

N. Shawki et al.

34. Gutman, D., Cobb, J., Somanna, D., Park, Y., Wang, F., Kurc, T., et al. (2013). Cancer Digital Slide Archive: an informatics resource to support integrated in silico analysis of TCGA pathology data. Journal of the American Medical Informatics Association, 20(6), 1091–1098. https://doi.org/10.1136/amiajnl-2012-001469. 35. Drissen, H. (2017). Philips and LabPON plan to create world’s largest pathology database of annotated tissue images for deep learning. Retrieved from https://www.philips.com/a-w/ about/news/archive/standard/news/press/2017/20170306-philips-and-labpon-plan-to-createworlds-largest-pathology-database-of-annotated-tissue-images-for-deep-learning.html. 36. Ferrell, S., von Weltin, E., Obeid, I., & Picone, J. (2018). Open source resources to advance EEG research. In IEEE Signal Processing in Medicine and Biology Symposium, Philadelphia, PA (pp. 1–3). Retrieved from https://ieeexplore.ieee.org/document/8615622. 37. Obeid, I., & Picone, J. (2018). The Temple University Hospital EEG Data Corpus. In Augmentation of brain function: Facts, fiction and controversy. Volume I: Brain-machine interfaces (1st ed., pp. 394–398). Lausanne, Switzerland: Frontiers Media S.A.. https://doi.org/ 10.3389/fnins.2016.001967. 38. de Freitas, N., Reed, S., & Vinyals, O. (2017). Deep learning: Practice and trends. In Neural Information Processing Systems, Long Beach, CA. Retrieved from https://nips.cc/Conferences/ 2017/Schedule?showEvent=8730. 39. Golmohammadi, M., Shah, V., Obeid, I., & Picone, J. (2019). Deep learning approaches for automatic analysis of EEGs. In S.-M. Chan & W. Pedrycz (Eds.), Deep learning: Algorithms and applications (1st ed.). New York, NY: Springer. Retrieved from https:// isip.piconepress.com/publications/book_sections/2019/springer/deep_learning/. 40. LeCun, Y., & Bengio, Y. (1998). Convolutional networks for images, speech, and time series. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 255–258). Cambridge, MA: MIT Press. Retrieved from http://dl.acm.org/citation.cfm?id=303568.303704. 41. Golmohammadi, M., Ziyabari, S., Shah, V., Obeid, I., & Picone, J. (2018). Deep architectures for spatio-temporal modeling: Automated seizure detection in scalp EEGs. In Proceedings of the IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, FL (pp. 1–6). https://doi.org/10.1109/ICMLA.2018.00118. 42. Saon, G., Sercu, T., Rennie, S., & Kuo, H.-K. J. (2016). The IBM 2016 English Conversational Telephone Speech Recognition System. In Proceedings of the Annual Conference of the International Speech Communication Association (Vol. 08–12–Sept, pp. 7–11). https://doi.org/ 10.21437/Interspeech.2016-1460. 43. Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA (pp. 1–14). https://doi.org/10.1016/j.infsof.2008.09.005. 44. Ghiasi, G., Lin, T.-Y., & Le, Q. V. (2018). DropBlock: A regularization method for convolutional networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 31 (pp. 10727– 10737). Red Hook, NY: Curran Associates, Inc.. Retrieved from http://papers.nips.cc/paper/ 8271-dropblock-a-regularization-method-for-convolutional-networks.pdf. 45. Chen, Y., Kalantidis, Y., Li, J., Yan, S., & Feng, J. (2018). Aˆ2-nets: Double attention networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 31, pp. 352–361). Red Hook, NY: Curran Associates, Inc. Retrieved from http://papers.nips.cc/paper/7318-a2-nets-doubleattention-networks.pdf. 46. Cire¸san, D. C., Giusti, A., Gambardella, L. M., & Schmidhuber, J. (2013). Mitosis detection in breast cancer histology images with deep neural networks. In International Conference on Medical Image Computing and Computer-assisted Intervention. Haspolat, Turkey: Signal Processing and Communications Applications Conference. https://doi.org/10.1109/ SIU.2013.6531502. 47. Cruz-Roa, A., Basavanhally, A., Gonzalez, F., Gilmore, H., Feldman, M., Ganesan, S., et al. (2014). Automatic detection of invasive ductal carcinoma in whole slide images with convolutional neural networks. In Medical Imaging 2014: Digital Pathology (pp. 1–15). https:/ /doi.org/10.1117/12.2043872.

3 The Temple University Hospital Digital Pathology Corpus

105

48. Hua, K. L., Hsu, C. H., Hidayati, S. C., Cheng, W. H., & Chen, Y. J. (2015). Computer-aided classification of lung nodules on computed tomography images via deep learning technique. OncoTargets and Therapy, 8, 2015–2022. https://doi.org/10.2147/OTT.S80733. 49. Sirinukunwattana, K., Raza, S. E. A., Tsang, Y. W., Snead, D. R. J., Cree, I. A., & Rajpoot, N. M. (2016). Locality sensitive deep learning for detection and classification of nuclei in routine colon cancer histology images. IEEE Transactions on Medical Imaging, 35(5), 1196–1206. https://doi.org/10.1109/TMI.2016.2525803. 50. Bejnordi, B. E., Zuidhof, G., Balkenhol, M., Hermsen, M., Bult, P., van Ginneken, B., et al. (2017). Context-aware stacked convolutional neural networks for classification of breast carcinomas in whole-slide histopathology images. Journal of Medical Imaging, 4(4), 44504. https://doi.org/10.1117/1.JMI.4.4.044504. 51. Wang, D., Khosla, A., Gargeya, R., Irshad, H., & Beck, A. H. (2016). Deep learning for identifying metastatic breast cancer. ArXiv Preprint ArXiv, 1606, 05718. 52. Obeid, I., & Picone, J. (2016). The Neural Engineering Data Consortium: Building community resources to advance research. Philadelphia, PA: Temple University. https://doi.org/https:// www.isip.piconepress.com/publications/reports/2016/nsf/cri/. 53. Campbell, C., Mecca, N., Obeid, I., & Picone, J. (2017). The Neuronix HPC Cluster: Improving cluster management using free and open source software tools. In I. Obeid & J. Picone (Eds.), IEEE Signal Processing in Medicine and Biology Symposium (p. 1). Philadelphia, PA: IEEE. https://doi.org/10.1109/SPMB.2017.8257042. 54. Yoo, A. B., Jette, M. A., & Grondona, M. (2003). SLURM: Simple Linux utility for resource management. In D. Feitelson, L. Rudolph, & U. Schwiegelshohn (Eds.), Job scheduling strategies for parallel processing (pp. 44–60). Berlin: Springer. 55. Red Hat Inc. (2019). What is Gluster? Retrieved from https://docs.gluster.org/en/v3/ AdministratorGuide/GlusterFS Introduction/. 56. Bonwick, J., Ahrens, M., Henson, V., Maybee, M., & Shellenbaum, M. (2003). The Zettabyte File System. In Proceedings of the 2nd Usenix Conference on File and Storage Technologies, San Francisco, CA (pp. 1–13). Retrieved from http://citeseerx.ist.psu.edu/viewdoc/ download?doi=10.1.1.184.3704&rep=rep1&type=pdf. 57. Satyanarayanan, M., Goode, A., Gilbert, B., Harkes, J., & Jukic, D. (2013). OpenSlide: A vendor-neutral software foundation for digital pathology. Journal of Pathology Informatics, 4(1), 27. https://doi.org/10.4103/2153-3539.119005. 58. Clunie, D. (2019). DICOM whole slide imaging: Acquire, archive, view, annotate, download and transmit, Bangor, PA. Retrieved from https://www.dclunie.com/papers/ HIMA_2017_DICOMWSI_Clunie.pdf. 59. Leica Biosystems. (2008). Digital slides and third-party data interchange (MAN-0069, Revision B), Wetzlar, Germany. Retrieved August 22, 2019, from http://web.archive.org/web/ 20120420105738/http://www.aperio.com/documents/api/Aperio_Digital_Slides_and_Thirdparty_data_interchange.pdf. 60. Leica Biosystems. (2018). Aperio ImageScope - Pathology slide viewing software. Retrieved from https://www.leicabiosystems.com/digital-pathology/manage/aperio-imagescope/. 61. Rojo, M. G., Garcia, G. B., Mateos, C. P., Garcia, J. G., & Vicente, M. C. (2006). Critical comparison of 31 commercially available digital slide systems in pathology. International Journal of Surgical Pathology, 14(4), 285–305. https://doi.org/10.1177/1066896906292274. 62. Brzezinski, R. (2016). HIPAA privacy and security compliance - Simplified: Practical Guide for healthcare providers and managers 2016 Edition (3rd ed.). Seattle, WA: CreateSpace Independent Publishing Platform. 63. Epic Systems Corporation. (2019). EPIC outcomes. Retrieved from https://www.epic.com/. 64. The Cornell Law School. (2019). 42 CFR 493.1274 - Standard: cytology. Retrieved from https:/ /www.law.cornell.edu/cfr/text/42/493.1274. 65. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/ A:1010933404324.

106

N. Shawki et al.

66. Armato, S. G., III, McLennan, G., Bidaut, L., McNitt-Gray, M. F., Meyer, C. R., Reeves, A. P., et al. (2011). The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI): a completed reference database of lung nodules on CT scans. Medical Physics, 38(2), 915–931. https://doi.org/10.1118/1.3528204. 67. Roux, L., & Capron, F. (2014). MITOS atypia 2014 grand challenge. Retrieved April 22, 2019, from https://mitos-atypia-14.grand-challenge.org/. 68. Ioffe, S., & Szegedy, C. (2015). Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning (ICML), Lille, France (pp. 448–456). Retrieved from https://arxiv.org/abs/1502.03167. 69. Ba, J., & Kingma, D. (2014). Adam: A method for stochastic optimization. In International Conference on Learning Representations, Banff, Canada (pp. 1–15). https://doi.org/ arXiv:1412.6980. 70. Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, Hawaii (pp. 1–8). https://doi.org/10.1109/CVPR.2017.195. 71. Fukunaga, K. (1990). Introduction to statistical pattern recognition. Computer science and scientific computing (2nd ed.). San Diego, CA: Academic Press. Retrieved from https://www.elsevier.com/books/introduction-to-statistical-pattern-recognition/fukunaga/ 978-0-08-047865-4. 72. Shah, V., von Weltin, E., Ahsan, T., Obeid, I., & Picone, J. (2019). On the use of nonexperts for generation of high-quality annotations of seizure events. Journal of Clinical Neurophysiology (in review). Retrieved from https://www.isip.piconepress.com/publications/ unpublished/journals/2019/elsevier_cn/ira. 73. CDC. (1988). Clinical laboratory improvement amendments. Retrieved from https:// wwwn.cdc.gov/clia/Regulatory/default.aspx.

Chapter 4

Transient Artifacts Suppression in Time Series via Convex Analysis Yining Feng, Baoqing Ding, Harry Graber, and Ivan Selesnick

4.1 Introduction Transients prevail in time series signals, i.e., biomedical signals, financial time series. Depending on the applications and how informative the transients are, the transients can be the signal of interests to be analyzed, or merely interferences. Transient artifacts sometimes plague acquired biomedical signals, e.g., EEG/ECG [14], near-infrared spectroscopy (NIRS) [2, 13], infrared oculography (IROG) [7]. Thus, the suppression of transient artifacts without distorting the underlying signal of interest is important. Traditional linear time-invariant (LTI) filtering fails at this task because it requires that the frequency bands of the artifacts and of the underlying signal of interest do not overlap. That is usually not the case. This book chapter considers the suppression of transient artifacts in noisy signals, where the artifacts are of two types: spikes or brief waves and step discontinuities. We model the observed time series as the superposition of three morphologically distinct components [27] in additive noise: y = f + x1 + x2 + w,

y, f, x1 , x2 , w ∈ RN .

(4.1)

Y. Feng () · I. Selesnick Tandon School of Engineering, New York University, Brooklyn, NY, USA e-mail: [email protected]; [email protected] B. Ding Tandon School of Engineering, New York University, Brooklyn, NY, USA School of Mechanical Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi, China H. Graber Photon Migration Technologies Corp., Glen Head, NY, USA e-mail: [email protected] © Springer Nature Switzerland AG 2020 I. Obeid et al. (eds.), Signal Processing in Medicine and Biology, https://doi.org/10.1007/978-3-030-36844-9_4

107

108

Y. Feng et al. x1

x2

f

x1 + x2 + f

y

0

20

40

60

80

100 120 140 160 180 200 Time (n)

Fig. 4.1 Illustration of transient artifacts x1 , x2 , low-pass signal f , clean signal x1 + x2 + f and noisy observation y

Here f is a low-pass signal, x1 comprises transient artifacts that are spikes and brief waves, x2 comprises step discontinuity artifacts, and w is the additive white Gaussian noise. We model x1 as a sparse signal piecewise constant signal, and x2 as a piecewise constant signal. As an example, Fig. 4.1 illustrates the three components x1 , x2 , and f , the clean signal x1 + x2 + f , and noisy observation y. Given the time series y, we formulate the estimation of x1 , x2 , and f as the solution to an optimization problem where the time series x1 , x2 , and f are the optimization variables. The quality of the estimation of the underlying signal f depends on the quality of the estimation of the transient artifacts x1 and x2 ; thus the better x1 , x2 is estimated, the better f is estimated. We employ a sparse optimization approach which defines a type of nonlinear filter. Such optimization problems are usually regularized by 1 norm to induce sparsity. The 1 norm is classically adopted to induce sparsity; however, it has a tendency to underestimate the true values of sparse signals. Hence, various non-convex regularizers are often considered as alternatives to the 1 norm. However, then the objective function is generally non-convex and has extraneous suboptimal local minimizers [20]. To avoid such complications, it is advantageous to maintain the convexity of the objective function. This is possible, even when the regularizer is not convex, provided the non-convex regularizer is carefully defined [4, 5, 11, 21, 24].

4 Transient Artifacts Suppression in Time Series via Convex Analysis

109

In this chapter, we propose a new non-convex regularizer that improves upon the 1 norm and maintains the convexity of the objective function to be minimized. The proposed non-convex regularizer is specifically for the joint estimation of sparse piecewise constant signals and sparse signals. The proposed regularizer is defined using a generalized Moreau envelope (GME), a generalization of the well-known Moreau envelope defined in convex analysis [24]. While the generalized Moreau envelope of a convex function is always convex, in this chapter, it is used to construct a non-convex function. We can prescribe the proposed regularizer in a way such that the objective function is convex. We provide a simple forward-backward splitting (FBS) algorithm to reliably obtain the global minimum of the proposed convex optimization problem. The chapter is organized as follows: In Sect. 4.2, we describe the fused lasso penalty for modeling sparse piecewise constant signals, and total variation penalty for piecewise constant signals, we define the generalized Moreau envelope of a function, and we provide a formula for its gradient. In Sect. 4.3, we formulate the transient suppression problem as a convex optimization problem using the conventional (convex) 1 norm conjoint penalty for joint estimation of sparse piecewise constant signals and sparse signals, and introduce the parameter-setting strategy for such optimization problem, along with the forward-backward splitting (FBS) based optimization algorithm. In Sect. 4.4, we introduce a new non-convex generalized conjoint penalty, which we define using the generalized Moreau envelope. We also discuss the method to design the proposed penalty to achieve good performance. In Sect. 4.5, we formulate the transient artifacts suppression problem as a convex optimization problem using the new non-convex generalized conjoint penalty. We also provide an iterative algorithm, based on forward-backward splitting (FBS) to solve this optimization problem. In Sect. 4.6, we demonstrate the proposed approach with simulated data and real NIRS data.

4.1.1 Related Work Numerous methods have been proposed to suppress transient artifacts in biomedical time series, e.g., NIRS data, including non-negative matrix factorization [10] (nonconvex optimization), method with the aid of acceleration data [17] (aided by extra information), independent component analysis [1], and hybrid methods [15] (complex architecture). Most of the methods require multi-channel information or reference signals. When the artifacts behave distinctively among the channels, or in absence of the multi-channel information, we have to deal with single channel time series. A thorough review and comparison of five classical single channel methods (both linear and nonlinear) was presented in Ref. [9]: band-pass filtering, principle component analysis, Kalman filtering, spline interpolation, and wavelet thresholding. Linear filtering techniques fail, as expected. Spline and wavelets were the best performing methods. Wavelet thresholding is a nonlinear filtering technique

110

Y. Feng et al.

[14, 18, 19]. Wavelet transforms decompose signals into multiple subbands: the transient artifacts manifest as large wavelet coefficients in the subbands, while the low-pass signal f is contained in only the low-frequency subbands. By thresholding the wavelet coefficients and then reconstructing the data from the thresholded wavelet coefficients, the signal of transients can be estimated. For the spline interpolation approach, after identification of the segments containing the transient artifacts in the time domain, each segment is modeled separately by cubic spline interpolation [23]. The artifacts are extracted accordingly. The formulation of transient artifact suppression as an optimization was considered in [25, 26]. However, only separable non-convex regularizers were considered in those works, which rules out the possibility of maintaining the convexity of the objective function to be minimized. Recent work [11] developed a more sophisticated type of non-convex regularizer that maintains the convexity of the objective function, but the problem was formulated to address the sparse and piecewise constant artifacts, only one of the two types of artifacts we consider in this chapter. In this book chapter, we consider the suppression of both types of the artifacts with a non-convex regularizer that can be prescribed to preserve the convexity of the objective function.

4.2 Preliminaries 4.2.1 Difference Matrices Let Dk be the order-k difference matrix. For example, the matrices D1 ∈ R(N −1)×N and D2 ∈ R(N −2)×N are given by ⎤ −1 1 ⎢ .. .. ⎥ D1 = ⎣ . . ⎦, −1 1 ⎡

(4.2)

and ⎡

1 −2 1 ⎢ 1 −2 1 ⎢ D2 = ⎢ .. ⎣ .

⎤ ⎥ ⎥ ⎥, ⎦

(4.3)

1 −2 1 etc. The linear operator Dk is a discrete approximation of the k-order derivative. The frequency response of Dk is given by  k ω f Dk (ω) = (1 − e−jω )k = 2j e−j 2 sin( ω2 ) .

(4.4)

4 Transient Artifacts Suppression in Time Series via Convex Analysis

111

We also denote the first-order discrete time integrator by S ∈ RN ×(N −1) , ⎡

0 ⎢1 ⎢ ⎢ S = ⎢1 1 ⎢. . . ⎣ .. .. . .

⎤ ⎥ ⎥ ⎥ ⎥, ⎥ ⎦

(4.5)

1 1 ··· 1 and D1 S = I . We can factor order-k difference matrix into order-d difference matrix and order(k−d) difference matrix (d < k), then D¯ k−d ∈ R(N −k)×(N −d) and Dk = D¯ k−d Dd .

(4.6)

4.2.2 Soft-Thresholding, Total Variation, and Fused Lasso Penalty The soft-thresholding function with threshold parameter λ is defined as soft(y, λ) =

0,

|y|  λ

(|y| − λ)sign(y),

|y| > λ.

(4.7)

The estimation of a piecewise constant signal x from its noisy observation y, known as the total variation denoising (TVD) problem, is x ∗ = arg min x

1 2

 y − x 22 + λ1 D1 x 1 .

(4.8)

If we further impose x to be sparse by additional 1 -norm regularization, then the estimation of a sparse piecewise constant signal x from its noisy observation y, known as the fused lasso signal estimation problem [29], can be formulated as x ∗ = arg min x

1 2

 y − x 22 + λ1 D1 x 1 + λ0 x 1 ,

(4.9)

where λ0 > 0, λ1 > 0, and D1 is the matrix (4.2). The solution to problem (4.9) is given by [12] x ∗ = soft(tvd(y, λ1 ), λ0 ),

(4.10)

where tvd(·, λ) is the solution to the total variation (TV) denoising problem (4.8). TV denoising can be solved exactly in finite-time by fast solvers, e.g., [8]. In the

112

Y. Feng et al.

context of convex analysis, TV denoising constitutes the proximal operator of D· 1 . Likewise, the proximal operator of the fused lasso penalty is given by (4.10).

4.2.3 The Generalized Moreau Envelope In this section, we define the generalized Moreau envelope of a convex function. This will be used to define the proposed non-convex penalty. For a description of the generalized Moreau envelope and its properties, see Refs. [16, 24]. We will use results from convex analysis [3]. We denote by Γ0 (RN ) the set of proper lower semicontinuous convex function from RN to R ∪ {+∞}. The Moreau envelope of a convex function f ∈ Γ0 (RN ), denoted f M : RN → R, is defined as  1 M x − v 22 + f (v) . f (x) = inf (4.11) N 2 v∈R Similarly, we define the generalized Moreau envelope (GME) of f as follows. Definition 4.1 Let f ∈ Γ0 (RN ) and B ∈ RM×N . We define the generalized Moreau envelope fBM : RN → R as  fBM (x) = inf

v∈RN

1 B(x − v) 22 + f (v) . 2

(4.12)

The function is parameterized by matrix B. The generalized Moreau envelope of a convex function f is differentiable, even if f itself is not. Lemma 4.1 The generalized Moreau envelope of f ∈ Γ0 (RN ) is differentiable and its gradient is given by 1   B(x − v) 22 + f (v) . ∇fBM (x) = B TB x − arg min v∈RN 2

(4.13)

Proof Inasmuch as f is a convex function with unique critical point, f is coercive (Corollary 8.7.1 in [22]), and B · 22 is bounded below, therefore fBM is exact in Γ0 (RN ) (Proposition 12.14 (ii) in [3]), that is, fBM (x) = min { 12 B(x − v) 22 + f (v)}. v∈RN

Since B · 22 is Fréchet differentiable everywhere, by Proposition 18.7 in [3], fBM is Fréchet differentiable. The gradient is given by (Theorem 3.8 (e) in [28])

4 Transient Artifacts Suppression in Time Series via Convex Analysis

∇fBM (x) = ∇

1

2 B(x

113

 − v) 22 = B TB(x − v).

Because fBM is exact, v is achieved by v = arg min { 12 B(x − v) 22 + f (v)}. v∈RN



This completes the proof.

4.3 Transient Artifacts Suppression 4.3.1 Problem Formulation We formulate the transient artifacts suppression (TAS) problem for signal model (4.1) as a variation of morphological component analysis (MCA), arg

min

x1 ,x2 ,f ∈RN

1

 α y −x1 −x2 −f 22 + Dk f 22 +λ0 x1 1 +λ1 D1 x1 1 +λ2 D1 x2 1 , 2 2 (4.14)

where Dk is the order-k difference operator. The low-pass signal f is regularized using standard (quadratic) Tikhonov regularization. The transient artifact signal x1 is penalized using the 1 norm fused lasso penalty, and x2 is regularized by the 1 norm total variation penalty. Since the objective function in (4.14) is quadratic with respect to f , the solution for f can be written in closed-form, f ∗ = (I + αDkT Dk )−1 (y − x1 − x2 ),

(4.15)

where (I + αDkT Dk )−1 represents a low-pass filter, and its frequency response is LPFf (ω) =

1 1 + α4 sin2k ( ω2 ) k

.

(4.16)

The parameter α > 0 is related to the cut-off frequency fc of the low-pass filter by α=

1 k

2k

4 sin (πfc )

.

(4.17)

This relation is obtained by setting the frequency response equal to one half at the frequency ωc = 2πfc . We rewrite the objective function in (4.14) in terms of x1 , x2 by substituting f in (4.15). Then the objective function F is given by

114

Y. Feng et al.

F (x1 , x2 ) = 12 y − x1 − x2 − f 22 + α2 Dk f 22 + λ0 x1 1 + λ1 D1 x1 1 + λ2 D1 x2 1 = 12 (I − (I + αDkT Dk )−1 )(y − x1 − x2 ) 22

(4.18) (4.19)

+ α2 Dk (I + αDkT Dk )−1 (y − x1 − x2 ) 22 + λ0 x1 1 + λ1 D1 x1 1 + λ2 D1 x2 1 . Proposition 4.1 The objective function F can be written more simply as F (x1 , x2 ) := 12 H (y − x1 − x2 ) 22 + λ0 x1 1 + λ1 D1 x1 1 + λ2 D1 x2 1 , (4.20) where H TH := αDkT (I + αDk DkT )−1 Dk .

(4.21)

Proof We can open up the quadratic terms in (4.19) and we have H TH = (I − (I + αDkT Dk )−1 )(I − (I + αDkT Dk )−1 ) + α(I + αDkT Dk )−1 DkT Dk (I + αDkT Dk )−1 = I − 2(I + αDkT Dk )−1 + (I + αDkT Dk )−2 + α(I + αDkT Dk )−1 DkT Dk (I + αDkT Dk )−1 = I − 2(I + αDkT Dk )−1 + (I + αDkT Dk )−1 (I + αDkT Dk )(I + αDkT Dk )−1 = I − (I + αDkT Dk )−1 = αDkT (I + αDk DkT )−1 Dk . The last equality is obtained using matrix inverse lemma. Hence we complete the proof.

It turns out we do not need to specify H itself, and it is sufficient to work with H TH for algorithm implementation (so are all the filters introduced in the rest of the chapter). Note that the filter H TH and first-order difference D1 all have a notch at ω = 0; hence, x2 is only uniquely defined up to a constant value, i.e., adding or subtracting a constant value to x2 will not change the cost function value. The non-uniqueness of the solution to x2 would drastically affect the estimation quality of x1 + x2 . To achieve uniqueness of the solution for such problem formulation, we substitute variable u = D1 x2 , and in turn the objective function F is F (x1 , u) = 12 Hy − H x1 − H1 u 22 + λ0 x1 1 + λ1 D1 x1 1 + λ2 u 1

(4.22)

4 Transient Artifacts Suppression in Time Series via Convex Analysis

115

where we factor D1 out of H , and T (I + αDk DkT )−1 D¯ k−1 . H1TH1 := α D¯ k−1

(4.23)

In turn, we get the estimation for x2 by x2 = Su, which is trivial since u = D1 Su = D1 x2 .

(4.24)

For simplicity, we write the objective function in a compact form. Here we define x :=

! " x1 , u

yˆ := Hy, # $ A := H H1 , ⎡ ⎤ λ0 I ⎦, W := ⎣λ1 D1 λ2 I

(4.25) (4.26) (4.27) (4.28)

and we denote the conjoint fused lasso and 1 norm penalty as the function ϕ : R2N −1 → R ϕ(x) := W x 1

(4.29)

then the objective function F : R2N −1 → R is F (x) := 12 yˆ − Ax 22 + W x 1 .

(4.30)

4.3.2 Optimization Algorithm We propose an optimization algorithm for the proposed problem formulation (4.30), an application of the forward-backward splitting algorithm. The implementation of the forward-backward splitting algorithm for minimizing the objective function F requires the Lipschitz constant of the quadratic term in (4.30), given by the maximum eigenvalue of ATA. Lemma 4.2 The maximum eigenvalue of the linear operator H TH and H1TH1 in (4.21), (4.23) is upper bounded by, respectively, ρ0  (1 + (α4d )−1 )−1 , and

(4.31)

116

Y. Feng et al.

α ρ1  d



1 (d − 1) α

1− 1 d

(4.32)

.

Proof The linear operator H TH , H1TH1 are the matrix form of LTI filters, thus their maximum eigenvalue are upper bounded by the maximum values of frequency response. The corresponding frequency responses are (H TH )f (ω) = (H1TH1 )f (ω) =

(DkT Dk )f (ω) 1/α + (DkT Dk )f (ω)

=

[4 sin2 (ω/2)]d 1/α + [4 sin2 (ω/2)]d

,

[4 sin2 (ω/2)]d−1 (Dk−1 Dk−1 )f (ω) = . 1/α + (Dk Dk )f (ω) 1/α + [4 sin2 (ω/2)]d



Setting the derivative to zero, we obtain the value in (4.31), (4.32).

An illustration of aforementioned frequency responses and corresponding impulse responses are in Fig. 4.2.

1.2

20

[HTH]f(ω), k = 4, fc = 0.03

T

[H1H1]f(ω), k = 4, fc = 0.03

1 15 0.8 10

0.6 0.4

5 0.2 0

0

0.1

0.2 0.3 Normalized Frequency

0.4

0.5

0

0

0.1

0.2 0.3 Normalized Frequency

1.2 Impulse Response of HTH

0.4

0.5

T

Impulse Response of H1 H1

2

1 1.5

0.8 0.6

1

0.4

0.5

0.2

0

0 -0.5 -0.2

0

50

100

150

200

0

50

100

Fig. 4.2 Frequency response and impulse response of filters H TH , H1TH1

150

200

4 Transient Artifacts Suppression in Time Series via Convex Analysis

117

Proposition 4.2 The maximum eigenvalue of the linear operator ATA =

" ! T H H H TH1 H1TH H1TH1

(4.33)

is upper bounded by ρ  2ρ1 ,

(4.34)

where ρ1 is defined in (4.32). Proof AT A is upper bounded by a block diagonal matrix, i.e., ! β

"

H TH H1T H1

! −

" H T H H T H1  0, H1T H H1T H1

when β  2. Here we denote ! P =β

"

H TH H1T H1

! −

" H T H H T H1 , H1T H H1T H1

we can rewrite " ! " ! " " ! I I # $ H β P = − I I H1T I I H1 "! " ! T " ! β − 1 −1 H H , = ⊗ I H1T −1 β − 1 H1 !

HT

where ⊗ is the Kronecker product. P is positive semidefinite if the 2 × 2 matrix is positive semidefinite, i.e., β  2. Then it is trivial that the maximum eigenvalue of such block diagonal matrix is upper bounded by ρ  β max{ρ0 , ρ1 }. Since ρ0 < ρ1 , ρ  2ρ1 . Hence we complete the proof.



Next we propose the optimization algorithm based on forward-backward splitting algorithm. Proposition 4.3 Let y ∈ RN , α > 0, μ = 2/ρ be given by (4.34), H TH, H1TH1 be given by (4.21)(4.23), and H TH1 , H1TH are defined as

118

Y. Feng et al.

H TH1 := αDkT (I + αDk DkT )−1 D¯ k−1 ,

(4.35)

T H1TH := α D¯ k−1 (I + αDk DkT )−1 Dk .

(4.36)

Then the iteration z1(i) = H TH (x1(i) − y) + H TH1 u(i) (i)

(i)

z2 = H1TH (x1 − y) + H1TH1 u(i) (i+1)

x1

(i)

(i)

(4.37a) (4.37b)

= soft(tvd(x1 − μz1 , μλ1 ), μλ0 )

(4.37c)

(i)

(4.37d)

u(i+1) = soft(u(i) − μz2 , μλ2 ) converges to the minimizer of F in (4.22).

Proof We look at the compact form of F in (4.30), and write F as the sum of two convex functions, J (x) = f1 (x) + f2 (x), where f1 (x) = 12 yˆ − Ax 22 , f2 (x) = W x 1 . Since both f1 and f2 are convex and additionally ∇f1 is ρ-Lipschitz continuous, the FBS algorithm, i.e., x (i+1) = proxμf2 (x (i) − μ∇f1 (x (i) )),

(4.38)

converges to the minimizer of J [3]. The parameter μ should be set such that 0  μ  2/ρ where ρ is given by (4.34). The gradient of f1 is given by ˆ ∇f1 (x) = AT(Ax − y), or more explicitly, ∇f1

! " ! T "! " ! T " x1 H H H TH1 x1 H Hy = − . u H1TH H1TH1 u H1THy

The non-smooth function f2 is ϕ defined in (4.29). The proximal operator of ϕ is two proximal operators—the proximal operator in (4.10) corresponding to the

4 Transient Artifacts Suppression in Time Series via Convex Analysis

119

fused lasso penalty, soft-thresholding corresponding to 1 norm—performed on x1 , u separately. In turns, (4.38) can be written as ˆ x (i+1) = proxμϕ (x (i) − μAT(Ax (i) − y)).

(4.39)

Write the FBS algorithm in (4.39) explicitly, we get the iteration steps in (4.37).

The matrix multiplications performed in the algorithm, namely H TH , H1TH1 , H1TH , H TH1 , only involve with operations of banded matrices, which has fast implementations. We can get the estimation of x2 by integrating u obtained from the proposed algorithm, i.e., x2 = Su.

4.3.3 Parameters Three regularization parameters λ0 , λ1 , λ2 must be specified properly to get a good estimation of the transient artifacts, i.e., x1 + x2 . Classically, the parameters are set according to the noise level σ . We first denote the optimal solutions by x1∗ , u∗ and rewrite the fixed point updates (FBS updates after convergence) in (4.37) as x1∗ = soft(tvd(x1∗ − μz(H TH (x1∗ − y) + H TH1 u∗ ), μλ1 ), μλ0 ),

(4.40)

u∗ = soft(u∗ − μ(H1TH (x1∗ − y) + H1TH1 u∗ ), μλ2 ).

(4.41)

If x1 , u are identically zero, and y only consists of white Gaussian noise w, then the λs must be chosen so that the above fixed point updates are 0 = soft(tvd(μH TH w, μλ1 ), μλ0 ),

(4.42)

0 = soft(μH1TH w, μλ2 ),

(4.43)

which means that the shrinkage operators (proximal operators) must annihilate the noise completely. In this sense, similar to 3-sigma rule, λ2 = β2 H1TH 2 σ in (4.43), √ λ0 = β0 H TH 2 σ and λ1 = β1 H TH 2 Nσ in (4.42) [21]. In the case where x1 , u are not identically zero, x1 , u are interdependent, which is obvious both from the MCA problem formulation and the fixed point updates in (4.40), (4.41). Because the two morphological components x1 , x2 have similar structures, changing the relative weights of λs, i.e., β0 , β1 , β2 , would not change the general structure of the estimation x1 + x2 , as shown in the four cases (iii), (vi), (ix), (xii) in Fig. 4.3. However, the micro-structures would morph between the two components according to the weights. It is evident in Fig. 4.3(x)–(xii) that the estimation quality of x1 + x2 (measured by RMSE) is better when the estimation of x1 and x2 agree with their respective prior - x1 being sparse and piecewise constant, x2 being piecewise constant.

120 3

Y. Feng et al. 3

(i) L1 Solution x1, RMSE:0.436

2

2

1

1

0

0

-1

-1

-2

(vii) L1 Solution x1, RMSE:0.61644 Clean L1

-2 0

4

20

40

60

80

100

120

140

160

180

200

0 4

(ii) L1 Solution x2, RMSE:0.529

3

3

2

2

1

1

0

0

-1

-1 0

4

20

40

60

(iii) L1 Solution x1+x2,

80

0

100

= 0,

1

120

= 0.06,

140

2

160

180

200

= 0.06, RMSE:0.467

0

3

2

2

1

1

0

0

-1

-1 0

4

20

40

60

80

100

120

140

160

180

200

(iv) L1 Solution x1, RMSE:0.139

60

20

40

60

(ix) L1 Solution x1+x2,

0 4

2

40

80

100

120

140

120

140

160

180

200

160

180

200

(viii) L1 Solution x2, RMSE:0.643

4

3

20

20

40

60

80

0

100

= 3e-4 ,

80

1

100

= 0.06,

2

= 0.06, RMSE:0.45

120

140

160

180

200

100

120

140

160

180

200

100

120

140

160

180

200

(x) L1 Solution x1, RMSE:0.117

2

0

0

-2

-2 0

20

40

60

80

100

120

140

160

180

200

0

(v) L1 Solution x2, RMSE:0.357

20

40

60

80

(xi) L1 Solution x2, RMSE:0.3

1

1

0

0

-1

-1 0

4

20

40

60

(vi) L1 Solution x1+x2,

80

0

100

= 0.06,

120

1

140

= 0.054,

2

160

180

200

= 0.063, RMSE:0.375

0 4

3

3

2

2

1

1

0

0

-1

-1 0

20

40

60

80

100

Time (n)

120

140

160

180

200

20

40

60

(xii) L1 Solution x1+x2,

0

20

40

60

80

0

= 0.105,

80

100

1

= 0.03,

120

140

2

= 0.063, RMSE:0.312

160

180

200

Time (n)

Fig. 4.3 Effects on the estimation results when changing relative weights of λs

Based on the principle mentioned above, we introduce a general strategy of tuning the parameters. First, we set β0 = 0 and tune β1 , β2 together, so that both of the estimations of x1 , x2 are piecewise constant, as shown in Fig. 4.3(i)–(iii). Then we increase β0 to force x1 to zero baseline, and adjust β1 , β2 at the same time to avoid over-penalization (case in (vii)–(ix)). The satisfactory results are presented in Fig. 4.3(x)–(xii).

4 Transient Artifacts Suppression in Time Series via Convex Analysis

121

4.4 The Generalized Conjoint Penalty In this section, we propose a non-convex generalization of the conjoint penalty W · 1 . We first define a special case of the generalized Moreau envelope, with f (v) = W v 1 in (4.12). Definition 4.2 We define SB : R2N −1 → R to be the generalized Moreau envelope of the fused lasso penalty, 1  B(x − v) 22 + W v 1 , (4.44) SB (x) := min v∈R2N−1 2 where W is given by (4.28). The function is parameterized by matrix B ∈ RM×2N −1 . As it is a generalized Moreau envelope, the function SB has several properties such as convexity, exactness, and differentiability. We now define the non-convex generalization of the conjoint penalty. Definition 4.3 We define the generalized conjoint penalty ψB : R2N −1 → R as ψB (x) := W x 1 − SB (x),

(4.45)

where SB is given by (4.44). The function is parameterized by matrix B ∈ RM×2N −1 . The generalized conjoint penalty generalizes the generalized minimax concave penalty [24], and enjoys similar properties. An effective sparse-inducing penalty function should be non-decreasing in any direction, i.e., penalize the large values more than or the same as the small values. Such important property can be stated as: Proposition 4.4 Let x ∈ R2N −1 with [W T sign(W x)]i = 0. The generalized conjoint penalty ψB has the property that [∇ψB ]i either has the same sign as [W T sign(W x)]i or is equal to zero. Proof It is straightforward that SB (x) =

min

v∈R2N−1

  W v 1 + 12 B(x − v) 22

 [ W v 1 + 12 B(x − v) 22 ]v=x = W x 1 . Since SB is convex and differentiable, we have SB (v) + [∇SB (v)]T (x − v)  SB (x), then naturally

∀x, v ∈ R2N −1 ,

122

Y. Feng et al.

SB (v) + [∇SB (v)]T (x − v)  W x 1 ,

∀x, v ∈ R2N −1 .

We also have W v 1 + [∇ W v 1 ]T (x − v)  W x 1 ,

∀x, v ∈ R2N −1 ,

where ∇ W v 1 = W T sign(W v). Since W · 1 cone consists of hyperplanes, it is linear within different hyperplanes W v 1 = [∇ W v 1 ]T v then we have [∇ W v 1 ]T x  W x 1 ,

∀x, v ∈ R2N −1 .

Specifically we have W x 1 = [∇ W v 1 ]T x,

∀x, v ∈ Ω,

where v lives in the same domain Ω of a hyperplane as x. We have the same assumption in the following derivation. Then SB (v) + [∇SB (v)]T (x − v)  [∇ W v 1 ]T x,

∀x, v ∈ Ω.

Let x = [0, 0, · · · , t, · · · , 0]T where t is at position n, then we have c(v) + [∇SB (v)]n t  [∇ W v 1 ]n t,

∀t ∈ R, v ∈ R2N −1 ,

where c(v) ∈ R does not dependent on t. It follows that |[∇SB (v)]n |  |[∇ W v 1 ]n |. Let x ∈ RN with [W T sign(W x)]i = 0, then we have ∂SB ∂ψB (x) = [W T sign(W x)]i − (x). ∂xi ∂xi Since |∂SB (x)/∂xi |  |[W T sign(W x)]i |, ∂ψB /∂xi always has the same sign as

[W T sign(W x)]i . Hence we complete the proof. Proposition 4.4 validates the proposed penalty as an effective sparsity-inducing regularizer.

4 Transient Artifacts Suppression in Time Series via Convex Analysis

123

4.5 Transient Artifact Suppression Using the Generalized Conjoint Penalty In this section, we formulate the problem of transient artifacts suppression, using the non-convex generalized conjoint penalty. We provide a condition to ensure the convexity of the objective function to be minimized. We investigate and compare different designs of the parametric matrix B in the proposed penalty. We also derive an algorithm for its minimization, using forward-backward splitting. We formulate the upgraded transient artifact suppression (TAS) problem as x ∗ = arg min



x∈R2N−1

J (x) =

 1 yˆ − Ax 22 + ψB (x) , 2

(4.46)

where x, y, ˆ A is given by (4.25)–(4.27) and ψB is the generalized conjoint penalty defined in (4.45). In this formulation, we replace the 1 norm conjoint penalty in (4.29) with the generalized conjoint penalty. Theorem 4.1 Let yˆ ∈ RN −k and A ∈ R(N −k)×(2N −1) . Define J : R2N −1 → R as J (x) =

1 yˆ − Ax 22 + ψB (x), 2

(4.47)

where ψB is the generalized fused lasso penalty (4.45). If B TB  ATA,

(4.48)

then J is a convex function. Proof We write J (x) as J (x) = 12 yˆ − Ax 22 + W x 1 − min { W v 1 + 12 B(x − v) 22 } v∈R2N−1

2 2 2 T T 1 1 ˆ = 12 y−Ax 2 + W x 1 − 2 Bx 2 − min { W v 1 + 2 Bv 2 − v B Bx} v∈R2N−1

ˆ 22 − yˆ TATAx + W x 1 = 12 x T (ATA−B TB)x + 12 y % & + max − W v 1 − 12 Bv 22 + v TB TBx v∈R2N−1

Consider the final expression for J . The first term is convex if ATA−B TB is positive semidefinite. The expression inside the curly braces is affine in x, hence convex in x. Therefore, the entire last term is convex in x, because the maximum of a set of convex functions (here indexed by v) is convex (Proposition 8.14 in [3]). The remaining terms are convex in x.



124

Y. Feng et al.

4.5.1 Design of Parametric Matrix B We introduce the way of designing parametric matrix B to preserve the convexity of the objective function (4.47). Since we can factor A as # $ A = H1 D I ,

(4.49)

# $ B=C DI .

(4.50)

we require B has the form

Then the convex condition is given by ! T" # $ D (H1T H1 − C T C) D I  0, A A−B B = I T

T

(4.51)

thus we only require C TC  H1TH1 ,

(4.52)

to make (4.48) satisfied, We choose B TB by the way of C TC. Since H1TH1 is the matrix form of a filter, to make (4.52) satisfied, we can design C TC as filter. We only require the frequency response amplitude of C TC is strictly smaller than that of H1TH1 . A simple choice for C TC is C TC = γ H1TH1 ,

0  γ  1.

(4.53)

Then (4.52) is easily satisfied, and the parameter γ controls the non-convexity of ψB . We can also design C TC as C TC =

α ¯T D (I + α D¯ k D¯ kT )−1 D¯ k . 4 k

(4.54)

Note that D¯ k ∈ R(N −k−1)×(N −1) , and C TC is not identical to H TH in (4.21), but has similar frequency response. An illustration of the frequency response of the two proposed choices for C TC is in Fig. 4.4. We can verify from Fig. 4.4 that the frequency response amplitude of the two designs are strictly smaller than that of H1TH1 . We also compare the estimation results using the proposed penalty with the two choices of parametric matrix C TC, along with the l1 solution shown in Fig. 4.5. The results are in Figs. 4.6 and 4.7. Overall, the estimation results using the proposed penalty with parametric matrix C TC given by (4.53) is better. Design by (4.54) enables more amplifying power for the sparse components in x1 , which can be seen

4 Transient Artifacts Suppression in Time Series via Convex Analysis 7

7 Choice 1 for CTC, γ = 0.4

Choice 2 for CTC CTC(ω)

6

6

5

5

4

4

3

3

2

2

1

1

0

125

0

0.1

0.2 0.3 Normalized Frequency

0.4

0.5

0

H1TH1(ω)

0

0.1

0.2 0.3 Normalized Frequency

0.4

0.5

Fig. 4.4 Frequency responses of two designs of C TC

from the “overshooting” phenomenon on the edge of the spikes in estimation result of x1 in Fig. 4.7. We do not conclude which design is better. The aforementioned two designs are merely two examples closely related to the formula of ATA; in fact, there are infinitely many choices of C TC for the purpose of preserving the convexity of the objective function. We will continue the discussion on how C TC controls the behavior of the proposed non-convex penalty in the next subsection.

4.5.2 Optimization Algorithm The following iterative algorithm minimizes the convex function J . Similar to Proposition 4.3, it is derived as an application of the forward-backward splitting (FBS) algorithm. Proposition 4.5 Let yˆ ∈ RN −k defined in (4.26), α > 0, μ = 2/ρ be given by (4.34), ϕ defined in (4.29), and ATA be given by (4.33). With B TB specified, the iteration v (i) = arg min { 12 B(x (i) − v) 22 + W v 1 }

(4.55a)

    z(i) = AT Ax (i) − yˆ − B TB x (i) − v (i)

(4.55b)

v∈R2N−1

x (i+1) = proxμϕ (x (i) − μz(i) ) converges to the minimizer of J in (4.47).

(4.55c)

126 4

Y. Feng et al.

(i) L1 Solution x1, RMSE:0.117 Clean L1

2 0 -2 0

20

40

60

80

100

120

140

160

180

200

80

100

120

140

160

180

200

80

100

120

140

160

180

200

100

120

140

160

180

200

(ii) L1 Solution x2, RMSE:0.3 1 0 -1 0 4

20

40

60

(iii) L1 Solution f, RMSE:0.28

2 0 -2

0

4 3 2 1 0 -1

20

40

60

(iv) L1 Solution x1+x2, RMSE:0.312

0

20

40

60

80

Time (n)

Fig. 4.5 Estimation results for x1 , x2 , f, x1 + x2 with RMSE via 1 norm based formulation

Proof Similar to the proof of Proposition 4.3, we write J as the sum of two convex functions, J (x) = f1 (x) + f2 (x), where f1 (x) = 12 yˆ − Ax 22 − SB (x), f2 (x) = W x 1 .

4 Transient Artifacts Suppression in Time Series via Convex Analysis 4

127

(i) CNC Solution x1, RMSE:0.076 Clean L1

2 0 -2 0 2

20

40

60

80

100

120

140

160

180

200

100

120

140

160

180

200

100

120

140

160

180

200

100 Time (n)

120

140

160

180

200

(ii) CNC Solution x2, RMSE:0.117

1 0 -1 0 4

20

40

60

80

(iii) CNC Solution f, RMSE:0.169

2 0 -2

0

20

40

60

80

(iv) CNC Solution x1+x2, RMSE:0.159

4 2 0 0

20

40

60

80

Fig. 4.6 Estimation results for x1 , x2 , f, x1 + x2 with RMSE via CNC penalty with C TC designed by (4.53)

Since both f1 and f2 are convex and additionally ∇f1 is at least ρ-Lipschitz continuous ( The original Lipschitz function, when subtracted another Lipschitz function, remains ρ-Lipschitz continuous). The FBS algorithm converges to the minimizer of J . It remains to determine the gradient of f1 and the proximal operator of f2 . The gradient of f1 is given by ∇f1 (x) = AT(Ax − y) − ∇SB (x). Using (4.13), we have %1 &  2 ∇SB (x) = B TB x − arg min 2 B(x − v) 2 + W v 1 . v∈R2N−1

128 4

Y. Feng et al.

(i) CNC Solution x1, RMSE:0.118 Clean L1

2 0 -2 0 2

20

40

60

80

100

120

140

160

180

200

100

120

140

160

180

200

100

120

140

160

180

200

100 120 Time (n)

140

160

180

200

(ii) CNC Solution x2, RMSE:0.145

1 0 -1 0 4

20

40

60

80

(iii) CNC Solution f, RMSE:0.175

2 0 -2

0

20

40

60

80

(iv) CNC Solution x1+x2, RMSE:0.194

4 2 0 0

20

40

60

80

Fig. 4.7 Estimation results for x1 , x2 , f, x1 + x2 with RMSE via CNC penalty with C TC designed by (4.54)

Hence, ∇f1 (x (i) ) is computed in (4.55b). The proximal operator of ϕ is computed in (4.55c).

The minimization in (4.55a) can be solved by FBS (i.e., ISTA) or other algorithms (e.g., FISTA, FASTA). We can write the iterations in (4.55) in explicit forms, ! " '2 ' ! " ' (

' (i) ' ' # $ x (i) 1' v1 v ' v ' 1 C DI W 1 ' +' min − 1 ' (i) = arg ' ' ' (i) N N−1 v2 v2 '1 2 u v2 v1 ∈R ,v2 ∈R 2 (4.56a)

4 Transient Artifacts Suppression in Time Series via Convex Analysis

129

 (i) (i) (i) (i) (i)  z1 = H TH (x1 − y) + H TH1 u(i) − D TC TC D(x1 − v1 ) + (u(i) − v2 ) (4.56b)  (i) (i) (i) (i) (i)  z2 = H1TH (x1 − y) + H1TH1 u(i) − C TC D(x1 − v1 ) + (u(i) − v2 ) (4.56c) (i+1)

x1

(i)

(i)

= soft(tvd(x1 − μz1 , μλ1 ), μλ0 )

(4.56d)

(i)

(4.56e)

u(i+1) = soft(u(i) − μz2 , μλ2 )

To gain more insight, we regroup and rewrite the fixed point updates similar to (4.40), (4.41),   g1∗ = D TC TC D(x1∗ − v1∗ ) + (u∗ − v2∗ )   g2∗ = C TC D(x1∗ − v1∗ ) + (u∗ − v2∗ )

(4.57b)

w1∗ = x1∗ − μ(H TH x1∗ + H TH1 u∗ ) + μg1∗

(4.57c)

w2∗ = u∗ − μ(H1TH x1∗ + H1TH1 u∗ ) + μg2∗

(4.57d)

x1∗ = soft(tvd(w1∗ + μH THy, μλ1 ), μλ0 )

(4.57e)

u∗ = soft(w2∗ + μH1THy, μλ2 )

(4.57f)

(4.57a)

Note that, when all the parameters set the same, the only difference between (4.57) and the fixed point converged for the 1 norm problem formulation given by (4.40), (4.41), is the added terms g1∗ , g2∗ . The addition of g1∗ , g2∗ gives rise to the power of amplifying signal values for the proposed penalty, and how g1∗ , g2∗ behaves is dictated by the design of C TC. Intuitively, C TC has to be chosen so that g1∗ , g2∗ reflects and accentuate the characteristics of the original signal; since C TC is designed as filter, it should be chosen according to the frequency components of the original clean signal. As shown in Figs. 4.8, 4.9, an illustration of how the two different design of C TC (4.53), (4.54) would affect the estimation of x1 , u through the intermediate variables given in (4.57). Especially, shown in Fig. 4.8, the addon term g1∗ behaves quite differently: while in general the proposed non-convex penalty with design in (4.53) shows more amplifying power, with design in (4.54) does perform better on the negative edge in sample 80–100. Using design in (4.53) gives us a better estimation for u, from Fig. 4.9. We claim that, in order to achieve the best performance, one should choose C TC according to the frequency characteristics of the specific signal. For a simple implementation of the algorithm, we use the choice (4.53) in the numerical examples in the following section.

130

Y. Feng et al.

(i) g* with C TC choice 1

1

(v) g* with C TC choice 2

1

1

0.5

1

0.5

0

0

-0.5

-0.5

-1

-1 0

4

20

(ii)

w* 1

40

60

80

100

120

140

160

180

200

0 4

T

with C C choice 1

2

2

0

0

-2

20

(v)

w* 1

40

60

80

100

120

140

160

180

200

100

120

140

160

180

200

T

with C C choice 2

-2 0

4

(iii)

20

40

w* 1

T

+

60

80

100

120

140

160

180

200

0 4

T

H Hy with C C choice 1

2

(vi)

20

40

w* 1

T

+

60

80 T

H Hy with C C choice 2

2

0

0

H THy w *1 +

-2 0 4

20

*

40

60

80

100

120

140

160

H THy

180

H THy w *1 +

-2 200

0 4

T

(iv) x1 with C C choice 1, RMSE:0.094

2

20

40

60

80

100

120

140

160

H THy

180

200

T

*

(vii) x1 with C C choice 2, RMSE:0.064

2

0

0 Clean

Clean

x *1

-2 0

20

40

60

80

100

120

140

160

180

x *1

-2 200

0

20

40

60

80

100

120

140

160

180

200

Time (n)

Fig. 4.8 Comparison of estimation results for x1∗ via CNC penalty with C TC designed by (4.53) and (4.54), along with intermediate variables g1∗ , w1∗ , w1∗ + μH THy

4.6 Numerical Examples We show results on two numerical examples to demonstrate the performance of the proposed method: one on synthetic data and one on near-infrared spectroscopic (NIRS) data.

4.6.1 Example 1 We apply the proposed method to the synthetic data illustrated in Fig. 4.1. The noisy signal consists of two low-frequency sinusoids, several piecewise constant transients, and additive white Gaussian noise (standard deviation σ = 0.3). We show the solutions using both the 1 norm conjoint penalty and the proposed non-convex generalized conjoint penalty (both of which are formulated as convex optimization problems). We set k = 4 in (4.21), (4.23), (4.35), (4.36)√and α using (4.17) with fc = 0.03. We set λ0 = β0 H TH 2 σ , λ1 = β1 H TH 2 Nσ , and λ2 = β2 H1TH 2 σ with β0 = 0.16, β1 = 0.06, and β2 = 0.084, according to the strategy we introduced

4 Transient Artifacts Suppression in Time Series via Convex Analysis (i) g* with C TC choice 1 2

2

(v) g* with C TC choice 2 2

2

1

131

1

0

0

-1

-1 0

3

20

(ii)

w* 2

40

60

80

100

120

140

160

180

200

0

T

with C C choice 1

20

(vi)

2

w* 2

40

60

80

100

120

140

160

180

200

100

120

140

160

180

200

T

with C C choice 2

2 1

1

0

0

-1

-1 -2

-2 0

20

(iii) w* +

2

2

40

60

80

100

120

140

160

180

200

HTHy with C TC choice 1

0

(vii) w* +

2

1

1

20

2

40

60

80

HTHy with C TC choice 2 1

1

0

0 H T1 Hy

-1

w *2 +

-2

H T1 Hy

-1 H T1 Hy

w *2 +

-2

-3

H T1Hy

-3 0

20

*

40

60

80

100

120

140

160

180

200

T

(iv) u with C C choice 1, RMSE:0.044

2

0

40

*

60

80

100

120

140

160

180

200

T

(viii) u with C C choice 2, RMSE:0.064

2

1

20

1

0

0 Clean

-1

Clean

-1

u*

-2

u*

-2 0

20

40

60

80

100

120

140

160

180

200

0

20

40

60

80

100

120

140

160

180

200

Time (n)

Fig. 4.9 Comparison of estimation results for u∗ via CNC penalty with C TC designed by (4.53) and (4.54), along with intermediate variables g2∗ , w2∗ , w2∗ + μH1THy

in Sect. 4.3.3. Additionally, for the generalized conjoint penalty, we set γ , the index of non-convexity, to γ = 0.4. The proposed method is implemented using algorithm (4.55), and the explicit formula is in (4.56). Figures 4.5 and 4.6 show the estimated transient artifacts of the transient artifacts suppression with 1 norm penalty and the generalized conjoint penalty, compared with the ground truth. We tuned parameters for the 1 norm problem with respect to RMSE, in order to have a fair comparison. As shown in Fig. 4.5(ii), the 1 norm regularizer systematically underestimates the true signal values of x2 component, and x1 component does not always adhere to zero baseline. The non-convex generalized conjoint penalty estimates the signal values more accurately (Fig. 4.6). The improvement attained by the generalized penalty is also reflected in the lower RMSE value.

4.6.2 Example 2 In this example, we apply the proposed method to two pieces of real near-infrared spectroscopic (NIRS) time series data, and compare it to wavelet thresholding and spline interpolation artifact suppression methods. The raw data shown in Figs. 4.10

132

Y. Feng et al.

20

Raw Data

Raw Data Corrected Data

0 -20 -40

0

30

200

400

600

800

1000

1200

1400

1600

1800

400

600

800

1000

1200

1400

1600

1800

400

600

800

1000

1200

1400

1600

1800

400

600

800

1000

1200

1400

1600

1800

600

800

1000

1200

1400

1600

1800

Type 1 Artifact

20 10 0 -10

0

15

200 Type 2 Artifact

10 5 0 0 30

200 Total Artifact

20 10 0 -10

0

20

200

Corrected Data

0 -20 -40

0

200

400

Time (n)

Fig. 4.10 Transient artifacts suppression on NIRS data (measurement 38) with proposed method, estimation results for type 1, type 2, total artifact and corrected data

4 Transient Artifacts Suppression in Time Series via Convex Analysis 10

0

(i) Raw Data

0

-10

-10

-20

-20 100

20

200

300

400

500

600

700

(viii) Raw Data

-30 900 10

(ii) Artifacts Estimated by Wavelet Thresholding

133

1000

1100

1200

1300

1400

1500

1400

1500

(ix) Artifacts Estimated by Wavelet Thresholding

5 10

0

0

-5

-10 100 20

200

300

400

500

600

700

(iii) Corrected Data by Wavelet Thresholding

0 Raw Data Corrected Data

10

-10 900

1000

1100

1200

1300

(x) Corrected Data by Wavelet Thresholding Raw Data Corrected Data

-10

0 -20

-10 -20 100

20

200

300

400

500

600

700

-30 900 10

(iv) Artifacts Estimated by Spline Interpolation

1000

1100

1200

1300

1400

1500

1400

1500

(xi) Artifacts Estimated by Spline Interpolation

5 10

0

0

-5

-10 100 20

200

300

400

500

600

700

(v) Corrected Data by Spline Interpolation

-10 900

0 Raw Data Corrected Data

10

1000

1100

1200

1300

(xii) Corrected Data by Spline Interpolation Raw Data Corrected Data

-10

0 -20

-10 -20 100

20

200

300

400

500

600

700

-30 900 20

(vi) Artifacts Estimated by Proposed Method

1000

1100

1200

1300

1400

1500

1400

1500

(xiii) Artifacts Estimated by Proposed Method

15 10

10

0

5

-10 100 20

200

300

400

500

600

700

(vii) Corrected Data by Proposed Method

0 900

0 Raw Data Corrected Data

10

1000

1100

1200

1300

(xiv) Corrected Data by Proposed Method Raw Data Corrected Data

-10

0 -20

-10 -20 100

200

300

400

Time (n)

500

600

700

-30 900

1000

1100

1200

1300

1400

1500

Time (n)

Fig. 4.11 Detail comparison of the transient artifact suppression with wavelet thresholding, spline interpolation and proposed method on NIRS data (measurement 38) on segment sample 100–700 and 900–1500

and 4.12 were acquired using a pair of optodes (one source and one detector) on the back of the subject’s head and around the subject’s left ear, respectively. In this case, it is susceptible to artifacts due to motion artifacts manifested as abrupt shifts of the baseline (Fig. 4.10), and more frequent subtle muscle movements (Fig. 4.12). The artifacts are of variable amplitude, width, and shape. The time series are of length 1900. Figures 4.10, 4.12 shows the type 1, type 2 and total artifacts extracted from the raw data using the proposed method, along with the corrected data obtained by subtracting the estimated artifacts from the raw data. We set the λ parameters as in Example 1 and we set γ = 1 in this example. For comparison of different artifact suppression methods, we show two segments of 600 samples of each piece to better present the details in Figs. 4.11, 4.13. For the

134

Y. Feng et al.

5

Raw Data

0

-5

0

4

200

400

600

800

1000

1200

1400

1600

1800

400

600

800

1000

1200

1400

1600

1800

400

600

800

1000

1200

1400

1600

1800

400

600

800

1000

1200

1400

1600

1800

Type 1 Artifact

2 0 -2 -4

0

1

200 Type 2 Artifact

0.5 0 -0.5 -1

0

4

200 Total Artifact

2 0 -2 -4

0

5

200

Corrected Data

Raw Data Corrected Data

0

-5

0

200

400

600

800

1000 Time (n)

1200

1400

1600

1800

Fig. 4.12 Transient artifacts suppression on NIRS data (measurement 18) with proposed method, estimation results for type 1, type 2, total artifact and corrected data

4 Transient Artifacts Suppression in Time Series via Convex Analysis

4

4

(i) Raw Data

135

(viii) Raw Data

2 2

0

0

-2

-2 100 2

200

300

400

500

600

700

-4 900 4

(ii) Artifacts Estimated by Wavelet Thresholding

1

1000

1100

1200

1300

1400

1500

1400

1500

(ix) Artifacts Estimated by Wavelet Thresholding

2

0

0

-1 -2 -2 100 8

200

300

400

500

600

700

(iii) Corrected Data by Wavelet Thresholding

6

Raw Data Corrected Data

4

900 4

2 0

-2

4

200

300

400

500

600

1100

1200

1300

Raw Data Corrected Data

2 0

-2 100

1000

(x) Corrected Data by Wavelet Thresholding

700

-4 900 4

(iv) Artifacts Estimated by Spline Interpolation

1000

1100

1200

1300

1400

1500

1400

1500

(xi) Artifacts Estimated by Spline Interpolation

2 2 0 0 -2 -2 100

200

300

400

500

8 (v) Corrected Data by Spline Interpolation 6 4 2 0 -2 -4 100 200 300 400 500 4

600

700

900 4

Raw Data Corrected Data

1000

1100

1200

1300

(xii) Corrected Data by Spline Interpolation Raw Data Corrected Data

2 0 -2

600

700

-4 900 4

(vi) Artifacts Estimated by Proposed Method

1000

1100

1200

1300

1400

1500

1400

1500

(xiii) Artifacts Estimated by Proposed Method

2

2

0

0

-2 -2 100 8

200

300

400

500

600

700

(vii) Corrected Data by Proposed Method

900 4

6

Raw Data Corrected Data

4 2 0

-2 200

300

400

Time (n)

500

600

1100

1200

1300

Raw Data Corrected Data

2 0

-2 100

1000

(xiv) Corrected Data by Proposed Method

700

-4 900

1000

1100

1200

1300

1400

1500

Time (n)

Fig. 4.13 Detail comparison of the transient artifact suppression with wavelet thresholding, spline interpolation and proposed method on NIRS data (measurement 18) on segment sample 100–700 and 900–1500

wavelet thresholding method [18], we use the undecimated Haar wavelet transform the non-negative garrote threshold function. The artifacts are estimated by the reconstruction of the all thresholded subbands (the low-pass subband is set to zero). For the spline interpolation artifact suppression method [23], we first identify the segments containing the transient artifacts using the moving standard deviation: the transient artifacts surge in short time periods, segments containing artifacts have large standard deviation. Then, each identified segment is modeled using cubic spline interpolation, and use the strategy presented in Ref. [23] to smooth and

136

Y. Feng et al.

reconstruct the complete signal. The estimation results for the proposed method are the same as shown in Figs. 4.10, 4.12. For the NIRS data in Fig. 4.11, the corrected data is illustrated in (iii), (v), (vii), (x), (xii), (xiv). The result using the proposed method fixed the abrupt baseline shift ((vii) sample 450–500), eliminates the brief-wave artifacts, and preserves the general structure of the original data. In contrast, the wavelet thresholding method is unable to compensate for the baseline shift ((iii) sample 450–500), and changes the baseline in the neighborhood of artifacts. The spline interpolation method introduces more artifacts in the corrected data due to imperfect artifacts estimation ((xii) sample 900–1100). For the NIRS data in Fig. 4.13, the proposed method captures most of the subtle artifacts, while the spline interpolation method only captures the most prominent ones. The artifacts in sample 450–500 and sample 950–1050 are estimated with distinct pre- and post-artifact baseline values by the proposed method (vi), (xiii), while the wavelet-estimated artifact signal exhibits a small change due to the slowly-varying low-pass component (ii), (ix), causing changes of baseline in corrected data. The proposed method better estimates the transient artifacts, and therefore gives a less distorted corrected data, compared to wavelet thresholding and spline interpolation method.

4.7 Conclusion and Future Work For the purpose of suppressing transient artifacts in biomedical time series data, we propose a sparse-assisted optimization problem for the estimation of signals comprising a low-pass signal, a sparse piecewise constant signal, a piecewise constant signal, and additive white Gaussian noise. To better estimate the artifacts, and in turns benefit the suppression, we propose a non-convex generalized conjoint penalty that can be designed to preserve the convexity of the total cost function to be minimized, thereby realizing the benefits of a convex optimization framework (reliable, robust algorithms, etc.). The benefit of the proposed non-convex penalty, relative to the classical 1 norm penalty, is that it overcomes the tendency of the 1 norm to underestimate the true amplitude of signal values. We apply the proposed method to the suppression of artifacts in near-infrared spectroscopy (NIRS). Tuning the parameters of the proposed method, in presence of noise with large variance, is tricky. Although in real biomedical data, the two types of the artifacts are distinguishable, manually tuning the regularization parameters is needed nevertheless. This is not satisfactory for real applications where there are large amount of data to be processed and analyzed. Thus a reliable and automatic way of deciding the parameters is required. An adaptive strategy of setting the regularization parameters was proposed in [6]. In order to bring the proposed method to application, investigation and adaptation of such strategy is of interest. Acknowledgements The authors thank Randall Barbour for important discussions. This work was supported by NSF (grant CCF-1525398).

4 Transient Artifacts Suppression in Time Series via Convex Analysis

137

References 1. Akhtar, M. T., Mitsuhashi, W., & James, C. J. (2012). Employing spatially constrained ICA and wavelet denoising, for automatic removal of artifacts from multichannel EEG data. Signal Processing, 92(2), 401–416. 2. Ayaz, H., Izzetoglu, M., Shewokis, P. A., & Onaral, B. (2010). Sliding-window motion artifact rejection for functional near-infrared spectroscopy. In 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology (pp. 6567–6570). 3. Bauschke, H. H. & Combettes, P. L. (2011). Convex analysis and monotone operator theory in Hilbert spaces (Vol. 408). New York: Springer. 4. Bayram, I., Chen, P.-Y., & Selesnick, I. (2014, May). Fused lasso with a non-convex sparsity inducing penalty. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5. Blake, A. & Zisserman, A. (1987). Visual reconstruction. Cambridge: MIT Press. 6. Bobin, J., Starck, J.-L., Fadili, J. M., Moudden, Y., & Donoho, D. L. (2007). Morphological component analysis: An adaptive thresholding strategy. IEEE Transactions on Image Processing, 16(11), 2675–2681. 7. Calkins, M. E., Katsanis, J., Hammer, M. A., & Iacono, W. G. (2001). The misclassification of blinks as saccades: Implications for investigations of eye movement dysfunction in schizophrenia. Psychophysiology, 38(5), 761–767. 8. Condat, L. (2013). A direct algorithm for 1-d total variation denoising. IEEE Signal Processing Letters, 20(11), 1054–1057. 9. Cooper, R., Selb, J., Gagnon, L., Phillip, D., Schytz, H. W., Iversen, H. K., et al. (2012). A systematic comparison of motion artifact correction techniques for functional near-infrared spectroscopy. Frontiers in Neuroscience, 6, 147. 10. Damon, C., Liutkus, A., Gramfort, A., & Essid, S. (2013). Non-negative matrix factorization for single-channel EEG artifact rejection. In IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 1177–1181). 11. Feng, Y., Graber, H., & Selesnick, I. (2018). The suppression of transient artifacts in time series via convex analysis. In 2018 IEEE Signal Processing in Medicine and Biology Symposium (SPMB) (pp. 1–6). 12. Friedman, J., Hastie, T., Höfling, H., & Tibshirani, R. (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2), 302–332. 13. Graber, H. L., Xu, Y., & Barbour, R. L. (2011). Optomechanical imaging system for breast cancer detection. Journal of the Optical Society of America A, 28(12), 2473–2493. 14. Islam, M. K., Rastegarnia, A., Nguyen, A. T., & Yang, Z. (2014). Artifact characterization and removal for in vivo neural recording. Journal of Neuroscience Methods, 226, 110–123. 15. Jahani, S., Setarehdan, S. K., Boas, D. A., & Yücel, M. A. (2018). Motion artifact detection and correction in functional near-infrared spectroscopy: A new hybrid method based on spline interpolation method and Savitzky–Golay filtering. Neurophotonics, 5(1), 015003. 16. Lanza, A., Morigi, S., Selesnick, I., & Sgallari, F. (2017). Sparsity-inducing non-convex nonseparable regularization for convex image processing. Preprint. 17. Metz, A. J., Wolf, M., Achermann, P., & Scholkmann, F. (2015). A new approach for automatic removal of movement artifacts in near-infrared spectroscopy time series by means of acceleration data. Algorithms, 8(4), 1052–1075. 18. Molavi, B. & Dumont, G. A. (2012). Wavelet-based motion artifact removal for functional near-infrared spectroscopy. Physiological Measurement, 33(2), 259–270. 19. Molla, M. K. I., Islam, M. R., Tanaka, T., & Rutkowski, T. M. (2012). Artifact suppression from EEG signals using data adaptive time domain filtering. Neurocomputing, 97, 297–308. 20. Nikolova, M. (2011). Energy minimization methods. In Handbook of mathematical methods in imaging (pp. 139–185). New York: Springer. 21. Parekh, A. & Selesnick, I. W. (2015). Convex fused lasso denoising with non-convex regularization and its use for pulse detection. In IEEE Signal Processing in Medicine and Biology Symposium (SPMB) (pp. 1–6).

138

Y. Feng et al.

22. Rockafellar, R. T. (1972). Convex analysis. Princeton: Princeton University Press. 23. Scholkmann, F., Spichtig, S., Muehlemann, T., & Wolf, M. (2010). How to detect and reduce movement artifacts in near-infrared imaging using moving standard deviation and spline interpolation. Physiological Measurement, 31(5), 649–662. 24. Selesnick, I. (2017). Sparse regularization via convex analysis. IEEE Transactions on Signal Processing, 65(17), 4481–4494. 25. Selesnick, I. W., Graber, H. L., Ding, Y., Zhang, T., & Barbour, R. L. (2014). Transient artifact reduction algorithm (TARA) based on sparse optimization. IEEE Transactions on Signal Processing, 62(24), 6596–6611. 26. Selesnick, I. W., Graber, H. L., Pfeil, D. S., & Barbour, R. L. (2014). Simultaneous low-pass filtering and total variation denoising. IEEE Transactions on Signal Processing, 62(5), 1109– 1124. 27. Starck, J.-L., Donoho, D., & Elad, M. (2004). Redundant multiscale transforms and their application for morphological component separation. Advances in Imaging and Electron Physics, 132, 287–348. 28. Thomas, S. (1996). The operation of infimal convolution. Unpublished Doctoral Dissertation, University of Lund. 29. Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., & Knight, K. (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1), 91–108.

Chapter 5

The Hurst Exponent: A Novel Approach for Assessing Focus During Trauma Resuscitation Ikechukwu P. Ohu, Jestin N. Carlson, and Davide Piovesan

5.1 Introduction Medical errors occur in over 60% of patients in the outpatient setting and with nearly 1/3 of all patients admitted to healthcare facilities [1, 2]. Many of these errors, and their resultant adverse effects, may be related to poor communication and poor teamwork. Post-graduate team-based training has been shown to improve outcomes; however, there are few pre-graduate team-based training initiatives [3, 4]. Successful critical procedures, including the resuscitation of trauma patients, are the result of effective teamwork incorporating three key components: (1) efficient movements, (2) effective communication, and (3) focused assessment. Previous works have shown that highly performing individuals complete critical maneuvers rapidly and with limited variability [5, 6]. Current assessment of resuscitation team performance is often based on evaluations using checklists that evaluate verbal communication. However, highly efficient teams may function with several nonverbal cues that may not be measured by current assessment methods. To perceive these non-verbal cues, individuals are required to divert their visual attention from the patient point of focus to their peers.

I. P. Ohu · D. Piovesan () Biomedical Industrial and Systems Engineering Department, Gannon University, Erie, PA, USA Patient Simulation Center, Morosky College of Health Professions and Sciences, Gannon University, Erie, PA, USA e-mail: [email protected]; [email protected] J. N. Carlson Patient Simulation Center, Morosky College of Health Professions and Sciences, Gannon University, Erie, PA, USA Department of Emergency Medicine, Saint Vincent Health System, Erie, PA, USA e-mail: [email protected] © Springer Nature Switzerland AG 2020 I. Obeid et al. (eds.), Signal Processing in Medicine and Biology, https://doi.org/10.1007/978-3-030-36844-9_5

139

140

I. P. Ohu et al.

Despite the advances in our understanding of how movement patterns influence communication, there is limited data on how providers focus on various aspects of the scene. Although a moving visual cue can be tracked only with eye gaze, a strong correlation exists between head movements and eye saccades [7]. While other fields have assessed non-verbal cues by tracking head movements [8] and used fractal statistics to assess motor skills [9], the use of these techniques in trauma teams is limited. Fractal statistics is comprised of non-traditional mathematical methods allowing the interpretation of patterns for which understanding is difficult if analyzed with traditional Euclidean concepts. Essentially, it provides a measurement of the complexity of a pattern. Simply put, the more a signal increases in complexity the more it approximates a random noise. At this point the signal is by definition not predictable. Thus, two individuals interacting to accomplish a common task who rely on non-verbal cues to establish the intention of each other must identify a cue signal (e.g., a movement or posture, etc.) whose meaning is not too difficult to decipher. The simplest signal (i.e., the most predictable) is periodic and therefore, when amplitude, phase, and period are determined, no additional cues are necessary. Non-verbal cues have been studied extensively in the co-operation between two agents [10–13], specifically when haptic sensations are used as cues. This has great implication for tasks like dancing or any activity wherein the co-operation between two agents require some form of physical contact. When only two agents collaborate, they are often referred to as a dyad. In a dyad there is often the establishment of a simple hierarchy, where one agent becomes the lead and the other becomes the follower. These paradigms are predetermined for certain interaction tasks like ballroom dancing where, traditionally, the male is the leader and the female is the follower. For tasks that require multiple agents, visual cues are often preferred to haptic cues. In this case, there can be an authoritative relationship. A typical example can be seen in an orchestra, where the conductor leads, and all the players follow the cues provided by the conductor. In the case where no clear hierarchical structure is present, it is useful to provide an analogy by thinking about the behavior of a flock of birds, such as the common starling, Sturnus vulgaris. From the perspective of an external observer, the flock changes shape in many disparate geometrical patterns, yet collisions between the birds are rare, even though there appear to be no welldefined leader of the flock. This is possible because each bird follows three simple rules—maintain the same velocity, direction of flight, and distance from all the neighboring birds. We can see that the rules that each bird follows are remarkably simple; yet, the interaction between all the agents gives rise to a complex behavior of the entire flock. A hierarchical behavior that lies in between the orchestration of a unique conductor and the reliance on the nearest neighbors can be found, for example, in a flock of Mallards (Anas platyrhynchos). Firstly, a flock of Mallards usually has a leader which is the bird positioned at the cuspid of the well-known V-shaped formation of the flock. The flock flight pattern is very predictable (i.e., exhibits low complexity), to the point that the followers are able to shut down one of their brain

5 The Hurst Exponent: A Novel Approach for Assessing Focus During. . .

141

hemispheres, commanding the vision of the eye pointing at the internal part of the V-formation [14] and just follow the bird in front of them. This, most likely, is the results that Mallards are seldom subject to predatory attacks and therefore do not require to react very often to unexpected occurrences by breaking formation. On the other hand, starlings’ behavior within a flock is more demanding. The birds in the internal part of the flock are required to be alert to avoid collisions but this allows to both be protected by their companions and for the rapid change of directions and separation of the flock in case of attack from a predator. In a trauma scenario, there is traditionally a leader (physician or physician assistant) but reliance on the nearest neighbor is also essential. The well-defined chain of command is predictable and therefore making communication more reliable. On the other hand, complications can occur and therefore the behavior of the trauma team is necessarily complex (i.e., less predictable). Even though more cognitively demanding, a complex interaction allows for fast action and adaptability to unexpected occurrences. We posit that as each bird follows simple rules to be a successful participant within the flock, so does each caregiver. Specifically, the attention of the caregiver is divided between the patient and the colleagues with whom they are interacting. If visual cues are followed, the caregiver must look in the direction of the cue. While eye saccade are primarily involved to gaze between cues a movement of the head is associated with it [15] as it is required to maintain body balance [16]. The position of the patient with respect to the caregivers and of the caregivers with respect to each other allows for a convenient separation of two head movements that allow us to monitor how each caregiver divide their focus. Based on the previous consideration, we would expect that the movements of the head would not happen randomly (and therefore with maximal complexity) but at the same time they should not be periodic and therefore completely predictable. Furthermore, we would expect to find that as caregivers gain experience, and are improving their performance, the complexity of their movement would decrease as they are increasingly able to predict what the intentions of their colleagues are. Our positing a decrease of complexity comes from an extensive body of literature in both neuroscience and ergonomics. For example, studies that evaluated expert physical therapists trying to assess a physical impairment by manipulating a simulated patient found that the complexity of the executed motion and the force experienced during physical manipulation tended to be higher during incorrectly identified trials than in those correctly assessed [17]. This suggests that movements that are too complex are difficult to interpret. Furthermore, it was observed that improvement in surgical motions as the subjects gain experience is associated with a decrease in complexity of the movements [9, 18], indicating that practice reduces unnecessary movements and increases predictability of the actions. Many tools exist to quantify the complexity of a signal. A common mathematical tool is the Approximate Entropy (ApEn) that is used to compute the modality in which fluctuations occur within a signal. If the fluctuations follow a repetitive pattern within a time series, it makes the signal predictable. ApEn quantifies the likelihood that patterns observations that look like each other are not followed by

142

I. P. Ohu et al.

further similar observations [19]. If a signal is repeatable its ApEn is small, ideally 0 for a perfectly periodic signal. Signals that are less and less predictable have a higher ApEn, where for pure randomness ApEn = 2. Sample entropy (SampEn) is a modification of approximate entropy (ApEn) [20]. SampEn has two advantages over ApEn: data length independence and simple algorithmical implementation. Approximate entropy (ApEn) and sample entropy (SampEn) are widely used for temporal complexity analysis of real-world phenomena. However, their relationship with the Hurst exponent as a measure of self-similarity is not widely studied [21]. We decided to use the Hurst exponent as a measure of complexity because ApEn and SampEn are susceptible to signal amplitude changes. A common practice for addressing this issue is to correct their input signal amplitude by its standard deviation. This might be a problem since the role of some healthcare provider might make them move their head with larger amplitude compare to others. Nevertheless, ApEn and SampEn are related to the Hurst exponent in their tolerance r and embedding dimension m parameters. We sought to perform a preliminary, proof-of-concept study to assess the ability to perform head tracking during a simulated trauma scenario. The major contributions of this chapter are: (1) the quantitative determination of change in focus/attention during a team-based healthcare intervention activity between the patient/task and team members, (2) the observation that verbal instructions on teambased processes/activities have no direct influence on changing the point of focus of the providers, (3) the result of the H exponent is profession-independent, hence it can be applied to the pre- and post-training scenarios irrespective of the training received by the healthcare providers.

5.2 Method 5.2.1 Hurst Exponent (H) In complexity theory, the trajectories of a system in state space may be constrained within an identifiable region and easily identified by a set of numerical values (an attractor) [22]. Different types of dynamics lead to different attractor geometries, including points, cycles, and fractal structures (chaotic systems) [23]. In general, Hurst exponents can be used to estimate system attractors from time series data [24]. These exponents quantify both the strength of attraction exerted by the attractor on nearby points and the degree to which neighboring points within the attractor diverge from one another, and thus provide useful characterizations of the system [25]. Also borrowing from complexity theory, trajectories of different states of a system in an n-dimensional phase space can be reconstructed and constrained to a region within that phase space [26]. The representations of the differing states

5 The Hurst Exponent: A Novel Approach for Assessing Focus During. . .

143

(attractors) have varying geometries on account of dynamical influences including points, cycles, and fractals, which are primary components of chaotic systems. However, these attractors may exhibit statistical dependence which are either short range (exponentially decaying) or long range (auto covariance functions decay at a relatively slower rate) [27, 28]. In time series data, the rates of decay are deduced through the examination of the variance of partial sums of consecutive values. A measure of long-range dependence in a time series in popular use is the Hurst exponent parameter, H [29]. These exponents quantify both the strength of attraction exerted by the attractor on nearby points, providing estimates on the degree of predictability of noisy data, and the degree to which neighboring points within an attractor diverge from one another, giving useful characterizations of a system. Hurst exponents can be used to estimate system attractors from time series data, by taking on values between 0 and 1. An H value of less than 0.5 is indicative of strong negative correlation between the variance of partial sums of successive data in a time series, which is reflective of anti-persistency and highly erratic motions, that is characteristically observed from novice MIS surgeons while performing fundamentals of laparoscopic surgery (FLS) tasks [9]. While an H value of 0.5 represents a perfectly random motion. A value closer to 1 is indicative of increased persistence and long-range dependence—this is reflective of increased motion coordination, confidence, and expertise. The computational power of the Hurst exponent and its ability to distinguish between time series data that exhibit autocorrelation from that which shows antipersistence, lies in the existence of self-similarities of fractal structures in nature. Likewise in statistics, especially for data collected from human movement, selfsimilarities exist in the movement artifacts and are ordinarily beclouded from observation by the apparent chaos of the overlying attractors representative of perturbations brought about by overall joint interactions during movement, which may or may not be relevant to the goal for which the motion was initially enacted. The Hurst exponent can essentially “filter” the noisy artifacts and create a trimodal (anti-persistent, random, and persistent) characterization of the movement patterns. Starting from the series s(t) = [s1 , s2 , . . . , sn ] the attractor of the underlying dynamics is reconstructed in a phase space by applying the time-delay vector method. The reconstructed trajectory X can be expressed as a matrix where each row is a phase space vector: X = [x1 , x2 , . . . , sm ]T

(5.1)

$ # where xi = s1 , si+τ , . . . , si+(DE −1)τ , m = n − (DE − 1)τ , DE is the embedding dimension, and τ is the delay time. The Hurst algorithm for obtaining this exponent from a time series is as follows:  H (τ ) = log

R(τ ) S(τ )c

) log (τ )

(5.2)

144

I. P. Ohu et al.

where   τ 1  S (τ ) =  (xi − x (τ ))2 τ

(5.3)

i=1

R (τ ) = max X (t, τ ) − min X (t, τ ) 1≤t≤τ

(5.4)

1≤t≤τ

The constant, c is typically set to 1.0. A value H in the range of 0.5 < H < 1 indicates a time series with long-term positive autocorrelation, whereas a value in the range of 0 < H < 0.5 indicates a time series with long-term switching between high and low values in adjacent pairs. A value of H = 0.5 can indicate a completely uncorrelated series.

5.2.2 Experimental Protocol Under an institutional review board (IRB) approved protocol, we enrolled a convenience sample of two simulated trauma teams utilizing undergraduate health professional students from four disciplines available at our institution: 2nd year Radiologic Science (RS), 4th year Physician Assistant (PA), 2nd year Respiratory Care (RT), and 4th year Registered Nurse (RN) students. Each team consisted of one member from each discipline, randomly assigned, and the PA students were the team leads for the resuscitations. All participants completed a customized 30-min online trauma resuscitation course, designed by the investigators, prior to the simulation. None had other formal trauma training (e.g., Advanced Trauma Life Support); however, all were certified in healthcare provider cardiopulmonary resuscitation. On the day of data collection, each team completed a simulated trauma resuscitation followed by a customized 30-min trauma teamwork education module designed by the authors and described previously [30, 31]. This module incorporated the tenets as defined by the Agency for Healthcare Research and Quality (AHRQ) TeamSTEPPS material [3, 19] which is aimed at optimizing patient outcomes by improving communication and teamwork skills. The teams then completed a second simulated trauma/resuscitation exercise, identical to the first scenario. The details of the scenario have been published previously [32]. All data were collected on a single day at Gannon university’s Patient Simulation Center and all simulations were video recorded for offline review and analysis.

5.2.3 Measure of Head Movements ®

Head motion and orientation of the participants was measured using Xsens MTw motion trackers (Xsens North America Inc.) affixed using color-coded headbands

5 The Hurst Exponent: A Novel Approach for Assessing Focus During. . .

145

Fig. 5.1 Spatial positioning and head orientation of providers during the simulated trauma/resuscitation exercise. (a) zeros of sensors; (b) focusing on patient, with a rotation of the sensor around the x-axis (roll); (c) focusing on the colleagues with a rotation of the sensor around the z-axis (yaw)

to the position of the external occipital protuberance. The reference frame of the motion sensor is defined as a right-hand orthogonal coordinate system. The origin is positioned at the head’s center of rotation which is assumed to coincide with the atlas. With the head parallel to the sagittal plane, the y-axis of the reference frame is negative in the rostral direction while the positive z-axis points upwards and the positive x-axis points laterally toward the left ear (Fig. 5.1). Rotations of the head about the x-, y-, and z-axis were defined as the roll, pitch, and yaw, respectively. Given that in the simulated trauma scenario, each of the participants in a team stood in a circle around the mannequin representing the patient (Figs. 5.1 and 5.2), a movement of the head up and down (rotation around the x-axis of the sensor and thus change in roll angle) is assumed to be representative of a transition of focus from the patient to other team participants, and vice versa. Rotations of the head to look left and right (rotation around the z-axis of the sensor and thus change in yaw angle) are indicative of change in focus toward the team members on either side of a participant. To help coordinate movements during trauma resuscitations, team members are often assigned starting positions relative to the patient. We positioned the RT close to the head of the patient since this provider need to do the intubation. The RN need to have access to the thorax for the use of the defibrillator. RS intervene only if needed and therefore is generally toward the leg of the patient [33]. Head movement during gaze can have several levels of complexity. If the subject’s gaze is fixated on a point they are paying visual attention to, the complexity of the head movement is low. The movement of the target and the change in the target’s trajectory from predictable (based on an obvious, recognizable pattern) to unpredictable increases the complexity of the subject’s head movement. Random movements are observed during visual scanning for a cue when the subject is not paying attention to a single event.

146

I. P. Ohu et al.

Fig. 5.2 Spatial positioning of providers during the simulated trauma/resuscitation exercise. RT respiratory therapy student, RN registered nursing student, PA physician assistant student, RS radiologic science student

5.2.4 Application of Hurst Exponent to Head Movements The decrease in head movement complexity and therefore increase of attention on a well-defined task can be quantified using Hurst exponent (H). This study evaluated the H values of Euler coordinates of head motions as an indicator of focus and attention to the patient (roll) and to the teammates (yaw), during a simulated trauma resuscitation using a unique team of interdisciplinary trainees. The H algorithm is a statistical measure used for analyzing nonlinear time series data. It provides an estimate of the persistence of data over time. Persistent data denotes information that does not change frequently (fixating a point). Antipersistent data indicates that if a value in the time series had been up in the previous period, it is more likely that it will be down in the next period and vice versa. With values lying between 0 and 1, an H value between 0 and 0.5 indicates anti-persistence. This means that the activity from which the time series data was collected goes through a switching sequence between high and low values. An H value of 0.5 is indicative of a purely random motion, while an H value lying between 0.5 and 1 indicates persistent data. The H estimate was applied in the study to determine if there is a change in the direction of focus of the participants during trauma resuscitation scenarios occurring after online training and TeamSTEPPS. Repetitive changes of the head’s Euler’s angles irrespective of the complexity of the motions are made evident by H values, which were computed from motion data collected during the trauma resuscitation scenarios. The time frame for collecting data on team interactions was limited to the duration within the experiment at which communications/interactions between team members was expected, which was below 5 min. At a frequency of 40 Hz, a 4-min interaction generated 9600 data points/variables, which is significantly sufficient statistically, for performing Hurst estimation as appropriate classifications can be done on data that is significantly less

5 The Hurst Exponent: A Novel Approach for Assessing Focus During. . .

147

than that [34]. The computation of H estimates was done using a Microsoft Excel add-on (NumXL 1.59 by Spider Financial) applying the corrected Hurst exponent algorithm [35, 36]. We used the paired, two-sided t-test to compare pre and post values by provider discipline and considered a p < 0.05 to be significant.

5.3 Results The result of the H estimates in the pre- and post-TeamSTEPPS training scenarios is summarized in Tables 5.1 and 5.2. There was no statistically significant difference between the pre- and post-TeamSTEPPS H values for either roll or yaw while comparing data of all of the healthcare disciplines involved in the study. Pre- and post-TeamSTEPPS scenarios show high persistence (subjects mostly focusing on a specific scene) with H values ranging between 0.8 and 1 (Figs. 5.3 and 5.4). An example of raw data is presented in Fig. 5.5. The limitation of this work lies in the small number of subject available for the analysis and limited healthcare disciplines. Also, this was performed in a simulated setting and future work will be needed to evaluate these techniques in the clinical setting. This work represents a stepping-stone for the use of fractal statistical analysis in the characterization of visual attention and non-verbal communication in emergency medicine scenarios. As a follow-up to this study, further analysis will be needed to determine the probability of the recurrence of positive autocorrelation in both the pre and post scenarios, and the predictability of H estimates in either scenario. While the H values deduced from this study indicate persistence focus of the providers in a specific direction, more insight can be gained on the specific subject of the persistent focus if the H results are compared with visual head orientation observations. Table 5.1 Hurst estimates of Roll angles in the pre- and post-TeamSTEPPS training scenarios. RN PA RT RS

ROLL Pre Mean 0.9114 0.8851 0.8897 0.9032

St. dev. 0.05 0.06 0.05 0.05

Post Mean 0.9290 0.8474 0.8990 0.8799

St. dev. 0.07 0.02 0.05 0.04

RN PA RT RS

YAW Pre Mean 0.9184 0.926 0.9093 0.9014

St. dev. 0.10 0.07 0.04 0.05

Post Mean 0.9266 0.8652 0.8750 0.9182

St. dev. 0.05 0.04 0.05 0.06

Table 5.2 Hurst estimates of Yaw angles in the pre- and post-TeamSTEPPS training scenarios.

p-value 0.55 0.35 0.54 0.20

p-value 0.87 0.34 0.17 0.64

148

I. P. Ohu et al.

Fig. 5.3 Clustered column charts of H estimates of roll head motions (looking up and down)

Fig. 5.4 Clustered column charts of H estimates of yaw head motions (looking left and right)

5.4 Discussion We used head movement to infer the focus of healthcare providers in a simulated trauma resuscitation activity. By positioning the laying simulated patient at waist height of the provider, they were forced to rotate their head on the sagittal plane to look down. By positioning the team members around the bed and not in direct line of sight, the providers are forced to move the head left and right in order to look at each other, rotating their head around the coronal plane. We recorded and analyzed these two signals as a figure of merit of the focus of the provider either on the task or on the teammates. The Hurst exponent applied to the signals reliably measures the direction of focus of the participants during a simulated trauma resuscitation scenario. Future research will be needed to evaluate this analytic technique across providers and in the clinical setting. Application of H to the determination of orientation independently of either direct in situ visual observations or review of

5 The Hurst Exponent: A Novel Approach for Assessing Focus During. . .

A

2.5

Log(R/S)

2 1.5 1 0.5 0

B

0

1

2

3

2.5 2

Log(R/S)

Fig. 5.5 Rescaled range (R/S) plots of roll (a), yaw (b) with Hurst Exponent (H) values of 0.9392, 0.8457, and 0.8940 respectively, from 5 min sample data of a provider’s head movement. Motion data was collected at a sampling frequency of 40 Hz. H is the slope of an R/S plot. Blue = Pre-TEAMSTEPPS training; Red = Post-TEAMSTEPPS training

149

1.5 1 0.5 0

0

0.5

1

1.5

2

2.5

3

recorded videos presents opportunities for markerless and non-video-dependent deduction of team efficiencies in various work and non-work scenarios, and realtime analysis of the same in low-light conditions. Previous works suggested that ocular tracking can identify eye movement patterns that differ between providers of different proficiency [31, 37–39] . Resuscitation of critically ill patients, including trauma patient, can often be chaotic making ocular filtering through this chaos challenging for providers. Problems with scene perception during critical procedures and trauma resuscitations, as measured by visual tracking, have been linked to cognitive deficiencies [37, 39]. Proficient providers have longer fixation times on fewer aspects of the scene during trauma resuscitation scenarios while less proficient providers have short fixation times, often haphazardly scanning the entire scene [37]. Identifying visual centers of attention may be one objective way to identify cognitive deficiencies in providers and allow impactful, focused feedback to trainees. Our preliminary work suggests that the H exponent may reliably measure how providers focus on the scene. Identifying areas where healthcare providers divert their focus from the patient could potentially lead to focused interventions to help improve performance. When we think about the dynamics of deterministic systems, we often focus on the response of a system to specific perturbations. For example, we can observe the impulse response of an oscillating circuit and assess its resonant frequencies and damping (i.e., the attractors of the system). These frequencies are the intrinsic characteristics of the system and in the specific engineering application can be sought after (i.e., resonator cavities or antennas) or intentionally avoided of the system (i.e., resonant frequency in mechanical systems where vibrations are undesired). It is well known that a complex system might have several of these

150

I. P. Ohu et al.

resonant frequencies that can act at multiple scales. However, a complex system is not necessarily deterministic. A fractal system is a complex, nonlinear, and interactive system which can adapt to a changing environment. Fractal systems have attractors that are “strange” and give rise to chaotic responses. A fractal system is interesting because by its nature, it is able to resonate at multiple frequencies. A repeated geometrical pattern in a mechanical system (e.g., the branch of a tree) provides a specific natural frequency and mode of vibration. If the pattern is nested into a tree-like structure, the structure will be able to resonate at multiple frequencies that are dependent on the size of the initial pattern [40]. It is said that a tree-like structure is therefore “fractal” and each branch level from trunk to leaves has its own dynamics that interacts with the compound dynamics of the whole system. Each level of tree-branch can be seen as a dimension embedded in the system. Because of the direct correlation between branch geometry and period of oscillation, we can clearly see in this example a connection between each embedded dimension and the time scale (i.e., time epoch) at which the vibrational phenomenon will need to be analyzed. A small twig will have a period of oscillation very small, in the order of a fraction of second, whereas the whole trunk would complete a whole oscillation in seconds or even minutes if we think at a giant sequoia tree. If we think about “group dynamics” as a fractal system, we can see that the interaction between individuals can be interpreted within different temporal dimensions and time scales much like the oscillation of a branch within a tree. We can focus on group dynamics at four main temporal scales: at the neural, communication, task practice, and training level. Each of these phenomena occur at different time scales and can be elicited by perturbations with specific characteristics capable of exciting the proper response. For example, smiling at a teammate provides a neural reaction that prompt the person receiving the smile to smile back. The “perturbation” is in this case a change in the facial features of one member that can be very quick but that also has very small power as a signal (e.g., it has a very small amplitude as a movement and it might go unnoticed by the observer) . Smiling is contagious, it resonates with people, but it is not going to “travel far” within the team dynamics if the team is very big. This is comparable to having a light breeze blow on the leaves of a tree. It will excite the smallest twigs but would not shake the whole tree. On the other hand, a verbal comment occurs on a longer time scale, and cognitively it requires some time to sink and be cognitively processed by individuals. This type of perturbation might propagate through a larger part of the team. It is a signal that is more powerful (i.e., longer duration, exciting neural circuits that require conscious thoughts) and it might be equivalent to a bird seating on the branch. It will shake the branch and the sub-branches spanning from it, exciting higher frequency dynamics such as the twigs and the leaves. Indeed, we can see that comments are connected to body reaction as well, such as frowning if the comment is bad and smiling if it is good. Practicing a task has an even longer time scale. If a teammate, not only say something bad but, does something bad, the ramifications are larger. It affects an

5 The Hurst Exponent: A Novel Approach for Assessing Focus During. . .

151

even larger part of the group but it usually takes some time before the malpractice is evident to everybody; much longer than rumors spreading. The malpractice of an individual could affect the group as a whole. Training, on the other hand, impacts inevitably the dynamics of the whole group. It is not uncommon to hear managers of organizations saying “that is how we do things here.” A change in training is equivalent to changing the culture of an organization and “to shaking the whole tree.” It requires a perturbation that is very long in time and high in power to be able to do that. Current techniques for teaching acute care providers to run resuscitations are limited to general feedback after an encounter (simulated or live) and lack focused and objective detail. Previous work has suggested that experienced providers focus on different aspects of the environment with less variability in focal points than novice individuals. Data examining visual centers of attention in the simulated trauma resuscitation setting are lacking. It is thought that experienced acute care providers follow these same trends where focused visual centers of attention are a marker of proficiency and may translate into safer patient care. It was thought that TeamSTEPPS education will encourage practitioner to have a more focused visual center of attention than providers who went through regular training. Past research evaluated the impact of the TeamSTEPPS method on the communication between teams of healthcare practitioner. We can see therefore that since such method affects dynamics within a specific timescale (i.e., communication), it ought to be able to influence phenomena at shorter time scale as well (i.e., points of focuses with the communicator). We previously found that TeamSTEPPS education improved team dynamics among undergraduate health professionals [32]. This was observed using the Team Performance Observation Tool (TPOT) that is a TeamSTEPPS curriculum instrument that can be utilized to evaluate the effectiveness of team performance. TPOT was designed based on the Agency for Health Quality and Research TeamSTEPPS curriculum. TPOT can be used by a variety of reviewers, has good inter-rater reliability, and has been successfully used in the clinical setting to assess teamwork globally [41].

5.5 Conclusion Research trying to quantify team dynamics using fractal analysis rather than questionnaires is flourishing [42, 43]. Team coordination on submarines teams was previously analyzed using entropy analysis of electroencephalographic signals [44]. Stability of communication was tested on unmanned air vehicle (UAV) reconnaissance mission using the largest Lyapunov exponent, a measure of dynamical stability, to detect a perturbation to team communication [30]. It should be noted that each of these different studies focused on a scale of the event which is quite different and that is directly correlated with the time epoch size of the signal’s time series.

152

I. P. Ohu et al.

Electroencephalography might be able to provide us with features that characterize the implicit recognition of a gesture that can happen on the order of milliseconds, the understanding of an event that can occur in hundreds of milliseconds and the taking of a decision in the order of seconds. These are events that occur at the neural scale. On the other hand, oral communication such as sentences and short dialogs can be on the order of seconds to minutes and require explicit communication and cognition. Furthermore, monitoring the performance of a task and its coordination can require minutes to hours. And the training for the specific task can require days or weeks [45]. When looking at our Hurst’s exponent analysis, we found no statistical significant decrease of the complexity of the head movement. Various scenarios could have been observed. The head movement was partitioned in two independent movements such as looking left or right (i.e., the yaw) and looking up or down (i.e., the roll). Looking up or down refers to the attention provided to the simulated patient which forces each caregiver to look down. Looking left and right is correlated to the attention that the caregiver gives to the teammates to the left and right. An increase in permanency for the roll would have indicated increased attention to the patients. An increase in permanency to the yaw might have indicated an increase in attention for the teammates by reducing the number of focal points. Indeed what makes an expert is the ability to predict what is going to happen next, based on well-determined cues. A typical example is provided in many sports. Oftentimes the expert players are not faster or stronger than the novice. They are superior because they are able to develop a mental model that gives certain cues which provide a clear indication of what is going to happen next. The problem in our case is that these cues are very much task specific and cannot be individualized (consciously or not) without purposeful practice. If skills were transferable, we would have that great racquetball players might be able to be also great squash player right away. This however is usually not the case. In fact, while the players might be confident of their position in the court, they expect the ball to land in stereotypical spots and to bounce in a specific manner. Since the coefficient of restitution of the ball in the two sports is dramatically different playing one sport when we are used to the other tends to be confusing for the uninitiated. The outcomes of our analysis might point out that the team members were comfortable with their role in executing a plan of care and prioritizing tasks, but that they were not comfortable with the interaction with others. These findings may suggest that the siloed nature of undergraduate health professional education hamper a team dynamic in which there is more interaction between the agents. Thus interprofessional learning experiences to be performed early in the curricula of health professional students might be able to break the barriers of a siloed education system and to facilitate interprofessional communication and teamwork. An alternative scenario is that the students were not completely proficient with the task. Using the racquet sport metaphor once more, they did not know how any of the two balls would bounce to begin with. Experiences with the observations

5 The Hurst Exponent: A Novel Approach for Assessing Focus During. . .

153

of hundreds of interprofessional scenarios have indicated that before directing and evaluating the work of others, students need to be competent with their specific task [46]. It can be seen that team dynamics can be studied by characterizing events that occur at different time scale and that assessment questionnaires are usually able to assess only the explicit interaction between members at a scale of minutes to hours. Using motion capture data in combination with the Hurst exponent would allow for the analysis of the interactions between team member at different time scales. Time series analysis could focus on phenomena that can be primarily neural (i.e., saccadic movements of the eyes, facial expressions, or reflexive movements, just to name a few) to see how different individuals react to stimuli caused by perturbation in group dynamics. Communication could be studied through movements that are primarily related to cognitive processes such as eye and head tracking of targets. Performance could be evaluated by tracking purposeful tasks to see coordination between agents. Finally, training can be quantified by comparing multiple sessions of the same task. Acknowledgements This work was supported by an educational research grant from the Society for Academic Emergency Medicine (SAEM). We would like to thank SAEM for funding this work.

Appendix A Simplified Approach for the Estimation of the Hurst Exponent A significant number of applications have been developed for the estimation of the Hurst exponent, with varying degrees of complexity in their applications. An unfettered understanding of the underlying steps behind the estimation of H will be considerably aided by the simplification of application steps. An attempt is made here to outline a chronological approach to the preparation of a time series data file for Hurst exponent estimation, using the rescaled range analysis approach (https:// blogs.cfainstitute.org/investor/2013/01/30/rescaled-range-analysis-a-method-fordetecting-persistence-randomness-or-mean-reversion-in-financial-markets/) and ® a spreadsheet application (Microsoft Excel ). The following steps serve a dual purpose: outlining a stepwise approach to H estimation and illustrating the ability of the H algorithm to validate randomly generated numbers in Excel as being truly so. (a) From a single time series data requiring H estimation, define additional data segments created through the division of the original time series data in constant, increasing multiples, ensuring that the smallest data epoch has enough observations for the performance of standard descriptive statistics (mean, median, mode, skewness, etc.). The choice of epoch size is made at the discretion of the analyst, driven primarily by the size of the time series data available for analysis. The choice of the epoch size can be based on the time

154

I. P. Ohu et al. ®

Fig. 5.6 Microsoft Excel random number generation function

Fig. 5.7 Mean epoch values

Fig. 5.8 Mean adjusted series data

scale of the phenomenon in question. Long epoch would refer to events that repeat with a very low frequency or are characterized by fractal attractors of very small magnitude and vice versa. ® Using Microsoft Excel , generate a random number using the function, =rand (), and create 100 random observations (Fig. 5.6). Copy the generated random time series data and paste it on a different spreadsheet column as numerical variables, to prevent the reset of the generated random variables every time the spreadsheet is refreshed. Create 4 segments of the original time series data, N. Segment 1 should have all of the variables in the N time series; segment 2 should have two epochs of size N/2 each; segment 3 should have four epochs each of size N/4, while segment 4 should have six epochs of each of size N/6. (b) Determine the average values of data in each of the epochs (Fig. 5.7). (c) For each of the defined epochs, subtract the mean determined in (b) from each observation in the corresponding epochs, and create a dataset of mean adjusted series, having the same number of observations as the original time series data (Fig. 5.8).

5 The Hurst Exponent: A Novel Approach for Assessing Focus During. . .

155

Fig. 5.9 Cumulative deviation

Fig. 5.10 Epoch range value determination

The mean adjusted series value for the observation in cell L3 is computed using the formula: =B3-$G$3. (d) Using data from step (c), create “one-step-behind” cumulative deviation (time) series data, retaining the same number of observations as in steps (a) and (c) (Fig. 5.9). The observation in cell Q6 is computed using the formula, =L6+Q5. (e) Determine the range values of each of the data epochs from the cumulative deviation series created in step (d) (Fig. 5.10).

156

I. P. Ohu et al.

Fig. 5.11 Epoch standard deviation determination

®

Fig. 5.12 Computation of rescaled range in Excel with reference to already determined epoch values of range and standard deviation

(f) Determine the standard deviation values for each of the original data epochs (Fig. 5.11). (g) Divide the range values derived in (e) by the standard deviation values determined in step (f) to compute the rescaled range values (Fig. 5.12). (h) Compute the mean values of the rescaled range data and determine the log (to base 10) of the mean rescaled range values (Fig. 5.13). (i) Determine the log (to base 10) of the number of observations in each of the data epochs. (j) Create a plot of log(Rescaled Range) vs. log(n) and compute the slope of the curve (Fig. 5.14). The observed slope of 0.48999 is the Hurst exponent estimate of the generated random time series data, which verifies the randomness of the data set (Fig. 5.15).

5 The Hurst Exponent: A Novel Approach for Assessing Focus During. . .

Fig. 5.13 Computation of the mean rescaled range values for each epoch

Fig. 5.14 Log (rescaled range) vs. Log(n)

Fig. 5.15 Hurst estimation of generated random time series data in Excel

®

157

158

I. P. Ohu et al.

References 1. Classen, D. C., Resar, R., Griffin, F., Federico, F., Frankel, T., Kimmel, N., et al. (2011). ‘Global trigger tool’ shows that adverse events in hospitals may be ten times greater than previously measured. Health Affairs (Millwood), 30, 581–589. https://doi.org/10.1377/hlthaff.2011.0190. 2. O’Connor, P. J., Sperl-Hillen, J. A. M., Johnson, P. E., & Rush, W. A. (2005). Clinical inertia and outpatient medical errors. In K. Henriksen, J. B. Battles, E. S. Marks, & D. I. Lewin (Eds.), Advances in patient safety: From research to implementation. Rockville, MD: Agency for Healthcare Research and Quality (US). 3. Neily, J., Mills, P. D., Young-Xu, Y., Carney, B. T., West, P., Berger, D. H., et al. (2010). Association between implementation of a medical team training program and surgical mortality. Journal of the American Medical Association, 304, 1693–1700. https://doi.org/10.1001/ jama.2010.1506. 4. Reagans, R., Argote, L., & Brooks, D. (2005). Individual experience and experience working together: Predicting learning rates from knowing who knows what and knowing how to work together. Management Science, 51, 869–881. 5. Carlson, J. N., Das, S., De la Torre, F., Callaway, C. W., Phrampus, P. E., & Hodgins, J. (2012). Motion capture measures variability in laryngoscopic movement during endotracheal intubation: A preliminary report. Simulation in Healthcare, 7, 255–260. https://doi.org/ 10.1097/SIH.0b013e318258975a. 6. Carlson, J. N., Quintero, J., Guyette, F. X., Callaway, C. W., & Menegazzi, J. J. (2012). Variables associated with successful intubation attempts using video laryngoscopy: A preliminary report in a helicopter emergency medical service. Prehospital Emergency Care, 16, 293–298. https://doi.org/10.3109/10903127.2011.640764. 7. Saeb, S., Weber, C., & Triesch, J. (2011). Learning the optimal control of coordinated eye and head movements. PLoS Computational Biology, 7(11), e1002253. 8. Ramseyer, F., & Tschacher, W. (2014). Nonverbal synchrony of head- and body-movement in psychotherapy: Different signals have different associations with outcome. Frontiers in Psychology, 5, 979. https://doi.org/10.3389/fpsyg.2014.00979. 9. Ohu, I., Cho, S., Zihni, A., Cavallo, J. A., & Awad, M. M. (2015). Analysis of surgical motions in minimally invasive surgery using complexity theory. International Journal of Biomedical Engineering and Technology, 17, 24–41. https://doi.org/10.1504/IJBET.2015.066966. 10. Melendez-Calderon, A., Komisar, V., & Burdet, E. (2015). Interpersonal strategies for disturbance attenuation during a rhythmic joint motor action. Physiology & Behavior, 147, 348–358. https://doi.org/10.1016/j.physbeh.2015.04.046. 11. Melendez-Calderon, A., Komisar, V., Ganesh, G., & Burdet, E. (2011, August 30-September 3). Classification of strategies for disturbance attenuation in human-human collaborative tasks. Paper presented at the 2011 Annual International Conference of the IEEE Engineering in Medicine and Biology Society. 12. Reed, K., Peshkin, M., Hartmann, M. J., Grabowecky, M., Patton, J., & Vishton, P. M. (2006). Haptically linked dyads: Are two motor-control systems better than one? Psychological Science, 17(5), 365–366. https://doi.org/10.1111/j.1467-9280.2006.01712.x. 13. Takagi, A., Beckers, N., & Burdet, E. (2016). Motion plan changes predictably in dyadic reaching. PLoS one, 11(12), e0167314. 14. Rattenborg Niels, C. (2017). Sleeping on the wing. Interface Focus, 7(1), 20160082. https:// doi.org/10.1098/rsfs.2016.0082. 15. Stahl, J. S. (2001). Eye-head coordination and the variation of eye-movement accuracy with orbital eccentricity. Experimental Brain Research, 136(2), 200–210. https://doi.org/10.1007/ s002210000593. 16. Kim, S.-Y., Moon, B.-Y., & Cho, H. G. (2016). Smooth-pursuit eye movements without head movement disrupt the static body balance. Journal of Physical Therapy Science, 28(4), 1335– 1338. https://doi.org/10.1589/jpts.28.1335.

5 The Hurst Exponent: A Novel Approach for Assessing Focus During. . .

159

17. Piovesan, D., Melendez-Calderon, A., & Mussa-Ivaldi, F. A. (2013). Haptic recognition of dystonia and spasticity in simulated multi-joint hypertonia. IEEE International Conference on Rehabilitation Robotics, 2013, 6650449. https://doi.org/10.1109/ICORR.2013.6650449. 18. Wang, H. E., Schmicker, R. H., Daya, M. R., Stephens, S. W., Idris, A. H., Carlson, J. N., et al. (2018). Effect of a strategy of initial laryngeal tube insertion vs endotracheal intubation on 72hour survival in adults with out-of-hospital cardiac arrest: A randomized clinical trial. JAMA, 320(8), 769–778. 19. Ho, K. K. L., Moody, G. B., Peng, C.-K., Mietus, J. E., Larson, M. G., Levy, D., & Goldberger, A. L. (1997). Predicting survival in heart failure case and control subjects by use of fully automated methods for deriving nonlinear and conventional indices of heart rate dynamics. Circulation, 96(3), 842–848. https://doi.org/10.1161/01.CIR.96.3.842. 20. Richman, J. S., & Moorman, J. R. (2000). Physiological time-series analysis using approximate entropy and sample entropy. American Journal of Physiology-Heart and Circulatory Physiology, 278(6), H2039–H2049. https://doi.org/10.1152/ajpheart.2000.278.6.H2039. 21. Omidvarnia, A., Mesbah, M., Pedersen, M., & Jackson, G. (2018). Range entropy: A bridge between signal complexity and self-similarity. Entropy, 20(12), 962. Retrieved from http:// www.mdpi.com/1099-4300/20/12/962. 22. Pearlmutter, B. A. (1989). Learning state space trajectories in recurrent neural networks. Neural Computation, 1(2), 263–269. 23. Packard, N. H., Crutchfield, J. P., Farmer, J. D., & Shaw, R. S. (1980). Geometry from a time series. Physical Review Letters, 45(9), 712. 24. De la Fuente, I., Martinez, L., Aguirregabiria, J., & Veguillas, J. (1998). R/S analysis strange attractors. Fractals, 6(2), 95–100. 25. Torre, K., & Balasubramaniam, R. (2011). Disentangling stability, variability and adaptability in human performance: Focus on the interplay between local variance and serial correlation. Journal of Experimental Psychology: Human Perception and Performance, 37(2), 539. 26. Golomb, D., Hansel, D., Shraiman, B., & Sompolinsky, H. (1992). Clustering in globally coupled phase oscillators. Physical Review A, 45(6), 3516. 27. Crowley, P. (1992). Density dependence, boundedness, and attraction: Detecting stability in stochastic systems. Oecologia, 90(2), 246–254. 28. Radii, R., & Politi, A. (1985). Statistical description of chaotic attractors: The dimension function. Journal of Statistical Physics, 40(5–6), 725–750. 29. Cajueiro, D. O., & Tabak, B. M. (2005). The rescaled variance statistic and the determination of the Hurst exponent. Mathematics and Computers in Simulation, 70(3), 172–179. 30. Gorman, J. C., Hessler, E. E., Amazeen, P. G., Cooke, N. J., & Shope, S. M. (2012). Dynamical analysis in real time: Detecting perturbations to team communication. Ergonomics, 55(8), 825– 839. https://doi.org/10.1080/00140139.2012.679317. 31. Klauer, S. G., Olsen, E. C., Simons-Morton, B. G., Dingus, T. A., Ramsey, D. J., & Ouimet, M. C. (2008). Detection of road hazards by novice teen and experienced adult drivers. Transportation Research Record Journal, 2078, 26–32. https://doi.org/10.3141/2078-04. 32. Baker, V. O. T., Cuzzola, R., Knox, C., Liotta, C., Cornfield, C. S., Tarkowski, R. D., et al. (2015). Teamwork education improves trauma team performance in undergraduate health professional students. Journal of Educational Evaluation for Health Professions, 12, 36–36. https://doi.org/10.3352/jeehp.2015.12.36. 33. Ohu, I. P., Piovesan, D., & Carlson, J. N. (2018, December 1). The Hurst exponent—A novel approach for assessing focus during trauma resuscitation. Paper presented at the 2018 IEEE Signal Processing in Medicine and Biology Symposium (SPMB). 34. Qian, B., & Rasheed, K. (2004). Hurst exponent and financial market predictability. Paper presented at the IASTED conference on Financial Engineering and Applications. 35. Anis, A. A., & Lloyd, E. H. (1976). The expected value of the adjusted rescaled Hurst range of independent Normal summands. Biometrika, 63(1), 111–116. 36. Peters, E. E. (1994). John Wiley & Sons. 37. Baylis, J., Fernando, S., Szulewski, A., & Howes, D. (2013). Data gathering in resuscitation scenarios: Novice versus expert physicians. Canadian Journal of Emergency Medicine, 15(1).

160

I. P. Ohu et al.

38. Chapman, P. R., & Underwood, G. (1998). Visual search of driving situations: Danger and experience. Perception, 27(8), 951–964. 39. Christenson, J., et al. (2009). Resuscitation outcomes consortium investigators: Chest compression fraction determines survival in patients with out-of-hospital ventricular fibrillation. Circulation, 120(13), 1241–1247. 40. Torab, P., & Piovesan, D. (2015). Vibrations of fractal structures: On the nonlinearities of damping by branching. Journal of Nanotechnology in Engineering and Medicine, 6(3), 034502. 41. Capella, J., Smith, S., Philp, A., Putnam, T., Gilbert, C., Fry, W., et al. (2010). Teamwork training improves the clinical care of trauma patients. Journal of Surgical Education, 67(6), 439–443. 42. Gorman, J. C., Dunbar, T. A., Grimm, D., & Gipson, C. L. (2017). Understanding and modeling teams as dynamical systems. Frontiers in Psychology, 8, 1053. 43. Mattei, T. A. (2014). Unveiling complexity: Non-linear and fractal analysis in neuroscience and cognitive psychology. Frontiers in Computational Neuroscience, 8, 17. 44. Likens, A. D., Amazeen, P. G., Stevens, R., Galloway, T., & Gorman, J. C. (2014). Neural signatures of team coordination are revealed by multifractal analysis. Social Neuroscience, 9(3), 219–234. https://doi.org/10.1080/17470919.2014.882861. 45. Stevens, R. (2014). Modeling the neurodynamics of submarine piloting and navigation teams. Retrieved from 46. Masters, C., Baker, V. O. T., & Jodon, H. (2013). Multidisciplinary, team-based learning: The simulated interdisciplinary to multidisciplinary progressive-level education (SIMPLE©) approach. Clinical Simulation in Nursing, 9(5), e171–e178.

Chapter 6

Gaussian Smoothing Filter for Improved EMG Signal Modeling Ibrahim F. J. Ghalyan, Ziyad M. Abouelenin, Gnanapoongkothai Annamalai, and Vikram Kapila

6.1 Introduction Electromyography (EMG) is a process of capturing the electric response of muscles by measuring the electric potential produced in muscle cells when they are activated and engaged in a movement [1]. The understanding of the nature of muscle cell activation has permitted the use of EMG signals for diverse application domains, such as neuromuscular monitoring used for myasthenia gravis patients [2], a computer interface for a disabled limb [3], nerve function assessment using needle EMG [4], robot movement control [5], and human gait analysis [6], among others. In the majority of applications that are based on EMG signals, the statistical nature of captured EMG data is utilized to develop models that fit the needs of a given application and EMG signal classification has emerged as a vital component for such applications [7]. One of the earliest efforts to classify EMG signals employed the auto-regressivemoving-average (ARMA) technique to build parametric classification models for interpreting the time series of EMG signals [8]. The use of various time-frequency representations, along with Fourier and wavelet transforms, was shown to yield promising classification performance, indicating the importance of studying multiple representations for a successful modeling of EMG signals [9]. For a prosthetic hand application, Nishikawa et al. [10] developed a real-time EMG signal-based approach that could simultaneously detect multiple hand motions and learn to adapt to an individual human operator. By combining hidden Markov and auto-regressive models, Ju et al. [11] developed efficient models of human hand gesture using sensed EMG signals. Yoshikawa et al. [12] employed support vector machine (SVM) to

I. F. J. Ghalyan () · Z. M. Abouelenin · G. Annamalai · V. Kapila Department of Mechanical and Aerospace Engineering, NYU Tandon School of Engineering, Six Metrotech Center, Brooklyn, NY, USA e-mail: [email protected] © Springer Nature Switzerland AG 2020 I. Obeid et al. (eds.), Signal Processing in Medicine and Biology, https://doi.org/10.1007/978-3-030-36844-9_6

161

162

I. F. J. Ghalyan et al.

classify real-time EMG signals for distinguishing human hand gestures with the estimation of the arm joint angles by developing simple linear models relating EMG signals to joint angles. Many other techniques were suggested to efficiently classify EMG signals, for example, k-nearest neighbor (k-NN) [13], linear discriminant analysis (LDA) [14], deep neural network (DNN) [15], among others. This chapter suggests a novel technique to improve the classification performance of EMG signals when using some of the aforementioned classification techniques. The performance of the classification process is susceptible to degradation when using EMG signals corrupted with noise. To address this factor, the use of a Gaussian smoothing filter (GSF) is suggested as a mean to remove the noise from sensed EMG signals, which enhances the classification performance of such kind of signals. The main features of the GSF are its simplicity in implementation, its equal support in both frequency and time domains, and excellent noise suppression performance. These features provide a significant impetus for the applicability of the GSF to the EMG classification task. Moreover, employing the GSF is shown to reduce the computational cost required for training and testing various classification models. The chapter is an extension to the work reported in [16], but with further experimental validations, comparisons, and analysis. The rest of the chapter is organized as follows. Section 6.2 comprises a comprehensive survey of works related to EMG signal analysis. Section 6.3 describes the classification problem of EMG signals with the noise encountered in them. Section 6.4 explains the GSF and several classification techniques, namely SVM, k-NN, naïve Bayes classifier (NBC), LDA, and Gaussian mixtures model (GMM) used for EMG signals classification. Section 6.5 details the experimental validation and the enhancement in the classification performance with the use of the GSF in smoothing EMG signals. Finally, Sect. 6.6 discusses the results obtained in the experimental validations and Sect. 6.7 contains concluding remarks with recommendations for future works.

6.2 Related Works Battye et al. [17] reported one of the earliest efforts for capturing the activation of muscles to create a prosthetic hand capable of performing light grasping tasks. In this pioneering work, fluctuations of potential, reported between two distinct points on the surface of the skin, were captured by two electrodes composed of brass metals. The electrodes were firmly fixed on the skin surface and connected to amplifiers to magnify sensed fluctuations during a grasping process to allow distinguishing between multiple phases of hand gestures. The prosthetic systems of Kobrinsky [18] were equipped with surface skin electrodes to measure myoelectric signals and opened the possibility to control prosthetic arms by using these sensed signals. Bottomley [19] constructed a system containing two EMG channels wherein outputs of muscles of biceps brachii and triceps were utilized to generate an on-off signal to control a pneumatic elbow with flexion-extension phases. By relying on

6 Gaussian Smoothing Filter for Improved EMG Signal Modeling

163

sensed EMG signals captured from a group of muscles in the forearm, Bottomley [20] suggested a proportional controller to actuate a prosthetic hand. The bicep and tricep muscles are usually the natural muscles used by a human hand in flexing and relaxing the elbow joint. Thus, by using the residual bicep and tricep muscles, Rothchild and Mann built the Boston Arm, which is one of the earliest prosthetic arms, where they realized a natural strategy to control the proportional motion of a prosthetic arm in a sophisticated, human-like manner [21–23]. With multiple case studies, Herberts [24] investigated the quantitative values and nature of variation of sensed EMG signals by tabulating the range of variations of EMG data and mapping them with physical range of motion of limbs (or hands). The work of Herberts [24] allowed the determination of the required motion of motors to compensate the corresponding quantity of limb motion, which spurred the application of statistical analysis to EMG data to help engender a more thorough understanding of the nature of variation of EMG signals. By studying the spectral density of EMG signals, Scott [25] investigated statistical features of sensed EMG signals, marking a milestone in the literature of analysis of EMG signals. Dorcas et al. [26] proposed an amplitude level coding technique wherein the level of EMG signals was analyzed for each amputee and possible control of a prosthetic arm was realized. The spectral density analysis has been suggested to compare the muscles behavior for subjects before and after reaching a status of fatigue [27]. Specifically, the muscle contraction level was examined by using a cumulative power difference function and mean frequency of sensed EMG signals, allowing identification of fatigue in muscles and having a deeper understanding of the muscle behavior. In an early application of pattern recognition for prosthetic arms, Lawrence and Lin [28] developed pattern recognition and regression models by relying on data captured during the performance of daily tasks. Instead of using EMG signals, Lawrence and Lin [28] suggested using position signals to develop statistical models for efficiently mapping the motion of human hand to elbow joint parameters. Although Lawrence and Lin [28] did not use EMG data, their work triggered the idea of employing statistical modeling for developing prosthetic arms. Despite the progress reported in the aforementioned works, [26–28], their ad hoc approaches were not optimized to solve the problems of underlying applications. To promote advanced statistical modeling of EMG signals, Graupe and Cline [8] suggested using ARMA models, with sensed EMG signals, to distinguish between multiple phases of a hand motion. This work represents one of the earliest efforts to effectively employ statistical modeling to obtain models that classify EMG signals without a significant human intervention. The reception of a multistate EMG channel was improved in Parker et al. [29] by minimizing the probability of error in transferring EMG signals in a given channel. Specifically, extracting the statistical parameters of the source of signals allowed the computation of optimum receiver parameters for minimizing the probability of errors and permitted the use of EMG signals in sophisticated prosthetic arms. A structural interpretation of EMG signals was suggested in De Luca [30] where features and properties of contracting muscles were employed to develop mathematical models of captured EMG signals. Although the majority of

164

I. F. J. Ghalyan et al.

EMG signals, used for the modeling process, were taken from human muscles, De Luca [30] occasionally referenced to corresponding data of mammals to develop models. A myoprocessor development board was realized in Hogan and Mann [31] for the mathematical analysis and interpretation of EMG signals. The myoprocessor allowed obtaining optimal models of sensed signals and excellent performance was obtained in experimental settings wherein EMG signals were captured from multiple channels. The emergence of bioinspired artificial intelligence (AI) techniques has contributed significantly in enhancing the modeling process and the development of efficient prosthetic arms. Hudgins et al. [32] successfully used an artificial neural network (ANN) to model EMG signals. A deterministic structure of features was extracted for a window of a trail of EMG samples, yielding a structure of patterns that was used to train the ANN. Moreover, the extracted structure of patterns assisted in the development of a control signal for a prosthetic arm to reduce future efforts of amputees in performing a certain task. To examine the EMG signals classification performance of the ANN, a comparative study [33] investigated the performance of multi-layer perceptron (MLP), error back propagation-based ANN, and radial basis functions (RBF), and the results showed the superiority of RBF for such classification tasks. Merletti and Conte [34] characterized muscle behavior by plotting a classification graph that allows the capture of conduction of muscle fibers and spectral variables. In addition to information obtained from linear electrode arrays, resulting variables assisted in realizing efficient EMG models for multiple applications. The distribution of sensed EMG signals for several activities was investigated by Bilodeau et al. [35]. Step contraction, where a constant study force is applied, and ramp contraction, where the study force increases with time up to a certain level, were used and results of 16 subjects showed the abnormal distribution of sensed EMG signals. Clancy and Hogan [36] showed that for EMG signals with Gaussian distribution, root-mean-square (RMS) processing is excellent when employing the maximum likelihood (ML) to model the given signals. Alternatively, mean-absolute-value (MAV) processing is more suitable for the case of EMG signals with Laplacian distribution. Results of Clancy and Hogan [36] explicitly state the relationship between the distribution of sensed EMG signals and the choice of the metric used in model development, since this choice was shown to have an impact on the resulting signal-to-noise ratio (SNR) of the models produced. Englehart et al. [9] showed that ensemble time-frequency based features extracted from sensed signals have a more reliable EMG classification process. Specifically, Fourier transform and wavelet techniques were used in estimating time and frequency features, enhancing the classification of EMG signals significantly. Farina and Merletti [37] have addressed the issues of quality of estimating the amplitude of EMG signals, variables extracted from frequency domain, and velocity of conduction of muscles. Moreover, they showed that pre-whitening EMG signals, using techniques like auto-regressive modeling, resulted in enhancement to estimate amplitudes of EMG signals. They additionally investigated the effect of the order of autoregression for stationary and non-stationary EMG signals. Englehart et al. [38] utilized wavelet transform to extract features of EMG signals along with

6 Gaussian Smoothing Filter for Improved EMG Signal Modeling

165

the principle component analysis (PCA) technique to reduce the dimensionality of features. Streams of EMG data were used to, online, estimate classification boundaries based on extracted features, enhancing the classification performance. Rosen et al. [39] used sensed EMG signals and measured joint values to develop efficient models of moments of muscles of elbow joints. Myoprocessors were employed to develop models of EMG and joint signals for assisting in the control of a two-link mechanism that can perform tasks of upper limbs. Hussein and Granat [40] utilized a neuro-fuzzy classifier, tuned by a genetic algorithm (GA), to develop efficient EMG models. The neuro-fuzzy models reduced delay times while enhancing time spectral analysis, allowing the feasibility of detecting intentions of patients. To enhance classification accuracy and reduce time required for control and prediction, Englehart and Hudgins [41] relied on computational processors to classify EMG signals resulting in online generation of classification boundaries to obtain a dexterous and flexible control of prosthetic arms. Ajiboye and Weir [42] proposed a fuzzy inference engine to model EMG signals, where fuzzy cmeans (FCM) clustering technique was employed to construct fuzzy sets of if-then rules of a knowledge base. This methodology was applied to five subjects: data of four subjects were used for off-line training while data of the fifth subject was employed in real-time implementation, yielding an excellent modeling performance of captured EMG signals. Huang et al. [43] considered a Gaussian mixtures model (GMM) to develop classifiers for EMG signals, sensed for multiple limb motions. Their GMM-based EMG classification technique was tested for 12 subjects and excellent results were reported. This work was later extended by employing hidden Markov models (HMM), resulting in a lower computational complexity that made it suitable for real-time applications with higher classification accuracy [44]. For controlling exoskeleton robots, Fleischer et al. [45] employed EMG signals that were mapped to corresponding force signals, allowing to develop models for human muscles and opening possibility of application of such kind of robots to replace lost limbs, rehabilitation, diagnostics, among others. Multiple classification techniques were benchmarked by Reaz et al. [46], such as auto-regressive, ANN, and fuzzy classifiers. This work also considered other signal processing aspects of EMG data, such as EMG signal detection, decomposition, among others. Oskoei and Hu [47] used GA to estimate optimal sets of features of EMG signals, where LDA was employed to build classification models for estimated features of EMG signals. This combination of GA and LDA resulted in enhanced classification performance. Subsequently, Oskoei and Hu [48] utilized SVM to classify sensed EMG signals corresponding to multiple motions of upper limbs and showed that SVM resulted in a better classification accuracy, enhanced robustness, and lower computational cost. They additionally suggested online measures of classification correctness, wherein the output entropy of the developed classifier was used to tune models online. Wavelet transform was used to denoise EMG signals by Hussain et al. [49], wherein higher-order statistics (HOS) was employed to analyze resulting signals. These authors examined several types of wavelet basis functions and multiple key performance indices (KPIs) to measure the overall

166

I. F. J. Ghalyan et al.

performance (e.g., root mean square and signal-to-noise ratio), and obtained an excellent understanding of EMG signals. Ahmad and Chappell [50] investigated several characteristics of EMG signals collected from 20 subjects as they moved their wrists. They studied the behavior of EMG signals when moving the wrist at multiple speeds by performing moving approximate entropy analysis and illustrated that sensed EMG signals can be partitioned into three regions: start, middle, and end regions. They additionally showed the regularity of EMG data at the start and end of contractions of muscles versus the data captured in the middle of contractions. To recognize six gestures of human hand, Khezri and Jahed [51] employed an adaptive neuro-fuzzy inference system (ANFIS) wherein data from time and frequency domains were used for feature extraction. The training of the ANFIS was conducted by using hybrid back propagation and least mean square, while a number of neuro-fuzzy rules were optimized using subtractive clustering. By studying the EMG signals in static and dynamic situations, Lorrain et al. [52] demonstrated the relevance of steady-state and transient portions of the given signals to the modeling process. A comprehensive study of feature extraction techniques of EMG signals was conducted by Phinyomark et al. [53] who illustrated that power spectrum ratio (PSR) and median frequency (MNF) facilitate the use of input EMG vector to perform classification and yield superior performance vis-à-vis other techniques. To develop EMG signals-based generic models applicable to multiple users, Matsubara and Morimoto [54] collected EMG signals of multiple users performing similar hand gestures. They suggested bilinear EMG models composed of two features: one specific to a user and the other dependent on a motion, indicating the possibility of applying resulting models for motion-dependent component to new users with a slight tuning. To enhance the accuracy of the performance in classifying sensed EMG signals, Subasi [55] proposed a particle swarm optimization (PSO) in the training process of SVM models, with a discrete wavelet transform (DWT) being used for feature extractions of the sensed signals, while kernels of SVM being tuned by PSO. Phinyomark et al. [56] considered EMG data collected from subjects over 21 days and employed a hybrid technique, combining both time and frequency domains, aiming to extract important features of the given data. Specifically, their hybrid feature extraction technique combined sample entropy, fourth-order cepstrum coefficients, root mean square, and waveform length, while LDA was used to classify sensed EMG signals, resulting in enhanced classification accuracy. To measure fatigue in muscles, Rogers and MacIsaac [57] collected EMG signals of nine healthy subjects—in fatigue tests of isometric, cyclic, and random contractions—and demonstrated that the generalized mapping index (GMI) and PCA produced excellent estimates of muscle fatigue. The distribution of EMG signals has been shown to exhibit super-Gaussian features during isometric and nonfatiguing contractions of muscles [58]. The value of kurtosis, the division of fourth cumulant of a random variable by the square of its second cumulant, was utilized to analyze the relationship between contraction forces and the probability distribution function (PDF) of sensed EMG signals. It was shown that increment in contraction

6 Gaussian Smoothing Filter for Improved EMG Signal Modeling

167

forces renders the PDF of sensed EMG signals to be Gaussian. To estimate PDFs of EMG signals collected for multiple hand motions, Thongpanja et al. [59] extended the work of Nazarpour et al. [58] by proposing to use L-kurtosis, the value of kurtosis by taking into account L-moments rather than a simple moment. Tsai et al. [60] utilized a short-time Fourier transform (STFT) ranking technique to extract features of sensed EMG signals. For dynamic and isometric contractions of muscles, they compared features extracted from time-domain analysis with features obtained from frequency domain and achieved an excellent EMG modeling performance. Their work further showed that combining STFT ranking with PCA would result in enhanced performance due to an increase in data separability. In Siddiqi et al. [61], continuous measurement of the angle of a thumb was discretized and models were developed to map sensed EMG signals to the corresponding angle values. Moreover, the use of auto-regressive (AR) models and SVM revealed the superiority of SVM in an experimental setup. A tangible enhancement in classification performance was realized by considering wavelet transform-based multi-resolution analysis, with multiple wavelet basis functions, that addressed nonstationarity and nonlinearity of sensed EMG signals [62]. These authors used each basis function to extract a group of features of the given EMG data samples and the resulting groups of features were used in a back propagation neural network, yielding excellent modeling accuracy. For an efficient control of a multi-degree of freedom prosthetic arm, Kasuya et al. [63] considered the problem of degradation in modeling performance arising from many classes. Indeed, with the increase in the number of classes, redundancy of sensed EMG signals increases that reduces the classification accuracy. To enhance the modeling process of sensed EMG signals, Kasuya et al. [63] proposed an efficient post-processing of the temporal sequence of sensed EMG signals. To develop efficient models of EMG signals, Peng et al. [64] suggested the NeuCube technique, where complex spatio-temporal nature of sensed signals is addressed, by combining a spiking neural network (SNN) with time domain (TD) feature extraction. In Zhang et al. [65], two TD features and a PCA stage are utilized to preprocess sensed EMG signals for multiple hand motions. Next, by using least-square support vector machine (LS-SVM), efficient models were developed that yielded an enhanced modeling performance for the classification of multiple motions of an arm. Modeling of force in human–environment interaction was studied in Pang et al. [66], where EMG signals were used to estimate musculotendon forces arising from human interaction with the surrounding environment. Then, a neural network was employed to distinguish between touch and push motions by using EMG signals. Moreover, they utilized a Bayesian linear regression (BLR) to estimate optimal parameters of interaction dynamics, resulting in promising modeling of human– environment interaction. With an aim to detect neuromuscular disorders, Naik et al. [67] suggested an ensemble empirical mode decomposition technique to decompose a set of intrinsic modes in a single channel EMG signal. Next, they utilized TD methodology to extract important features from the resulting signal and LDA to successfully detect disorders. Log-likelihood-based LDA classification technique was suggested by Spanias et al. [68] to detect disturbance in EMG signals for use in

168

I. F. J. Ghalyan et al.

prosthetic limbs that utilize multiple sensors, including EMG sensors, for modeling motion of limbs. In their approach, when a significant disturbance is detected, EMG samples are discarded and decision relies on signals captured by other sensors. To accommodate possible changes, that might cause drift in sensed EMG signals, a supervised adaptation technique was suggested in Vidovic et al. [69]. In this approach, a slight calibration is conducted on pre-trained models to accommodate possible non-stationarities encountered during changes like sweating, weight lifting, variable situations of arm, among others, that might alter sensed EMG signals. A multi-dimensional dynamic time wrapping (MD-DTW) was used as a metric by AbdelMaseeh et al. [70], to render a measure of distance between trajectories of muscle activations, allowing distinction between multiple phases of human hand motions. This technique was applied to EMG signals collected from 40 subjects and excellent classification accuracy was reported. The transient behavior of motion was investigated by Samuel et al. [71], who considered motion containing static and dynamic components, revealing a significant impact on classification accuracy if the transient behavior is not properly accounted for. Zhai et al. [72] utilized spectrograms of sensed EMG signals to extract important features, specifically when using PCA in reducing the dimension of the considered signals, and reported success in classifying EMG signals collected for hand motions. Use of EMG signals was extended to gait recognition in Lee et al. [73], where non-parametric statistical models were developed to model important features of EMG signals, yielding efficient classification of human gait. The effect of arm position on the modeling process of EMG signals was studied by Jochumsen et al. [74] who showed that using EMG readings, corresponding to more arm positions, results in further robustness and enhancement to the classification performance. By using only two channels of EMG signals, Tavakoli et al. [75] classified five gestures of human hand. They combined high dimensional feature space with SVM to yield a reasonable classification performance for a given number of gestures and only two channels. Chow-Liu trees, combined with forward feature selection technique, were employed by Camargo and Young [76] to select important features of classification task corresponding to upper-limb motion. They showed that the selection of relevant features significantly enhances the classification performance of EMG signals classification. EMG signals are usually subject to noise that degrades EMG signal modeling performance if not well addressed. As per reviewing the literature of EMG signals and despite the importance of filtering such kind of noisy signals, relatively small amount of research focused on filtering the noise, compared with modeling and other aspects of EMG signal analysis. However, it was noticed that [77] reported one of the earliest efforts of filtering EMG signals by using a recursive first-order Butterworth filter. In Conforto et al. [78], high-pass, moving average, moving median, and adaptive wavelet filters were used and compared in filtering sensed EMG signals contaminated with noise. The study reported in De Luca et al. [79] efficiently employed a Butterworth filter of a constant predefined corner frequency and slope to remove noise and artifacts resulted in capturing EMG signals. In Ghalyan et al. [16], the authors suggested to employ GSF in removing the noise of sensed signals

6 Gaussian Smoothing Filter for Improved EMG Signal Modeling

169

in an EMG-based hand gesture classification task. It was shown that GSF would not only enhance the classification accuracy of developed models, but also reduce the computational efforts. EMG signal modeling and processing are data-driven tasks, where applications played vital roles in the progress of both of the modeling and processing of such kind of signals and this can be seen in the entanglement of literature of both applications and theory of EMG signals. However, Fig. 6.1 shows a timeline of the main milestones of the related works detailed above, with a rough separation between theoretical and applied aspects of EMG signals modeling and processing. This chapter extends results obtained by Ghalyan et al. [16], where it was suggested to use GSF to filter out the noise of sensed EMG signals to enhance classification accuracy. The GSF technique is elaborated with further details and classification techniques, considered in Ghalyan et al. [16], are studied thoroughly. Furthermore, one additional classification technique, viz., the GMM, and another experiment are added to the set of experimental validations. The considered classification techniques and experimental validations reinforce the feasibility of employing GSF for such a task and reveal further details about the suggested enhancement in EMG classification process.

6.3 Problem Formulation Let us consider two distinct gestures of a human hand, as shown in Fig. 6.2: (a) with the hand closed and (b) with the hand opened. The corresponding sensed signal of one of the MYO band sensors corresponding to phases of Fig. 6.2a, b are shown in Fig. 6.3a, b, respectively. One can formulate the problem of classifying phases of hand gestures (phasel , l = 1, . . . , L), where L denotes the total number of phases with L = 2 for the case of two phases shown in Fig. 6.1a, b, as  yphasel (t) =

1 0

if xs (t) ∈ phasel otherwise, 

(6.1)

where yphasel (t) ∈ B  {0, 1} is the binary desired classifier output of the lth phase at time instance t and xs ∈ R8 is the corresponding EMG signal vector. It is worth mentioning that, throughout this chapter, we have xs ∈ R8 since the MYO band considered, in the experiments, is composed of eight sensors. Alternatively, for cases where we consider r EMG sensors in the device, then xs would be a r-dimensional vector, i.e., xs ∈ Rr . The main objective of the classification process is to develop models that can realize the nonlinear mapping given in Eq. (6.1) as accurately as possible. However, it is obvious from Fig. 6.3a, b that the given EMG signals are corrupted with noise that degrades the performance of the classification process. The source of the noise can be the sensors, the communication media, the human body, or their

Fig. 6.1 Timeline of the literature of EMG signal processing and applications, where it can be seen that application brought about a significant impact on development of EMG signals analysis and modeling

170 I. F. J. Ghalyan et al.

6 Gaussian Smoothing Filter for Improved EMG Signal Modeling

171

Fig. 6.2 MYO band mounted on an arm with: (a) hand closed and (b) hand opened. The MYO band is equipped with eight sensors, providing eight EMG signals and it is obvious that the gestures are distinct

Fig. 6.3 The EMG signals of Fig. 6.1: (a) hand closed and (b) hand opened. Despite the fact that they belong to distinct hand gestures, they are not easily distinguishable due to their contamination with noise

172

I. F. J. Ghalyan et al.

combinations. The existence of noise is inevitable in many sensors, deteriorating the process of capturing precise variations in measured quantities. For instance, in the example of EMG data given in Fig. 6.3, it is obvious that the value of noise is relatively high so that the variation in muscles contraction cannot be distinguished precisely, even though EMG signals shown in Fig. 6.3a, b belong to completely distinct hand gestures, rendering classifying EMG signals belonging to distinct hand gestures, or distinct contractions of muscles, a difficult objective. Furthermore, the noise is, usually, of a stochastic nature that has a direct impact on the modeling process of sensed EMG signals. More specifically, let xs (t) to be sensed EMG signals that can be described as xs (t) = x(t) + ω(t),

(6.2)

where x(t) is the signal component corresponding to muscles contractions and ω(t) is the noise component of sensed EMG signals. The stochastic nature of the noise ω(t) is reflected on sensed EMG signals xs (t). Thus, if one develops a model to classify sensed EMG signals, then the stochastic nature of noise makes the decision boundary imprecise in its distinguishing ability when employed in testing phases. Indeed, stochasticity of ω(t) renders features of sensed EMG signals to vary and a decision boundary of models trained on a certain data set might not be valid for another data set. Thus and as a summary, the noise has the following effects on sensed EMG signals. • Its high amplitude, compared with the signal component corresponding to muscle contractions, renders use of sensed EMG signals, to distinguish distinct muscle contractions, unreliable since noise dominance renders EMG signals for distinct muscle contractions to have similar shape. • Even if its amplitude is not relatively high, the stochastic nature of noise renders developing a decision boundary, to distinguish distinct contractions of muscles, susceptible to performance degradation. Therefore, modeling sensed EMG signals, in the presence of noise, is a challenging task. The main objective of this chapter is to enhance the classification process by filtering out noise from sensed EMG signals with the use of the GSF before performing the classification step. Employing GSF enables constructing an estimate of the signal component, corresponding to muscles contractions, say x(t), from sensed EMG signals xs (t) and helps in suppressing—and degrading—the influence of noise ω(t), for enhancing the EMG classification process. Hence, the main contributions of this chapter are: • Filter out the noise of sensed EMG signals by using GSF. • Enhance the EMG classification process by using GSF-filtered EMG signals. The following section will detail GSF and its application to enhance EMG classification process.

6 Gaussian Smoothing Filter for Improved EMG Signal Modeling

173

6.4 GSF-Based Enhanced Classification Process To outline the suggested GSF-based enhanced classification approach, we begin with the review of several related topics, including the concept of GSF and various existing classification techniques, to see the impact of the GSF on multiple classification methods. Then, the GSF and classification techniques are used to enhance the process of distinguishing hand gestures by using the EMG signals. The enhancement of classification accuracy of the case of the GSF is compared with that of the MF to have further insights on the performance of multiple filtering techniques.

6.4.1 Gaussian Smoothing Filter (GSF) The Gaussian smoothing filter (GSF) has been shown to provide excellent filtering performance for varied applications, e.g., filtering robot signals [80, 81], object area filtering in images [82], among others. The main reasons behind the use of GSF in diverse applications are: • Simple implementation, yet efficient performance, as it has only one characterizing parameter, thus allowing GSF to be applied with reasonable efforts. • Similarity of supports in both time and frequency domains, rendering the ability to understand and predict the behavior of GSF, in both frequency and time domains; on EMG signals an easy and straightforward task. • Its Gaussian shape helps to have a compromise between filtering out noise while keeping part of high frequency components of sensed EMG signals. To demonstrate the three main features that GSF can exhibit, let us start with the impulse response that characterizes a GSF which can be described by [83]   2 s (t)) exp − (x2σ 2 g (xs (t)) = , √ 2π σ 2

(6.3)

where xs (t) is the sensed EMG signal that is required to be smoothed and σ is the standard deviation of the GSF. It is obvious that the GSF, given in Eq. (6.3), has only one characterizing parameter σ that renders the GSF to be a simple filter in its implementation. To estimate the component of xs (t) that corresponds to muscle contractions, say x(t), using the GSF, one can implement the convolution operation of the impulse response of the GSF, given in Eq. (6.3), on sensed EMG signals that yields: x(t) ˆ = xs (t) ∗ g (xs (t)) ,

(6.4)

174

I. F. J. Ghalyan et al.

Fig. 6.4 An example of the impulse response of a Gaussian smoothing filter (GSF) with σ = 1. The GSF has equal supports in both time and frequency domains

where x(t) ˆ is the estimation of x(t) and ∗ denotes the convolution operator. Next, using the convolution integral, Eq. (6.4) can be rewritten as * x(t) ˆ =

∞ −∞

xs (τ ) g (xs (t − τ )) dτ.

(6.5)

Figure 6.4 shows the impulse response of the GSF and taking the Fourier transform for Eq. (6.3), we obtain   G(f ) = exp −2π 2 f 2 σ 2 ,

(6.6)

where f denotes the frequency. Equation (6.6) is also a Gaussian function, which reveals that both time and frequency domain responses of GSF have a similar support that is a Gaussian function. Even though a GSF behaves like a low-pass filter (LPF), its Gaussian function provides a good compromise of, partially, keeping high frequency components of the original signal, reducing possible distortions of the signal while conducting the filtering process. Thus, the GSF is demonstrated to provide efficient filtering performance, simple in its realization, has support similarities in time and frequency domains, and provides a compromise between filtering while preserving high frequency components of the signal.

6.4.2 Filtered EMG Signals Classification Using Eq. (6.5), one can rewrite Eq. (6.1) in terms of the filtered signal, say x(t), ˆ as  yphasel (t) =

1 0

if x(t) ˆ ∈ phasel otherwise. 

(6.7)

6 Gaussian Smoothing Filter for Improved EMG Signal Modeling

175

To realize Eq. (6.7), existing classification techniques can be employed for developing models that accurately approximate Eq. (6.7). However, before explaining classification techniques that will be employed in this chapter, we will elaborate on the effect of employing filtered EMG signals in the framework of Eq. (6.7) compared with employing the setting of non-filtered EMG signals given in Eq. (6.1). The signal x(t) ˆ is considered as an estimation of the component of EMG signals, reflecting muscle contractions, say x(t). Though it is expected that noise might not be completely eliminated, the signal x(t) ˆ has the following features: • potential dominance of x(t) ˆ with respect to filtered noise component by selecting the appropriate value of GSF parameter, i.e., σ . • reduction of filtered noise when using GSF results in reducing the effect of stochasticity of noise on boundary decision, enhancing the classification performance, should the GSF parameter be judiciously selected. Thus, the first stage is filtering out the noise from sensed EMG signals. Once sensed EMG signals are denoised, the resulting filtered EMG signals are used to develop models by using one of the classification techniques. The pattern classification literature offers many techniques that can be employed for such a classification task. Below, we summarize several well-known classification techniques for which further details can be found in classification and statistical modeling literature [84, 85]. It is worth mentioning that each classifier detailed below is applied independently from others in order to gain a precise indication of the accuracy of each classification technique considered below.

6.4.3 Support Vector Machine (SVM) Support vector machine (SVM) is considered one of the efficient classification techniques that was successfully utilized in many data-driven modeling tasks, e.g., healthcare [86, 87], computer vision [88, 89], EMG signals modeling [48], to name a few. Consider a given data set D = {(x1 , y1 ), . . . , (xN , yN )}, where xi ∈ Rr is the input vector—the EMG signals in our case—and yi ∈ {−1, 1} is a label indicating whether the vector xi belongs to a certain class, if yi = 1, or not, if yi = − 1. The main idea of SVM is to develop models, maximizing the hyperplane margins that separate the data xi , belonging to distinct classes, which is to say classes of labels −1 and 1 in this example. If the data set D is linearly separable, then one can imagine a hyperplane margin, bounded by two parallel lines that are described by w T xi − b = 1

(6.8)

w T xi − b = −1,

(6.9)

and

176

I. F. J. Ghalyan et al.

where Eq. (6.8) gives the boundary of data belonging to the class with label 1 and Eq. (6.9) is the corresponding boundary of data belonging to the class of label −1, w is a vector normal to the hyperplane of the line equations of Eqs. (6.8) and (6.9), and b is the intercept value of the linear equation of the boundary. Thus, if the vector xi satisfies wT xi − b ≥ 1, then it is deemed to belong to the class with label 1, while vectors achieving wT xi − b ≤ − 1 are deemed to belong to the class of label −1. The margin between the lines of Eqs. (6.8) and (6.9) can be found to be 2 1 2 w and maximization of such a margin can be achieved by minimizing 2 w . Thus, for linearly separable data set, SVM can be formulated to be the solution of the following constrained optimization problem minJ (w) = w

' 1' ' 2' 'w ' , 2

(6.10)

subject to w T xi − b ≥ 1

(6.11)

w T xi − b ≤ −1.

(6.12)

and

Since Eqs. (6.11) and (6.12) correspond to the cases of yi = 1 and yi = − 1, respectively, one can replace them with a single constraint that can be described as   yi w T xi − b ≥ 1.

(6.13)

Thus, for linearly separable data, maximizing the marginal hyperplane can be achieved by solving the minimization problem of Eq. (6.10) subject to Eq. (6.13). Note that linearly separable data allows the establishment of linear boundaries to separate the data that belong to distinct labels. For data that are not linearly separable, maximizing the marginal hyperplane can be established by plugging the following hinge loss function    ξi = max 0, 1 − yi w T xi − b ,

(6.14)

into the cost function Eq. (6.10), yielding a new constrained optimization problem of the form minJ (w, ξi ) = w,ξi

subject to

N '  1' ' 2' ξi , 'w ' + C 2 i=1

(6.15)

6 Gaussian Smoothing Filter for Improved EMG Signal Modeling

  yi ψ w T xi ≥ 1 − ξi ,

177

(6.16)

where (w, ξ i ) are the SVM parameters with ξ i ≥ 0, C is a constant vector, and ψ(.) characterizes the classifier. Thus, estimation of w and ξ i for a given training set produces information on boundaries for class separation, which yields a good classification performance. The optimization problem of Eqs. (6.15) and (6.16) can be solved using various techniques, for example, the Lagrange multiplier optimization method, which is frequently employed and yields an excellent classification performance (see [90] for further details about SVM).

6.4.4 k-Nearest Neighbor (k-NN) Due to its efficient performance, k-nearest neighbor (k-NN) is successfully used to model data in diverse application domains, e.g., time-series analysis [91], classification of heart disease [92], EMG signals classification [13], among others. In k-NN, one can employ the nearest k values, to a certain point, to take a decision corresponding to that point of interest. More specifically, for the set of data D, the k-NN technique can be employed by estimating the output of a classifier using y(x) ˆ =

1 k



(6.17)

yi ,

xi ∈Nk (x)

where y(x) ˆ is the classifier output, Nk (x) is the neighborhood containing k closest points of x, namely, xi , i = 1, . . . , k, and yi is the class label corresponding to the point xi . The neighborhood can be specified by using any of the following distance metrics, among others,   k    2    d xi , xj =  xi − xj ,

(Euclidean)

(6.18)

i=1,i=j k      xi − xj  , d xi , xj =

(Manhattan)

(6.19)

i=1,i=j

⎛ ⎞1 p k     p d xi , xj = ⎝ xi − xj ⎠ , i=1,i=j

(Minkowski)

(6.20)

178

I. F. J. Ghalyan et al.

Now if y(x) ˆ is greater than a certain threshold, e.g., 0.5 for binary classification with 0 or 1 outputs, then x is deemed to belong to one class otherwise it belongs to the other class (see [85] for further details about the k-NN technique). The k-NN is one of the most popular classification techniques due to its efficient modeling ability, even if the boundary between data, belonging to distinct classes, is nonlinear that renders the k-NN technique to be robust against noise and accurate, especially for large sets of data. It is obvious that the k-NN modeling technique has only one parameter, which is k, specifying the model performance. Specifically, the larger the value of k the more accurate modeling results. However, the increment of k adds a significant burden to the computational efforts since the distance metric is required to be computed k-times for each sample whose model output is required to be predicted. Hence, the value of k plays a key role in determining the nature of the modeling in the k-NN technique and its optimal estimation is not a straightforward task. Nevertheless, the value of k is usually chosen such that a good accuracy is obtained while keeping the computational cost to remain reasonable. One more obstacle in the implementation of k-NN is the choice of the distance metric to model a given set of data, since such a metric would have a direct impact on the size of the k-neighborhood and the shape of clusters. Usually for simplicity, Euclidean distance is employed in many applications of k-NN. However, the nature of data itself might force a status quo and another distance metric might appear to be more suitable for a given modeling process.

6.4.5 Naïve Bayes Classification (NBC) Naïve Bayes classification (NBC) technique has been efficiently employed in applications such as medical images analysis [93], abnormality detection in ECG signals [94], and EMG signals classification [95], to name a few. The main objective of NBC is to develop models, reducing the probability of misclassification of a given data set. Now, given the data set D, and suppose that y takes ck to be characterized by xˆ | y = ck ∼ pck ,

(6.21)

where k = 1, . . . , K, and pck is the probability distribution. The risk of misclassification can be described as        R C Xˆ = p C Xˆ = Y , (6.22)   & % where Xˆ is the vector of input features, Xˆ = xˆ1 , . . . , xˆN , C Xˆ is the output matrix, and Y is the label matrix. To reduce the risk function Eq. (6.22), one needs to develop models for which the likelihood, of a certain input xˆi belonging to a certain class ck , is maximum. That is

6 Gaussian Smoothing Filter for Improved EMG Signal Modeling

  ck = arg max p y = ck |xˆ = xˆi , ck ∈{c1 ,...,cK }

179

(6.23)

  where p y = ck |xˆ = xˆi is the likelihood that xˆi belongs to class ck . To compute the value of the likelihood, let us consider Pk (x) = p(x = xi | y = ck ) to denote the density function of x obtained from the kth class where ck is the label of the kth class of y. Using the Bayes rule, one can show that πk Pk (x) . p (y = ck |x = xi ) = K j =1 πj Pj (x)

(6.24)

where π k is the prior probability of the kth class, which can be computed by finding the ratio of number of training samples belonging to the kth class compared with the total number of samples of the given data set. p(y = ck | x = xi ) is also called the posterior probability that suggests y = ck given the predictor xi . Thus, the class with the largest posterior probability for a predictor xi is judged to be the class to which the predictor xi corresponds and this is called the Naïve Bayes Classification (NBC). Note that Pk (x) plays a key role in specifying the accuracy of the classification process and one of the simplest, yet efficient, scheme is to employ the Gaussian function for approximating Pk (x) (further details about NBC can be found in [85]), i.e., Pk (x) =

  1 T −1 exp −  − μ) , − μ) (x (x n 1 2 (2π ) 2 || 2 1

(6.25)

where μ and  are the mean and standard deviation of the Gaussian function.

6.4.6 Linear Discriminant Analysis (LDA) The linear discriminant analysis (LDA) classification [84, 85] is developed based on the NBC classification technique and it has been widely used in many applications such as EMG signals modeling [96], cancer classification [97], remote sensing [98], among others. Suppose that Pk (x) is approximated by a Gaussian function. Then, we have

1 (x − μk )2 Pk (x) = √ exp − , (6.26) 2σk2 2π σk where σ k and μk are the standard deviation and mean of signals corresponding to the kth class. Substituting Eq. (6.26) into Eq. (6.24) yields

180

I. F. J. Ghalyan et al.

p (y = ck |x = xi ) =

  (x−μk )2 exp − 2σk2 ,  2 x−μ 1 exp − ( 2j )

πk √ 1 2πσk K

√ j =1 πj

2πσj

(6.27)

2σj

and taking the log of Eq. (6.27), we obtain δk (x) = x

μ2k μk + + log (πk ) . σk2 2σk2

(6.28)

Given a set of data, one can estimate μk and σk2 by using 1  xi , nk

(6.29)

K   1 xi , N −K

(6.30)

nk , N

(6.31)

μˆ k =

i:yi =ck

σˆ 2 =

k=1 i:yi =ck

and πˆ k =

where nk is the number of training samples of the kth class. Hence, if one assumes that the data x is drawn from a Gaussian distribution, then by taking the log of Eq. (6.13), one obtains δˆk (x) = x

  μˆ 2 μˆ k + k2 + log πˆ k . 2 σˆ 2σˆ

(6.32)

Thus, a sample is deemed to belong to a certain class if Eq. (6.32) results in the largest value for that class. This is why Eq. (6.32) is called a discriminant and its linearity with respect to the predictor gives the linear attribute of the LDA classification process.

6.4.7 Gaussian Mixtures Model (GMM)-Based Classifier Gaussian mixtures model (GMM)-based classification [84, 85] is an extension of NBC that has been employed in many applications involving data-driven modeling, e.g., robotics and automation [99–101], EMG signals modeling [102, 103], among others. Instead of using a single Gaussian function to approximate the PDF, as in the NBC classification technique of Eq. (6.25), GMM employs a superposition of

6 Gaussian Smoothing Filter for Improved EMG Signal Modeling

181

multiple Gaussian components to obtain the PDF of the classifier for improving performance in the data classification process. That is, the PDF is approximated as Pk (x) =

M 

  ωq Nq x, μq , q ,

(6.33)

q=1

  where Nq x, μq , q is the qth component described by the Gaussian function   Nq x, μq , q =

  T −1   1 1 x − μ , x − μ exp −  q q q 1 r 2 (2π ) 2 q  2

(6.34)

μq , Σ q , and ωq are the mean, standard deviation, and weight, respectively, of the qth component, and M is the total number of components. Let θ q represent the set of GMM parameters of the qth component, i.e., θ q = (ωq , μq , Σ q ) and suppose that θ = [θ 1 , . . . , θ M ]T . Then, the PDF Pk (x) of Eq. (6.33) in terms of the parameter set θ can be rewritten as Pk (x|θ ) =

M 

  ωq Nq x, μq , q .

(6.35)

q=1

Consider the log-likelihood below as an objective function L (x, θ ) =

N 

ln (Pk (xn |θ )) .

(6.36)

n=1

Now the set of parameters θ maximizing L(x, θ ) is deemed as an optimal parameters set θ ∗ , i.e., θ ∗ = arg maxL (x, θ ) . θ

(6.37)

The sum of the total number of weights for a given sample must satisfy M 

ωq = 1,

(6.38)

q=1

which serves as a constraint to the objective function Eq. (6.37). Analytical solution to such a constrained optimization is usually intractable due to high dimensionality of the given data. Iterative techniques are often employed to solve such problems and expectation maximization (EM) is considered to be an efficient technique in estimating optimal set of parameters θ ∗ . To summarize the EM algorithm, consider

182

I. F. J. Ghalyan et al.

a set of latent variables g = {g1 , . . . , gM }, with gq ∈ {0, 1} and the joint probability distribution p(x, g) can be written as

M

q=1 gq

p (x, g) = p(g)p (x|g) ,

= 1. Then,

(6.39)

where p(g) is the marginal distribution over g and p(x| g) is the conditional probability. The marginal distribution can be characterized by p(g) =

M /

g

ωi i ,

(6.40)

i=1

while the conditional probability is given by p (x|g) =

M /

 g Nq x, μq , Σq q .

(6.41)

q=1

Using Eqs. (6.33), (6.39)–(6.41), we obtain p (x|θ ) =



p(g)p (x|g) =

g

M 

  ωq Nq x, μq , Σq .

(6.42)

i=1

An important quantity in the EM algorithm is the conditional probability of gi given a sample xj , i.e., p(gi = 1| xj ), which can be denoted by γ (gij ) and can be characterized as follows:     p (gi = 1) p xj |gi = 1 γ gij = M (6.43)  . i=1 p (gi = 1) p xj |gi = 1 γ (gij ) of Eq. (6.43) can be rewritten as     ωi Ni xj , μi , i γ gij = M   i=1 ωi Ni xj , μi , i

(6.44)

and is called the responsibility that the ith component takes for explaining the sample xj . Now, the EM algorithm can be summarized in the following steps. • Step 1: Initialize parameter vectors θ q = (ωq , μq , Σ q ) and convergence parameters ε and . • Step 2: (E-Step) Given the current parameter vector θ q , compute the responsibilities using Eq. (6.44).

6 Gaussian Smoothing Filter for Improved EMG Signal Modeling

183

• Step 3: (M-Step) Re-estimate the parameters using the current responsibilities N 1    γ gqj xj , Nq

(6.45)

N  T 1    xj − μnew γ gqj xj − μnew , q q Nq

(6.46)

Nq , N

(6.47)

μnew = q

j =1

qnew =

j =1

ωqnew = with

N    Nq = γ gqj .

(6.48)

j =1

• Step 4: Compute the log-likelihood using Lnew =

N  j =1

ln

⎧ M ⎨ ⎩



new ωqnew Nq x, μnew q , q

q=1

⎫ ⎬ ⎭

.

(6.49)

• Step 5: Check for the convergence. If |θ new − θ | ≤ ε or |Lnew − L| ≤  then stop. Otherwise go to Step2. The EM-GMM classification technique can potentially accommodate possible abnormal distribution in sensed EMG signals, thus enhancing the classification performance. For a detailed exposition of the GMM technique, see Bishop [84].

6.5 Experimental Validations To evaluate the performance of the GSF and its impact on the EMG classification problem for hand gesture recognition, we consider two experiments: in the first experiment, we take the hand gesture classification task, while in the second experiment, we address classifying phases of motion of a hand during a grasping task. In both experiments, we use a Thalmic Labs MYO band, containing eight EMG sensors, for capturing EMG signals of the hand during tasks execution. The sampling rate of the EMG signals is 200 Hz and all computations are performed using

184

I. F. J. Ghalyan et al.

MATLAB running on a 64-bit computer with an AMD 2.00 GHz CPU and 16.0 GB RAM operating under Microsoft Windows. In the experimental validations below, linear kernel, three components, and k = 50 points are used for SVM, GMM, and kNN techniques, respectively. These values were found after several trials and errors.

6.5.1 Experiment 1: Hand Gestures Figure 6.5 shows the settings of experiment 1, where six distinct hand gestures are considered: flexion, extension, wrist flexion, wrist extension, pinching, and index extension that are named in this experiment as Phases 1, 2, 3, 4, 5, and 6, respectively, (i.e., Phasel , l = 1, . . . , 6). Figure 6.6 shows the unfiltered EMG signals for all six phases of hand gestures considered in this section. For the experiment of this section, number of samples used in six phases considered were as follows.

Fig. 6.5 Experiment 1: MYO band with multiple situations: (a) Flexion, (b) Extension, (c) Wrist flexion, (d) Wrist extension, (e) Pinching, and (f) Index extension

6 Gaussian Smoothing Filter for Improved EMG Signal Modeling Phase 1 200 0 -200 100 0 -100 200 0 -200 200 0 -200 200 0 -200 200 0 -200 100 0 -100 200 0 -200

Phase 2

Phase 3

185

Phase 4 Phase 5 Phase 6

0

2000

4000

6000

8000

10000

12000

0

2000

4000

6000

8000

10000

12000

0

2000

4000

6000

8000

10000

12000

0

2000

4000

6000

8000

10000

12000

0

2000

4000

6000

8000

10000

12000

0

2000

4000

6000

8000

10000

12000

0

2000

4000

6000

8000

10000

12000

0

2000

4000

6000

8000

10000 12000 Samples (200 Hz)

Fig. 6.6 Experiment 1: the unfiltered MYO band EMG signals. It is obvious that sensed EMG signals are contaminated with a significant amount of noise, degrading the distinguishability between signals belonging to distinct hand gestures

• • • • • •

Phase 1: 2570 samples Phase 2: 2121 samples Phase 3: 2147 samples Phase 4: 1483 samples Phase 5: 1161 samples Phase 6: 1229 samples

The samples of Phases 1–6 are taken for only one run of the hand without a repetition to gain an insight of the performance of employing the GSF on the classification process of the given phases. Moreover, sensed EMG signals correspond to isometric contractions of muscles of the given hand, under study, and no dynamic contractions are imposed when collecting the signals. A tenfold cross validation was employed to evaluate the performance of the classification of sensed EMG signals without any random mixture, nor selection, employed before partitioning of data

186

I. F. J. Ghalyan et al.

considered in this experiment. The partitioning of data was conducted automatically in the classification into ten partitions, nine of which are taken as training set and one as test. The overall classification performance is computed by estimating the mean of all ten possible training-test combinations of the resulted partitions to have more accurate evaluation of the considered classification techniques. We have relatively small number of phases and the ground truth of the classification task was generated manually. Employing the SVM, LDA, NBC, k-NN, and GMM in classifying the EMG signals of Fig. 6.6 resulted in classification accuracy of 79.82%, 83.33%, 86.31%, 93.56%, and 90.97%, respectively. Using a GSF with σ = 2, which was manually tuned and selected, the obtained filtered EMG signals are shown in Fig. 6.7. It is worth mentioning that σ = 2 is not an optimal value of the filter parameter, though it was obtained after several trials. With the filtered EMG signals of Fig. 6.7 used for classifying hand gestures under consideration, the SVM, LDA, NBC, kPhase 1 100 0 -100 50 0 -50 50 0 -50 50 0 -50 50 0 -50 50 0 -50 50 0 -50 100 0 -100

Phase 2

Phase 3

Phase 4 Phase 5 Phase 6

0

2000

4000

6000

8000

10000

12000

0

2000

4000

6000

8000

10000

12000

0

2000

4000

6000

8000

10000

12000

0

2000

4000

6000

8000

10000

12000

0

2000

4000

6000

8000

10000

12000

0

2000

4000

6000

8000

10000

12000

0

2000

4000

6000

8000

10000

12000

0

2000

4000

6000

8000

10000 12000 Samples (200 Hz)

Fig. 6.7 Experiment 1: the filtered MYO band EMG signals. The amount of noise was drastically reduced with the use of GSF, increasing the distinguishability of sensed EMG signals of distinct hand gestures

6 Gaussian Smoothing Filter for Improved EMG Signal Modeling

187

Table 6.1 Enhancement of EMG signals classification of experiment 1 Scheme SVM LDA NBC k-NN GMM

Performance (%) without GSF 79.82 83.33 86.31 93.56 90.97

Performance (%) with GSF 94.52 93.19 93.79 98.09 95.92

100

Performance [%]

80 60 40 20 0

SVM

LDA

NBC k-NN Without GSF With GSF

GMM

Fig. 6.8 The classification accuracy with and without using GSF in removing noise from sensed EMG signals of experiment 1, where it can be noticed that a significant enhancement in classification performance is resulted when using GSF in filtering sensed signals

NN, and GMM classification techniques yielded classification accuracy of 94.52%, 93.19%, 93.79%, 98.09%, and 95.92%, respectively. Thus, it is seen that the use of GSF in the EMG classification process results in significant improvement in performance for all classification techniques considered in this experiment. Such enhancement in classification performance is a consequence of filtering out the noise from EMG signals. Table 6.1 summarizes the classification accuracy of the aforementioned techniques with and without using the GSF in filtering out the noise. Furthermore, Fig. 6.8 visualizes the obtained classification performance with and without employing GSF, given in Table 6.1, and one can observe a noticeable enhancement when using the GSF in removing the noise from EMG signals and despite the anonymity of noise properties, like power spectral density of noise. Measuring the time required for developing and testing the models of the unfiltered EMG signals resulted a total computational time of 402.96 s when using the SVM, 9.33 s for the LDA, 10.62 s when employing the NBC, 10.18 s in the case of k-NN, and 11.39 s for the GMM technique. The corresponding time required for developing and testing the models for the case of filtered EMG signals was found to be 381.11 s when using the SVM, 4.14 s for the LDA classifier, 6.54 s when

188

I. F. J. Ghalyan et al.

Table 6.2 Accumulative computational time of experiment 1

Scheme SVM LDA NBC k-NN GMM

Time (s) without GSF 402.96 9.33 10.62 10.18 11.39

Time (s) with GSF 381.11 4.14 6.54 5.14 7.25

450 400 Time [sec.]

350 300 250 200 150 100 50 0

SVM

LDA NBC k-NN Without GSF With GSF

GMM

Fig. 6.9 Computational time, required for developing and testing the models, with and without employment of GSF in experiment 1. It is clear that a significant reduction in computational time results with the use of GSF

employing the NBC, 5.14 s in the case of k-NN, and 7.25 s for the GMM technique. Table 6.2 summarizes the computational time of the considered techniques with and without employing GSF. Figure 6.9 visualizes the data of Table 6.2 and it is obvious that the measured computational time is reduced significantly, for all classification techniques when using the GSF in filtering out the noise from sensed EMG signals. Employing a tenth-order Median Filter (MF) in smoothing the EMG signals of Fig. 6.6 results in 90.91%, 89.50%, 89.53%, 95.43%, and 94.08% classification performance with SVM, LDA, NBC, k-NN, and GMM techniques, respectively. Clearly, the MF provides a significant enhancement in classifying the considered EMG signals. With previously obtained classification performance when using raw EMG signals and GSF-based EMG signals, (see Table 6.1, and Fig. 6.8), we computed the improvement in the classification performance of the five techniques considered in this experiment with GSF versus raw signals and MF versus raw signals. These improvements in classification performance with the use of smoothing filters are shown in Fig. 6.10, which shows that the GSF-based EMG signals yield superior performance with all five classification techniques.

6 Gaussian Smoothing Filter for Improved EMG Signal Modeling

189

20

Improvement [%]

18 16 14 12 10 8 6 4 2 0

SVM

LDA GSF

NBC

MF

k-NN

GMM

Fig. 6.10 Improvement in classification performance when using the Gaussian smoothing filter (GSF) and median filter (MF) of experiment 1

6.5.2 Experiment 2: Grasping Task In this experiment, we consider another task which is grasping, lifting, and reorienting an object by hand. Figure 6.11 shows the settings of this experiment which is composed of four phases that are hand approaching, grasping, lifting, and reorientating the object, denoted as phases 1, 2, 3, and 4, respectively. Each one of the four phases is executed four times and corresponding EMG signals are collected and graphed as shown in Fig. 6.12. Number of samples used for the four phases of signals of Fig. 6.12 were as follows. • • • •

Phase 1: 4517 samples Phase 2: 3905 samples Phase 3: 4407 samples Phase 4: 4298 samples

Unlike experiment 1, sensed EMG signals of this experiment correspond to both isometric and dynamic contractions of arm muscles when conducting phases 1–4 of the given task of grasping. It is evidenced from Fig. 6.12 that the EMG signals are contaminated with noise. Use of SVM, LDA, NBC, k-NN, and GMM to classify EMG signals of Fig. 6.12 yielded classification performance of 72.82%, 77.23%, 80.01%, 87.38%, and 83.91%, respectively. In this case, using sensed EMG signals of four repetitions of each phase, a fourfold cross validation was employed to evaluate the classification performance. The use of four repetitions of each phase in the fourfold cross validation broadens the approach employed in experiment 1 where data from a single performance of each phase was used. With a GSF of σ = 2, the EMG signals of Fig. 6.12 were filtered and yielded the

190

I. F. J. Ghalyan et al.

Fig. 6.11 Experiment 2 hand situations with hand: (a) approaching, (b) grasping, (c) lifting, and (d) re-orientating an object

filtered versions shown in Fig. 6.13. Using the filtered EMG signals of Fig. 6.13 with the SVM, NBC, LDA, k-NN, and GMM classification techniques resulted in the classification performance of 88.17%, 88.94%, 90.52%, 96.05%, and 92.30%, respectively. Table 6.3 summarizes the classification performance of the considered classification techniques before and after using the GSF in filtering the noise from sensed EMG signals and Fig. 6.14 provides a graphical representation of the results of Table 6.3. Table 6.3 and Fig. 6.14 demonstrate that the use of GSF produces significant improvement in the performance of all five considered classification techniques. This is attributed to the ability of GSF to suppress noise from sensed EMG signals while preserving high frequency components of original signals. The computational times required for developing and testing models of the various classification techniques of this experiment were also measured. For the unfiltered EMG signals of Fig. 6.12, the results were 539.71, 12.10, 14.08, 13.02, and 16.92 s with the use of the SVM, LDA, NBC, k-NN, and GMM techniques, respectively. In contrast, for the filtered EMG signals of Fig. 6.13, the computational times obtained were 441.29, 7.16, 9.47, 7.89, and 11.07 s with the use of the SVM, LDA, NBC, k-NN, and GMM techniques, respectively. Table 6.4 summarizes computational times of all classification techniques, considered in this experiment, before and after the GSF filtering process and the values of Table 6.4 are graphed in Fig. 6.15. Both Table 6.4 and Figure 6.15 clearly demonstrate a tangible reduction in computational times when using the GSF in filtering sensed EMG signals. For a further comparison, a tenth-order Median Filter (MF) is used in smoothing unfiltered EMG signals of Fig. 6.12. For EMG signals filtered using the MF, the

6 Gaussian Smoothing Filter for Improved EMG Signal Modeling Phase 2

Phase 1 200 0 -200

Phase 3

191 Phase 4

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0

2000

4000

6000

8000

10000

12000

14000

100 0 -100 200 0 -200 200 0 -200 200 0 -200 200 0 -200 100 0 -100 200 0 -200

16000 18000 Samples (200 Hz)

Fig. 6.12 Experiment 2: the unfiltered MYO band EMG signals. It is obvious that sensed EMG signals are contaminated with a significant amount of noise, degrading the distinguishability between signals belonging to distinct hand gestures

classification performance was found to be 82.58%, 82.93%, 83.09%, 88.31%, and 86.29% with the use of the SVM, LDA, NBC, k-NN, and GMM techniques, respectively. It is evidenced that the use of MF, in filtering EMG signals of experiment 2, enhanced the performance for all classification techniques vis-à-vis the unfiltered EMG signals. Finally, Fig. 6.16 compares improvements achieved with considered classification techniques for cases when using GSF and MF in filtering out noise of EMG signals of experiment 2. It is clear from Fig. 6.16 that the amount of improvement, when using GSF, is more than the corresponding improvement achieved with MF.

192

I. F. J. Ghalyan et al. Phase 1 20 0 -20 20 0 -20 50 0 -50 50 0 -50 50 0 -50 50 0 -50 20 0 -20 20 0 -20

Phase 2

Phase 3

Phase 4

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

Samples (200 Hz)

Fig. 6.13 Experiment 2: the filtered MYO band EMG signals. The amount of noise was drastically reduced with the use of GSF, increasing the distinguishability of sensed EMG signals of distinct hand gestures

Table 6.3 Enhancement of EMG signals classification of experiment 2 Scheme SVM LDA NBC k-NN GMM

Performance (%) without GSF 72.82 77.23 80.01 87.38 83.91

Performance (%) with GSF 88.17 88.94 90.52 96.05 92.30

6 Gaussian Smoothing Filter for Improved EMG Signal Modeling

193

Performance [%]

100 80 60 40 20 0

SVM

LDA

NBC k-NN Without GSF With GSF

GMM

Fig. 6.14 The classification accuracy with and without using GSF in removing noise from sensed EMG signals of experiment 2, where it can be noticed that a significant enhancement in classification performance is resulted when using GSF in filtering sensed signals Table 6.4 Accumulative computational time of experiment 2

Scheme SVM LDA NBC k-NN GMM

Time (s) without GSF 539.71 12.10 14.08 13.02 16.92

Time (s) with GSF 441.29 7.16 9.47 7.89 11.07

6.6 Discussions By analyzing results obtained in experiments 1 and 2, the following observations can be made. • The performance of GSF in filtering out the noise stems from the nature of similarity of supports in time and frequency domains as both of them are Gaussian functions. Thus, even as the frequency components corresponding to the noise are attenuated, high frequency components of original EMG signals are partially preserved without significantly degrading the signal quality. Tables 6.1 and 6.3, with Figs. 6.8 and 6.14, provide an evidence of improved classification performance with the GSF-based versus sensed EMG signals. • The rate of approximation rN in a learning process is related to the smoothness of a signal by the relation [104] rN = N

− Ns

d

,

(6.50)

194

I. F. J. Ghalyan et al.

600

Time [sec.]

500 400 300 200 100 0

SVM

LDA

NBC k-NN Without GSF With GSF

GMM

Fig. 6.15 Computational time, required for developing and testing the models, with and without employment of GSF in experiment 2. It is clear that a significant reduction in computational time results with the use of GSF

Improvement [%]

25 20 15 10 5 0

SVM

LDA

NBC GSF

k-NN

GMM

MF

Fig. 6.16 Improvement in classification performance when using the Gaussian smoothing filter (GSF) and median filter (MF) of experiment 2

where s is a smoothness measure of the signal and N and Nd correspond to the dimensionality of the input training space. Thus, for a fixed N and Nd , according to Eq. (6.50) one can deduce that increment in the smoothness s of EMG signals results in decrement of the rate of approximation rN leading to a faster

6 Gaussian Smoothing Filter for Improved EMG Signal Modeling

195

approximation of the empirical risk function of the EMG classification process (see [90] for more details about the empirical risk function). When employing the GSF, the value of s increases resulting in reduced values of rN which speeds up the process of approximating and minimizing the empirical risk function of the classification process. Indeed, if one reviews the computational times given in Tables 6.2 and 6.4, with Figs. 6.9 and 6.15, it can be seen that smoothing sensed EMG signals produced remarkable reductions in computational cost in both experiments, indicating that employing GSF, in filtering out noise of EMG signals, not only enhances the classification performance, but reduces the computational burden as well. • Figures 6.10 and 6.16 clearly show that a significant enhancement is obtained for all classification techniques, in both experiments, when using GSF in removing noise, reflecting the sensitivity of considered classification techniques to noise due to the fact that noise in EMG signals results in a change in the decision boundary of the classifiers, degrading their performance. However, improvement of classification performance shown in Figs. 6.10 and 6.16 demonstrates that SVM classification technique, in both experiments, witnessed the highest improvement in classification performance in both cases when using GSF and MF in filtering noise out of sensed EMG signals. Such high improvement indicates that SVM has a higher sensitivity to noise. Even though hinge loss of Eq. (6.14) was plugged into the constrained optimization problem of Eqs. (6.15) and (6.16), it was not enough to accommodate the stochastic nature of noise of sensed EMG signals. By reviewing the improvements of LDA classification technique, shown in Figs. 6.10 and 6.16, it is seen that LDA has the second highest improvement in both experiments, with both GSF and MF filtering processes. This indicates that the linear nature of LDA, given in Eq. (6.32), reduces robustness against stochastic noise existent in EMG signals, if not well filtered. Indeed, even though their formulations and decision-making processes are different, both SVM and LDA exploit the linearity of boundary decision, which makes them susceptible to stochastic noise, rendering such linear boundary decision not efficient in precisely making decisions during test phases of EMG signals. • Consideration of classification results of Tables 6.1 and 6.3 of experiments 1 and 2, respectively, reveals that all classification techniques resulted in highest accuracy in experiment 1 compared to experiment 2, even though number of samples in experiment 2 is larger than that of experiment 1. Such a slight degradation in classification performance of experiment 2 indicates that employing multiple runs for training and a completely different run for testing may produce a lower performance compared with the single run situation of experiment 1 where both training and test samples are taken from a set belonging to the same run. Such degradation may result from potential increase in difference of captured signals in training and testing phases that are taken from completely different runs. Different runs engender imperfections and difference in hand motions that are reflected on sensed EMG signals, resulting in a greater difference between training and test sets of samples of a certain phase. Nonetheless, use of multiple

196









I. F. J. Ghalyan et al.

runs may also reduce possible uncertainty in the training process and hence lower real-world misclassification. In both experiments, and as summarized in Tables 6.1 and 6.3, the GMM classification resulted in a better performance than NBC classification. In GMM technique, the PDF is approximated by multiple Gaussian components, as given in Eq. (6.35), that enables it to model sensed signals even if they are abnormally distributed. In NBC classification technique, the PDF is approximated by a single Gaussian function, rendering the modeling process susceptible to a significant performance degradation if the data are not normally distributed. Indeed, abnormal data distribution renders the Gaussian PDF of NBC classification not fit to the purpose, causing a significant reduction to the corresponding posterior probability, given in Eq. (6.4), and yielding a tangible performance degradation. In both experiments, k-NN classification technique resulted in the highest performance due to the fact that in k-NN, nonlinear boundary of decision-making is developed, adding further robustness. From Figs. 6.10 and 6.16, the k-NN classification technique is seen to yield lowest improvement in classification performance for unfiltered versus filtered signals. Combining this observation with the fact that k-NN resulted in the highest performance suggests that k-NN is more robust against noise of EMG signals. Nevertheless, if we compare the performance of k-NN before and after employing GSF in filtering out noise, significant performance improvement is evidenced because k-NN classification using filtered EMG signals reduces the effect of stochastic nature of noise. From Tables 6.2 and 6.4, it is observed that the SVM classification technique necessitates the largest computational effort, in both experiments, vis-à-vis alternative techniques considered in both experiments. Note that in the SVM modeling technique, kernel functions are imposed to add a nonlinear component to the decision boundary to help improve SVM performance in classifying a given data set. Finding the parameters of such imposed kernel functions requires solving a constrained optimization problem that adds a significant burden to the computational cost. As an alternative, use of linear kernels can reduce the computational cost so that the optimization can be solved in a shorter time. However, linear kernel functions can degrade the classification performance, hindering efficient implementation of such a technique for EMG signals classification. In this chapter quadratic kernel functions were used, resulting in reasonable performance at the expense of increased computational time. Additionally, the computational time reported in Tables 6.2 and 6.4 gives a clear indication that the time required to develop and test models of experiment 2 is more than corresponding values of experiment 1. This increase is a natural result of the use of a significantly larger data set used in experiment 2, compared with experiment 1. However, the ratios of data set sizes and computational times for experiments 1 and 2 do not yield a straightforward proportionality relationship. This of course indicates that the computational time relies on multiple factors, including the size of given data sets, and these factors require further investigation.

6 Gaussian Smoothing Filter for Improved EMG Signal Modeling

197

• Figures 6.10 and 6.16 illustrate that MF yields improvement in classification performance for both experiments. However, GSF results in higher values of improvement, in comparison to MF, for all classification techniques considered in experiments 1 and 2. The main idea of MF is to compute the signal median, within a certain window, and utilize it to filter out noise within that specific window. The MF is relatively effective for the case of signals with relatively small magnitudes of noise. However for EMG signals corrupted with relatively large magnitude of noise, MF suffers significant performance degradation. For experiment 1, as seen from Figs. 6.6 and 6.7, the GSF removed a significant amount of noise from sensed EMG signals, indicating that the magnitude of noise in input signals was relatively large. For experiment 2, similar conclusions can be drawn from Figs. 6.12 and 6.13. For these reasons, the MF resulted in less improvement in classification performance in contrast to the results achieved with the use of GSF. • The implementation of GSF in filtering sensed EMG signals required manual tuning of its single free parameter σ to obtain an acceptable performance for all considered classification techniques. In this sense, the implementation of GSF was simple and without requiring any significant effort. Nevertheless, the selected value of the parameter σ = 2 is not an optimal realization of GSF for the considered classification techniques and data sets of both experiments. Developing a closed form algorithm, where the parameter of the GSF is optimized, can enhance the performance of various classification techniques. Furthermore, such a closed form estimation of the parameter of GSF can help reduce the computational time in SVM technique. Specifically, as an optimized GSF effectively filters noise, use of linear kernels, rather than quadratic ones, may be sufficient in the SVM technique. Moreover, considering the computational time as another cost function and tuning the GSF such that the computational time is reduced, as well, represent a tangible contribution. Possible use of multiobjective optimization can solve such a task. However, such a task is left to future research.

6.7 Conclusion This chapter has suggested the use of Gaussian smoothing filter (GSF) to enhance the classification process of sensed electromyography (EMG) signals. These EMG signals are shown to be contaminated with a significant amount of noise that causes a degradation in the performance of various classification methods. By employing the GSF, the noise of sensed EMG signals is filtered to eliminate undesirable and detrimental effects of noise, enhancing the classification performance. The GSF provides a similar support in time and frequency domains, thus allowing a compromise in filtering the noise while keeping, partially, high frequency components of sensed EMG signals.

198

I. F. J. Ghalyan et al.

To evaluate the performance of GSF, two experiments were conducted on a MYO band. In the first experiment, the classification of a scenario composed of six distinct hand gestures is addressed. An array of classification techniques were considered, viz., support vector machine (SVM), k-nearest neighbor (k-NN), linear discriminant analysis (LDA), naïve Bayes classifier (NBC), and Gaussian mixtures model (GMM). For the case of unfiltered EMG signals, the classification performance was found to be 79.82%, 83.33%, 86.31%, 93.56%, 90.97% when using SVM, LDA, NBC, k-NN, and GMM techniques, respectively. Filtering EMG signals of the first experiment with a GSF of a suitable parameter resulted in improved classification performance of 94.52%, 93.19%, 93.79%, 98.09%, and 95.92% with SVM, LDA, NBC, k-NN, and GMM classification techniques, respectively. The computational time required for developing and testing models was measured for the unfiltered versus filtered EMG signals and it was shown that the computational time reduced from 402.96 to 381.11 s, 9.33 to 4.14 s, 10.62 to 6.54 s, 10.18 to 5.14 s, and 11.39 to 7.25 s when using SVM, LDA, NBC, k-NN, and GMM classification techniques, respectively. In the second experiment, a grasping task, composed of four phases, of an object was considered. In contrast to unfiltered EMG signals, the use of GSF in filtering out noise of sensed EMG signals improved classification performance from 72.82% to 88.17%, 77.23% to 88.94%, 80.01% to 90.52%, 87.38% to 96.05%, and 83.91% to 92.30% when using SVM, LDA, NBC, k-NN, and GMM classification techniques, respectively. As in experiment 1, the use of filtered EMG signals reduced the computation time from 539.71 to 441.29 s, 12.10 to 7.16 s, 14.08 to 9.47 s, 13.02 to 7.89 s, and 16.92 to 11.07 s when using SVM, LDA, NBC, k-NN, and GMM classification techniques, respectively. The enhancement in classification accuracy, reported in both experiments, when using the GSF is a consequence of removing the noise that has a stochastic nature and its removal renders the classification process to be more accurate. The reduction in computational time is a result of smoothing sensed EMG signals that was shown to have a direct impact on the reduction of convergence time during the modeling process. Finally, a comparison was performed for classifying EMG signals, smoothed using the Median Filter (MF) versus the GSF, and the GSF was shown to yield a superior performance. Despite the excellent performance reported in this paper with the use of GSF, its parameter is not optimized and this may affect the filtering process, producing non-optimal classification results. Thus, future work will focus on developing an enhanced GSF algorithm where the optimal value of the standard deviation is estimated and integrated in the classification process with reducing the computational time. Acknowledgements This work is supported in part by the National Science Foundation grants DRK-12 DRL: 1417769, ITEST DRL: 1614085, and RET Site EEC: 1542286, and NY Space Grant Consortium grant 76156-10488.

6 Gaussian Smoothing Filter for Improved EMG Signal Modeling

199

References 1. Kamen, G., & Gabriel, D. (2010). Essentials of Electromyography. Champagn, IL: Human Kinetics. 2. Botelho, S. Y. (1955). Comparison of simultaneously recorded electrical and mechanical activity in myasthenia gravis patients and in partially curarized normal humans. The American Journal of Medicine, 19(5), 693–696. 3. Choi, C., & Kim, J. (2007). A real-time EMG-based assistive computer interface for the upper limb disabled. In IEEE 10th International Conference on Rehabilitation Robotics, Noordwijk, The Netherlands (pp. 459–462). 4. Sandoval, A. E. (2010). Electrodiagnostics for low back pain. Physical Medicine and Rehabilitiation Clinics of North America, 21(4), 767–776. 5. Sharma, S., & Dubey, A. K. (2012). Movement control of robot in real time using EMG signal. In 2nd International Conference on Power, Control and Embedded Systems, Allahabad, India (pp. 1–4). 6. Di Nardo, F., et al. (2015). Assessment of the ankle muscle co-contraction during normal gait: A surface electromyography study. Journal of Electromyography and Kinesiology, 25(2), 347–354. 7. Nazmi, N., et al. (2016). A review of classification techniques of EMG signals during isotonic and isometric contractions. Sensors, 16(8). 8. Graupe, D., & Cline, W. K. (1975). Functional separation of EMG signals via ARMA identification methods for prosthesis control purposes. IEEE Transactions on Systems, Man, and Cybernetics, SMC-5(2), 252–259. 9. Englehart, K., Hudgins, B., Parker, P. A., & Stevenson, M. (1999). Classification of the myoelectric signal using time-frequency based representations. Medical Engineering & Physics, 21(6–7), 431–438. 10. Nishikawa, D., Yu, W., Yokoi, H., & Kakazu, Y. (1999). EMG prosthetic hand controller using real-time learning method. In IEEE International Conference on Systems, Man, and Cybernetics (pp. 153–158). 11. Ju, P., Kaelbling, L. P., & Singer, Y. (2000). State-based classification of finger gestures from electromyographic signals. In Proceedings of the 7th International Conference on Machine Learning, Stanford, CA, USA (pp. 439–446). 12. Yoshikawa, M., Mikawa, M., & Tanaka, K. (2006). Real-time hand motion estimation using EMG signals with support vector machines. In SICE-ICASE International Joint Conference, Busan, South Korea (pp. 593–598). 13. Murugappan, M. (2011). Electromyogram signal based human emotion classification using KNN and LDA. In IEEE International Conference on System Engineering and Technology (ICSET), Sham Alam, Malaysia (pp. 106–110). 14. Negi, S., Kumar, Y., & Mishra, V. M. (2016). Feature extraction and classification for EMG signals using linear discriminant analysis. In 2nd International Conference on Advances in Computing, Communication, and Automation (ICACCA), Bareilly, India (pp. 1–6). 15. Orjuela-Cañón, A. D., Ruíz-Olaya, A. F., & Forero, L. (2017). Deep neural network for EMG signal classification of wrist position: Preliminary results. In IEEE Latin American Conference on Computational Intelligence (LA-CCI), Arequipa, Peru (pp. 1–5). 16. Ghalyan, I. F., Abouelenin, Z. M., & Kapila, V. (2018). Gaussian filtering of EMG signals for improved hand gesture classification. In The IEEE Signal Processing in Medicine and Biology Symposium (SPMB 2018), Philadelphia, PA, USA (pp. 1–6). 17. Battye, C. K., Nightingale, A., & Willis, J. (1955). The use of myo-electric currents in the operation of prostheses. The Journal of Bone and Joint Surgery, British, 37–B(3), 506–510. 18. Kobrinsky, A. (1960). Bioelectric control systems. Radio USSR (In Russian), 11, 37–39. 19. Bottomley, A. H. (1962). Working model of a Myo-electric control system. In Proceedings of the International Symposium on the Applications of Automatic Control Prosthetic Design, Belgrade, Yugoslavia (pp. 37–45).

200

I. F. J. Ghalyan et al.

20. Bottomley, A. H. (1963). Myo-electriccontrol of powered prostheses. The Journal of Bone and Joint Surgery, British Volume, 47(3), 411–415. 21. Mann, R. W. (1968). Design criteria, development and pre-and post-fitting amputee evaluation of an EMG controlled, force sensing, proportional-rate, elbow prosthesis with cutaneous kinesthetic feedback. IFAC Proceedings Volumes, 2(4), 579–586. 22. Rothchild, R. D. (1965). Design of an externally powered artificial elbow forelectromyographic control. Cambridge, MA: MIT. 23. Rothchild, R. D., & Mann, R. W. (1966). An EMG controlled, force sensing,proportional rate, elbow prosthesis. In Proceedings of the Symposium on Biomedical Engineering (pp. 106–109). Milwaukee, WI: Marquette University. 24. Herberts, P. (1969). Myoelectric signals in control of prostheses: Studies on arm amputees and normal individuals. Acta Orthopaedica Scandinavica, 40(Suppl 124), 1–83. 25. Scott, R. N. (1967). Myoelectric energy spectra. Medical and Biological Engineering, 3, 303– 305. 26. Dorcas, D. S., Dunfield, V. A., & Scott, R. M. (1970). Improved myoelectric control system. Medical and Biological Engineering, 8, 333–341. 27. Kwatny, E., Thomas, D. H., & Kwatny, H. G. (1970). An application of signal processing techniques to the study of myoelectric signals. IEEE Transactions on Biomedical Engineering, BME-17(4), 303–313. 28. Lawrence, P. D., & Lin, W. (1972). Statistical decision making in the real-time control of an arm aid for the disabled. IEEE Transactions on Systems, Man, and Cybernetics, SMC-2(1), 35–42. https://doi.org/10.1109/TSMC.1972.5408554. 29. Parker, P. A., Stuller, J. A., & Scott, R. N. (1977). Signal processing for the multistate myoelectric channel. Proceedings of the IEEE, 65(5), 662–674. 30. De Luca, C. J. (1979). Physiological and mathematical basis of myoelectric signals. IEEE Transactions on Biomedical Engineering, BME-26(6), 313–325. 31. Hogan, N., & Mann, R. W. (1980). Myoelectric signal processing: Optimal estimation applied to electromyography-part I: Derivation of the optimal myoprocessor. IEEE Transactions on Biomedical Engineering, BME-27(7), 382–395. 32. Hudgins, B., Parker, P., & Scott, R. N. (1993). A new strategy for multifunction myoelectric control. IEEE Transactions on Biomedical Engineering, 40(1), 82–94. 33. Chaiyaratana, N., Zalzala, A. M. S., & Datta, D. (1996). Myoelectric signals pattern recognition for intelligent functional operation of upper-limb prosthesis (ACSE Research Report 621). Department of Automatic Control and Systems Engineering. 34. Merletti, R., & Conte, L. R. L. (1997). Surface EMG signal processing during isometric contractions. Journal of Electromyography and Kinesiology, 7(4), 241–250. 35. Bilodeau, M., Cincera, M., Arsenault, A. B., & Gravel, D. (1997). Normality and stationarity of EMG signals of elbow flexor muscles during ramp and step isometric contractions. Journal of Electromyography and Kinesiology, 7(2), 87–96. 36. Clancy, E. A., & Hogan, N. (1999). Probability density of the surface electromyogram and its relation to amplitude detectors. IEEE Transactions on Biomedical Engineering, 46(6), 730– 739. 37. Farina, D., & Merletti, R. (2000). Comparison of algorithms for estimation of EMG variables during voluntary isometric contractions. Journal of Electromyography and Kinesiology, 10(5), 337–349. 38. Englehart, K., Hudgin, B., & Parker, P. A. (2001). A wavelet-based continuous classification scheme for multifunction myoelectric control. IEEE Transactions on Biomedical Engineering, 48(3), 302–311. 39. Rosen, J., Brand, M., Fuchs, M. B., & Arcan, M. (2001). A myosignal-based powered exoskeleton system. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, 31(3), 210–222. 40. Hussein, S. E., & Granat, M. H. (2002). Intention detection using a neuro-fuzzy EMG classifier. IEEE Engineering in Medicine and Biology Magazine, 21(6), 123–129.

6 Gaussian Smoothing Filter for Improved EMG Signal Modeling

201

41. Englehart, K., & Hudgins, B. (2003). A robust, real-time control scheme for multifunction myoelectric control. IEEE Transactions on Biomedical Engineering, 50(7), 848–854. 42. Ajiboye, A. B., & Weir, R. F. (2005). A heuristic fuzzy logic approach to EMG pattern recognition for multifunctional prosthesis control. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 13(3), 280–291. 43. Huang, Y., Englehart, K. B., Hudgins, B., & Chan, A. (2005). A Gaussian mixture model based classification scheme for myoelectric control of powered upper limb prostheses. IEEE Transactions on Biomedical Engineering, 52(11), 1801–1811. 44. Chan, A., & Englehart, K. B. (2005). Continuous myoelectric control for powered prostheses using hidden Markov models. IEEE Transactions on Biomedical Engineering, 52(1), 121– 124. 45. Fleischer, C., Wege, A., Kondak, K., & Hommel, G. (2006). Application of EMG signals for controlling exoskeletonrobots. Biomedical Engineering, 51, 314–319. 46. Reaz, M. B., Hussain, M. S., & Mohd-Yasin, F. (2006). Techniques of EMG signal analysis: Detection, processing, classification and applications. Biological Procedures Online, 8(1), 11–35. 47. Oskoei, M. A., & Hu, H. (2006). GA-based feature subset selection for myoelectric classification. In 2006 IEEE International Conference on Robotics and Biomimetics, Kunming, China (pp. 1465–1470). 48. Oskoei, M. A., & Hu, H. (2008). Support vector machine-based classification scheme for myoelectric control applied to upper limb. IEEE Transactions on Biomedical Engineering, 55(8), 1956–1965. 49. Hussain, M. S., Reaz, M. B. I., Mohd.-Yasin, F., & Ibrahimy, M. I. (2008). Electromyography signal analysis using wavelet transform and higher order statistics to determine muscle contraction. The Journal of Knowledge Engineering, Expert Systems, 26(1), 35–48. 50. Ahmad, S. A., & Chappell, P. H. (2009). Surface EMG pattern analysis of the wrist muscles at different speeds of contraction. Journal of Medical Engineering & Technology, 33(5), 376– 385. 51. Khezri, M., & Jahed, M. (2011). A neuro–fuzzy inference system for sEMG-based identification of hand motion commands. IEEE Transactions on Industrial Electronics, 58(5), 1952–1960. 52. Lorrain, T., Jiang, N., & Farina, D. (2011). Influence of the training set on the accuracy of surface EMG classification in dynamic contractions for the control of multifunction prostheses. Journal of NeuroEngineering and Rehabilitation, 8(1), 25. 53. Phinyomark, A., Phukpattaranont, P., & Limsakul, C. (2012). Feature reduction and selection for EMG signal classification. Expert Systems with Applications, 39(8), 7420–7431. 54. Matsubara, T., & Morimoto, J. (2013). Bilinear modeling of EMG signals to extract userindependent features for multiuser myoelectric interface. IEEE Transactions on Biomedical Engineering, 60(8), 2205–2213. 55. Subasi, A. (2013). Classification of EMG signals using PSO optimized SVM for diagnosis of neuromuscular disorders. Computers in Biology and Medicine, 43(5), 576–586. 56. Phinyomark, A., et al. (2013). EMG feature evaluation for improving myoelectric pattern recognition robustness. Expert Systems with Applications, 40(12), 4832–4840. 57. Rogers, D. R., & MacIsaac, D. T. (2013). A comparison of EMG-based muscle fatigue assessments during dynamic contractions. Journal of Electromyography and Kinesiology, 23(5), 1004–1011. 58. Nazarpour, K., Al-Timemy, A. H., Bugmann, G., & Jackson, A. (2013). A note on the probability distribution function of the surface electromyogram signal. Brain Research Bulletin, 90, 88–91. 59. Thongpanja, S., et al. (2015). Analysis of electromyography in dynamic hand motions using L-kurtosis. Applied Mechanics and Materials, 781, 604–607. 60. Tsai, A.-C., et al. (2014). A comparison of upper-limb motion pattern recognition using EMG signals during dynamic and isometric muscle contractions. Biomedical Signal Processing and Control, 11, 17–26.

202

I. F. J. Ghalyan et al.

61. Siddiqi, A. R., Sidek, S. N., & Khorshidtalab, A. (2015). Signal processing of EMG signal for continuous thumb-angle estimation. In 41st Annual Conference of the IEEE Industrial Electronics Society (IECON 2015), Yokohama, Japan (pp. 374–379). 62. Yu, Y., Fan, L., Kuang, S., Sun, L., & Zhang, F. (2015). The research of sEMG movement pattern classification based on multiple fused wavelet function. In IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER), Shenyang, China (pp. 487–491). 63. Kasuya, M., Yokoi, H., & Kato, R. (2015). Analysis and optimization of novel post-processing method for myoelectric pattern recognition. In 2015 IEEE International Conference on Rehabilitation Robotics (ICORR), Singapore, Singapore (pp. 985–990). 64. Peng, L., Hou, Z., Kasabov, N., Bian, G., Vladareanu, L., & Yu, H. (2015). Feasibility of NeuCube spiking neural network architecture for EMG pattern recognition. In 2015 International Conference on Advanced Mechatronic Systems (ICAMechS) (pp. 365–369). 65. Zhang, Q., Xiong, C., & Zheng, C. (2015). Intuitive motion classification from EMG for the 3-D arm motions coordinated by multiple DoFs. In 7th IEEE/EMBS International Conference on Neural Engineering (NER), Montpellier, France (pp. 836–839). 66. Pang, M., Guo, S., & Zhang, S. (2015). Prediction of interaction force using EMG for characteristic evaluation of touch and push motions. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany (pp. 2099–2104). 67. Naik, G. R., Selvan, S. E., & Nguyen, H. T. (2016). Single-channel EMG classification with ensemble-empirical-mode-decomposition-based ICA for diagnosing neuromuscular disorders. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 24(7), 734– 743. 68. Spanias, J. A., Perreault, E. J., & Hargrove, L. J. (2016). Detection of and compensation for EMG disturbances for powered lower limb prosthesis control. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 24(2), 226–234. 69. Vidovic, M. M., Hwang, H., Amsüss, S., Hahne, J. M., Farina, D., & Müller, K. (2016). Improving the robustness of myoelectric pattern recognition for upper limb prostheses by covariate shift adaptation. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 24(9), 961–970. 70. AbdelMaseeh, M., Chen, T., & Stashuk, D. W. (2016). Extraction and classification of multichannel electromyographic activation trajectories for hand movement recognition. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 24(6), 662–673. 71. Samuel, O. W., Li, X., Fang, P., & Li, G. (2015). Examining the effect of subjects’ mobility on upper-limb motion identification based on EMG-pattern recognition. In 2016 Asia-Pacific Conference on Intelligent Robot Systems (ACIRS) (pp. 137–141). 72. Zhai, X., Jelfs, B., Chan, R. H. M., & Tin, C. (2016). Short latency hand movement classification based on surface EMG spectrogram with PCA. In 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Orlando, FL, USA (pp. 327–330). 73. Lee, S. W., Yi, T., Jung, J., & Bien, Z. (2017). Design of a gait phase recognition system that can cope with EMG electrode location variation. IEEE Transactions on Automation Science and Engineering, 14(3), 1429–1439. 74. Jochumsen, M., Waris, A., & Kamavuako, E. N. (2018). The effect of arm position on classification of hand gestures with intramuscular EMG. Biomedical Signal Processing and Control, 43, 1–8. 75. Tavakoli, M., Benussi, C., Lopes, P. A., Osorio, L. B., & de Almeida, A. T. (2018). Robust hand gesture recognition with a double channel surface EMG wearable armband and SVM classifier. Biomedical Signal Processing and Control, 46, 121–130. 76. Camargo, J., & Young, A. (2019). Feature selection and non-linear classifiers: Effects on simultaneous motion recognition in upper limb. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 27(4), 743–750. https://doi.org/10.1109/TNSRE.2019.2903986. 77. Zschorlich, V. (1989). Digital filtering of EMG-signals. Electromyography and Clinical Neurophysiology, 28(2), 81–86.

6 Gaussian Smoothing Filter for Improved EMG Signal Modeling

203

78. Conforto, S., D’Alessio, T., & Pignatelli, S. (1999). Optimal rejection of movement artefacts from myoelectric signals by means of a wavelet filtering procedure. Journal of Electromyography and Kinesiology, 9(1), 47–57. 79. De Luca, C. J., Gilmore, L. D., Kuznetsov, M., & Roy, S. H. (2010). Filtering the surface EMG signal: Movement artifact and baseline noise contamination. Journal of Biomechanics, 43(8), 1573–1579. 80. Ghalyan, I. F. J. (2016). Force-controlled robotic assembly processes of rigid and flexible objects: Methodologies and applications (1st ed.). Cham: Springer International Publishing. 81. Jasim, I. F., Plapper, P. W., & Voos, H. (2015). Gaussian filtering for enhanced impedance parameters identification in robotic assembly processes. In 20th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA 2015), Luxembourg, Luxembourg. https://doi.org/10.1109/ETFA.2015.7301611 82. Ghalyan, I. F., Jaydeep, A., & Kapila, V. (2018). Learning robot-object distance using Bayesian regression with application to a collision avoidance scenario. In 48th IEEE Applied Imagery Pattern Recognition Workshop (AIPR 2018), Washington, DC, USA. 83. Shapiro, L. G., & Stockman, G. (2001). Computer vision (1st ed.). Upper Saddle River, NJ: Prentice Hall PTR. 84. Bishop, C. M. (2006). Pattern recognition and machine learning. Berlin: Springer. 85. Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). New York: Springer. 86. Khan, M., Ahamed, S. I., Rahman, M., & Yang, J. (2012). Gesthaar: An accelerometer-based gesture recognition method and its application in NUI driven pervasive healthcare. In 2012 IEEE International Conference on Emerging Signal Processing Applications, Las Vegas, NV, USA (pp. 163–166). 87. Rahulamathavan, Y., Veluru, S., Phan, R. C., Chambers, J. A., & Rajarajan, M. (2014). Privacy-preserving clinical decision support system using Gaussian kernel-based classification. IEEE Journal of Biomedical and Health Informatics, 18(1), 56–66. 88. Chen, S., Ouyang, Y., Lin, C., & Chang, C. (2018). Iterative support vector machine for hyperspectral image classification. In 25th IEEE International Conference on Image Processing (ICIP), Vancouver, BC, Canada (pp. 3309–3312). https://doi.org/10.1109/ICIP.2018.8451145 89. Ghalyan, I. F., Chacko, S. M., & Kapila, V. (2018). Simultaneous robustness against random initialization and optimal order selection in Bag-of-Words modeling. Pattern Recognition Letters, 116, 135–142. 90. Vapnik, V. (2000). The nature of statistical learning theory (2nd ed.). New York: Springer. 91. Yang, K., & Shahabi, C. (2007). An efficient k nearest neighbor search for multivariate time series. Information and Computation, 205(1), 65–98. 92. Jabbar, M. A., Deekshatulu, B. L., & Chandra, P. (2013). Classification of heart disease using k-nearest neighbor and genetic algorithm. Procedia Technology, 10, 85–94. 93. Krishna, A., Edwin, D., & Hariharan, S. (2017). Classification of liver tumor using SFTA based Naïve Bayes classifier and support vector machine. In 2017 International Conference on Intelligent Computing, Instrumentation and Control Technologies (ICICICT), Kannur, India (pp. 1066–1070). 94. Padmavathi, S., & Ramanujam, E. (2015). Naïve Bayes classifier for ECG abnormalities using multivariate maximal time series motif. Procedia Computer Science, 47, 222–228. 95. Falih, A. D. I., Dharma, W. A., & Sumpeno, S. (2017). Classification of EMG signals from forearm muscles as automatic control using Naive Bayes. In 2017 International Seminar on Intelligent Technology and Its Applications (ISITIA), Surabaya, Indonesia (pp. 346–351). 96. Zhang, D., Zhao, X., Han, J., & Zhao, Y. (2014). A comparative study on PCA and LDA based EMG pattern recognition for anthropomorphic robotic hand. In IEEE International Conference on Robotics and Automation (ICRA 2014), Hong Kong (pp. 4850–4855). 97. Sharma, A., & Paliwal, K. K. (2008). Cancer classification by gradient LDA technique using microarray gene expression data. Data & Knowledge Engineering, 66(2), 338–347. 98. Bandos, T. V., Bruzzone, L., & Camps-Valls, G. (2009). Classification of hyperspectral images with regularized linear discriminant analysis. IEEE Transactions on Geoscience and Remote Sensing, 47(3), 862–873.

204

I. F. J. Ghalyan et al.

99. Jasim, I. F., & Plapper, P. W. (2014). Contact-state monitoring of force-guided robotic assembly tasks using expectation maximization-based Gaussian mixtures models. The International Journal of Advanced Manufacturing Technology, 73(5–8), 623–633. Retrieved from http:// link.springer.com/article/10.1007%2Fs00170-014-5803-x. 100. Jasim, I. F., & Plapper, P. W. (2014). Contact-state recognition of compliant motion robots using expectation maximization-based Gaussian Mixtures. In Joint 45th International Symposium on Robotics (ISR 2014) and 8th German Conference on Robotics (ROBOTIK 2014), Munich, Germany. 101. Jasim, I. F., Plapper, P. W., & Voos, H. (2017). Contact-state modelling in force-controlled robotic peg-in-hole assembly processes of flexible objects using optimised Gaussian mixtures. Proceedings of the Institution of Mechanical Engineers, Part B: Journal of Engineering Manufacture, 231(8), 1448–1463. https://doi.org/10.1177/0954405415598945. 102. Chu, J., & Lee, Y. (2009). Conjugate-prior-penalized learning of Gaussian mixture models for multifunction myoelectric hand control. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 17(3), 287–297. 103. Vögele, A. M., Zsoldos, R. R., Krüger, B., & Licka, T. (2016). Novel methods for surface EMG analysis and exploration based on multi-modal Gaussian mixture models. PLoS One, 11(6), 1–28. 104. Lorentz, G. G. (1966). Approximation of functions. New York: Holt-Rinehart-Winston.

Chapter 7

Clustering of SCG Events Using Unsupervised Machine Learning Peshala T. Gamage, Md Khurshidul Azad, Amirtaha Taebi, Richard H. Sandler, and Hansen A. Mansy

7.1 Introduction Seismocardiography (SCG) is the process of measuring the chest surface vibrations resulting from cardiac activity. These vibrations are typically measured by an accelerometer and believed to be primarily caused by valve closure and opening, blood momentum changes, and myocardial movements [1–3]. The method was initially introduced in 1957 by Eliot et al. [4] and the term “seismocardiography” was used by Bozhenko [5] in 1961 who studied the use of SCG in space flights. Although, SCG is a promising method for the detection of important cardiac information such as cardiac timing intervals and cardiac contractility, its use was initially limited by the heavy accelerometers available at the time, while more rapid advancement was made in electrocardiography (ECG) and other medical imaging methods [6, 7]. Later, the progress in MEMS technology produced much lighter, highly sensitive and low-cost accelerometers, which has promoted the use of SCG as a feasible method for cardiac monitoring systems [8]. SCG provides information about the mechanical cardiac activity and, when combined with ECG which is indicative of the electrical activity, can provide a more complete picture of the cardiac health. This includes determining electromechanical time intervals such as pre-ejection period (PEP) which was found to correlate

P. T. Gamage () · M. K. Azad Biomedical Acoustic Research Lab, University of Central Florida, Orlando, FL, USA e-mail: [email protected] A. Taebi Department of Biomedical Engineering, University of California Davis, Davis, CA, USA R. H. Sandler · H. A. Mansy Biomedical Acoustic Research Lab, University of Central Florida, Orlando, FL, USA Biomedical Acoustics Research Company, Orlando, FL, USA © Springer Nature Switzerland AG 2020 I. Obeid et al. (eds.), Signal Processing in Medicine and Biology, https://doi.org/10.1007/978-3-030-36844-9_7

205

206

P. T. Gamage et al.

with cardiac health [9]. Results from several other studies have also suggested correlations between SCG signal features and different cardiac pathologies [10] including heart failure [11, 12]. Hence, SCG has high potential utility for diagnosis and monitoring of cardiac conditions. In addition, SCG has many advantages including being noninvasive, inexpensive, compatible with telemedicine, and not requiring highly trained operators. Analysis of SCG waveforms usually focuses on both time and frequency features to gain better understanding of heart function and for classification of SCG under different heart conditions [13, 14]. In addition, SCG can provide information about the interactions between cardiovascular and pulmonary systems [12, 15]. Some studies have also employed SCG to monitor sleep apnea [16] and for estimating respiratory rate [17]. Fiducial points (i.e., certain SCG peaks) can be defined for SCG and can contain useful diagnostic information [18]. For instance, several studies showed the consistent use of SCG to find the cardiac events (shown in Fig. 7.1) such as mitral valve closing (MC), aortic valve opening (AO), etc. [18–20]). However, some studies have reported inconsistencies in the location of the certain fiducial points [18, 21]. The inconsistency and uncertainty of the correlation between fiducial points and cardiac events have posed limitations on the clinical application of SCG [22]. This may be mainly due to the low time resolution in medical imaging modalities [18, 23] that are usually used to establish the above correlation and also possibly due to the high variability of the SCG signal. SCG signals have a high inter- and intra-subject variability. Signal morphology may also change among consecutive beats in the same data collection session of the same subject. Figure 7.2 shows an example of two different SCG morphologies observed in measurements performed on a healthy subject during a single recording

Fig. 7.1 The cardiac events identified in the SCG signal as proposed by Crow et al. [23]. AS atrial systole, MC mitral valve closure, IM isovolumic movement, AO aortic valve opening, IC isotonic contraction, RE rapid ejection, AC aortic valve closure, MO mitral valve opening, RF rapid filling

7 Clustering of SCG Events Using Unsupervised Machine Learning

207

Fig. 7.2 Two different SCG morphologies observed in a healthy subject during a single recording session

session. Unlike ECG, SCG signals are associated with the mechanical movement (rather than electrical activity) measured over chest surface. Consequently, SCG signal morphology can be largely affected by different factors such as respiration (which involves lung volume and intrathoracic pressure changes), heart rate, and cardiac contractility [6, 24]. Effects of respiration on SCG morphology have been reported. An earlier study reported SCG waveform morphological changes for the inspiratory vs. expiratory phases [25]. A recent study [12] calculated waveform dissimilarities between SCG events and concluded that SCG morphology appeared to change with lung volume (which may correlate with intrathoracic pressure) more than respiratory flow direction (inspiration vs. expiration). Interrelated mechanisms that may cause SCG changes with respiration include: 1. Heart position changes: During breathing, the diaphragm and lungs move leading to changes in the heart position relative to the SCG sensor that has a fixed location at the chest surface. Lung and heart movements also change the structural composition of the thorax and consequently affect the properties of transmission path between the sensor and the sources of vibrations, which would affect the signal measured at the chest wall [26]. For different subjects, this effect may vary based on their breathing pattern and tissue dimensions and properties. 2. Intrathoracic pressure variation: This pressure affects filling and ejection of blood flow from the different heart chambers. For example, the negative intrathoracic pressure induced during inspiration causes higher right heart filling and increases right heart output into the more compliant lungs [27]. Conversely, the positive expiratory intrathoracic pressure on the lungs inhibits the right heart filling and ejection during expiration. The resulting changes in blood flow would, in turn, cause morphological differences in the measured SCG signal. Other processes such as chest wall and diaphragm contraction and relaxation may further interact with these mechanisms leading to complex variation in the SCG morphology that can be of diagnostic value. It is to be noted, however, that increased variability in the SCG waveform would reduce the precision of estimating

208

P. T. Gamage et al.

the mean SCG waveform, which may interfere with accurate determination of SCG features and reduce SCG diagnostic utility. To provide averages of SCG waveforms that would be optimally representative of SCG events, it is helpful to separate SCG waveforms into groups with minimum maximum intra- and inter-group dissimilarity, respectively. Hence, this study investigates the use of machine learning for optimal clustering of the SCG waveforms, which can help provide more accurate signal features and increase the diagnostic value of SCG. While the focus here will be on SCG time domain morphology, similar analysis can be performed in the frequency domain. Machine learning (ML) is a potential tool for grouping SCG events based on their temporal morphological features without a need to have full understanding of the underlying mechanisms of SCG variability [28]. A few studies have used supervised machine learning methods, such as support vector machine (SVM) and random forest (RF) to classify SCG events into respiratory phases such as inspiration vs. expiration or high lung volume vs. low lung volume [15, 29, 30]. In these supervised classification studies, SCG morphology grouping is assumed a priori and the machine algorithm will iteratively try to minimize a loss function (using training and validating dataset) to optimize the accuracy of the classifier. Hence, the accuracy of these classifiers is not necessarily an indication of optimal grouping into classes with minimum and maximum intra- and inter-class variability, but rather indicates how well the algorithm can classify SCG events into predefined groups (e.g., inspiratory/expiratory groups). In contrast to supervised ML, unsupervised ML is capable of clustering the input data into groups without defining the grouping a priori. Here, the unsupervised ML algorithms will optimize a function to separate the input data into clusters such that the data in a cluster are internally similar while dissimilarity between clusters is maximized. In this study, we employ an unsupervised machine learning approach to cluster SCG beats based on their temporal morphology. The main objectives of our study are: • Clustering SCG events to minimize intra-cluster variability: Unsupervised ML is used for clustering SCG temporal waveforms in healthy subjects. The optimum number of clusters is decided by analyzing the variance within and between the clusters. • Finding relations between cluster boundaries and respiratory phases and heart rate: The timing of the clustered SCG waveforms are compared with their respiratory phases (i.e., inspiratory vs. expiratory and high vs. low lung volume phases). The timing when SCG beats switch from one cluster to another is compared with the respiratory phases and heart rate changes. • Calculating a representative SCG event for each cluster: After the clusters are defined, a beat that is a representative of the morphology of each cluster is calculated using an advanced shape averaging method. Ultimately, this representative SCG event may be used to define fiducial points for diagnostic purposes.

7 Clustering of SCG Events Using Unsupervised Machine Learning

209

7.2 Methods Figure 7.3 summarizes the methodology employed in this study while more details are provided in following sections.

7.2.1 Experimental Measurements SCG signals were acquired from 17 healthy subjects after Institutional Review Board (IRB) approval. Subject characteristics are listed in Table 7.1. SCG was measured using a tri-axial accelerometer (Model: 356A32, PCB Piezotronics, Depew, NY) affixed to the chest surface using double-sided medicalgrade tape (B205-1, 3M, Minneapolis, MN) such that the measured z-component of the acceleration was normal to the chest surface (i.e., dorso-ventral component). The sensor was placed at the 4th intercostal space at the left lower sternal border. The signal from the accelerometer was amplified using a charge amplifier (Model: 482C, PCB Piezotronics, Depew NY) and then acquired using a data acquisition module (Model: IX-TA-220, iWorx Systems Inc, Dover, NH). The current SCG sensor is sensitive to chest wall movement due to respiration. While this movement is an artifact that can corrupt SCG, that artifact has a much lower frequency. This makes it easy to remove that artifact by low-pass filtering, which is the approach implemented in this study. Two other signals were simultaneously acquired. These include ECG (in the lead two arrangement, Model: IX-B3G, iWorx Systems, Inc., Dover, NH) and respiratory flowrate (via a mouthpiece using a spirometer, Model: A-FH-300, iWorx Systems, Inc., Dover, NH). A sampling rate of 10 kHz was used for data acquisition. Subjects rested comfortably on 45-degree inclined bed during data collection. A diagram of the experimental setup is shown in Fig. 7.4.

Data Acquisition • SCG measurement • ECG mesurement • Respiratory flow rate measurement

Preprocessing • Band Pass Filtering • Segmentation of SCG beats using ECG

Clustering SCG events • Shape dissimilarity measurement • Unsupervised ML • Optimizing number of clusters

Analyzing the clustering distribution • Analyzing cluster distribution in relation with respitaroty phases

Fig. 7.3 Methodology workflow Table 7.1 Subject characteristics

Age (years) Height (cm) Weight (kg) BMI

23 ± 3.5 168.5 ± 9 70 ± 13 24.5 ± 3.9

210

P. T. Gamage et al.

Fig. 7.4 Experimental setup

7.2.2 Preprocessing 7.2.2.1

Filtering

The code for all signal processing steps was written in MATLAB (2017b. The MathWorks, Inc., MA). SCG and ECG signals were forward-backward filtered using a fourth-order Chebyshev 2 type band-pass filter (0.5–50 Hz) to reduce the background noise and baseline wondering (i.e., variation) due to respiration. In addition, a moving average filter of order 5 (low-pass with cutoff ~2 kHz) was employed to further smooth the signal. For each subject, the original and the filtered signals were compared in time and frequency domains to make sure filtering had minimal distortion on the SCG event amplitudes. Figure 7.5 shows an example of the original and filtered SCG data in the time and frequency domains.

7.2.2.2

SCG Segmentation

The SCG signal was segmented into SCG beats (also called events in this manuscript) using the R peaks of the ECG signal, which were detected using Pan–Tomkins algorithm [31]. Each SCG beat was selected to start 0.1 s before the R peak of the corresponding ECG, while the end point of SCG beat was selected 0.1 s before the R peak of the following ECG complex. Since the R-R interval varies over

7 Clustering of SCG Events Using Unsupervised Machine Learning

211

Fig. 7.5 Original and filtered SCG signal in (a) time domain, (b) zoomed in time domain, (c) frequency domain, (d) zoomed in frequency domain

time, this approach resulted in SCG beats with varying duration, which is different from SCG studies that fix the duration of SCG beats [13, 22, 30]. Figure 7.6 shows an example of the segmented SCG signal.

7.2.3 Unsupervised Machine Learning 7.2.3.1

Clustering SCG Morphology

The morphology of a SCG beat may be best described by the signal amplitudes (at each data point of the beat) and the dissimilarity between the SCG morphologies can be quantified by calculating the differences between the signal amplitudes (after signal alignment). Hence, the amplitude values of SCG beats are an appropriate first choice as the input feature vector for the clustering algorithm, which is also called raw-based method in time series clustering [32].

212

P. T. Gamage et al.

Fig. 7.6 Segmentation of the SCG signal using ECG beats

The clustering algorithm will separate SCG beats into clusters by measuring the distance (i.e., dissimilarity) between the respective feature vectors using a distance measure. Traditional algorithms typically utilize Euclidean distance in these calculations (Fig. 7.9a). When clustering, SCG beats need to be accurately aligned in time, otherwise errors in distance measurements will increase and beats with similar morphologies may be assigned to a different cluster. However, alignment of SCG beats become more complicated when signals have inherent morphologic variability. For example, SCG beat morphology may non-linearly stretch or compress due to heart rate variability. This phenomenon can be seen in consecutive beats shown in Fig. 7.7a, b. Even if a fiducial point is selected for the alignment of SCG beats, changes in the heart rate and associated intervals would cause misalignment of other fiducial points, which would lead to overestimating dissimilarity calculated by the Euclidean distance. Furthermore, dissimilarity calculated using the classic Euclidean distance assumes equal length of the SCG beats. However, beats have different lengths due to heart rate variability, and longer beats should be trimmed or compressed (to the length of the shortest beats, which corresponds to the highest heart rate) to have a constant length among all SCG beats. This process may remove valuable information contained in the part of beat that is removed. Hence, for accurate clustering of SCG morphology a distance measure that accounts for aforementioned distortions in SCG beats is needed. Previous studies of clustering time series data (based on their temporal morphology) have encountered similar issues and proposed the use of dynamic time warping (DTW), which is a similarity measure that delivers more superior accuracy than Euclidean distance [32, 33].

7.2.4 Dynamic Time Warping (DTW) DTW is a widely used measure of the similarity between two time series. It was originally designed for automatic speech recognition [34] to identify the same

7 Clustering of SCG Events Using Unsupervised Machine Learning

213

Fig. 7.7 (a) Variations of the length of segmented SCG events due to heart rate (b) Zoomed view

word spoken at different speeds. DTW determines the optimal “global alignment” between two time sequences by exploiting the temporal distortions between them [34, 35]. DTW non-linearly “warps” the two sequences in the time domain to determine a measure of their similarity [34]. This dissimilarity measure is often used in time series clustering [32]. The steps for calculating the DTW distance between two time series with different lengths, X and Y, are as follows: X = {x1 , x2 , . . . xi , . . . .xn }

(7.1)

% & Y = y1 , y2 , . . . yj , . . . .ym

(7.2)

where n and m are the lengths of the two signals. A “distance matrix” for X and Y is then generated as shown in Fig. 7.8. This distance matrix is recursively filled using following formula,   D (i, j ) = δ xi , yj ⎧ ⎨ D (i, j − 1)      2 + min D (i − 1, j ) where δ xi , yj = xi − yj or xi − yj  ⎩ D (i − 1, j − 1) (7.3)

214

P. T. Gamage et al.

Fig. 7.8 Distance matrix and the optimum warping path for signals X and Y

An optimal alignment (warping path) W = {w1 , w2 , . . . . wk , . . . , wN } is to be found where wk = (i, j) represent the alignment between ith point of X and jth point of Y. The optimal warping path is found such that it minimizes, DT W (X, Y ) = argmin

k=N 

D(w)

(7.4)

k=1

where the warping path should satisfy the following three conditions [36]. Boundary constraint: w1 = (1, 1), wN = (n, m)     Monotonicity constraint: wk = (i, j), wk + 1 = (i , j ) where i ≥ i and j ≥ j     Continuity constraint: wk = (i, j), wk + 1 = (i , j ) where i ≤ i + 1 and j ≤ j + 1 The computed DTW(X, Y) reflects the dissimilarity between X and Y. Figure 7.9 shows the difference between using Euclidean distance and DTW as a dissimilarity measure. As can be seen in Fig. 7.9, associated points are concurrent for Euclidean distance. In DTW, associated points are related non-linearly in time.

7.2.5 Averaging SCG Beats Averaging of SCG beats is performed in order to find a representative beat from each set of similar SCG beats. The representative beat is then used to derive features (e.g., to determine fiducial points) for diagnostic purposes. Hence, the SCG beat should accurately represent the morphology of the set of SCG beats. In previous studies, fixed length SCG beats were averaged after aligning relative to a certain peak (such as R peak of ECG, maximum peak of the systolic portion of SCG, S1 or S2 peaks of the phonocardiography (PCG)) [13, 22, 28]. However, calculating an accurate

7 Clustering of SCG Events Using Unsupervised Machine Learning

215

Fig. 7.9 Associated points between signals X and Y when the dissimilarity is measured with (a) Euclidean and (b) DTW measures [36]

average while conserving the original morphology (of the non-linearly stretched or truncated SCG beats) is not an easy task. For traditional averaging methods in Euclidean space, morphological features are more conserved close to the aligned location while features may be lost in other regions of the beat. For example, if the SCG beats are aligned with respect to R peaks, morphological features may be conserved in the systolic region close to the R peak while beat morphology may get smeared in the diastolic region due to decreased alignment accuracy. To overcome this issue, a recent study calculated two different averages by aligning the signals relative to R peaks in ECG and S2 peaks of PCG and analyzed the fiducial points in systolic and diastolic regions, separately [22]. Furthermore, if the average is calculated between SCG beats with different morphologies, the average beat may have a different morphology that may mask some important diagnostic features that are contained in original SCG beats. In the current study, we propose the use of a shape-based averaging technique after SCG beats are optimally clustered into different groups. When employing DTW, several shape-based averaging methods were suggested in literature including Nonlinear Alignment and Averaging Filters (NLAAF), Prioritized Shape Averaging (PSA), and DTW barycenter averaging (DBA) [37, 38]. The first two methods use a pairwise averaging strategy that suffer from the growth of the length of the resulting average (to almost double) in each step and the process of reducing the length may lead to loss of information [37]. The recently introduced DBA method eliminates these drawbacks and appears most accurate and efficient [32]. Therefore, DBA will be employed in this study.

7.2.5.1

DTW Barycenter Averaging (DBA)

DBA tries to find an optimum average for a set of time sequences (e.g., SCG beats) in DTW space. This average is such that it has minimum DTW distance from the set

216

P. T. Gamage et al.

of sequences. The method starts by selecting an arbitrary average and iteratively updating the average to minimize the sum (DTW) of distances from the set of sequences to the sequence average. When calculating the DTW distance between the average beat and the set beats, many points of the set of beats may be associated with the same point of the average beat (and vice versa) as seen in Fig. 7.9. The higher the number of these points, the higher the DTW distance between the set of beats and the average beat (because these distances are added). To reduce the associated number of points, DBA suggests updating the point of the average beat by taking the barycenter average (barycenter is the center of mass of the points, assuming equal mass for the points) of the associated points. Iteratively, the same procedure is followed with the updated average till the sum of the DTW distances between the average and the set of beats converges within a predefined tolerance [37]. Figure 7.10 shows a DBA average calculated for a set of SCG beats. However, DBA has a time complexity of Θ(I. N. l2 ), where I, N, and l denote the number of iterations, number of sequences, and the length of the sequences, respectively [37]. In addition, in some cases DBA may not converge to a smooth signal (which was observed when DBA was calculated at sampling rate of 500 Hz in our study) due to non-linear distortions in DTW calculations [37]. To avoid such limitations, the medoid can be also used as a possible alternative representative of the cluster the morphology.

7.2.5.2

Clustering Algorithms

Unsupervised machine learning is employed in the current study to cluster SCG morphology into clusters with highest intra-cluster similarity and highest intercluster dissimilarity. In time series clustering, three main types of clustering methods are used, namely, hierarchical, spectral, and partitional clustering [33]. Hierarchical clustering generates a cluster hierarchy by (a) combining most similar clusters pairwise (starting from the individual sequences) till all sequences

Fig. 7.10 DBA average calculated for a set of SCG beats

7 Clustering of SCG Events Using Unsupervised Machine Learning

217

(i.e., SCG beats) are merged into a single cluster (agglomerative clustering) or (b) dividing the clusters into pairs until each leaf cluster contains only one object (divisive clustering). However, hierarchical clustering performs clustering in a local level and no global objective function is directly minimized as in “partitional” based methods (discussed below). In addition, hierarchical clustering is very sensitive to outliers. Furthermore, during the clustering process, once two clusters are merged/split they cannot be undone in a later step for a more suitable merge or split [39]. Hence, hierarchical clustering may yield a suboptimal solution and the dendrogram may not necessarily represent the natural clusters in the data. Spectral based clustering is typically used in graphical applications such as image processing, computer vision, and computational biology. This clustering is a graphbased approach where the sequences are treated as nodes in a graph that are mapped to a low dimensional space using the spectrum (Eigen values) of the input data similarity matrix [40, 41]. Spectral based clustering is very sensitive to the initial conditions. In addition, it is sensitive to the similarity parameters used to define the “connectedness” in the similarity graph which are used in the Gaussian similarity function [40]. Also, this method is computationally expensive for large datasets. In partitional clustering, each cluster is represented by the center of the cluster (e.g., k-means) or the member of the cluster with the minimum average dissimilarity to all other members (e.g., k-medoids). These algorithms work on assigning the sequences to the closest cluster representative (means or medoids), then updating the mean or medoid. Ideally, this process is repeated till no changes in cluster assignments are observed. The convergence of partitional clustering is usually monitored by the “sum of distances” (SOD), which is the total summation of the distances between each observation to its cluster medoid/center (see Eq. 7.8). Partitional clustering may converge to a local minimum since the algorithms heavily depend on the initial conditions (e.g., initial medoids). However, this issue can be controlled by appropriately tuning the initial conditions by monitoring the convergence criteria for several randomly selected initial conditions where certain initial conditions may converge to a higher SOD values (local minima) and other will converge to a low SOD value (global minima). Furthermore, by monitoring the number of iterations for convergence, a good initial condition can be selected. In addition, the partitional clustering will yield a simple representation of the clustered results since each cluster is represented by the medoids or average. This study will use partitional clustering to cluster the SCG morphology. Partitional based methods have been extensively used in shape-based time series clustering with DTW as a distance measure where generally k-medoid method is preferred over k-means to avoid the effect of outliers. In addition, the k-means may require more computations since it requires computing an artificial average sequence as the centroid for each iteration [42]. The centroid is then computed by a shape-based averaging method such as DBA, which requires additional computational expense. In contrast, k-medoid will determine the medoid sequence, which can be easily implemented with DTW measure. A recent study [32] reported higher accuracy for k-medoid clustering when compared with other clustering methods for shape-based clustering of time series. The details of proposed algorithm are given below.

218

7.2.5.3

P. T. Gamage et al.

k-Medoid Clustering with DTW as a Distance Measure

Let a single sequence (i.e., SCG beat) be defined by its amplitudes as, & % Xi = x1 , x2 , x3 , . . . , xli

(7.5)

where li is the length of the sequence Xi . A set of sequences (i.e., SCG beats) can be defined as, S = {X1 , X2 , X3 , . . . Xi , . . . ., XN }

(7.6)

where N is the number of sequences (i.e., SCG beats) to be clustered. The algorithm can be described as follows. Algorithm Inputs: Number of clusters = K. Set of sequences: S = {X1 , X2 , X3 , . . . , Xi . . . ., XN } where each sequence is defined by its feature vector (amplitude) as Xi = & % x1 , x2 , x3 , . . . , xli Step 1: Initialize C1 , . . . , Cj , . . . Ck as the medoids for each cluster. Step 2: For each Xi find the nearest Cj using DTW as the distance measure and assign Xi to cluster j. Step 3: Update Cj based on the clustered events from step 2 using Eq. (7.7). Cj = argminy∈{X1j ,X2j ,...,Xij ,...,Xnj }

nj 

  dtw y, Xij

(7.7)

i=1

where Xij is the ith sequence belonging to cluster j and nj is the number of sequences belonging to Cj after step 2. Step 4: Repeat step 2 and 3 till none of the cluster assignments change. The time complexity of DTW is Θ(l2 ) where l denotes the length of a sequence and when DTW is calculated N times, the complexity becomes Θ(N. l2 ) [37]. Hence, to reduce the time complexity of clustering in the current study, SCG beats were down sampled to 500 Hz (after filtering and segmentation). The SCG beats were also normalized by their maximum amplitudes. This is not expected to affect DTW measure of similarity.

7.3 Results and Discussion In this section, we discuss the clustering results in relation to respiratory phases and heart rate. Cluster switching timing and the variability are also discussed.

7 Clustering of SCG Events Using Unsupervised Machine Learning

219

7.3.1 Optimum Number of Clusters Optimum number of clusters were decided using the elbow method and analysis of the average silhouette value. Elbow method is commonly used for determination of the fewest number of clusters that optimizes intra-cluster variance. The variance (or heterogeneity) of the clustering is often measured by calculating the sum of distances (SOD) between the observation points (i.e., SCG beats) and their cluster medoids using Eq. (7.8). k nj   1  SOD = dtw Cj , Xij N

(7.8)

j =1 i=1

where Xij is the ith sequence belonging to cluster medoid Cj and nj is the number of sequences belonging to Cj . N is the total number of sequences used for clustering. With the increased number of clusters, the observation points get closer to their centroids and the SOD will decrease (e.g., for N clusters, SOD reaches zero). When SOD is plotted against the number of clusters, SOD will decline rapidly, and then at a slower rate creating an elbow shape in the plot. The number of clusters at the elbow point can then be selected as the optimum number of clusters. Figure 7.11 shows the mean SOD of 17 subjects for different number of clusters. The elbow shape can be seen around a cluster number of 2.

Fig. 7.11 Average SOD for different number of clusters. Two clusters were chosen using the elbow method

220

P. T. Gamage et al.

The average Silhouette value (Si) of the clustered data is another method to determine the optimum number of clusters. The Si value of an observation point (i.e., SCG beat) in the cluster is a relative measure of how well that observation point is placed within its own cluster. The Silhouette value (Si) of the point i in the clustering can be expressed as, Si = (bi − ai ) / max (ai , bi )

(7.9)

where ai denotes the average distance measured from the i th point to the other points in the same cluster (that point i belongs to) and bi denotes the average of the minimum distances measured from i th point to all other points in other clusters. Si values range from −1 to 1 where a positive Si value closer to 1 indicates that a point is well inside in its own cluster (away from the boundaries of other clusters) and a negative Si value closer to −1 indicates that the point is closer (to the boundaries) of other clusters. Figure 7.12 shows the average Si values calculated for 17 subjects. The results indicated that the highest average Si value was observed when the data are clustered into two groups. With the increased number of clusters, the average Si value decreased. Based on the results from the elbow method and average Si value, 2 clusters were selected for the study.

Fig. 7.12 Box-whiskers plot of average Silhouette value (calculated for all 17 subjects) for different number of clusters

7 Clustering of SCG Events Using Unsupervised Machine Learning

221

7.3.2 Purity of Clustering with Labels HLV/LLV and INS/EXP Some previous studies categorized SCG signal into respiratory phases to minimize the variability of SCG beats prior to feature detection. While some studies have chosen to group SCG beats based on respiratory flow direction (inhale/exhale) [12, 15], others grouped SCG beats according to the lung volume [13, 28] or to the movement of the chest measured using a chest belt [30]. To study the efficiency of these grouping criteria, purity values of the clustered data were calculated for two labeling criteria. The labels were named inspiration/expiration (INS/EXP) and high/low lung volume (HLV/LLV). The positive/negative values of the flowrate signal (measured using a spirometer) indicated inspiration/expiration, respectively. The flowrate signal was integrated to get the lung volume signal and the positive/negative values of that signal were indicative of high/low lung volume, respectively. SCG beats were labeled based on the timing of the corresponding R peak on the respiratory signals as shown in Fig. 7.13. The purity value indicates how well the labeling criteria fits with the cluster result and it is defined as, Purity =

TP + TN TP + TN + FP + FN

(7.10)

where TP is the number of true positives and TN is the number of true negatives. FP and FN indicate the number of false positives and false negatives, respectively. For example, if the labeling criteria is INS/EXP and SCG beats are divided into two clusters, TP indicates the number of correctly labeled SCG beats as INS and TN indicates number of correctly labeled SCG beats as EXP. Similarly, FP and FN are the number of incorrectly labeled beats as INS and EXP, respectively.

Fig. 7.13 Labeling SCG beats, HLV high lung volume, LLV low lung volume, EXP expiration, INSP inspiration, Red trace lung volume, Blue trace respiratory flowrate

Purity

222

P. T. Gamage et al.

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

INS/EXP

HLV/LLV

Fig. 7.14 Purity values for labeling criteria INS/EXP and HLV/LLV. HLV/LLV labeling provided higher purity levels 35 30 25 20 15 10 5 0

SOD(before clustering)

SOD(INS/EXP)

SOD (HLV/LLV)

SOD(K-medoid)

Fig. 7.15 SOD values before clustering and when the clusters were separated based on HLV/LLV, INS/EXP, and k-medoid solution. K-medoid had least SOD in all subjects

Figure 7.14 shows the purity values for 17 subjects. The mean purity value for INS/EXP and HLV/LLV were 0.66 and 0.77, respectively. The higher purity for HLV/LLV classification suggests that the clustering using HLV/LLV criteria would provide better separation of SCG beats than INS/EXP criteria (except for subject 5 and 8). Further, the intra-group variance (as an indicative of group heterogeneity) before clustering was compared with flowrate direction (i.e., INS/EXP), lung volume (i.e., HLV/LLV), and k-medoid clustering. Here, SOD (Eq. 7.8) was used as a measure of the variance [33]. The mean SOD were 21.57 before clustering and 20.93, 20.22, and 18.24 for INS/EXP, HLV/LLV, and k-medoid, respectively. It can then be concluded that the highest variance existed before clustering and decreased by about 15% when k-medoid clustering was performed (Fig. 7.15). The variance for the INS/EXP was slightly higher than HLV/LLV grouping, suggesting that HLV/LLV grouping may provide better homogeneity than INS/EXP. Figure 7.15 shows higher heterogeneity for the INSP/EXP compared with HLV/LLV grouping in most subjects (Fig. 7.15).

7 Clustering of SCG Events Using Unsupervised Machine Learning

223

7.3.3 Analyzing Cluster Distribution with Respiratory Phases To analyze the possible relations between the cluster distribution and the respiratory cycle, the timing of clustered SCG beats was located on the respiratory flowrate and lung volume waveforms. Here, the respiratory waveforms were normalized according to Eq. (7.11). Fnorm =

F Fmax − Fmin

(7.11)

where F is the respiratory waveform and Fmax , Fmin are the maximum and minimum values of F. The locations of SCG beats are shown in Fig. 7.16, where beats belonging to cluster 1 and cluster 2 are labeled as blue “o” circles and red “∇” triangles, respectively. The shown locations of the SCG beats coincide with the timing of their respective R peaks. Figure 7.16a shows the cluster distribution (of one subject) on the normalized flowrate waveform and Fig. 7.16b shows the cluster distribution on lung volume waveform. Figure 7.16c shows the cluster distribution with both flowrate and lung volume in 3D.

Fig. 7.16 Cluster results (event locations) plotted on (a) respiratory flowrate cycle, (b) lung volume cycle, (c) both flowrate and lung volume cycles in 3D

224

P. T. Gamage et al.

Fig. 7.17 (a) Four respiratory phases identified in a simplified lung volume waveform, (b) Cluster distribution in lung volume and flowrate space in a typical subject

For a better analysis of the cluster distribution, the respiratory phases were divided into four groups, namely, HLV-INS, HLV-EXP, LLV-EXP, and LLV-INS as shown in Fig. 7.17a. A recent study [28] used a variance minimization approach to analyze SCG clustering and observed similar trends. The cluster distribution (of one subject) with corresponding respiratory flowrate and lung volume is presented in Fig. 7.17b and the corresponding four respiratory phases are shown in Fig. 7.17a. As shown in Fig. 7.17b, the clusters are well separated in LLV-EXP and HLVINS regions while clusters are mixed in LLV-INS and LLV-EXP regions where the cluster switching occurs. This trend was consistent among all subjects.

7.3.4 Cluster Switching The cutoff timing when SCG beat switches between the two clusters was determined by employing a linear support vector machine (SVM) theory [43]. Here, the SVM method was applied on a two-dimensional space using the values of respiratory flowrate and lung volume at the time of the R peak of SCG beats. The linear SVM method finds a hyperplane (i.e., a decision boundary) in the feature space of flowrate-lung volume such that the margin between two clusters is maximized. Figure 7.18 shows the decision boundary plotted on flowrate vs. lung volume for one subject. The equation of the decision boundary is a linear function of flowrate and lung volume, which we define as the cluster cutoff equation. Similarly, cluster cutoff equations were calculated for all the subjects and the results are shown in Fig. 7.19. These linear equations (in the form LV=m×FL+c) had a positive slope “m” of −0.63 ± 0.39 (mean ± std) and an intercept “c” of 0.05 ± 0.05 (mean ± std).

7 Clustering of SCG Events Using Unsupervised Machine Learning

225

Fig. 7.18 The cutoff function between the two clusters calculated from linear SVM theory in a typical subject

7.3.5 Relation Between Heart Rate and Clustering To investigate the relation between heart rate and clustering, cluster distribution was plotted with the heart rate variation as function of breathing as shown in Fig. 7.20. As shown in Fig. 7.20b, the trend line of the heart rate showed patterns that are similar to previous studies that investigated the well-known phenomenon of respiratory sinus arrhythmia (RSA) [44]. The data showed that the average heart rates for the two clusters were 76.95 and 73.36, respectively. The heart rate for the cluster containing HLV-INS phase was 6.03 % ± 3.32 % (mean ± std), higher than the other cluster. This change in heart rate was significant (p = 3.04 × 10−7 ).

7.3.6 Intra-cluster Variability In k-medoid clustering, the distance between the observation points and the medoids are measured then observation points (i.e., SCG beats) are assigned to the clusters with the closest medoid. Hence, within a single cluster, some points are closer to the medoid while others are far from the medoid. To demonstrate SCG beat variation within clusters, Fig. 7.21 shows SCG beats for cluster 1 and cluster 2 separately. Here, it can be seen that the clustered SCG beats contain dominant

226

P. T. Gamage et al.

Fig. 7.19 Cluster distributions and the cluster cutoff equations plotted in flowrate vs. lung volume space for all study subjects

7 Clustering of SCG Events Using Unsupervised Machine Learning

227

Fig. 7.20 (a) Defining respiratory phase angle, (b) cluster distribution with heart rate as a function of respiratory phase angle (for one subject)

Fig. 7.21 Clustered SCG beats plotted separately for cluster 1 and cluster 2 (for one subject)

morphological features inherent to each cluster while having noticeable variability within the clusters. By analyzing the distance from the cluster medoid to each point belonging to that cluster, the variance of the cluster can be quantified. In our study, such analysis will help detect outliers as well as to locate SCG beats with the most similar morphologies. Figure 7.22a, b show a plot of the normalized distance distribution from the medoid to the beats in the same cluster. In Fig. 7.22, the distances are sorted from the closest to the farthest. The plot takes the form of a tangent-like function, which was consistent among all subjects. The first point in the plot represents the medoid itself (distance is zero) followed by the point closest to the medoid (point A). As the points move away from the medoid, the distance gradually increases at an approximately constant rate (e.g., point B). As the distance further increases, a steeper distance

228

P. T. Gamage et al.

Fig. 7.22 Distance plotted from cluster medoids to beats in the same cluster in ascending order for (a) cluster 1 (b) cluster 2. Closest (10%) of the beats to the medoid of (c) cluster 1 and (d) cluster 2. Furthest (10%) of the beats from the medoid of (e) cluster 1 and (f) cluster2

increase is seen. This steep gradient appears to continue until (point C) where outliers may be found. This data suggests a relatively small number of outliers. Figure 7.22c–f show the 10% of SCG beats that are closest and farthest from the medoids for both clusters. As expected, results show that the beats closest to the medoids have more homogenous morphology while beats farthest from the medoid appeared to have more heterogeneity. In this study, we proposed and used DBA (Sect. 7.2.4) as an averaging method for calculating a representative event of the SCG beats that are closest to the medoid. Here, DBA was calculated for the 10% of the closest beats to the medoid (including the medoid) using the medoid as the initial average. Figure 7.23 shows the calculated DBA averages (for a sampling frequency of 10 kHz using 10 iterations) of the two clusters for all subjects. These results suggest significant inter-subject variability of SCG beats as well as varying effects of respiration. The noticeable differences between the beats in the two clusters suggest that separation of SCG beats in two clusters would provide a more precise estimate of SCG waveforms.

7 Clustering of SCG Events Using Unsupervised Machine Learning

229

Fig. 7.23 Calculated DBA for 2 clusters for different subjects

Although the number of participants was relatively small (a few hundred heart beats in each of the 17 participants, which is a limitation of the study), clustering results were consistent in all subjects and reached statistical significance. In addition, the clustering results were consistent with the findings of previous studies

230

P. T. Gamage et al.

that showed similar dependence of SCG morphology on respiratory phases. To further confirm this result, future studies will consider larger number of subjects including those with cardiac conditions.

7.4 Conclusion This study investigated the utility of unsupervised machine learning in clustering SCG beats to reduce their variability. Seventeen subjects participated in the study. k-medoid clustering was implemented and dynamic time warping was chosen as the dissimilarity measure. The study results showed that the SCG morphology can be optimally separated into two clusters based on the elbow method and the comparison of average silhouette values. The relation between clusters and respiratory phases was investigated. The clusters had better agreement with lung volume phases (i.e., high vs. low lung volume) than the respiratory flowrate phases (i.e., inspiration vs. expiration). SCG switching from one cluster to the other consistently occurred during the first half of inspiration and expiration. The relation between SCG switching and heart rate was also investigated. The average heart rate of the first cluster (containing inspiration and high lung volume) was significantly higher than the other cluster. This suggests that the mechanisms that cause respiratory sinus arrhythmia may be involved in SCG variability. Waveform differences between clusters were noticeable and varied among subjects. The proposed clustering significantly (p < 0.01) decreased SCG variability (by about 15%). The reduced variability can provide more precise average waveforms and, consequently, more accurate estimates of SCG features. This may yield stronger SCG correlation with heart function, which would enhance clinical utility of SCG. While several studies have shown SCG utility for monitoring cardiac pathology, more studies are actively investigating additional clinically relevant SCG features. If successful, SCG would provide an inexpensive portable noninvasive tool for telehealth and precision medicine applications. It may also provide useful information for big data approaches across volumes of health systems and monitoring data. Acknowledgements This study was supported by NIH R44HL099053. Hansen A. Mansy and Richard H. Sandler are part owners of Biomedical Acoustics Research Company, which is the primary recipient of the above grant, as such they may benefit financially as a result of the outcomes of the research work reported in this publication.

References 1. Gurev, V., Tavakolian, K., Constantino, J., Kaminska, B., Blaber, A. P., & Trayanova, N. A. (2012). Mechanisms underlying isovolumic contraction and ejection peaks in seismocardiogram morphology. Journal of Medical and Biological Engineering, 32(2), 103.

7 Clustering of SCG Events Using Unsupervised Machine Learning

231

2. Korzeniowska-Kubacka, I., Ku´smierczyk-Droszcz, B., Bili´nska, M., DobraszkiewiczWasilewska, B., Mazurek, K., & Piotrowicz, R. (2006). Seismocardiography-a non-invasive method of assessing systolic and diastolic left ventricular function in ischaemic heart disease. Cardiology Journal, 13(4), 319–325. 3. Taebi, A., Solar, B. E., Bomar, A. J., Sandler, R. H., & Mansy, H. A. (2019). Recent advances in seismocardiography. Vibration, 2(1), 64–86. 4. Mounsey, P. (1957). Praecordial ballistocardiography. British Heart Journal, 19(2), 259. 5. Bozhenko, B. (1961). Seismocardiography—a new method in the study of functional conditions of the heart. Terapevticheski˘ı Arkhiv, 33, 55. 6. Inan, O. T., Migeotte, P.-F., Park, K.-S., Etemadi, M., Tavakolian, K., Casanella, R., et al. (2015). Ballistocardiography and seismocardiography: a review of recent advances. IEEE Journal of Biomedical and Health Informatics, 19(4), 1414–1427. 7. Wilson, R. A., Bamrah, V. S., Lindsay, J., Jr., Schwaiger, M., & Morganroth, J. (1993). Diagnostic accuracy of seismocardiography compared with electrocardiography for the anatomic and physiologic diagnosis of coronary artery disease during exercise testing. The American Journal of Cardiology, 71(7), 536–545. 8. Di Rienzo, M., Meriggi, P., Rizzo, F., Vaini, E., Faini, A., Merati, G., et al. (2011). A wearable system for the seismocardiogram assessment in daily life conditions. In Paper presented at the 2011 Annual International Conference of the IEEE Engineering in Medicine and Biology Society. 9. Sahoo, P., Thakkar, H., Lin, W.-Y., Chang, P.-C., & Lee, M.-Y. (2018). On the design of an efficient cardiac health monitoring system through combined analysis of ECG and SCG signals. Sensors, 18(2), 379. 10. Wick, C. A., Su, J.-J., McClellan, J. H., Brand, O., Bhatti, P. T., Buice, A. L., et al. (2012). A system for seismocardiography-based identification of quiescent heart phases: implications for cardiac imaging. IEEE Transactions on Information Technology in Biomedicine, 16(5), 869– 877. 11. Krishnan, K., Mansy, H., Berson, A., Mentz, R. J., & Sandler, R. H. (2018). Siesmocardiographic changes with HF status change: observations from a Pilot study. Journal of Cardiac Failure, 24(8), S54. 12. Taebi, A., & Mansy, H. A. (2017a). Grouping similar seismocardiographic signals using respiratory information. In Paper presented at the signal processing in medicine and biology symposium (SPMB), 2017 IEEE. 13. Taebi, A. (2018), Characterization, classification, and genesis of seismocardiographic signals. Ph.D. Thesis, University of Central Florida, Orlando, FL, USA. 14. Taebi, A., & Mansy, H. A. (2017b). Time-frequency distribution of seismocardiographic signals: a comparative study. Bioengineering, 4(2), 32. 15. Taebi, A., Solar, B. E., & Mansy, H. A. (2018). An adaptive feature extraction algorithm for classification of seismocardiographic signals. arXiv preprint arXiv:1803.10343. 16. Morillo, D. S., Ojeda, J. L. R., Foix, L. F. C., & Jiménez, A. L. (2010). An accelerometerbased device for sleep apnea screening. IEEE Transactions on Information Technology in Biomedicine, 14(2), 491–499. 17. Reinvuo, T., Hannula, M., Sorvoja, H., Alasaarela, E., & Myllyla, R. (2006). Measurement of respiratory rate with high-resolution accelerometer and EMFit pressure sensor. In Paper presented at the Proceedings of the 2006 IEEE Sensors Applications Symposium, 2006. 18. Akhbardeh, A., Tavakolian, K., Gurev, V., Lee, T., New, W., Kaminska, B., et al. (2009). Comparative analysis of three different modalities for characterization of the seismocardiogram. In Paper presented at the Conference Proceedings. 19. Zanetti, J., Poliac, M., & Crow, R. (1991). Seismocardiography: Waveform identification and noise analysis. In Paper presented at the [1991] Proceedings Computers in Cardiology. 20. Zanetti, J. M., & Tavakolian, K. (2013). Seismocardiography: Past, present and future. In Paper presented at the 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

232

P. T. Gamage et al.

21. Khosrow-Khavar, F., Tavakolian, K., Blaber, A. P., Zanetti, J. M., Fazel-Rezai, R., & Menon, C. (2015). Automatic annotation of seismocardiogram with high-frequency precordial accelerations. IEEE Journal of Biomedical and Health Informatics, 19(4), 1428–1434. 22. Sørensen, K., Schmidt, S. E., Jensen, A. S., Søgaard, P., & Struijk, J. J. (2018). Definition of fiducial points in the normal seismocardiogram. Scientific Reports, 8(1), 15455. 23. Crow, R. S., Hannan, P., Jacobs, D., Hedquist, L., & Salerno, D. M. (1994). Relationship between seismocardiogram and echocardiogram for events in the cardiac cycle. American Journal of Noninvasive Cardiology, 8, 39–46. 24. Tavakolian, K., Portacio, G., Tamddondoust, N. R., Jahns, G., Ngai, B., Dumont, G. A., et al. (2012). Myocardial contractility: a seismocardiography approach. In Paper presented at the Engineering in Medicine and Biology Society (EMBC), 2012 Annual International Conference of the IEEE. 25. Tavakolian, K., Vaseghi, A., & Kaminska, B. (2008). Improvement of ballistocardiogram processing by inclusion of respiration information. Physiological Measurement, 29(7), 771. 26. Dai, Z., Peng, Y., Henry, B. M., Mansy, H. A., Sandler, R. H., & Royston, T. J. (2014). A comprehensive computational model of sound transmission through the porcine lung. The Journal of the Acoustical Society of America, 136(3), 1419–1429. 27. Cheuk, M. Y., & Sanderson, J. E. (1997). Right and left ventricular diastolic function in patients with and without heart failure: effect of age, sex, heart rate, and respiration on Doppler-derived measurements. American Heart Journal, 134(3), 426–434. 28. Gamage, P. T., Azad, M. K., Taebi, A., Sandler, R. H., & Mansy, H. A. (2018). Clustering seismocardiographic events using unsupervised machine learning. In Paper presented at the 2018 IEEE Signal Processing in Medicine and Biology Symposium (SPMB). 29. Solar, B. E., Taebi, A., & Mansy, H. A. (2017). Classification of seismocardiographic cycles into lung volume phases. In Paper presented at the Signal Processing in Medicine and Biology Symposium (SPMB), 2017 IEEE. 30. Zakeri, V., Akhbardeh, A., Alamdari, N., Fazel-Rezai, R., Paukkunen, M., & Tavakolian, K. (2017). Analyzing seismocardiogram cycles to identify the respiratory phases. IEEE Transactions on Biomedical Engineering, 64(8), 1786–1792. 31. Tompkins, J. P. W. J. (1985). A real-time QRS detection algorithm. IEEE Transactions on Biomedical Engineering, BME-32(3), 230–236. https://doi.org/10.1109/TBME.1985.325532. 32. Paparrizos, J., & Gravano, L. (2017). Fast and accurate time-series clustering. ACM Transactions on Database Systems (TODS), 42(2), 8. 33. Paparrizos, J., & Gravano, L. (2015). k-shape: Efficient and accurate clustering of time series. In Paper presented at the Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 34. Sakoe, H., & Chiba, S. (1978). Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(1), 43–49. 35. Silva, D. F., & Batista, G. E. (2016). Speeding up all-pairwise dynamic time warping matrix calculation. In Paper presented at the Proceedings of the 2016 SIAM International Conference on Data Mining. 36. Zhang, Z., Tang, P., Huo, L., & Zhou, Z. (2014). MODIS NDVI time series clustering under dynamic time warping. International Journal of Wavelets, Multiresolution and Information Processing, 12(05), 1461011. 37. Petitjean, F., Forestier, G., Webb, G. I., Nicholson, A. E., Chen, Y., & Keogh, E. (2014). Dynamic time warping averaging of time series allows faster and more accurate classification. In Paper presented at the Data Mining (ICDM), 2014 IEEE International Conference on. 38. Petitjean, F., Ketterlin, A., & Gançarski, P. (2011). A global averaging method for dynamic time warping, with applications to clustering. Pattern Recognition, 44(3), 678–693. 39. Rokach, L., & Maimon, O. (2005). Clustering methods data mining and knowledge discovery handbook (pp. 321–352). Berlin: Springer. 40. Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395–416.

7 Clustering of SCG Events Using Unsupervised Machine Learning

233

41. Zelnik-Manor, L., & Perona, P. (2005). Self-tuning spectral clustering. In Paper presented at the Advances in neural information processing systems. 42. Liao, T. W. (2005). Clustering of time series data—A survey. Pattern Recognition, 38(11), 1857–1874. 43. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. 44. Angelone, A., & Coulter, N. A., Jr. (1964). Respiratory sinus arrhythmia: a frequency dependent phenomenon. Journal of Applied Physiology, 19(3), 479–482.

Chapter 8

Deep Learning Approaches for Automated Seizure Detection from Scalp Electroencephalograms Meysam Golmohammadi, Vinit Shah, Iyad Obeid, and Joseph Picone

8.1 Introduction An EEG records the electrical activity along the scalp and measures spontaneous electrical activity of the brain. The signals measured along the scalp can be correlated with brain activity, which makes it a primary tool for diagnosis of brainrelated illnesses [1, 2]. Electroencephalograms (EEGs) are used in a broad range of healthcare institutions to monitor and record electrical activity in the brain. EEGs are essential in the diagnosis of clinical conditions such as epilepsy, depth of anesthesia, coma, encephalopathy, brain death, and even in the progression of Alzheimer’s disease [3, 4]. Manual interpretation of EEGs is time-consuming since these recordings may last hours or days. It is also an expensive process as it requires highly trained experts. Therefore, high-performance automated analysis of EEGs can reduce time of diagnosis and enhance real-time applications by flagging sections of the signal that need further review. Many methods have been developed over the years [5], including time-frequency digital signal processing techniques [6, 7], autoregressive spectral analysis [8], wavelet analysis [9], nonlinear dynamical analysis [10], multivariate techniques based on simulated leaky integrate-and-fire neurons [11– 13], and expert systems that attempt to mimic a human observer [14]. In spite of recent research progress in this field, the transition of automated EEG analysis

M. Golmohammadi Internet Brands, El Segundo, CA, USA Department of Electrical and Computer Engineering, The Neural Engineering Data Consortium, Temple University, Philadelphia, PA, USA V. Shah · I. Obeid · J. Picone () Department of Electrical and Computer Engineering, The Neural Engineering Data Consortium, Temple University, Philadelphia, PA, USA © Springer Nature Switzerland AG 2020 I. Obeid et al. (eds.), Signal Processing in Medicine and Biology, https://doi.org/10.1007/978-3-030-36844-9_8

235

236

M. Golmohammadi et al.

technology to commercial products in operational use in clinical settings has been limited, mainly because of unacceptably high false alarm rates [15–17]. In recent years, progress in machine learning and big data resources has enabled a new generation of technology that is approaching acceptable levels of performance for clinical applications. The main challenge in this task is to operate with an extremely low false alarm rate. A typical critical care unit contains 12 to 24 beds. Even a relatively low false alarm rate of 5 false alarms (FAs) per 24 h per patient, which translates to between 60 and 120 false alarms per day, would overwhelm healthcare staff servicing these events. This is especially true when one considers the amount of other equipment that frequently trigger alerts [18]. In this chapter, we discuss the application of deep learning technology to the automated EEG interpretation problem and introduce several promising architectures that deliver performance close to the requirements for operational use in clinical settings.

8.1.1 Leveraging Recent Advances in Deep Learning Machine learning has made tremendous progress over the past three decades due to rapid advances in low-cost highly parallel computational infrastructure, powerful machine learning algorithms, and, most importantly, big data. Although contemporary approaches for automatic interpretation of EEGs have employed more modern machine learning approaches such as neural networks [19, 20] and support vector machines [21], state-of-the-art machine learning algorithms have not previously been utilized in EEG analysis because of a lack of big data resources. A significant big data resource known as the TUH EEG Corpus (TUEG) is now available, creating a unique opportunity to evaluate high-performance deep learning approaches [22]. This database includes detailed physician reports and patient medical histories, which are critical to the application of deep learning. However, transforming physicians’ reports into meaningful information that can be exploited by deep learning paradigms is proving to be challenging because the mapping of reports to underlying EEG events is nontrivial. Though modern deep learning algorithms have generated significant improvements in performance in fields such as speech and image recognition, it is far from trivial to apply these approaches to new domains, especially applications such as EEG analysis that rely on waveform interpretation. Deep learning approaches can be viewed as a broad family of neural network algorithms that use a large number of layers of nonlinear processing units to learn a mapping between inputs and outputs. These algorithms are usually trained using a combination of supervised and unsupervised learning. The best overall approach is often determined empirically and requires extensive experimentation for optimization. There is no universal theory on how to arrive at the best architecture, and the results are almost always heavily data dependent. Therefore, in this chapter we will present a variety of approaches and establish some well-calibrated benchmarks of performance. We explore two general classes of deep neural networks in detail.

8 Deep Learning Approaches for Automated Seizure Detection from Scalp. . .

237

The first class is a Convolutional Neural Network (CNN), which is a class of deep neural networks that have revolutionized fields like image and video recognition, recommender systems, image classification, medical image analysis, and natural language processing through end-to-end learning from raw data [23]. An interesting characteristic of CNNs that was leveraged in these applications is their ability to learn local patterns in data by using convolutions, more precisely crosscorrelation, as their key component. This property makes them a powerful candidate for modeling EEGs which are inherently multichannel signals. Each channel in an EEG possesses some spatial significance with respect to the type and locality of a seizure event [24]. EEGs also have an extremely low signal to noise ratio and events of interest such as seizures are easily confused with signal artifacts (e.g., eye movements) or benign variants (e.g., slowing) [25]. The spatial property of the signal is an important cue for disambiguating these types of artifacts from seizures. These properties make modeling EEGs more challenging compared to more conventional applications like image recognition of static images or speech recognition using a single microphone. In this study, we adapt well-known CNN architectures to be more suitable for automatic seizure detection. Leveraging a highperformance time-synchronous system that provides accurate segmentation of the signal is also crucial to the development of these kinds of systems. Hence, we use a hidden Markov model (HMM) based approach as a non-deep learning baseline system [26]. Optimizing the depth of a CNN is crucial to achieving state-of-the-art performance. Best results are achieved on most tasks by exploiting very deep structures (e.g., thirteen layers are common) [27, 28]. However, training deeper CNN structures is more difficult since they are prone to degradation in performance with respect to generalization and suffer from convergence problems. Increasing the depth of a CNN incrementally often saturates sensitivity and also results in a rapid decrease in sensitivity. Often increasing the number of layers also increases the error on the training data due to convergence issues, indicating that the degradation in performance is not created by overfitting. We address such degradations in performance by designing deeper CNNs using a deep residual learning framework (ResNet) [29]. We also extend the CNN approach by introducing an alternate structure, a deep convolutional generative adversarial network (DCGAN) [30] to allow unsupervised training. Generative adversarial networks (GANs) [31] have emerged as powerful techniques for learning generative models based on game theory. Generative models use an analysis by synthesis approach to learn the essential features of data required for high-performance classification using an unsupervised approach. We introduce techniques to stabilize the training of DCGAN for spatio-temporal modeling of EEGs. The second class of network that we discuss is a Long Short-Term Memory (LSTM) network. LSTMs are a special kind of recurrent neural network (RNN) architecture that can learn long-term dependencies. This is achieved by introducing a new structure called a memory cell and by adding multiplicative gate units that learn to open and close access to the constant error flow [32]. It has been shown

238

M. Golmohammadi et al.

that LSTMs are capable of learning to bridge minimal time lags in excess of 1000 discrete time steps. To overcome the problem of learning long-term dependencies in modeling of EEGs, we describe a few hybrid systems composed of LSTMs that model both spatial relationships (e.g., cross-channel dependencies) and temporal dynamics (e.g., spikes). In an alternative approach for sequence learning of EEGs, we propose a structure based on gated recurrent units (GRUs) [33]. A GRU is a gating mechanism for RNNs that is similar in concept to what LSTMs attempt to accomplish. It has been shown that GRUs can outperform many other RNNs, including LSTM, in several datasets [33].

8.1.2 Big Data Enables Deep Learning Research Recognizing that deep learning algorithms require large amounts of data to train complex models, especially when one attempts to process clinical data with a significant number of artifacts using specialized models, we have developed a large corpus of EEG data to support this kind of technology development. The TUEG Corpus is the largest publicly available corpus of clinical EEG recordings in the world. The most recent release, v1.1.0, includes data from 2002 to 2015 and contains over 23,000 sessions from over 13,500 patients—over 1.8 years of multichannel signal data in total [22]. This dataset was collected at the Department of Neurology at Temple University Hospital. The data includes sessions taken from outpatient treatments, Intensive Care Units (ICU) and Epilepsy Monitoring Units (EMU), Emergency Rooms (ER) as well as several other locations within the hospital. Since TUEG consists entirely of clinical data, it contains many real-world artifacts (e.g., eye blinking, muscle artifacts, head movements). This makes it an extremely challenging task for machine learning systems and differentiates it from most research corpora currently available in this area. Each of the sessions contains at least one EDF file and one physician report. These reports are generated by a boardcertified neurologist and are the official hospital record. These reports are comprised of unstructured text that describes the patient, relevant history, medications, and clinical impression. The corpus is publicly available from the Neural Engineering Data Consortium (www.nedcdata.org). EEG signals in TUEG were recorded using several generations of Natus Medical Incorporated’s Nicolet™ EEG recording technology [34]. The raw signals consist of multichannel recordings in which the number of channels varies between 20 and 128 channels [24, 35]. A 16-bit A/D converter was used to digitize the data. The sample frequency varies from 250 to 1024 Hz. In our work, we resample all EEGs to a sample frequency of 250 Hz. The Natus system stores the data in a proprietary format that has been exported to EDF with the use of NicVue v5.71.4.2530. The original EEG records are split into multiple EDF files depending on how the session was annotated by the attending technician. For our studies, we use the 19 channels associated with a standard 10/20 EEG configuration and apply a Transverse Central Parasagittal (TCP) montage [36, 37].

8 Deep Learning Approaches for Automated Seizure Detection from Scalp. . . Table 8.1 An overview of the corpora used to develop the technology described in this chapter

Description Patients Sessions Files Seizure (s) Non-seizure (s) Total (s)

TUSZ Train 64 281 1028 17,686 596,696 614,382

Eval 50 229 985 45,649 556,033 601,682

239 DUSZ Eval 45 45 45 48,567 599,381 647,948

A portion of TUEG was annotated manually for seizures. This corpus is known as the TUH EEG Seizure Detection Corpus (TUSZ) [38]. TUSZ is also the world’s largest publicly available corpus of annotated data for seizure detection that is unencumbered. No data sharing or IRB agreements are needed to access the data. TUSZ contains a rich variety of seizure morphologies. Variation in onset and termination, frequency and amplitude, and locality and focality protect the corpus from a bias towards one type of seizure morphology. TUSZ, which reflects a seizure detection task, is the focus of the experiments presented in this chapter. For related work on six-way classification of EEG events, see [26, 39, 40]. We have also included an evaluation on a held-out data set based on the Duke University Seizure Corpus (DUSZ) [41]. The DUSZ database is collected solely from the adult ICU patients exhibiting non-convulsive seizures. These are continuous EEG (cEEG) records [42] where most seizures are focal and slower in frequency. TUSZ in contrast contains records from a much broader range of patients and morphologies. A comparison of these two corpora is shown in Table 8.1. The evaluation sets are comparable in terms of the number of patients and total amount of data, but TUSZ contains many more sessions collected from each patient. It is important to note that TUSZ was collected using several generations of Natus Incorporated EEG equipment [34], while DUSZ was collected at a different hospital, Duke University, using a Nihon Kohden system [43]. Hence, using DUSZ as a held-out evaluation set is an important benchmark because it tests the robustness of the models to variations in the recording conditions. Deep learning systems are notoriously prone to overtraining, so this second data set represents important evidence that the results presented here are generalizable and reproducible on other tasks.

8.2 Temporal Modeling of Sequential Signals The classic approach to machine learning, shown in Fig. 8.1, involves an iterative process that begins with the collection and annotation of data and ends with an openset, or blind, evaluation. Data is usually sorted into training (train), development test set (dev_test), and evaluation (eval). Evaluations on the dev_test data are used to guide system development. One cannot adjust system parameters based on the

240

M. Golmohammadi et al.

Fig. 8.1 An overview of a typical design cycle for machine learning

outcome of evaluations on the eval set but can use these results to assess overall system performance. We typically iterate on all aspects of this approach, including expansion and repartitioning of the training and dev_test data, until overall system performance is optimized. We often leverage previous stages of technology development to seed, or initialize, models used in a new round of development. Further, there is often a need to temporally segment the data, for example automatically labeling events of interest, to support further explorations of the problem space. Therefore, it is common when exploring new applications to begin with a familiar technology. As previously mentioned, EEG signals have a strong temporal component. Hence, a likely candidate for establishing good baseline results is an HMM approach, since this algorithm is particularly strong at automatically segmenting the data and localizing events of interest. HMM systems typically operate on a sequence of vectors referred to as features. In this section, we briefly introduce the feature extraction process we have used, and we describe a baseline system that integrates hidden Markov models for sequential decoding of EEG events with deep learning for decision-making based on temporal and spatial context.

8 Deep Learning Approaches for Automated Seizure Detection from Scalp. . .

241

8.2.1 A Linear Frequency Cepstral Coefficient Approach to Feature Extraction The first step in our machine learning systems consists of converting the signal to a sequence of feature vectors [44]. Common EEG feature extraction methods include temporal, spatial, and spectral analysis [45, 46]. A variety of methodologies have been broadly applied for extracting features from EEG signals including a wavelet transform, independent component analysis, and autoregressive modeling [47, 48]. In this study, we use a methodology based on mel-frequency cepstral coefficients (MFCC) which have been successfully applied to many signal processing applications including speech recognition [44]. In our systems, we use linear frequency cepstral coefficients (LFCCs) since a linear frequency scale provides some slight advantages over the mel scale for EEG signals [40]. A block diagram summarizing the feature extraction process used in this work is presented in Fig. 8.2. Though it is increasingly popular to operate directly from sampled data in a deep learning system, and let the system learn the best set of features automatically, for applications in which there is limited annotated data, it is often more beneficial to begin with a specific feature extraction algorithm. Experiments with different types of features [49] or with using sampled data directly [50] have not shown a significant improvement in performance. Harati et al. [40] did an extensive exploration of many of the common parameters associated with feature extraction and optimized the process for six-way event classification. We have found that this approach, which leverages a popular technique in speech recognition, is remarkably robust across many types of machine learning applications. The LFCCs are computed by dividing raw EEG signals into shorter

Fig. 8.2 Base features are calculated using linear frequency cepstral coefficients

242

M. Golmohammadi et al.

frames using a standard overlapping window approach. A high resolution Fast Fourier Transform (FFT) is computed next. The spectrum is downsampled with a filter bank composed of an array of overlapping bandpass filters. Finally, the cepstral coefficients are derived by computing a discrete cosine transform of the filter bank’s output [44]. In our experiments, we discarded the zeroth-order cepstral coefficient and replaced it with a frequency domain energy term which is calculated by adding the output of the oversampled filter bank after they are downsampled: Ef = log

N −1 

|X(k)|

2

.

(8.1)

k=0

We also introduce a new feature, called differential energy, that is based on the long-term differentiation of energy. Differential energy can significantly improve the results of spike detection, which is a critical part of seizure detection, because it amplifies the differences between transient pulse shape patterns and stationary background noise. To compute the differential energy term, we compute the energy of a set of consecutive frames, which we refer to as a window, for each channel of an EEG:     Ed = maxm Ef (m) − minm Ef (m) .

(8.2)

We have used a window of 9 frames which is 0.1 s in duration, corresponding to a total duration of 0.9 s, to calculate differential energy term. Even though this term is a relatively simple feature, it resulted in a statistically significant improvement in spike detection performance [40]. Our experiments have also shown that using regression-based derivatives of features, which is a popular method in speech recognition [44], is effective in the classification of EEG events. We use the following definition for the derivative: N dt =

n=1 n (ct+n − ct−n ) .  2 2 N n=1 n

(8.3)

Equation (8.3) is applied to the cepstral coefficients, ct , to compute the first derivatives, which are referred to as delta coefficients. Equation (8.3) is then reapplied to the first derivatives to compute the second derivatives, which are referred to as delta-delta coefficients. Again, we use a window length of 9 frames (0.9 s) for the first derivative and a window length of 3 (0.3 s) for the second derivative. The introduction of derivatives helps the system discriminate between steady-state behavior, such as that found in a periodic lateralized epileptiform discharges (PLED) event, and impulsive or nonstationary signals, such as that found in spikes (SPSW) and eye movements (EYEM). Through experiments designed to optimize feature extraction, we found that best performance can be achieved using a feature vector length of 26. This

8 Deep Learning Approaches for Automated Seizure Detection from Scalp. . .

243

vector includes nine absolute features consisting of seven cepstral coefficients, one frequency domain energy term, and one differential energy term. Nine deltas are added for these nine absolute features. Eight delta-deltas are added because we exclude the delta-delta term for differential energy [40].

8.2.2 Temporal and Spatial Context Modeling HMMs are among the most powerful statistical modeling tools available today for signals that have both a time and frequency domain component [51]. HMMs have been used quite successfully in sequential decoding tasks like speech recognition [52], cough detection [53], and gesture recognition [54] to model signals that have sequential properties such as temporal or spatial evolution. Automated interpretation of EEGs is a problem like speech recognition since both time domain (e.g., spikes) and frequency domain information (e.g., alpha waves) are used to identify critical events [55]. EEGs have a spatial component as well. A left-to-right channel-independent GMM-HMM, as illustrated in Fig. 8.3, was used as a baseline system for sequential decoding [26]. HMMs are attractive because training is much faster than comparable deep learning systems, and HMMs tend to work well when moderate amounts of annotated data are available. We divide each channel of an EEG into 1 s epochs and further subdivide these epochs into a sequence of 0.1 s frames. Each epoch is classified using an HMM trained on the subdivided epoch. These epoch-based decisions are postprocessed by additional statistical models in a process that parallels the language modeling component of a speech recognizer. Standard three state left-to-right HMMs [51] with 8 Gaussian mixture components per state were used. The covariance matrix for each mixture component was assumed to be a diagonal matrix—a common assumption for cepstral-based features. Though we evaluated both channel-dependent and channelindependent models, channel-independent models were ultimately used because channel-dependent models did not provide any improvement in performance. Supervised training based on the Baum–Welch reestimation algorithm was used to train two models—seizure and background. Models were trained on segments of data containing seizures based on manual annotations. Since seizures comprise a small percentage of the overall data (3% in the training set; 8% in the evaluation set), the amount of non-seizure data was limited to be comparable to the amount of seizure data, and non-seizure data was selected to include a rich variety of artifacts such as muscle and eye movements. Twenty iterations of Baum–Welch were used though performance is not very sensitive to this value. Standard Viterbi decoding (no beam search) was used in recognition to estimate the model likelihoods for every epoch of data. The entire file was not decoded as one stream because of the imbalance between the seizure and background classes—decoding was restarted for each epoch. The output of the epoch-based decisions was postprocessed by a deep learning system. Our baseline system used a Stacked denoising Autoencoder (SdA) [56, 57]

244

M. Golmohammadi et al.

Fig. 8.3 A hybrid architecture based on HMMs

as shown in Fig. 8.3. SdAs are an extension of stacked autoencoders and are a class of deep learning algorithms well suited to learning knowledge representations that are organized hierarchically [58]. They also lend themselves to problems involving training data that is sparse, ambiguous, or incomplete. Since inter-rater agreement is relatively low for seizure detection [16], it made sense to evaluate this type of algorithm as part of a baseline approach. An N-channel EEG was transformed into N independent feature streams. The hypotheses generated by the HMMs were postprocessed using a second stage of processing that examines temporal and spatial context. We apply a third pass of postprocessing that uses a stochastic language model to smooth hypotheses involving sequences of events so that we can suppress spurious outputs. This third stage of postprocessing provides a moderate reduction in false alarms. Training of SdA networks are done in two steps: (1) pre-training in a greedy layer-wise approach [58] and (2) fine-tuning by adding a logistic regression layer on top of the network [59]. The output of the first stage of processing is a vector of two likelihoods for each channel at each epoch. Therefore, if we have 22 channels and 2 classes (seizure and background), we will have a vector of dimension 2 × 22 = 44 for each epoch. Each of these scores is independent of the spatial context (other EEG channels) or temporal context (past or future epochs). To incorporate context, we form a supervector consisting of N epochs in time using a sliding window approach. We find it beneficial to make N large—typically 41. This results in a vector of dimension 41 × 44 = 1804 that needs to be processed each epoch. The input dimensionality is too high considering the amount of manually labeled data available for training

8 Deep Learning Approaches for Automated Seizure Detection from Scalp. . .

245

and the computational requirements. To deal with this problem we used Principal Components Analysis (PCA) [60, 61] to reduce the dimensionality to 20 before applying the SdA postprocessing. The parameters of the SdA model are optimized to minimize the average reconstruction error using a cross-entropy loss function. In the optimization process, a variant of stochastic gradient descent is used called “minibatch stochastic gradient descent” (MSGD) [62]. MSGD works identically to stochastic gradient descent, except that we use more than one training example to make each estimate of the gradient. This technique reduces variance in the estimate of the gradient and often makes better use of the hierarchical memory organization in modern computers. The SdA network has three hidden layers with corruption levels of 0.3 for each layer. The number of nodes per layer are: 1st layer (connected to the input layer) = 800, 2nd layer = 500, 3rd layer (connected to the output layer) = 300. The parameters for pre-training are: learning rate = 0.5, number of epochs = 150, batch size = 300. The parameters for fine-tuning are: learning rate = 0.1, number of epochs = 300, batch size = 100. The overall result of the second stage is a probability vector of dimension two containing a likelihood that each label could have occurred in the epoch. A soft decision paradigm is used rather than a hard decision paradigm because this output is smoothed in the third stage of processing. A more detailed explanation about the third pass of processing is presented in [63].

8.3 Improved Spatial Modeling Using CNNs Convolutional Neural Networks (CNNs) have delivered state-of-the-art performance on highly challenging tasks such as speech [64] and image recognition [28]. These early successes played a vital role in stimulating interest in deep learning approaches. In this section, we explore modeling of spatial information in the multichannel EEG signal to exploit our knowledge that seizures occur on a subset of channels [2]. The identity of these channels also plays an important role in localizing the seizure and identifying the type of seizure [65].

8.3.1 Deep Two-Dimensional Convolutional Neural Networks CNN networks are usually composed of convolutional layers and subsampling layers followed by one or more fully connected layers. Consider an image of dimension W × H × N, where W and H are the width and height of the image in pixels, and N is the number of channels (e.g., in an RGB image, N = 3 since there are three colors). Two-dimensional (2D) CNNs commonly used in sequential decoding problems such as speech or image recognition typically consist of a convolutional layer that will have K filters (or kernels) of size M × N × Q where M and N are smaller than the dimension of the data and Q is smaller than the number of

246

M. Golmohammadi et al.

Fig. 8.4 A two-dimensional decoding of EEG signals using a CNN/MLP hybrid architecture

channels. The image can be subsampled by skipping samples as you convolve the kernel over the image. This is known as the stride, which is essentially a decimation factor. CNNs have a large learning capacity that can be controlled by varying their depth and breadth to produce K feature maps of size (W – M + 1) × (H – N + 1) for a stride of 1, and proportionally smaller for larger strides. Each map is then subsampled using a technique known as max pooling [66], in which a filter is applied to reduce the dimensionality of the map. An activation function, such as a rectified linear unit (ReLU), is applied to each feature map either before or after the subsampling layer to introduce nonlinear properties to the network. Nonlinear activation functions are necessary for learning complex functional mappings. In Fig. 8.4, a system that combines CNN and a multi-layer perceptron (MLP) [28] is shown. Drawing on our image classification analogy, each image is a signal where the width of the image (W) is the window length multiplied by the number of samples per second, the height of the image (H) is the number of EEG channels, and the number of image channels (N) is the length of the feature vector. This architecture includes six convolutional layers, three max pooling layers, and two fully connected layers. A rectified linear unit (ReLU) nonlinearity is applied to the output of every convolutional and fully connected layer [67]. In our optimized version of this architecture, a window duration of 7 s is used. The first convolutional layer filters the input of size of 70 × 22 × 26 using 16 kernels of size 3 × 3 with a stride of 1. The input feature vectors have a dimension of 26, while there are 22 EEG channels. The window length is 70 because the features are computed every 0.1 s, or 10 times per second, and the window duration is 7 s. These kernel sizes and strides were experimentally optimized [26]. The second convolutional layer filters its input using 16 kernels of size 3 × 3 with a stride of 1. The first max pooling layer takes as input the output of the second convolutional layer and applies a pooling size of 2 × 2. This process is repeated two times with kernels of size 32 and 64. Next, a fully connected layer with 512

8 Deep Learning Approaches for Automated Seizure Detection from Scalp. . .

247

neurons is applied and the output is fed to a 2-way sigmoid function which produces a two-class decision. This two-class decision is the final label for the given epoch, which is 1 s in duration. Neurologists usually review EEGs using 10 s windows, so we attempt to use a similar amount of context in this system. Pattern recognition systems often subdivide the signal into small segments during which the signal can be considered quasi-stationary. A simple set of preliminary experiments determined that a reasonable tradeoff between computational complexity and performance was to split a 10 s window, which is what neurologists use to view the data, into 1 s epochs [40]. In our experiments, we found that structures which are composed of two consecutive convolutional layers before a pooling layer perform better than structures with one convolutional layer before a pooling layer. Pooling layers decrease the dimensions of the data and thereby can result in a loss of information. Using two convolutional layers before pooling mitigates the loss of information. We find that using very small fields throughout the architecture (e.g., 3 × 3) performs better than larger fields (e.g., 5 × 5 or 7 × 7) in the first convolutional layer.

8.3.2 Augmenting CNNs with Deep Residual Learning The depth of a CNN plays an instrumental role in its ability to achieve high performance [27, 28]. As many as thirteen layers are used for challenging problems such as speech and image recognition. However, training deeper CNN structures is more difficult since convergence and generalization become issues. Increasing the depth of CNNs, in our experience, tends to increase the error on evaluation dataset. As we add more convolutional layers, sensitivity first saturates and then degrades quickly. We also see an increase in the error on the training data when increasing the depth of a CNN, indicating that overfitting is actually not occurring. Such degradations in performance can be addressed by using a deep residual learning framework known as a ResNet [29]. ResNets introduce an “identity shortcut connection” that skips layers. Denoting the desired underlying mapping as H(x), we map the stacked nonlinear layers using F(x) = H(x) – x, where x is the input. The original mapping is recast into F(x) + x. It can be shown that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping [29]. The deep residual learning structure mitigates two important problems: vanishing/exploding gradients and saturation of accuracy when the number of layers is increased. As the gradient is backpropagated to earlier layers, repeated multiplication of numbers less than one often makes the gradient infinitively small. Performance saturates and can rapidly degrade due to numerical precision issues. Our structure addresses these problems by reformulating the layers as learning residual functions with reference to the layer inputs instead of learning unreferenced functions. An architecture for our ResNet approach is illustrated in Fig. 8.5. The shortcut connections between the convolutional layers make training of the model tractable

248

M. Golmohammadi et al.

Fig. 8.5 A deep residual learning framework, ResNet, is shown

by allowing information to propagate effectively through this very deep structure. The network consists of 6 residual blocks with two 2D convolutional layers per block. These convolutional layers are followed by a fully connected layer and a single dense neuron as the last layer. This brings the total number of layers in this modified CNN structure to 14. The 2D convolutional layers all have a filter length of (3, 3). The first 7 layers of this architecture have 32 filters while the last layers have 64 filters. We increment the number of filters from 32 to 64, since the initial layers represent generic features, while the deeper layers represent more detailed features. In other words, the richness of the data representations increases because each additional layer forms new kernels using combinations of the features from the previous layer. Except for the first and last layers of the network, before each convolutional layer we apply a Rectified Linear Unit (ReLU) as an activation function [68]. ReLU is the most commonly used activation function in deep learning models. The function returns 0 if it receives any negative input, but for any positive value it returns that value (e.g., f (x) = max (0, x)). To overcome the problem of overfitting in deep learning structures with a large number of parameters, we use dropout [69] as our regularization method between the convolutional layers and after ReLU. Dropout is

8 Deep Learning Approaches for Automated Seizure Detection from Scalp. . .

249

a regularization technique for addressing overfitting by randomly dropping units along with their connections from the deep learning structures during training. We use the Adam optimizer [70] which is an algorithm for first-order gradientbased optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. After parameter tuning, we apply Adam optimization using the following parameters (according to the notation in their original paper): α = 0.00005, β 1 = 0.9, β 2 = 0.999, ε = 10−8 , and decay = 0.0001. The deep learning systems described thus far have incorporated fully supervised training and discriminative models. Next, we introduce a generative deep learning structure based on convolutional neural networks that leverages unsupervised learning techniques. These are important for biomedical applications where large amounts of fully annotated data are difficult to find.

8.3.3 Unsupervised Learning Machine learning algorithms can generally be split into two categories: generative and discriminative. A generative model learns the joint probability distribution of P(X, Y) where X is an observable variable and Y is the target variable. These models learn the statistical distributions of the input data rather than simply classifying the data as one of C output classes. Hence the name, generative, since these methods learn to replicate the underlying statistics of the data. GMMs trained using a greedy clustering algorithm or HMMs trained using the Expectation Maximization (EM) algorithm [71] are well-known examples of generative models. A discriminative model, on the other hand, learns the conditional probability of the target Y, given an observation X, which we denote P(Y| X) [72]. Support vector machines [73] and Maximum Mutual Information Estimation (MMIE) [74] are two well-known discriminative models. Generative adversarial networks (GANs) [31] have emerged as a powerful learning paradigm technique for learning generative models for high-dimensional unstructured data. GANs use a game theory approach to find the Nash equilibrium between a generator and discriminator network [75]. A basic GAN structure consists of two neural networks: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. These two networks are trained simultaneously via an adversarial process. In this process, the generative network, G, transforms the input noise vector z to generate synthetic data G(z). The training objective for G is to maximize the probability of D making a mistake about the source of the data. The output of the generator is a synthetic EEG—data that is statistically consistent with an actual EEG but is fabricated entirely by the network. The second network, which is the discriminator, D,takes as input either the output of G or samples from real-world data. The output of D is a probability distribution over possible input sources. The output of the discriminator in GAN determines if the signal is a sample from real-world data or synthetic data from the generator.

250

M. Golmohammadi et al.

Fig. 8.6 An unsupervised learning architecture is shown that uses DCGANs

The generative model, G, and the discriminative model, D, compete in a twoplayer minimax game with a value function, V(G; D), in a way that D is trained to maximize the probability of assigning the correct label to both the synthetic and real data from G [31]. The generative model G is trained to fool the discriminator by minimizing log(1 − D(G(z))):    # $ minG maxD V (D, G) = Ex∼pdata (x) log D(x) + Ez∼pz (z) log 1 − D (G(z)) . (8.4) During the training process, our goal is to find a Nash equilibrium of a nonconvex two-player game that minimizes both the generator and discriminator’s cost functions [75]. A deep convolutional generative adversarial network (DCGAN) is shown in Fig. 8.6. The generative model takes 100 random inputs and maps them to a matrix with size of [21, 22, 250], where 21 is the window length (corresponding to a 21 s duration), 22 is number of EEG channels, and 250 is number of samples per sec. Recall, in our study, we resample all EEGs to a sample frequency of 250 Hz [40]. The generator is composed of transposed CNNs with upsamplers. Transposed convolution, also known as fractionally strided convolution, can be implemented by swapping the forward and backward passes of a regular convolution [31]. We need transposed convolutions in the generators since we want to go in the opposite direction of a normal convolution. For example, in this case we want to compose the vector of [21, 22, 250] from 100 random inputs. Using transposed convolutional layers, we can transform feature maps to a higher-dimensional space. Leaky ReLUs [68] are used for the activation function and dropout layers are used for regularization. Adam is used as the optimizer and binary cross-entropy [76] is used as the loss function.

8 Deep Learning Approaches for Automated Seizure Detection from Scalp. . .

251

In this architecture, the discriminative model accepts vectors from two sources: synthetic data generators and real data (raw EEGs in this case). It is composed of strided convolutional neural networks [77]. Strided convolutional neural networks are like regular CNNs but with a stride greater than one. In the discriminator we replace the usual approach of convolutional layers with max pooling layers with strided convolutional neural networks. This is based on our observations that using convolutional layers with max pooling makes the training of DCGAN unstable. This is due to the fact that using strided convolutional layers, the network learns its own spatial downsampling, and convolutional layers with max pooling tend to conflict with striding. Finding the Nash equilibrium, which is a key part of the GAN approach, is a challenging problem that impacts convergence during training. Several recent studies address the instability of GANs and suggest techniques to increase the training stability of GANs [77]. We conducted a number of preliminary experiments and determined that these techniques were appropriate: In the discriminator: • • • •

pre-training of the discriminator one-sided label smoothing eliminating fully connected layers on top of convolutional features replacing deterministic spatial pooling functions (such as max pooling) with strided convolutions

In the generator: • • • •

using an ReLU activation for all layers except for the output normalizing the input to [−1, 1] for the discriminator using a tanh() activation in the last layer except for the output using leaky ReLU activations in the discriminator for all layers except for the output • freezing the weights of discriminator during adversarial training process • unfreezing weights during discriminative training • eliminating batch normalization in all the layers of both the generator and discriminator The GAN approach is attractive for a number of reasons including creating an opportunity for data augmentation. Data augmentation is common in many stateof-the-art deep learning systems today [78], allowing the size of the training set to be increased as well as exposing the system to previously unseen patterns during training.

8.4 Learning Temporal Dependencies The duration of events such as seizures can vary dramatically from a few seconds to minutes. Further, neurologists use significant amounts of temporal context and adaptation in manually interpreting EEGs. They are very familiar with their patients

252

M. Golmohammadi et al.

and often can identify the patient by examining the EEG signal, especially when there are certain types of anomalous behaviors. In fact, they routinely use the first minute or so of an EEG to establish baseline signal conditions [65], or normalize their expectations, so that they can more accurately determine anomalous behavior. Recurrent neural networks (RNN), which date back to the late 1980s [79], have been proposed as a way to learn such dependencies. Prior to this, successful systems were often based on approaches such as hidden Markov models, or used heuristics to convert frame-level output into longer-term hypotheses. In this section, we introduce several architectures that model long-term dependencies.

8.4.1 Integration of Incremental Principal Component Analysis with LSTMs In the HMM/SdA structure proposed in Sect. 8.2.2, PCA was used prior to SdA for dimensionality reduction. Unlike HMM/SdA, applying LSTM networks directly to features requires more memory efficient approaches than PCA, or the memory requirements of the network can easily exceed the available computational resources (e.g., low-cost graphics processing units such as the Nvidia 1080ti have limited amount of memory—typically 8 Gb). Incremental principal components analysis (IPCA) is an effective technique for dimensionality reduction [61, 80]. This algorithm is often more memory efficient than PCA. IPCA has constant memory complexity proportional to the batch size, and it enables use of large datasets without a need to load the entire file or dataset into memory. IPCA builds a low-rank approximation for the input data using an amount of memory which is independent of the number of input data samples. It is still dependent on the dimensionality of the input data features but allows more direct control of memory usage by changing the batch size. In PCA, the first k dominant principal components, y1 (n), y2 (n), . . . , yk (n), are computed directly from the input, x(n) as follows: For n = 1, 2, . . . , do the following: 1. x1 (n) = x(n). 2. For i = 1, 2, . . . , min (k, n), do: (a) if i = n, initialize the i principal component as yi (n) = xi (n) (b) otherwise compute: yi (n) =

n−1−p yi (n − 1) 1+p yi (n − 1) + xi (n)xiT (n) yi (n − 1) n n

xi+1 (n) = xi (n) − xiT (n)

yi (n) yi (n) , yi (n) yi (n)

(8.5)

(8.6)

8 Deep Learning Approaches for Automated Seizure Detection from Scalp. . .

253

Fig. 8.7 An architecture that integrates IPCA and LSTM

where the positive parameter p is called the amnesic parameter. Typically, p ranges from 2 to 4. Then the eigenvector and eigenvalues are given by: ei =

yi (n) and λi = yi (n) . yi (n)

(8.7)

In Fig. 8.7, we present an architecture that integrates IPCA and LSTM [26]. In this system, samples are converted to features and the features are delivered to an IPCA layer that performs spatial context analysis and dimensionality reduction. The output of the IPCA layer is delivered to a one-layer LSTM for seizure classification task. The input to the IPCA layer is a vector whose dimension is the product of the number of channels, the number of features per frame, and the number of frames of context. Preliminary experiments have shown that 7 s of temporal context performs well. The corresponding dimension of the vector input to IPCA is 22 channels × 26 features × 7 s × 10 frames/s, or a total of 4004 elements. A batch size of 50 is used and the dimension of its output is 25 elements per frame at 10 frames/s. In order to learn long-term dependencies, one LSTM with a hidden layer size of 128 and batch size of 128 is used along with Adam optimization and a cross-entropy loss function.

8.4.2 End-to-End Sequence Labeling Using Deep Architectures In machine learning, sequence labeling is defined as assigning a categorial label to each member of a sequence of observed values. In automatic seizure detection, we assign one of two labels: seizure or non-seizure. This decision is made every epoch, which is typically a 1 s interval. The proposed structures are trained in an endto-end fashion, requiring no pre-training and no preprocessing, beyond the feature extraction process that was explained in Sect. 8.2.1. For example, for an architecture composed of a combination of CNN and LSTM, we do not train CNN independently from LSTM, but we train both jointly. This is challenging because there are typically convergence issues when attempting this.

254

M. Golmohammadi et al.

Fig. 8.8 A deep recurrent convolutional architecture

In Fig. 8.8, we integrate 2D CNNs, 1-D CNNs, and LSTM networks, which we refer to as CNN/LSTM, to better exploit long-term dependencies [26]. Note that the way that we handle data in CNN/LSTM is different from the CNN/MLP system presented in Fig. 8.4. The input EEG features vector sequence can be thought of as being composed of frames distributed in time where each frame is an image of width (W) equal to the length of a feature vector. The height (H) equals the number of EEG channels and the number of image channels (N) equals one. The input to the network consists of T frames where T is equal to the window length multiplied by the number of frames per second. In our optimized system, where features are available 10 times per second, a window duration of 21 s is used. The first 2D convolutional layer filters 210 frames (T = 21 × 10) of EEGs distributed in time with a size of 26 × 22 × 1 (W = 26, H = 22, N = 1) using 16 kernels of size 3 × 3 with a stride of 1. The first 2D max pooling layer takes as input a vector which is 260 frames distributed in time with a size of 26 × 22 × 16 and applies a pooling size of 2 × 2. This process is repeated two times with two 2D convolutional layers with 32 and 64 kernels of size 3 × 3, respectively, and two 2D max pooling layers with a pooling size 2 × 2. The output of the third max pooling layer is flattened to 210 frames with a size of 384 × 1. Then a 1D convolutional layer filters the output of the flattening layer using 16 kernels of size 3 which decreases the dimensionality in space to 210 × 16. Next, we apply a 1D max pooling layer with a size of 8 to decrease the dimensionality to 26 × 16. This is the input to a deep bidirectional LSTM network where the dimensionality of the output space is 128 and 256. The output of the last bidirectional LSTM layer is fed to a 2-way sigmoid function which produces a final classification of an epoch. To overcome the problem of overfitting and force the system to learn more robust features, dropout and Gaussian noise layers are used between layers [69]. To increase nonlinearity, Exponential Linear Units (ELU) are

8 Deep Learning Approaches for Automated Seizure Detection from Scalp. . .

255

used [81]. Adam is used in the optimization process along with a mean squared error loss function. More recently, Chung et al. [33] proposed another type of recurrent neural network, known as a gated recurrent unit (GRU). A GRU architecture is similar to an LSTM but without a separate memory cell. Unlike LSTM, a GRU does not include output activation functions and peep hole connections. It also integrates the input and forget gates into an update gate to balance between the previous activation and the candidate activation. The reset gate allows it to forget the previous state. It has been shown that the performance of a GRU is on par with an LSTM, but a GRU can be trained faster [26]. The architecture is similar to that Fig. 8.8, but we simply replace LSTM with GRU, in a way that the output of 1D max pooling is the input to a GRU where the dimensionality of the output space is 128 and 256. The output of the last GRU is fed to a 2-way sigmoid function which produces a final classification of an epoch. These two approaches, LSTM and GRU, are evaluated as part of a hybrid architecture that integrates CNNs with RNNs [82].

8.4.3 Temporal Event Modeling Using LSTMs A final architecture we wish to consider is a relatively straightforward variation of an LSTM network. LSTMs are a special type of recurrent neural network which contains forget and output gates to control the information flow during its recurrent passes. LSTM networks have proven to be outperform conventional RNNs, HMMs, and other sequence learning methods in numerous applications such as speech recognition and handwriting recognition [83, 84]. Our first implementation of LSTM was a hybrid network of both HMM and LSTM networks. A block diagram of HMM/LSTM system is shown in Fig. 8.9. Similar to the HMM/SdA model discussed before, the input to the second layer of the system, which is the first layer of LSTMs, is a vector of dimension 2 × 22 × window length. We use PCA to reduce the dimensionality of the input vector to 20 and pass it to the LSTM model. A window size of 41 s (41 epochs at 1 s per epoch) is used for a 32-node single hidden layer LSTM network. The final layer uses a dense neuron with a sigmoid activation function. The parameters of the models are optimized to minimize the error using a cross-entropy loss function and Adam [70]. Next, we use a 3-layer LSTM network model. Identification of a seizure event is done based on an observation of a specific type of epileptiform activity called “spike and wave discharges” [85]. The evolution of these activities across time helps identify a seizure event. These events can be observed on individual channels. Once observed, the seizures can be confirmed based on their focality, signal energy, and its polarity across spatially close channels. The architecture is shown in Fig. 8.10. In the preprocessing step, we extract a 26-dimensional feature vector for an 11frame context centered around the current frame. The output dimensionality for each frame is 10 × 26 (left) + 26 (center) + 10 × 26 (right) = 546. The static LSTM cells are used with a fixed batch size of 64 and a window size of 7 s. The data

256

M. Golmohammadi et al.

Fig. 8.9 A hybrid architecture that integrates HMM and LSTM

Fig. 8.10 A channel-based long short-term memory (LSTM) architecture

is randomly split into subsets where 80% is used for training and 20% is used for cross-validation during optimization. The features are normalized and scaled down to a range of [0, 1] on a file basis, which helps the gradient descent algorithm (and its variants) to converge much faster [86]. Shuffling was performed on batches to avoid training biases. The network includes 3 LSTM layers with (256, 64, 16) hidden layers followed by a 2-cell dense layer. The activation function used for all LSTM layers is a hyperbolic tangent function, tanh(), except for the final layer, which uses a softmax function to compress the range of output values to [0,1] so they resemble posterior

8 Deep Learning Approaches for Automated Seizure Detection from Scalp. . .

257

probabilities. A cross-entropy function is used for calculating loss. Stochastic gradient descent with Nesterov momentum is used for optimization. Nesterov momentum attempts to increase the speed of training by introducing a momentum term based on accumulated gradients of its previous steps and a correction term in the direction of the current gradient [87]. This tends to reduce the amount of overshoot during optimization. The optimization is performed on the training data at a very high learning rate of 1.0 for the first five epochs. Cross-validation is performed after each epoch. After five epochs, if the cross-validation loss stagnates for three consecutive epochs (referred to as “patience = 3”), learning rates are halved after each iteration until it anneals to zero. If the model fails to show consistent performance on a crossvalidation set, then it reverts to the previous epoch’s weights and restarts training until convergence. This method helps models avoid overfitting on the training data as long as the training and cross-validation sets are equally diverse. The outputs of the models are fed to a postprocessor which is described in more detail in Sect. 8.5. This postprocessor is designed based on domain knowledge and observed system behavior to remove spurious and misleading detections. This is implemented to incorporate spatial context. The postprocessor sets a threshold for hypothesis confidence, the minimum number of channels for target event detection and a duration constraint which must be satisfied for detection. For example, if multiple channels consistently detected spike and wave discharges in the same 9s interval, this event would be permitted as a valid output. Outputs from a fewer number of channels or over a smaller duration of time would be suppressed. We have now presented a considerable variety of deep learning architectures. It is difficult to predict which architecture performs best on a given task without extensive experimentation. Hence, in the following section, we review a wideranging study of how these architectures perform on the TUSZ seizure detection task.

8.5 Experimentation Machine learning is at its core an experimental science when addressing real-world problems of scale. Real-world data is complex and poses many challenges that require a wide variety of technologies to solve and can mask the benefits of one specific algorithm. Therefore, it is important that a rigorous evaluation paradigm be used to guide architecture decisions. In this chapter, we are focusing on the TUSZ Corpus because it is a very comprehensive dataset and it offers a very challenging task. The evaluation of machine learning algorithms in biomedical fields for applications involving sequential data lacks standardization. Common quantitative scalar evaluation metrics such as sensitivity and specificity can often be misleading depending on the requirements of the application. Evaluation metrics must ultimately reflect the needs of users yet be sufficiently sensitive to guide algorithm

258

M. Golmohammadi et al.

development. Feedback from critical care clinicians who use automated event detection software in clinical applications has been overwhelmingly emphatic that a low false alarm rate, typically measured in units of the number of errors per 24 h, is the single most important criterion for user acceptance. Though using a single metric is not often as insightful as examining performance over a range of operating conditions, there is a need for a single scalar figure of merit. Shah et al. [88] discuss the deficiencies of existing metrics for a seizure detection task and propose several new metrics that offer a more balanced view of performance. In this section, we compare the architectures previously described using one of these measures, the Any-Overlap Method (OVLP). We also provide detection error tradeoff (DET) curves [89].

8.5.1 Evaluation Metrics Researchers in biomedical fields typically report performance in terms of sensitivity and specificity [90]. In a two-class classification problem such as seizure detection, we can define four types of errors: • • • •

True Positives (TP): the number of “positives” detected correctly True Negatives (TN): the number of “negatives” detected correctly False Positives (FP): the number of “negatives” detected as “positives” False Negatives (FN): the number of “positives” detected as “negatives”

Sensitivity (TP/(TP+FN)) and specificity (TN/(TN+FP)) are derived from these quantities. There are a large number of auxiliary measures that can be calculated from these four basic quantities that are used extensively in the literature. For example, in information retrieval applications, systems are often evaluated using accuracy ((TP+TN)/(TP+FN+TN+FP)), precision (TP/(TP+FP)), recall (another term for sensitivity), and F1 score ((2•Precision•Recall)/(Precision + Recall)). However, none of these measures address the time scale on which the scoring must occur or how you score situations where the mapping of hypothesized events to reference events is ambiguous. These kinds of decisions are critical in the interpretation of scoring metrics such as sensitivity for many sequential decoding tasks such as automatic seizure detection [89, 91, 92]. In some applications, it is preferable to score every unit of time. With multichannel signals, such as EEGs, scoring for each channel for each unit of time might be appropriate since significant events such as seizures occur on a subset of the channels present in the signal. However, it is more common in the literature to simply score a summary decision per unit of time, such as every 1 s, that is based on an aggregation of the per-channel inputs (e.g., a majority vote). We refer to this type of scoring as epoch-based [93, 94]. An alternative, that is more common in speech and image recognition applications, is term-based [50, 95], in which we consider the start and stop time of the event, and each event identified in the reference annotation is counted once. There are fundamental differences between the two conventions.

8 Deep Learning Approaches for Automated Seizure Detection from Scalp. . .

259

Fig. 8.11 OVLP scoring is very permissive about the degree of overlap between the reference and hypothesis. For example, in Example 1, the TP score is 1 with no false alarms. In Example 2, the system detects 2 out of 3 seizure events, so the TP and FN scores are 2 and 1, respectively

For example, one event containing many epochs will count more heavily in an epoch-based scoring scenario. Epoch-based scoring generally weights the duration of an event more heavily since each unit of time is assessed independently. Term-based metrics score on an event basis and do not count individual frames. A typical approach for calculating errors in term-based scoring is the Any-Overlap Method (OVLP) [92]. This approach is illustrated in Fig. 8.11. TPs are counted when the hypothesis overlaps with the reference annotation. FPs correspond to situations in which a hypothesis does not overlap with the reference. OVLP is a more permissive metric that tends to produce much higher sensitivities. If an event is detected in close proximity to a reference event, the reference event is considered correctly detected. If a long event in the reference annotation is detected as multiple shorter events in the hypothesis, the reference event is also considered correctly detected. Multiple events in the hypothesis annotation corresponding to the same event in the reference annotation are not typically counted as FAs. Since the FA rate is a very important measure of performance in critical care applications, this is another cause for concern. However, since OVLP metric is the most popular choice in the neuroengineering community, we present our results in terms of OVLP. Note that results are still reported in terms of sensitivity, specificity, and false alarm rate. But, as previously mentioned, how one measures the errors that contribute to these measures is open for interpretation. Shah et al. [92] studied this problem extensively and showed that many of these measures correlate and are not significantly different in terms of the rank ordering and statistical significance of scoring differences for the TUSZ task. We provide a software package that allows researchers to replicate our metrics and reports on many of the most popular metrics [91].

260

M. Golmohammadi et al.

8.5.2 Postprocessing with Heuristics Improves Performance Because epoch-based scoring produces a hypothesis every epoch (1 s in this case), and these are scored against annotations that are essentially asynchronous, there is an opportunity to improve performance by examining sequences of epochs and collapsing multiple events into a single hypothesis. We have experimented with heuristic approaches to this as well as deep learning-based approaches and have found no significant advantage for the deep learning approaches. As is well known in machine learning research, a good heuristic can be very powerful. We apply a series of heuristics, summarized in Fig. 8.12, to improve performance. These heuristics are very important in reducing the false alarm rate to an acceptable level. The first heuristic we apply is a popular method that focuses on a model’s confidence in its output. Probabilistic filters [96] are implemented to only consider target events which are detected above a specified probability threshold. This method tends to suppress spurious long duration events (e.g., slowing) and extremely short duration events (e.g., muscle artifacts). This decision function is applied on the seizure (target) labels only. We compare each seizure label’s posterior with the threshold value. If the posterior is above the threshold, the label is kept as is; otherwise, it is changed to the non-seizure label, which we denote “background.” Our second heuristic was developed after performing extensive error analysis. The most common types of errors we observed were false detections of background events as seizures (FPs) which tend to occur in bursts. Usually these erroneous bursts occur for a very small duration of time (e.g., 3 to 7 s). To suppress these, any seizure event whose duration is below a specified threshold is automatically considered as a non-seizure, or background, event. Finally, we also implement a smoothing method that collapses sequences of two seizure events separated by a background event into one long seizure event. This is

Fig. 8.12 An illustration of the postprocessing algorithms used to reduce the FA rate

8 Deep Learning Approaches for Automated Seizure Detection from Scalp. . .

261

typically used to eliminate spurious background events. If seizures are observed in clusters separated by small intervals of time classified as background events, these isolated events are most likely part of one longer seizure event. In this method, we apply a nonlinear function that computes a pad time to extend the duration of an isolated event. If the modified endpoint of that event overlaps with another seizure event, the intervening background event is eliminated. We used a simple regression approach to derive a quadratic function that produces a padding factor: w(x) = − 0.0083d2 + 0.45d − 0.66, where d is the duration of the event. This method tends to reduce isolated background events when they are surrounding by seizure events, thereby increasing the specificity. The combination of these three postprocessing methods tends to decrease sensitivity slightly and reduce false alarms by two orders of magnitude, so their impact is significant. The ordering in which these methods are applied is important. We apply them in the order described above to achieve optimal performance.

8.5.3 A Comprehensive Evaluation of Hybrid Approaches A series of experiments was conducted to optimize the feature extraction process. These are described in detail in [40]. Subsequent attempts to replace feature extraction with deep learning-based approaches have resulted in a slight degradation in performance. A reasonable tradeoff between computational complexity and performance was to split the 10 s window, popular with neurologists who manually interpret these waveforms, into 1 s epochs, and to further subdivide these into 0.1 s frames. Hence, features were computed every 0.1 s using a 0.2 s overlapping analysis window. The output of the feature extraction system is 22 channels of data, where in each channel, a feature vector of dimension 26 corresponds to every 0.1 s. This type of analysis is very compatible with the way HMM systems operate, so it was a reasonable starting point for this work. We next evaluated several architectures using these features as inputs on TUSZ. These results are presented in Table 8.2. The related DET curve is illustrated in Fig. 8.13. An expanded version of the DET curve in Fig. 8.13 that compares the performance of these architectures in a region of the DET curve where the false positive rate, also known as the false alarm (FA) rate, is low is presented in Fig. 8.14. Since our focus is achieving a low false alarm rate, behavior in this region of the DET curve is very important. As previously mentioned, these systems were evaluated using the OVLP method, though results are similar for a variety of these metrics. It is important to note that the accuracy reported here is much lower than what is often published in the literature on other seizure detection tasks. This is due to a combination of factors including (1) the neuroscience community has favored a more permissive method of scoring that tends to produce much higher sensitivities and lower false alarm rates; and (2) TUSZ is a much more difficult task than any corpus previously released as open source. The evaluation set was designed to be

262

M. Golmohammadi et al.

Table 8.2 Performance of the proposed architectures on TUSZ System HMM HMM/SdA HMM/LSTM IPCA/LSTM CNN/MLP CNN/GRU ResNet CNN/LSTM Channel-based LSTM

Sensitivity (%) 30.32 35.35 30.05 32.97 39.09 30.83 30.50 30.83 39.46

Specificity (%) 80.07 73.35 80.53 77.57 76.84 91.49 94.24 97.10 95.20

FA/24 h 244 77 60 73 77 21 13 6 11

Fig. 8.13 A DET curve comparison of the proposed architectures on TUSZ

Fig. 8.14 An expanded comparison of performance in a region where the FP rate is low

8 Deep Learning Approaches for Automated Seizure Detection from Scalp. . .

263

representative of common clinical issues and includes many challenging examples of seizures. We have achieved much higher performance on other publicly available tasks, such as the Children’s Hospital of Boston MIT (CHB-MIT) Corpus, and demonstrated that the performance of these techniques exceeds that of published or commercially available technology. TUSZ is simply a much more difficult task and one that better represents the clinical challenges this technology faces. Also, note that the HMM baseline system, which is shown in the first row of Table 8.2, and channel-based LSTM, which is shown in the last row of Table 8.2, operate on each channel independently. The other methods consider all channels simultaneously by using a supervector that is a concatenation of the feature vectors for all channels. The baseline HMM system only classifies epochs (1 s in duration) using data from within that epoch. It does not look across channels or across multiple epochs when performing epoch-level classification. From Table 8.2 we can see that adding a deep learning structure for temporal and spatial analysis of EEGs can decrease the false alarm rate dramatically. Further, by comparing the results of HMM/SdA with HMM/LSTM, we find that a simple one-layer LSTM performs better than 3 layers of SdA due to LSTM’s ability to explicitly model long-term dependencies. Note that in this case the complexity and training time of these two systems is comparable. The best overall systems shown in Table 8.2 are CNN/LSTM and channel-based LSTM. CNN/LSTM is a doubly deep recurrent convolutional structure that models both spatial relationships (e.g., cross-channel dependencies) and temporal dynamics (e.g., spikes). For example, CNN/LSTM does a much better job rejecting artifacts that are easily confused with spikes because these appear on only a few channels, and hence can be filtered based on correlations between channels. The depth of the convolutional network is important since the top convolutional layers tend to learn generic features while the deeper layers learn dataset specific features. Performance degrades if a single convolutional layer is removed. For example, removing any of the middle convolutional layers results in a loss of about 4% in the sensitivity. However, it is important to note that the computational complexity of the channelbased systems is significantly higher than the systems that aggregate channel-based features into a single vector, since the channel-based systems are decoding each channel independently. As shown in Figs. 8.13 and 8.14, we find that CNN/LSTM has a significantly lower FA rate than CNN/GRU. We speculate that this is due to the fact that while a GRU unit controls the flow of information like the LSTM unit, it does not have a memory unit. LSTMs can remember longer sequences better than GRUs. Since seizure detection requires modeling long distance relationships, we believe that this explains why there is a difference in performance between the two systems. The time required for training for CNN/GRU was 10% less than CNN/LSTM. The training time of these two systems is comparable since most of the cycles are spent training the convolutional layers. We also observe that the ResNet structure improves the performance of CNN/MLP, but the best overall system is still CNN/LSTM.

264

M. Golmohammadi et al.

Table 8.3 A comparison of several CNN and LSTM architectures on DUSZ System CNN/LSTM CNN/LSTM Channel-based LSTM Channel-based LSTM

Data TUSZ DUSZ TUSZ DUSZ

Sensitivity (%) 30.83 33.71 39.46 42.32

Specificity (%) 97.10 70.72 95.20 86.93

FA/24 h 6 40 11 14

Fig. 8.15 Performance of CNN/LSTM and channel-based LSTM on TUSZ and DUSZ

We have also conducted an open-set evaluation of the best systems, CNN/LSTM and channel-based LSTM, on a completely different corpus—DUSZ. These results are shown in Table 8.3. A DET curve is shown in Fig. 8.15. This is an important evaluation because none of these systems were exposed to DUSZ data during training or development testing. Parameter optimizations were performed only on TUSZ data. As can be seen, at high FA rates, performance between the two systems is comparable. At low FA rates, however, CNN/LSTM performance on TUSZ is lower than on DUSZ. For channel-based LSTM, in the region of low FA rate, performance on TUSZ and DUSZ is very similar. This is reflected by the two middle curves in Fig. 8.15. The differences in performance for channel-based LSTM when the data changes are small. However, for CNN/LSTM, which gives the best overall performance on TUSZ, performance decreases rapidly on DUSZ. Recall that we did not train these systems on DUSZ—this is true open set testing. Hence, we can conclude in this limited study that channel-based LSTM generalized better than the CNN/LSTM system.

8 Deep Learning Approaches for Automated Seizure Detection from Scalp. . .

265

8.5.4 Optimization of Core Components Throughout these experiments, we observed that the choice of optimization method had a considerable impact on performance. The CNN/LSTM system was evaluated using a variety of optimization methods, including stochastic gradient descent (SGD) [70], RMSprop [97], Adagrad [98], Adadelta [99], Adam [70], Adamax [70], and Nadam [100]. These results are shown in Table 8.4. The best performance is achieved with Adam, a learning rate of α = 0.0005, a learning rate decay of 0.0001, exponential decay rates of β 1 = 0.9 and β 2 = 0.999 for the moment estimates and a fuzz factor of  = 10−8 . The parameters follow the notation described in [70]. Table 8.4 also illustrates that Nadam delivers comparable performance to Adam. Adam combines the advantages of AdaGrad which works well with sparse gradients, and RMSProp which works well in nonstationary settings. Similarly, we evaluated our CNN/LSTM using different activation functions, as shown in Table 8.5. ELU delivers a small but measurable increase in sensitivity, and more importantly, a reduction in false alarms. The ELU activation function is defined as:  x x>0 f (x) = (8.8) α. (ex − 1) x ≤ 0 where α is slope of negative section. The derivative of the ELU activation function is:  ´ 1 x>0 (8.9) f (x) = x α.e x ≤ 0 Table 8.4 Comparison of optimization algorithms

Table 8.5 A comparison of activation functions

System SGD RMSprop Adagrad Adadelta Adam Adamax Nadam

Sensitivity (%) 23.12 25.17 26.42 26.11 30.83 29.25 30.27

Specificity (%) 72.24 83.39 80.42 79.14 97.10 89.64 92.17

FA/24 h 44 23 31 33 6 18 14

System Linear Tanh Sigmoid Softsign ReLU ELU

Sensitivity (%) 26.46 26.53 28.63 30.05 30.51 30.83

Specificity (%) 88.48 89.17 90.08 90.51 94.74 97.10

FA/24 h 25 21 19 18 11 6

266

M. Golmohammadi et al.

The ReLU activation function is defined as:  x x>0 f (x) = 0 x≤0

(8.10)

The corresponding derivative is: ´

f (x) =



1 x>0 0 x≤0

(8.11)

ELU is very similar to ReLU except for negative inputs. ReLUs and ELUs accelerate learning by decreasing the gap between the normal gradient and the unit natural gradient [81]. ELUs push the mean towards zero but with a significantly smaller computational footprint. In the region where the input is negative (x < 0), since an ReLU’s gradient is zero, the weights will not get adjusted. Those neurons which connect into that state will stop responding to variations in error or input. This is referred to as the dying ReLU problem. But unlike ReLUs, ELUs have a clear saturation plateau in their negative region, allowing them to learn a more robust and stable representation. Determining the proper initialization strategy for the parameters in the model is part of the difficulty in training. Hence, we investigated a variety of initialization methods using the CNN/LSTM structure introduced in Fig. 8.8. These results are presented in Table 8.6. The related DET curve is illustrated in Fig. 8.16. In our experiments, we observed that proper initialization of weights in a convolutional recurrent neural network is critical to convergence. For example, initialization of tensor values to zero or one completely stalled the convergence process. Also, as we can see in Table 8.6, the FA rate of the system in the range of 30% sensitivity can change from 7 to 40, for different initialization methods. This decrease in performance and deceleration of convergence arises because some initializations can result in the deeper layers receiving inputs with small variances, which in turn slows down back propagation, and retards the overall convergence process.

Table 8.6 A comparison of initialization methods System Orthogonal Lecun uniform Glorot uniform Glorot normal Variance scaling Lecun normal He normal Random uniform Truncated normal He uniform

Sensitivity (%) 30.8 30.3 31.0 29.5 31.8 31.8 31.3 30.2 31.6 29.2

Specificity (%) 96.9 96.5 94.2 92.4 92.1 92.1 91.1 90.0 87.8 85.1

FA/24 h 7 8 13 18 19 19 22 25 31 40

8 Deep Learning Approaches for Automated Seizure Detection from Scalp. . .

267

Fig. 8.16 A comparison of different initialization methods for CNN/LSTM

Best performance is achieved using orthogonal initialization [101]. This method is a simple yet effective way of combatting exploding and vanishing gradients. In orthogonal initialization, the weight matrix is chosen as a random orthogonal matrix, i.e., a square matrix W for which WT W = I. Typically, the orthogonal matrix is obtained from the QR decomposition of a matrix of random numbers drawn from a normal distribution. Orthogonal matrices preserve the norm of a vector and their eigenvalues have an absolute value of one. This means that no matter how many times we perform repeated matrix multiplication, the resulting matrix does not explode or vanish. Also, in orthogonal matrices, columns and rows are all orthonormal to one another, which helps the weights to learn different input features. For example, if we apply orthogonal initialization on a CNN architecture, in each layer, each channel has a weight vector that is orthogonal to the weight vectors of the other channels. Overfitting is a serious problem in deep neural nets with many parameters. We have explored five popular regularization methods to address this problem. The techniques collectively known as L1, L2, and L1/L2 [102] prevent overfitting by adding a regularization term to the loss function. The L1 regularization technique, also known as Lasso regression, is defined as adding the sum of weights to the loss function: Cost Function = Loss Function + λ

k 

|wi | ,

(8.12)

i=1

where w is the weight vector and λ is a regularization parameter. The L2 technique, also known as ridge regression, is defined as adding the sum of the square of the weights to the loss function:

268

M. Golmohammadi et al.

Cost Function = Loss Function + λ

k 

wi2 .

(8.13)

i=1

The L1/L2 technique is a combination of both techniques: Cost Function = Loss Function + λ

k 

|wi | + λ

i=1

k 

wi2 .

(8.14)

i=1

In an alternative approach, we used dropout to prevent units from excessively coadapting by randomly dropping units and their connections from the neural network during training. We also studied the impact of introducing zero-centered Gaussian noise to the network. In this regularization method, which is considered a random data augmentation method [103], we add zero-centered Gaussian noise with a standard deviation of 0.2 to all hidden layers in the network as well as the visible or input layer. The results of these experiments are presented in Table 8.7 along with a DET curve in Fig. 8.17. While L1/L2 regularization has the best overall performance, in the region where FA rates are low, the dropout method delivers a lower FA rate. The Table 8.7 A comparison of performance for different regularizations

System L1/L2 Dropout Gaussian L2 L1

Sensitivity (%) 30.8 30.8 30.8 30.2 30.0

Specificity (%) 97.1 96.9 95.8 95.6 43.7

Fig. 8.17 A comparison of different regularization methods for CNN/LSTM

FA/24 h 6 7 9 10 276

8 Deep Learning Approaches for Automated Seizure Detection from Scalp. . .

269

Fig. 8.18 Synthetic EEG waveforms generated using DCGAN

primary error modalities observed were false alarms generated during brief delta range slowing patterns such as intermittent rhythmic delta activity [25]. Our closedloop experiments demonstrated that all regularization methods presented in Table 8.7, unfortunately, tend to increase the false alarm rate for slowing patterns. Finally, in Fig. 8.18, an example of an EEG that is generated by the DCGAN structure of Fig. 8.6 is shown. Note that to generate these EEGs, we use a generator block in DCGAN in which each EEG signal has a 7 s duration. We apply a 25 Hz low pass filter on the output of DCGAN, since most of the cerebral signals observed in scalp EEGs fall in the range of 1–20 Hz (in standard clinical recordings, activity below or above this range is likely to be an artifact). Unfortunately, in a simple pilot experiment in which we randomly mixed actual EEGs with synthetic EEGs, expert annotators could easily detect the synthetic EEGs, which was a bit discouraging. Seizures in the synthetic EEGs were sharper and more closely resembled a slowing event. Clearly, more work is needed with this architecture. However, our expert annotators also noted that the synthetic EEGs did exhibit focality. An example of focality is when activity is observed on the CZ-C4 channel, we would expect to observe the inverse of this pattern on the C4-T4 channel. As can be seen in Fig. 8.18, in last two seconds of the generated EEG, we observe slowing activity on the CZ-C4 channel and the inverse pattern of the same slowing activity on the C4-T4 channel. Hence, it is possible to generate synthetic multichannel EEG

270

M. Golmohammadi et al.

signals with DCGAN that resemble clinical EEGs. However, DCGAN is not yet at the point where it is generating data that is resulting in an improvement in the performance of our best systems.

8.6 Conclusions EEGs remain one of the main clinical tools that physicians use to understand brain function. New applications of EEGs are emerging including diagnosis of head trauma-related injuries which offer the potential to vastly expand the market for EEGs. A board-certified EEG specialist is required by law to interpret an EEG and produce a diagnosis. Since it takes several years of additional training post-medical school for a physician to qualify as a clinical specialist, the ability to generate data far exceeds the available expertise to interpret these data, creating a critical bottleneck. Despite rapid advances in deep learning in recent years, automatic interpretation of EEGs is still a very challenging problem. We have introduced a variety of deep learning architectures for automatic classification of EEGs including a hybrid architecture that integrates CNN and LSTM technology. Two systems are particularly promising: CNN/LSTM and channelbased LSTM. While these architectures deliver better performance than other deep structures, their performance still does not meet the needs of clinicians. Human performance on similar tasks is in the range of 75% sensitivity with a false alarm rate of 1 per 24 h [16]. The false alarm rate is particularly important to critical care applications since it impacts the workload experienced by healthcare providers. The primary error modalities for our deep learning-based approaches were false alarms generated during brief delta range slowing patterns such as intermittent rhythmic delta activity. A variety of these types of artifacts have been observed during inter-ictal and post-ictal stages. Training models on such events with diverse morphologies is potentially one way to reduce the remaining false alarms. This is one reason we are continuing our efforts to annotate a larger portion of TUSZ. We are also exploring the potential of supervised GAN frameworks for spatiotemporal modeling of EEGs. Most of the research on GANs is focused on either unsupervised learning or supervised learning using conditional GANs. Given that the annotation process to produce accurate labels is expensive and time-consuming, we are exploring semi-supervised learning in which only a small fraction of the data has labels. GANs can be used to perform semi-supervised classification by using a generator-discriminator pair to learn an unconditional model of data and then tune the discriminator using the small amount of labeled data for prediction. We are also continuing to manually label EEG data. We invite you to register at our project web site, www.isip.piconepress.com/projects/tuh_eeg, to be kept aware of the latest developments.

8 Deep Learning Approaches for Automated Seizure Detection from Scalp. . .

271

References 1. Ilmoniemi, R., & Sarvas, J. (2019). Brain signals: Physics and mathematics of MEG and EEG. Boston, MA: MIT. 2. Ebersole, J. S., & Pedley, T. A. (2014). Current practice of clinical electroencephalography. Philadelphia, PA: Wolters Kluwer. 3. Yamada, T., & Meng, E. (2017). Practical guide for clinical neurophysiologic testing: EEG. Philadelphia, PA: Lippincott Williams & Wilkins. 4. Ercegovac, M., & Berisavac, I. (2015). Importance of EEG in intensive care unit. Clinical Neurophysiology, 126, e178–e179. https://doi.org/10.1016/j.clinph.2015.04.027. 5. Ney, J. P., van der Goes, D. N., Nuwer, M. R., & Nelson, L. (2016). Continuous and routine EEG in intensive care: utilization and outcomes, United States 2005-2009. Neurology, 81, 2002–2008. https://doi.org/10.1212/01.wnl.0000436948.93399.2a. 6. Boashash, B. (2015). Time-frequency signal analysis and processing: A comprehensive reference. London: Academic. 7. Gotman, J. (1999). Automatic detection of seizures and spikes. Journal of Clinical Neurophysiology, 16, 130–140. 8. Li, P., Wang, X., Li, F., Zhang, R., Ma, T., Peng, Y., et al. (2015). Autoregressive model in the Lp norm space for EEG analysis. Journal of Neuroscience Methods, 240, 170–178. https:/ /doi.org/10.1016/j.jneumeth.2014.11.007. 9. Li, Y., Luo, M.-L., & Li, K. (2016). A multiwavelet-based time-varying model identification approach for time–frequency analysis of EEG signals. Neurocomputing, 193, 106–114. https:/ /doi.org/10.1016/j.neucom.2016.01.062. 10. Rodrıguez-Bermudez, G., & Garcıa-Laencina, P. J. (2015). Analysis of EEG signals using nonlinear dynamics and chaos: A review. Applied Mathematics & Information Science, 9, 2309–2321. https://doi.org/10.12785/amis/090512. 11. Eichler, M., Dahlhaus, R., & Dueck, J. (2017). Graphical modeling for multivariate hawkes processes with nonparametric link functions. Journal of Time Series Analysis, 38, 225–242. https://doi.org/10.1111/jtsa.12213. 12. Schad, A., Schindler, K., Schelter, B., Maiwald, T., Brandt, A., Timmer, J., et al. (2008). Application of a multivariate seizure detection and prediction method to non-invasive and intracranial long-term EEG recordings. Clinical Neurophysiology, 119, 197–211. 13. Schindler, K., Wiest, R., Kollar, M., & Donati, F. (2001). Using simulated neuronal cell models for detection of epileptic seizures in foramen ovale and scalp EEG. Clinical Neurophysiology, 112, 1006–1017. https://doi.org/10.1016/S1388-2457(01)00522-3. 14. Deburchgraeve, W., Cherian, P. J., De Vos, M., Swarte, R. M., Blok, J. H., Visser, G. H., et al. (2008). Automated neonatal seizure detection mimicking a human observer reading EEG. Clinical Neurophysiology, 119, 2447–2454. https://doi.org/10.1016/j.clinph.2008.07.281. 15. Baumgartner, C., & Koren, J. P. (2018). Seizure detection using scalp-EEG. Epilepsia, 59, 14–22. https://doi.org/10.1111/epi.14052. 16. Haider, H. A., Esteller, R. D., Hahn, C. B., Westover, M. J., Halford, J. W., Lee, J. M., et al. (2016). Sensitivity of quantitative EEG for seizure identification in the intensive care unit. Neurology, 87, 935–944. https://doi.org/10.1212/WNL.0000000000003034. 17. Varsavsky, A., & Mareels, I. (2006). Patient un-specific detection of epileptic seizures through changes in variance. In Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society (pp. 3747–3750). New York: IEEE. 18. Bridi, A. C., Louro, T. Q., & Da Silva, R. C. L. (2014). Clinical alarms in intensive care: implications of alarm fatigue for the safety of patients. Revista Latino-Americana de Enfermagem, 22, 1034. https://doi.org/10.1590/0104-1169.3488.2513. 19. Ahmedt-Aristizabal, D., Fookes, C., Denman, S., Nguyen, K., Sridharan, S., & Dionisio, S. (2019). Aberrant epileptic seizure identification: A computer vision perspective. Seizure European Journal of Epilepsy, 65, 65–71. https://doi.org/10.1016/j.seizure.2018.12.017.

272

M. Golmohammadi et al.

20. Ramgopal, S. (2014). Seizure detection, seizure prediction, and closed-loop warning systems in epilepsy. Epilepsy & Behavior, 37, 291–307. https://doi.org/10.1016/j.yebeh.2014.06.023. 21. Alotaiby, T., Alshebeili, S., Alshawi, T., Ahmad, I., & Abd El-Samie, F. (2014). EEG seizure detection and prediction algorithms: a survey. EURASIP Journal on Advances in Signal Processing, 2014, 1–21. https://doi.org/10.1186/1687-6180-2014-183. 22. Obeid, I., & Picone, J. (2016). The Temple University Hospital EEG data corpus. Frontiers in Neuroscience. Section Neural Technology, 10, 1–8. https://doi.org/10.3389/fnins.2016.00196. 23. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436. https://doi.org/ 10.1038/nature14539. 24. Shah, V., Golmohammadi, M., Ziyabari, S., von Weltin, E., Obeid, I., & Picone, J. (2017). Optimizing channel selection for seizure detection. In I. Obeid & J. Picone (Eds.), Proceedings of the IEEE signal processing in medicine and biology symposium (pp. 1–5). Philadelphia, PA: IEEE. https://doi.org/10.1109/SPMB.2017.8257019. 25. von Weltin, E., Ahsan, T., Shah, V., Jamshed, D., Golmohammadi, M., Obeid, I., et al. (2017). Electroencephalographic slowing: A primary source of error in automatic seizure detection. In I. Obeid & J. Picone (Eds.), Proceedings of the IEEE signal processing in medicine and biology symposium (pp. 1–5). Philadelphia, PA: IEEE. https://doi.org/10.1109/ SPMB.2017.8257018. 26. Golmohammadi, M., Ziyabari, S., Shah, V., Obeid, I., & Picone, J. (2018). Deep architectures for spatio-temporal modeling: Automated seizure detection in scalp EEGs. In Proceedings of the IEEE International Conference on Machine Learning and Applications (ICMLA). 1–6, Orlando, Florida, USA. https://doi.org/10.1109/ICMLA.2018.00118. 27. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., et al. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1–9). Boston, MA: IEEE. 28. Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In Proceedings of the international conference on learning representations (ICLR) (pp. 1–14). San Diego, CA: ICLR. 29. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 770–778). Las Vegas, NV. 30. Radford, A., Metz, L., & Chintala, S. (2016). Unsupervised representation learning with deep convolutional generative adversarial networks. In: Proceedings of the International Conference on Learning Representations (ICLR). San Juan, Puerto Rico. 31. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al. (2014). Generative adversarial nets. Proceedings of the Conference on Neural Information Processing Systems, 2672–2680. https://doi.org/10.1017/CBO9781139058452. 32. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9, 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735. 33. Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv Prepr. arXiv1412.3555 (pp. 1–9). 34. Natus Medical: Nicolet® NicVue Connectivity Solution. Retrieved from https:// neuro.natus.com/products-services/nicolet-nicvue-connectivity-solution. 35. Harati, A., Lopez, S., Obeid, I., Jacobson, M., Tobochnik, S., & Picone, J. (2014). The TUH EEG corpus: A big data resource for automated EEG interpretation. In I. Obeid & J. Picone (Eds.), Proceedings of the IEEE signal processing in medicine and biology symposium (pp. 1–5). Philadelphia, PA: IEEE. https://doi.org/10.1109/SPMB.2014.7002953. 36. Lopez, S., Golmohammadi, M., Obeid, I., & Picone, J. (2016). An analysis of two common reference points for EEGs. In I. Obeid & J. Picone (Eds.), Proceedings of the IEEE signal processing in medicine and biology symposium (pp. 1–4). Philadelphia, PA: IEEE. https:// doi.org/10.1109/SPMB.2016.7846854. 37. Hirsch, L. J., Laroche, S. M., Gaspard, N. T., Gerard, E. F., Svoronos, A., & Herman, S. T. (2013). American Clinical Neurophysiology Society’s Standardized Critical Care EEG Terminology: 2012 version. Journal of Clinical Neurophysiology, 30, 1–27. https://doi.org/ 10.1097/WNP.0b013e3182784729.

8 Deep Learning Approaches for Automated Seizure Detection from Scalp. . .

273

38. Shah, V., von Weltin, E., Lopez, S., McHugh, J. R., Veloso, L., Golmohammadi, M., et al. (2018). The Temple University Hospital seizure detection corpus. Frontiers in Neuroinformatics, 12, 83. https://doi.org/10.3389/fninf.2018.00083. 39. Shah, V., Anstotz, R., Obeid, I., & Picone, J. (2018). Adapting an automatic speech recognition system to event classification of electroencephalograms. In I. Obeid & J. Picone (Eds.), Proceedings of the IEEE Signal Processing in Medicine and Biology Symposium (SPMB) (p. 1). Philadelphia, PA: IEEE. https://doi.org/10.1109/SPMB.2016.7846854. 40. Harati, A., Golmohammadi, M., Lopez, S., Obeid, I., & Picone, J. (2015). Improved EEG event classification using differential energy. In I. Obeid & J. Picone (Eds.), Proceedings of the IEEE Signal Processing in Medicine and Biology Symposium (pp. 1–4). Philadelphia, PA: IEEE. https://doi.org/10.1109/SPMB.2015.7405421. 41. Swisher, C. B., White, C. R., Mace, B. E., & Dombrowski, K. E. (2015). Diagnostic accuracy of electrographic seizure detection by neurophysiologists and non-neurophysiologists in the adult ICU using a panel of quantitative EEG trends. Journal of Clinical Neurophysiology, 32, 324–330. https://doi.org/10.1097/WNP.0000000000000144. 42. Kubota, Y., Nakamoto, H., Egawa, S., & Kawamata, T. (2018). Continuous EEG monitoring in ICU. Journal of Intensive Care, 6, 39. https://doi.org/10.1186/s40560-018-0310-z. 43. Nihon Kohden Corporation. Retrieved from https://us.nihonkohden.com/products/eeg-1200. 44. Picone, J. (1993). Signal modeling techniques in speech recognition. Proceedings of the IEEE, 81, 1215–1247. https://doi.org/10.1109/5.237532. 45. Thodoroff, P., Pineau, J., & Lim, A. (2016). Learning robust features using deep learning for automatic seizure detection. In: Machine Learning and Healthcare Conference. 46. Mirowski, P., Madhavan, D., Lecun, Y., & Kuzniecky, R. (2009). Classification of patterns of EEG synchronization for seizure prediction. Clinical Neurophysiology, 120, 1927–1940. https://doi.org/10.1016/j.clinph.2009.09.002. 47. Subasi, A. (2007). EEG signal classification using wavelet feature extraction and a mixture of expert model. Expert Systems with Applications, 32, 1084–1093. https://doi.org/10.1016/ j.eswa.2006.02.005. 48. Jahankhani, P., Kodogiannis, V., & Revett, K. (2006). EEG signal classification using wavelet feature extraction and neural networks. In IEEE John Vincent Atanasoff 2006 International Symposium on Modern Computing (pp. 120–124). https://doi.org/10.1109/JVA.2006.17. 49. Da Rocha Garrit, P. H., Guimaraes Moura, A., Obeid, I., & Picone, J. (2015). Wavelet analysis for feature extraction on EEG signals. In NEDC Summer Research Experience for Undergraduates (p. 1). Philadelphia: Department of Electrical and Computer Engineering, Temple University. 50. Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., & Stolcke, A. (2017). The Microsoft 2017 conversational speech recognition system. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (pp. 1–5). Calgary. 51. Picone, J. (1990). Continuous speech recognition using hidden Markov models. IEEE ASSP Magazine, 7, 26–41. https://doi.org/10.1109/53.54527. 52. Huang, K., & Picone, J. (2002). Internet-accessible speech recognition technology. In Proceedings of the IEEE midwest symposium on circuits and systems (pp. III-73–III-76). Tulsa, OK. 53. Parker, D., Picone, J., Harati, A., Lu, S., Jenkyns, M., & Polgreen, P. (2013). Detecting paroxysmal coughing from pertussis cases using voice recognition technology. PLoS One, 8, e82971. https://doi.org/10.1371/journal.pone.0082971. 54. Lu, S., & Picone, J. (2013). Fingerspelling gesture recognition using a two-level hidden Markov model. In Proceedings of the International Conference on image processing, computer vision, and pattern recognition (ICPV) (pp. 538–543). Las Vegas, NV. 55. Obeid, I., & Picone, J. (2018). Machine learning approaches to automatic interpretation of EEGs. In E. Sejdik & T. Falk (Eds.), Signal processing and machine learning for biomedical big data (p. 30). Boca Raton, FL: Taylor & Francis Group. https://doi.org/ 10.1201/9781351061223.

274

M. Golmohammadi et al.

56. Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. (2008). Extracting and composing robust features with denoising autoencoders. In Proceedings of the International Conference on Machine Learning (ICMLA) (pp. 1096–1103). New York, NY. 57. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P.-A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion Pierre-Antoine Manzagol. Journal of Machine Learning Research, 11, 3371–3408. https://doi.org/10.1111/1467-8535.00290. 58. Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layer-wise training of deep networks. In Advances in neural information processing system (pp. 153–160). Vancouver, BC. 59. Hinton, G. E., Osindero, S., Teh, Y.-W., & Fast Learning, A. (2006). Algorithm for deep belief nets. Neural Computation, 18, 1527–1554. https://doi.org/10.1162/neco.2006.18.7.1527. 60. van der Maaten, L., Postma, E., van den Herik, J.: Dimensionality reduction: A comparative review. (2009) 61. Ross, D. A., Lim, J., Lin, R. S., & Yang, M. H. (2008). Incremental learning for robust visual tracking. International Journal of Computer Vision, 77, 125–141. https://doi.org/10.1007/ s11263-007-0075-7. 62. Zinkevich, M., Weimer, M., Smola, A., & Li, L. (2010). Parallelized stochastic gradient descent. In Proceedings of neural information processing systems (pp. 2595–2603). Vancouver, BC. 63. Golmohammadi, M., Harati Nejad Torbati, A. H., de Diego, S., Obeid, I., & Picone, J. (2019). Automatic analysis of EEGs using big data and hybrid deep learning architectures. Frontiers in Human Neuroscience, 13, 76. https://doi.org/10.3389/fnhum.2019.00076. 64. Saon, G., Sercu, T., Rennie, S., & Kuo, H.-K. J. (2016). The IBM 2016 English Conversational Telephone Speech Recognition System. In Proceedings of the Annual Conference of the International Speech Communication Association (pp. 7–11). 65. Lopez, S. (2017). Automated identification of abnormal adult electroencephalograms, Department of Electrical and Computer Engineering, Temple University. 58, Philadelphia, PA, USA. Retrieved from https://digital.library.temple.edu/digital/collection/p245801coll10/id/463223/ rec/1. 66. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436–444. 67. Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted Boltzmann machines. Proceedings of the International Conference on Machine Learning, 807–814. 68. Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). Rectifier nonlinearities improve neural network acoustic models. In ICML workshop on deep learning for audio, speech and language processing (p. 6). Atlanta, Georgia, USA. 69. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15, 1929–1958. https://doi.org/10.1214/12-AOS1000. 70. Kingma, D. P., & Ba, J. L. (2015). Adam: a method for stochastic optimization. In Proceedings of the International conference on learning representations (pp. 1–15). San Diego, CA. 71. Jelinek, F. (1997). Statistical methods for speech recognition. Boston, MA: MIT. 72. Bishop, C. (2011). Pattern recognition and machine learning. New York, NY: Springer. 73. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273–297. https://doi.org/10.1007/BF00994018. 74. Bahl, L., Brown, P., de Souza, P., & Mercer, R. (1986). Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 49–52). Tokyo. 75. Pandey, P. Deep generative models. Retrieved from https://towardsdatascience.com/deepgenerative-models-25ab2821afd3

8 Deep Learning Approaches for Automated Seizure Detection from Scalp. . .

275

76. Day, M. Y., Tsai, C. C., Chuang, W. C., Lin, J. K., Chang, H. Y., Fergus, R., et al. (2016). NIPS 2016 Tutorial: generative adversarial networks. EMNLP. https://doi.org/10.1007/9783-319-10590-1_53. 77. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved techniques for training GANs. In Proceedings of neural information processing systems (NIPS) (pp. 1–9). Barcelona. 78. Yang, S., López, S., Golmohammadi, M., Obeid, I., & Picone, J. (2016). Semi-automated annotation of signal events in clinical EEG data. In I. Obeid & J. Picone (Eds.), Proceedings of the IEEE signal processing in medicine and biology symposium (SPMB) (pp. 1–5). Philadelphia, PA: IEEE. https://doi.org/10.1109/SPMB.2016.7846855. 79. Lang, K. J., Waibel, A., & Hinton, G. E. (1990). A time-delay neural network architecture for isolated word recognition. Neural Networks, 3, 23–43. https://doi.org/10.1016/08936080(90)90044-L. 80. Levy, A., & Lindenbaum, M. (2000). Sequential Karhunen-Loeve basis extraction and its application to images. IEEE Transactions on Image Processing, 9, 1371–1374. https://doi.org/ 10.1109/ICIP.1998.723422. 81. Clevert, D., Unterthiner, T., & Hochreiter, S. (2016). Fast and accurate deep network learning by exponential linear units (ELUs). In International conference on learning representations (ICLR) (pp. 1–14). San Juan, Puerto Rico. 82. Golmohammadi, M., Ziyabari, S., Shah, V., Obeid, I., & Picone, J. (2017). Gated recurrent networks for seizure detection. In I. Obeid & J. Picone (Eds.), Proceedings of the IEEE Signal Processing in Medicine and Biology Symposium (pp. 1–5). Philadelphia, PA: IEEE. https:// doi.org/10.1109/SPMB.2017.8257020. 83. Hermans, M., & Schrauwen, B. (2013). Training and analyzing deep recurrent neural networks. Advances in Neural Information Processing Systems, 190–198. Retrieved from http://dl.acm.org/citation.cfm?id=2999611.2999633. 84. Graves, A., Mohamed, A., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. International Conference on Acoustics, Speech and Signal Processing, 6645–6649. https://doi.org/10.1109/ICASSP.2013.6638947. 85. Krauss, G. L., & Fisher, R. S. (2011). The Johns Hopkins Atlas of Digital EEG: An interactive training guide. Baltimore, MD: Johns Hopkins University Press. 86. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International conference on machine learning (ICML) (pp. 448–456). Lille, France. 87. Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In Proceedings of the International Conference on Machine Learning (ICML) (pp. 1139–1147). Atlanta, Georgia. 88. Shah, V., von Weltin, E., Ahsan, T., Obeid, I., & Picone, J. (2019). A cost-effective method for generating high-quality annotations of seizure events. Journal of Clinical Neurophysiology. (in review). Retrieved from www.isip.piconepress.com/publications/unpublished/journals/ 2017/jcn/ira. 89. Fiscus, J., Ajot, J., Garofolo, J., & Doddingtion, G. (2007). Results of the 2006 spoken term detection evaluation. In Proceedings of the SIGIR 2007 workshop: searching spontaneous conversational speech (pp. 45–50). Amsterdam, The Netherlands. 90. Japkowicz, N., & Shah, M. (2014). Evaluating learning algorithms: A classification perspective. Retrieved from https://www.amazon.com/Evaluating-Learning-AlgorithmsClassification-Perspective/dp/1107653118. 91. Shah, V., & Picone, J. NEDC Eval EEG: A comprehensive scoring package for sequential decoding of multichannel signals. Retrieved from https://www.isip.piconepress.com/projects/tuh_eeg/downloads/nedc_eval_eeg/. 92. Shah, V., Golmohammadi, M., Obeid, I., & Picone, J. (2018). Objective evaluation metrics for automatic classification of EEG events. Journal of Neural Engineering, 1–21. (in review). Retrieved from www.isip.piconepress.com/publications/unpublished/journals/2018/iop_jne/ metrics/.

276

M. Golmohammadi et al.

93. Liu, A., Hahn, J. S., Heldt, G. P., & Coen, R. W. (1992). Detection of neonatal seizures through computerized EEG analysis. Electroencephalography and Clinical Neurophysiology, 82, 32–37. https://doi.org/10.1016/0013-4694(92)90179-L. 94. Navakatikyan, M. A., Colditz, P. B., Burke, C. J., Inder, T. E., Richmond, J., & Williams, C. E. (2006). Seizure detection algorithm for neonates based on wave-sequence analysis. Clinical Neurophysiology, 117, 1190–1203. https://doi.org/10.1016/j.clinph.2006.02.016. 95. Fiscus, J. G., & Chen, N. (2013). Overview of the NIST Open Keyword Search 2013 Evaluation Workshop. Bethesda, Maryland, USA. 96. Sundermeyer, M., Ney, H., & Schluter, R. (2015). From feedforward to recurrent LSTM neural networks for language modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23, 517–529. https://doi.org/10.1109/TASLP.2015.2400218. 97. Bottou, L., & Lecun, Y. (2004). Large scale online learning. Advances in Neural Information Processing Systems, 217–225. Retrieved from https://papers.nips.cc/paper/2365-large-scaleonline-learning. 98. Tieleman, T., & Hinton, G. (2012). Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA Neural Networks for Machine Learning. 99. Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12, 2121–2159. Retrieved from https://dl.acm.org/citation.cfm?id=2021068. 100. Zeiler, M. D. (2012). ADADELTA: An adaptive learning rate method. arXiv. abs/1212.5 (pp. 1–6). 101. Saxe, A. M., McClelland, J. L., & Ganguli, S. (2014). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In Proceedings of the International conference on learning representations (ICLR) (pp. 1–22). Banff, Canada. 102. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Cambridge, MA: MIT. 103. Perez, L., & Wang, J. (2017). The effectiveness of data augmentation in image classification using deep learning. arXiv.

Correction to: The Temple University Hospital Digital Pathology Corpus

Nabila Shawki, M. Golam Shadin, Tarek Elseify, Luke Jakielaszek, Tunde Farkas, Yuri Persidsky, Nirag Jhala, Iyad Obeid, and Joseph Picone

Correction to: Chapter 3 in: I. Obeid et al. (eds.), Signal Processing in Medicine and Biology, https://doi.org/10.1007/978-3-030-36844-9_3

This book was inadvertently published with sensitive patient information in Figure 3.9 of this chapter for which we did not have permission to display. We have now revised the figure wherein the date of birth of this patient has been whited out.

The updated original version of this chapter can be found at https://doi.org/10.1007/978-3-030-36844-9_3 © Springer Nature Switzerland AG 2022 I. Obeid et al. (eds.), Signal Processing in Medicine and Biology, https://doi.org/10.1007/978-3-030-36844-9_9

C1

Index

A Approximate entropy (ApEn), 141, 142 Artifact reduction difference matrices, 110–111 fused lasso penalty, 111–112 optimization algorithm, 115–119 parameters, 119–120 problem formulation, 113–115 related works, 109–110 soft-thresholding, 111–112 spline interpolation, 135 total variation, 111–112 Audio analysis AVEC, 13, 15 machine learning models, 27 mPower study, 9 PD diagnosis (see Parkinson’s disease (PD) diagnosis) phoneme, 5 recordings, 11 VAD, 10 voice collection, 3 Audio Visual Emotion recognition Challenge 2013 (AVEC), 9, 13, 15, 17, 20–29 Auditory spectral centroid (ASC), 45, 54, 56–57 and ASF, 54 center of gravity, 46 recording location, 56 ROC curve, 61 systole and diastole, 58 Auditory spectral flux (ASF), 55–56 calculation, 45 features list, 49, 59

PAG analysis, 43 ROC curve, 61 waveform, 46

B Big data, 230, 236, 238–239 Bruit AVG phantoms, 53 BEF (see Bruit-enhancing filter (BEF)) expertise-dependence, 37 pathologic blood sounds, 36 recording location, 39 stenosis and blood flow, 38–39 Bruit-enhancing filter (BEF), 40, 42–44

C Classification EMG signals, 174–175 GMM, 180–183 GSF, 173–174 k-NN, 177–178 LDA, 179–180 NBC, 178–179 PAG spectral feature analysis, 52 slide image, 99 stenosis detection, 62–64 severity, 40 SVM, 175–177 Clinical support tools AVFs and AVGs, 36 binary classification method, 64

© Springer Nature Switzerland AG 2020 I. Obeid et al. (eds.), Signal Processing in Medicine and Biology, https://doi.org/10.1007/978-3-030-36844-9

277

278 Clinical support tools (cont.) biomarker, 29 case report, 87 demographics survey, 7 diagnoses, 28 dialysis treatment, 62 EEGs, 269, 270 SCG correlation, 230 Conjoint penalty, 121–122 designing parametric matrix, 124–125 GME, 109 optimization algorithm, 125–130 transient artifact suppression (see Transient artifacts) Convex optimization, 109, 130, 136 Convolutional neural networks (CNN) digital pathology (see Digital pathology) EEG signals, 245 initialization methods, 266 and LSTM, 253, 264 performance, 264 residual learning, 247–249 revolutionized fields, 237 RF, 91 spatial/temporal context, 75 state-of-the-art performance, 237 2D model, 245–247 unsupervised learning, 249–251

D Deep learning, 75–76 advances, 236–238 baseline system architecture, 92–96 big data, 238–239 core components, 265–270 EEG, 235 evaluation metrics, 258–259 experimental results, 96–100 hybrid approaches, 261–264 lung cancer, 91 machine learning, 236 postprocessing, 260–261 SC-CNN, 91 Digital pathology analog image, 71 annotation, 88–90 bioengineering community, 74 data anonymization, 86–88 organization, 84–86 deep learning, 75–76 image digitization, 80–84 ISUP, 71

Index laboratories, 69 low-grade prostate cancer, 70 machine learning techniques, 71, 72 NEDC, 75 scanning slides, 71 textbook case, 70 tissue fixation, 73 WSI, 72–74 Dynamic time warping (DTW), 212–214 and DBA, 215–216 k-medoid clustering, 218 multi-dimensional, 168 temporal morphology, 212

E Electromyography (EMG) adaptation technique, 168 analysis and modeling, 170 classification accuracy, 187 performance, 168 filtered classification, 174–175 GMM, 165 Laplacian distribution, 164 mapping, 163 mathematical analysis, 164 nerve function assessment, 161 noise component, 172 unfiltered MYO, 191 End-stage renal disease (ESRD), 35–36

F Flexible microphone, 37, 54 Frequency domain linear prediction (FDLP), 40, 42, 43 Fused lasso penalty, 109, 111–113, 115, 119, 121, 123

G Gaussian filter EMG, 161, 162 grasping task, 189–193 hand gestures, 184–188 problem formulation, 169–172 related works, 162–169 Gaussian mixtures model (GMM), 162, 165, 180–183, 196, 198 Gaussian smoothing filter (GSF), 173–174 classification performance, 189, 193 sensed EMG signals, 162 See also Gaussian filter

Index Generalized Moreau envelope (GME), 109, 112–113, 121 Generative adversarial networks (GANs), 237, 249 The Geneva minimalistic acoustic parameter set (GeMAPS), 9, 12–14, 17, 20, 23–28 H Hidden team-based assessment, 93, 94, 245, 253, 255, 256, 268 Hurst exponent (H), 142–144 ApEn, 141 applications, head movements, 146–147 discussion, 148–151 experimental protocol, 144 fractal statistics, 140 system, 150 group dynamics, 150 head movements, 144–146 hierarchical behavior, 140 medical errors, 139 rescaled range, 149 results, 147–148 resuscitation, 139 SampEn, 142 simplified approach, 153–157 visual cues, 141 K k-medoids, 217, 218, 222, 225, 230 k-nearest neighbor (k-NN), 162, 177–178, 187–193, 196, 198 L Linear discriminant analysis (LDA), 41, 162, 165, 179–180, 186–193, 196 Long short-term networks (LSTM), 96, 237, 256 bidirectional network, 254 channel-based, 263 CNN, 253, 264, 265, 268, 270 HMM, 256, 263 IPCA, 253

M Machine learning ANN, 20–22 clustering, 40 cross validation, 17

279 decision trees, 17–18 design cycle, 240 extra trees, 19 gradient boosted decision trees, 19 grid search, 17 healthcare data, 27 PD diagnosis (see Parkinson’s disease (PD) diagnosis) random forests, 18 research, 260 SVMs, 19–20 and voice analysis techniques, 1 Maximum relevance minimum redundancy (mRMR) algorithm, 13, 15, 16 Mel-frequency cepstrum coefficients (MFCCs), 6, 12, 26, 29, 241 Microphone AVFs and AVGs, 36 ESRD, 35 mathematical analysis, 37 recording, 37 skin-coupled (see Skin-coupled recording) speech recognition, 237 static images, 237 Morphological component analysis (MCA), 113, 119

N Naïve Bayes classification (NBC) technique, 162, 178–180, 186–193, 196, 198 Near-infrared spectroscopic (NIRS) data, 107, 109, 130–136 Non-convex regularization conjoint penalty, 121 ICU patients, 239 signal values, 131 sparse signals, 109 suboptimal local minimizers, 108

P Parkinson’s disease (PD) diagnosis AI, 1 algorithm, 9 detection, 4 discussion, 27–29 health-related applications, 1 mPower voice dataset, 6–8 neurodegenerative disease, 3 pathophysiology, 4–5 results, 22–27 VAD, 10–11 voice, 2–3, 5–6

280 Phonoangiography (PAG) BEF, 40, 42–44 blood flow, 38–39 classifier methods, 41 mathematical analysis, 37 recording location, 37 signal analysis, 40 spectral feature extraction, 47–48, 63 stenosis (see Stenosis) systole/diastole segmentation, 47 vascular access stenosis, 57–62 Polyvinylidene fluoride (PVDF), 48–51, 53, 56, 57

R Recurrent neural networks (RNN), 237, 252, 255, 266 Resuscitation Hurst exponent, 148 PA students, 144 spatial positioning, 145 TeamSTEPPS, 146 trauma patients, 139, 145

S Sample entropy (SampEn), 142 Seismocardiography (SCG) accelerometer, 205 averaging beats clustering algorithms, 216–217 DBA, 215–216 k-medoid clustering, 218 PCG, 214 cardiac events, 206 clusters distribution, 223–224 heart rate, 225 optimum number, 219–220 switching, 224, 225 experimental measurements, 209, 210 fiducial points, 206 HLV/LLV and INS/EXP, 221–222 interrelated mechanisms, 207 intra-cluster variability, 225–230 mechanical cardiac activity, 205 morphologies, 207 preprocessing filtering, 210 signal, 210–211 supervised and unsupervised ML, 208

Index temporal morphology, 208 waveforms, 206, 208 Seizure detection, 242, 244 identification, 255 LSTM, 253 spatial significance, 237 TUEG, 239 Sensors array, 51–52 bruit recording, 48 frequency response, 50–51 PVDF film, 50 single-tone testing, 51 in smart phones, 6 transducers, 49 Sequential signals design cycle, 240 HMM, 240 LFCCs, 241–243 temporal and spatial context modeling, 243–245 Simulated trauma, 142, 144–146, 148 Skin-coupled recording, 48–52 Sparse signal processing, 108, 109 Spectral centroid, 26, 41, 57, 58 Spectral flux, 15, 26, 40, 41, 45, 56 Stenosis ASC and ASF values, 54–55 detection and classification, 62–64 phantoms auditory signals, 53 phonoangiographic detection (see Phonoangiography (PAG)) ten-second recordings, 54 Support vector machines (SVMs)), 19–20, 175–177 GeMAPS, 23 machine learning approaches, 236 neurocognitive disorders, 3 RF, 208

T TeamSTEPPS, 144, 146–148, 151 Temporal dependencies end-to-end sequence labeling, 253–255 incremental principal component analysis, 252–253 using LSTMs, 255–257 Transient artifacts convex optimization, 109 EEG/ECG, 107 example 1, 130–131 example 2, 131–136

Index optimization approach, 108 sparsity, 108 See also Artifact reduction TUH digital pathology corpus (TUDP), 75, 77–80 U Unsupervised machine learning averaging beats, 214–218 clustering SCG morphology, 211–212 DTW, 212–214 V Variability, 5, 12, 54, 63, 151, 206–208, 212, 218, 225–230 Vascular access arteriovenous fistulas, 35 central veins, 37

281 detection (see Stenosis) dysfunction, 37 hemodialysis, 36, 37, 42 nominal lumen diameters, 39 phantoms, 47 phonoangiographic detection, 57–62 physiological ranges, 55 proof-of-concept experiments, 37 W Wavelet auditory signals, 45–47 EEG signals, 241 Fourier transform, 164 multi-resolution analysis, 167 PAG analysis, 44 sub-bands, 110 thresholding method, 136