Intelligent Fault Diagnosis and Health Assessment for Complex Electro-Mechanical Systems 9819935369, 9789819935369

Based on AI and machine learning, this book systematically presents the theories and methods for complex electro-mechani

134 70 25MB

English Pages 478 [474] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Contents
1 Introduction
1.1 Intelligent Fault Diagnosis, Prognostics, and Health Assessment
1.2 The Significance of Complex Electro-Mechanical System Fault Diagnosis, Prognostics, and Health Assessment
1.3 The Contents of Complex Electro-Mechanical System Fault Diagnosis, Prognostics, and Health Assessment
1.4 Overview of Intelligent Fault Diagnosis, Prognostics, and Health Assessment (IFDPHA)
1.4.1 Shallow Machine Learning-Based Methods
1.4.2 Deep Learning-Based Methods
1.5 Organization and Characteristics of the Book
References
2 Supervised SVM Based Intelligent Fault Diagnosis Methods
2.1 The Theory of Supervised Learning
2.1.1 The General Model of Supervised Learning
2.1.2 Risk Minimization Problem
2.1.3 Primary Learning Problem
2.2 Support Vector Machine
2.2.1 Linear Support Vector Machine
2.2.2 Nonlinear Support Vector Machine
2.2.3 Kernel Function
2.2.4 The Applications of SVM in Machinery Fault Diagnosis
2.3 The Parameters Optimization Method for SVM
2.3.1 Ant Colony Optimization
2.3.2 Ant Colony Optimization Based Parameters Optimization Method for SVM
2.3.3 Verification and Analysis by Exposed Datasets
2.3.4 The Application in Electrical Locomotive Rolling Bearing Single Fault Diagnosis
2.4 Feature Selection and Parameters Optimization Method for SVM
2.4.1 Ant Colony Optimization Based Feature Selection and Parameters Optimization Method for SVM
2.4.2 The Application in Rotor Multi Fault Diagnosis of Bently Testbench
2.4.3 The Application in Electrical Locomotive Rolling Bearing Multi Fault Diagnosis
2.5 Ensemble-Based Incremental Support Vector Machines
2.5.1 The Theory of Ensemble Learning
2.5.2 The Theory of Reinforcement Learning
2.5.3 Ensemble-Based Incremental Support Vector Machines
2.5.4 The Comparison Experiment Based on Rolling Bearing Incipient Fault Diagnosis
2.5.5 The Application in Electrical Locomotive Rolling Bearing Compound Fault Diagnosis
References
3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods
3.1 Semi-supervised Learning
3.2 Fault Detection and Classification Based on Semi-supervised Kernel Principal Component Analysis
3.2.1 Kernel Principal Component Analysis
3.2.2 Semi-supervise Kernel Principal Component Analysis
3.2.3 Semi-supervised KPCA Classification Algorithms
3.2.4 Application of Semi-supervised KPCA Method in Transmission Fault Detection and Classification
3.3 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering
3.3.1 Correlation of Outlier Detection and Early Fault
3.3.2 Semi-supervised Fuzzy Kernel Clustering
3.3.3 Semi-supervised Hypersphere-Based Fuzzy Kernel Clustering Method
3.3.4 Transmission Early Fault Detection
3.4 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering
3.4.1 Semi-supervised SOM Fault Diagnosis
3.4.2 Semi-supervised GNSOM Fault Diagnosis
3.4.3 Semi-supervised DPSOM Fault Diagnosis
3.4.4 Example Analysis
3.5 Relevance Vector Machine Diagnosis Method
3.5.1 Introduction to RVM
3.5.2 RVM Classifier Construction Method
3.5.3 Application of RVM in Fault Detection and Classification
References
4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis
4.1 Manifold Learning
4.2 Spectral Clustering Manifold Based Fault Feature Selection
4.2.1 Spectral Clustering
4.2.2 Spectral Clustering Based Feature Selection
4.2.3 DSTSVM Based Feature Extraction
4.2.4 Machinery Incipient Fault Diagnosis
4.3 LLE Based Fault Recognition
4.3.1 Local Linear Embedding
4.3.2 Classification Based on LLE
4.3.3 Dimension Reduction Performance Comparison Between LLE and Other Manifold Methods
4.3.4 LLE Based Fault Diagnosis
4.3.5 VKLLE Based Bearing Health State Recognition
4.4 Fault Classification Based on Distance Preserving Projection
4.4.1 Locality Preserving Projections
4.4.2 NFDPP
4.4.3 Experiment Analysis for Engine Misfire
4.4.4 Local and Global Spectral Regression Method
4.4.5 Application of Method Based on Distance Preserving Projections and Its Spectral Regression in Fault Classification
References
5 Deep Learning Based Machinery Fault Diagnosis
5.1 Deep Learning
5.2 DBN Based Machinery Fault Diagnosis
5.2.1 Deep Belief Network
5.2.2 DBN Based Vibration Signal Diagnosis
5.2.3 DBN Based Fault Classification
5.3 CNN Based Fault Classification
5.3.1 Convolutional Neural Network
5.3.2 CNN Based Fault Diagnosis Method
5.3.3 Transmission Fault Diagnosis Under Variable Speed
5.4 Deep Learning Based Equipment Degradation State Assessment
5.4.1 Stacked Autoencoder
5.4.2 Recurrent Neural Network
5.4.3 DAE-LSTM Based Tool Degradation State Assessment
References
6 Phase Space Reconstruction Based on Machinery System Degradation Tracking and Fault Prognostics
6.1 Phase Space Reconstruction
6.1.1 Takens Embedding Theorem
6.1.2 Determination of Delay Time
6.1.3 Determination of Embedding Dimensions
6.2 Recurrence Quantification Analysis Based on Machinery Fault Recognition
6.2.1 Phase Space Reconstruction Based RQA
6.2.2 RQA Based on Multi-parameters Fault Recognition
6.3 Kalman Filter Based Machinery Degradation Tracking
6.3.1 Standard Deviation Based RQA Threshold Selection
6.3.2 Selection of Degradation Tracking Threshold
6.4 Improved RQA Based Degradation Tracking
6.5 Kalman Filter Based Incipient Fault Prognostics
6.6 Particle Filter Based Machinery Fault Prognostics
6.7 Particle Filter
6.8 Enhanced Particle Filter
6.9 Enhanced Particle Filter Based Machinery Components Residual Useful Life Prediction
References
7 Complex Electro-Mechanical System Operational Reliability Assessment and Health Maintenance
7.1 Complex Electro-Mechanical System Operational Reliability Assessment
7.1.1 Definitions of Reliability
7.1.2 Operational Reliability Assessment
7.2 Reliability Assessment and Health Maintenance of Turbo Generator Set in Power Plant
7.2.1 Condition Monitoring and Vibration Signal Acquisition
7.2.2 Vibration Signal Analysis
7.2.3 Operational Reliability Assessment and Health Maintenance
7.2.4 Analysis and Discussion
7.3 Reliability Assessment and Health Maintenance of Compressor Gearbox in Steel Mill
7.3.1 Condition Monitoring and Vibration Signal Acquisition
7.3.2 Vibration Signal Analysis
7.3.3 Operational Reliability Assessment and Health Maintenance
7.3.4 Analysis and Discussion
7.4 Aero-Engine Rotor Assembly Reliability Assessment and Health Maintenance
7.4.1 The Structure Characteristics of Aero-Engine Rotor
7.4.2 Aero-Engine Rotor Assembly Reliability Assessment Test System
7.4.3 Experiment and Analysis
7.4.4 In-Service Aero-Engine Rotor Assembly Reliability Assessment and Health Maintenance
References
Recommend Papers

Intelligent Fault Diagnosis and Health Assessment for Complex Electro-Mechanical Systems
 9819935369, 9789819935369

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Weihua Li · Xiaoli Zhang · Ruqiang Yan

Intelligent Fault Diagnosis and Health Assessment for Complex ElectroMechanical Systems

Intelligent Fault Diagnosis and Health Assessment for Complex Electro-Mechanical Systems

Weihua Li · Xiaoli Zhang · Ruqiang Yan

Intelligent Fault Diagnosis and Health Assessment for Complex Electro-Mechanical Systems

Weihua Li School of Mechanical and Automotive Engineering South China University of Technology Guangzhou, Guangdong, China

Xiaoli Zhang School of Construction Machinery Chang’an University Xi’an, Shaanxi, China

Ruqiang Yan School of Mechanical Engineering Xi’an Jiaotong University Xi’an, Shaanxi, China

ISBN 978-981-99-3536-9 ISBN 978-981-99-3537-6 (eBook) https://doi.org/10.1007/978-981-99-3537-6 Jointly published with National Defense Industry Press The print edition is not for sale in China (Mainland). Customers from China (Mainland) please order the print book from: National Defense Industry Press. © National Defense Industry Press 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publishers, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publishers nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publishers remain neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Preface

Production equipment in process industries such as energy, petrochemical, and metallurgy, and other modern national economy industries such as national defense usually run under complex and harsh working conditions, including high temperature, heavy loads, severe corrosion, and serious fatigue and alternating stress. Various faults with different degrees inevitably occur in the key components during the operation of such complex electromechanical systems, which may cause hidden dangers to safety production or even disastrous casualties. Fault diagnosis and prognostics are of great significance to ensure the safe operation of equipment. Liangsheng Qu, an academician of the Chinese Academy of Sciences, pointed out that the fault diagnosis of equipment is essentially a pattern recognition problem. Therefore, important development directions such as intelligent fault diagnosis, prognostics, and health assessment are derived from the successful applications of pattern recognition methods based on artificial intelligence (AI) and machine learning in the field of fault diagnosis. Starting from this point, this book combines the author’s latest research achievements in intelligent fault diagnosis, prognostics, and health assessment and mainly focuses on the application of novel machine learning methods, such as enhanced support vector machine, semi-supervised learning, manifold learning, and deep belief network, in the field of signal feature extraction and fault diagnosis. Besides, the application of performance degradation assessment methods for mechanical components based on phase space reconstruction theory is introduced and illustrated through effective simulation analysis and typical engineering cases. The book consists of seven chapters. Chapter 1 introduces the definition of a complex electromechanical system and the research contents and development status of intelligent fault diagnosis, prognostics, and health assessment. Chapter 2 presents supervised support vector machines (SVM)-based algorithms and their applications in machinery fault diagnosis. Semi-supervised intelligent fault diagnosis methods, such as kernel principal component analysis (KPCA), fuzzy kernel clustering algorithms, self-organizing map (SOM) neural networks, and relevance vector machines (RVM) are systematically described in Chap. 3. In Chap. 4, fault feature selection and dimension reduction algorithms based on manifold learning algorithms, including

v

vi

Preface

spectral clustering, locally linear embedding (LLE), and distance-preserving projection are addressed, followed by the introduction of deep learning theories and deep belief network (DBN)-based signal reconstruction and fault diagnosis methods in Chap. 5. Chapter 6 introduces the basic theory of phase space reconstruction and the degradation performance assessment and remaining useful life (RUL) prediction research of the electromechanical system based on recurrent quantitative analysis (RQA) and Kalman filter (KF). Finally, Chap. 7 discusses the reliability assessment problems of typical complex electromechanical systems, such as turbine generator sets, compressor gearboxes, and aero-engine rotors. This book is a summary of the author’s long-term research, where most of the listed examples are the research findings on intelligent fault diagnosis and prognostics of complex electromechanical systems. Chapters 1, 3, 4, and 5 were written and compiled by Prof. Weihua Li. Prof. Xiaoli Zhang mainly contributed to Chaps. 2 and 7, and Prof. Ruqiang Yan wrote Chap. 6. This book is supported by outstanding research projects and individuals. I would like to express sincere appreciation for the support of the National Natural Science Foundation of China under Grant 50605021, 51075150, 51175080, 51405028, and 51475170 (in order of funding time) and the China Postdoctoral Science Foundation. I am grateful to Prof. Shuzi Yang, an academician of the Chinese Academy of Sciences, and Prof. Tielin Shi, for their guidance and encouragement. I would also like to acknowledge Prof. Rui Kang from Beihang University and Editor Tianming Bai from National Defense Industry Press for their great support and help. Last, I would like to thank the graduate students who have contributed a lot to the proofreading and typesetting of the book, including Yixiao Liao, Can Pan, Bin Zhang, Lanxin Liu, and Qiuli Chen. Nothing is perfect. This book inevitably has shortcomings, so we welcome criticism and generous advice from the readers. Guangzhou, China March 2020

Weihua Li

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Intelligent Fault Diagnosis, Prognostics, and Health Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 The Significance of Complex Electro-Mechanical System Fault Diagnosis, Prognostics, and Health Assessment . . . . . . . . . . . . 1.3 The Contents of Complex Electro-Mechanical System Fault Diagnosis, Prognostics, and Health Assessment . . . . . . . . . . . . . . . . . 1.4 Overview of Intelligent Fault Diagnosis, Prognostics, and Health Assessment (IFDPHA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Shallow Machine Learning-Based Methods . . . . . . . . . . . . . . 1.4.2 Deep Learning-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Organization and Characteristics of the Book . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Supervised SVM Based Intelligent Fault Diagnosis Methods . . . . . . . . 2.1 The Theory of Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 The General Model of Supervised Learning . . . . . . . . . . . . . . 2.1.2 Risk Minimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Primary Learning Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Linear Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Nonlinear Support Vector Machine . . . . . . . . . . . . . . . . . . . . . 2.2.3 Kernel Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 The Applications of SVM in Machinery Fault Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 The Parameters Optimization Method for SVM . . . . . . . . . . . . . . . . . 2.3.1 Ant Colony Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Ant Colony Optimization Based Parameters Optimization Method for SVM . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Verification and Analysis by Exposed Datasets . . . . . . . . . . .

1 1 2 4 6 6 8 9 10 13 13 14 15 16 17 18 19 19 22 24 26 30 36

vii

viii

Contents

2.3.4 The Application in Electrical Locomotive Rolling Bearing Single Fault Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Feature Selection and Parameters Optimization Method for SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Ant Colony Optimization Based Feature Selection and Parameters Optimization Method for SVM . . . . . . . . . . . 2.4.2 The Application in Rotor Multi Fault Diagnosis of Bently Testbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 The Application in Electrical Locomotive Rolling Bearing Multi Fault Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Ensemble-Based Incremental Support Vector Machines . . . . . . . . . . 2.5.1 The Theory of Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . 2.5.2 The Theory of Reinforcement Learning . . . . . . . . . . . . . . . . . 2.5.3 Ensemble-Based Incremental Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.4 The Comparison Experiment Based on Rolling Bearing Incipient Fault Diagnosis . . . . . . . . . . . . . . . . . . . . . . 2.5.5 The Application in Electrical Locomotive Rolling Bearing Compound Fault Diagnosis . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Semi-supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Fault Detection and Classification Based on Semi-supervised Kernel Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Kernel Principal Component Analysis . . . . . . . . . . . . . . . . . . . 3.2.2 Semi-supervise Kernel Principal Component Analysis . . . . . 3.2.3 Semi-supervised KPCA Classification Algorithms . . . . . . . . 3.2.4 Application of Semi-supervised KPCA Method in Transmission Fault Detection and Classification . . . . . . . . 3.3 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Correlation of Outlier Detection and Early Fault . . . . . . . . . . 3.3.2 Semi-supervised Fuzzy Kernel Clustering . . . . . . . . . . . . . . . 3.3.3 Semi-supervised Hypersphere-Based Fuzzy Kernel Clustering Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 Transmission Early Fault Detection . . . . . . . . . . . . . . . . . . . . . 3.4 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Semi-supervised SOM Fault Diagnosis . . . . . . . . . . . . . . . . . . 3.4.2 Semi-supervised GNSOM Fault Diagnosis . . . . . . . . . . . . . . . 3.4.3 Semi-supervised DPSOM Fault Diagnosis . . . . . . . . . . . . . . . 3.4.4 Example Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Relevance Vector Machine Diagnosis Method . . . . . . . . . . . . . . . . . .

44 51 52 56 61 69 71 73 75 79 88 93 95 95 96 96 97 109 115 128 128 130 132 137 150 150 155 156 158 165

Contents

3.5.1 Introduction to RVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 RVM Classifier Construction Method . . . . . . . . . . . . . . . . . . . 3.5.3 Application of RVM in Fault Detection and Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Manifold Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Spectral Clustering Manifold Based Fault Feature Selection . . . . . . 4.2.1 Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Spectral Clustering Based Feature Selection . . . . . . . . . . . . . 4.2.3 DSTSVM Based Feature Extraction . . . . . . . . . . . . . . . . . . . . 4.2.4 Machinery Incipient Fault Diagnosis . . . . . . . . . . . . . . . . . . . . 4.3 LLE Based Fault Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Local Linear Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Classification Based on LLE . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Dimension Reduction Performance Comparison Between LLE and Other Manifold Methods . . . . . . . . . . . . . . 4.3.4 LLE Based Fault Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.5 VKLLE Based Bearing Health State Recognition . . . . . . . . . 4.4 Fault Classification Based on Distance Preserving Projection . . . . . 4.4.1 Locality Preserving Projections . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 NFDPP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Experiment Analysis for Engine Misfire . . . . . . . . . . . . . . . . . 4.4.4 Local and Global Spectral Regression Method . . . . . . . . . . . 4.4.5 Application of Method Based on Distance Preserving Projections and Its Spectral Regression in Fault Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Deep Learning Based Machinery Fault Diagnosis . . . . . . . . . . . . . . . . . 5.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 DBN Based Machinery Fault Diagnosis . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Deep Belief Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 DBN Based Vibration Signal Diagnosis . . . . . . . . . . . . . . . . . 5.2.3 DBN Based Fault Classification . . . . . . . . . . . . . . . . . . . . . . . . 5.3 CNN Based Fault Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 CNN Based Fault Diagnosis Method . . . . . . . . . . . . . . . . . . . . 5.3.3 Transmission Fault Diagnosis Under Variable Speed . . . . . . 5.4 Deep Learning Based Equipment Degradation State Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Stacked Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

165 166 173 190 193 193 194 194 196 208 214 224 224 225 226 227 235 252 252 256 258 262

263 270 273 273 274 275 284 297 327 328 331 349 360 361 362

x

Contents

5.4.3 DAE-LSTM Based Tool Degradation State Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 6 Phase Space Reconstruction Based on Machinery System Degradation Tracking and Fault Prognostics . . . . . . . . . . . . . . . . . . . . . . 6.1 Phase Space Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Takens Embedding Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Determination of Delay Time . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.3 Determination of Embedding Dimensions . . . . . . . . . . . . . . . 6.2 Recurrence Quantification Analysis Based on Machinery Fault Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Phase Space Reconstruction Based RQA . . . . . . . . . . . . . . . . 6.2.2 RQA Based on Multi-parameters Fault Recognition . . . . . . . 6.3 Kalman Filter Based Machinery Degradation Tracking . . . . . . . . . . . 6.3.1 Standard Deviation Based RQA Threshold Selection . . . . . . 6.3.2 Selection of Degradation Tracking Threshold . . . . . . . . . . . . 6.4 Improved RQA Based Degradation Tracking . . . . . . . . . . . . . . . . . . . 6.5 Kalman Filter Based Incipient Fault Prognostics . . . . . . . . . . . . . . . . 6.6 Particle Filter Based Machinery Fault Prognostics . . . . . . . . . . . . . . . 6.7 Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.8 Enhanced Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.9 Enhanced Particle Filter Based Machinery Components Residual Useful Life Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Complex Electro-Mechanical System Operational Reliability Assessment and Health Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Complex Electro-Mechanical System Operational Reliability Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Definitions of Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Operational Reliability Assessment . . . . . . . . . . . . . . . . . . . . . 7.2 Reliability Assessment and Health Maintenance of Turbo Generator Set in Power Plant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Condition Monitoring and Vibration Signal Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Vibration Signal Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 Operational Reliability Assessment and Health Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.4 Analysis and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Reliability Assessment and Health Maintenance of Compressor Gearbox in Steel Mill . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Condition Monitoring and Vibration Signal Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Vibration Signal Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

371 371 372 373 374 375 375 378 384 384 386 387 389 399 401 408 410 416 419 419 421 421 425 428 429 433 440 441 442 443

Contents

7.3.3 Operational Reliability Assessment and Health Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.4 Analysis and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Aero-Engine Rotor Assembly Reliability Assessment and Health Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 The Structure Characteristics of Aero-Engine Rotor . . . . . . . 7.4.2 Aero-Engine Rotor Assembly Reliability Assessment Test System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.3 Experiment and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.4 In-Service Aero-Engine Rotor Assembly Reliability Assessment and Health Maintenance . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

445 451 453 456 458 459 463 465

Chapter 1

Introduction

1.1 Intelligent Fault Diagnosis, Prognostics, and Health Assessment Intelligent fault diagnosis, prognostics, and health assessment (IFDPHA) refers to the technology that leverages artificial intelligence and machine learning algorithms to conduct an array of tasks for a complex electro-mechanical system, such as health monitoring, fault diagnosis, and remaining useful life prediction. From the perspective of artificial intelligence, intelligent fault diagnostics and prognostics of the complex electro-mechanical system is a typical process of pattern recognition in which machine learning algorithms learn pattern knowledge from historical data and establish an end-to-end model for predicting results to guide the fault diagnosis, prognostics, and health assessment. The main contents of intelligent fault diagnosis, prognostic and health assessment include the following three aspects: (1) fault detection and identification (fault diagnosis); (2) fault severity assessment (health condition assessment); (3) remaining useful life prediction (prognosis). Fault detection and fault identification are two major steps in fault diagnosis. Fault detection aims to judge whether a fault occurs or not and ensure the faults can be detected in time, avoiding serious consequences because of the fault evolving. Fault identification aims to identify and locate faulty components so that downtime and maintenance costs can be reduced. Different from fault detection and identification, health condition assessment focuses on quantitatively analyzing the fault severity and the performance degradation of the complex electro-mechanical system. Usually, the incipient fault has little impact on the performance and operation of the complex electro-mechanical system, so the maintenance under such a situation might result in the waste of maintenance resources and increment of costs. However, when the fault evolves to a certain extent that may cause serious production accidents, maintenance should be scheduled and conducted on time. Based on the evaluation results of health condition assessment, fault prognosis aims to forecast future states of the fault types, fault severity, and degradation trends for the complex electro-mechanical system, and to predict the remaining useful life © National Defense Industry Press 2023 W. Li et al., Intelligent Fault Diagnosis and Health Assessment for Complex Electro-Mechanical Systems, https://doi.org/10.1007/978-981-99-3537-6_1

1

2

1 Introduction

of components or systems. IFDPHA is an extremely valuable assets in improving the overall maintenance and reliability of the complex electro-mechanical system, benefiting it in guiding predictive maintenance, enhancing the remaining useful life, and minimizing unplanned downtime, etc.

1.2 The Significance of Complex Electro-Mechanical System Fault Diagnosis, Prognostics, and Health Assessment Complex electro-mechanical systems reflect the epitome of human’s pursuit of product performance and engineering designs, which is inevitable in the process of mechanical equipment development. With the rapid development of information technology, computer science, and artificial intelligence, complex electromechanical systems are endowed with richer connotations and more complex functions. Modern industrial production has high restrictions for the production process and product quality, which makes the traditional mechanical system gradually replaced by various complex electro-mechanical systems. According to the definition in [1], a modern complex electro-mechanical system is a complex electro-mechanical-based physical system that integrates mechanical, electrical, hydraulic, and optical processes. A variety of functional units are integrated into different electro-mechanical systems, such as aero engines, high-speed trains, precision machine tools, and modern production equipment, through information flow fusion and driving. Complex electro-mechanical systems usually consist of a large number of parts and components, among which complex coupling relationships exist. Thus, it is necessary to consider the independent behavior of each subsystem and the complex coupling relationship between subsystems. Due to the complexity of functions, structures, coupling relationships, and the diversity of physical processes, the fault diagnosis, prognostics, and health assessment of complex electro-mechanical systems encounters lots of great challenges. Complex electro-mechanical systems are characterized by an inevitable trend to large-scale, high complexity, and high precision. In industrial fields of petrochemical, metallurgy, electric power, and machinery, industrial equipment typically operates at harsh and varying environments such as high temperatures, full speeds, and heavy loads. Mechanical and electrical system failure is the main factor that caused catastrophic accidents. For example, in 2011, the driving chain of an escalator in the Beijing subway broke and went down in the wrong direction, resulting in a stampede of passengers. In 2012, a transmission system fault occurred in the wind power plant both in Hebei and Jilin Province. The National Transportation Safety Board (NTSB) issued a report on the Harrison Ford plane crash on August 06, 2015, indicating that a mechanical failure with the carburetor resulted in the plane’s engine losing power. All the similar examples cannot be enumerated here. The failure of the electromechanical system may cause huge economic losses, environmental pollution, and

1.2 The Significance of Complex Electro-Mechanical System Fault …

3

even casualties, therefore, it is urgent to conduct health condition monitoring and assessment for electro-mechanical systems in the long-term services. The fault diagnosis, prognosis, and health assessment of complex electromechanical systems has become an important technical tool to ensure safety, stability, and reliability production, and has attracted more and more attention from the industry, academic institutions, and government departments. The technology for ensuring the reliability, safety, and maintainability of major products, major equipment, and key parts has been listed as one of the key technologies that should be pushed for breakthroughs both in The National Medium and Long-Term Plan for the Development of Science and Technology (2006–2020) and the mechanical engineering disciplines Development Strategy Report (2011–2020). According to the China Intelligent Manufacturing Engineering Implementation Guide (2016–2020) released by the Ministry of Industry and Information Technology, developing online fault diagnosis and analysis methods based on big data is a crucial technology that should be placed more attention, which can avoid catastrophic accidents, to improve the utilization rate of the equipment, shorten the downtime and maintenance time, and ensure product quality. In industrial production, the ultimate goal of mechanical maintenance is to keep machines operating safely and efficiently. The development of the maintenance strategy has gone through four stages: (1) Corrective Maintenance; (2) Planned Maintenance; (3) Condition-based Maintenance; (4) Predictive Maintenance. Corrective maintenance refers to the strategy that maintenance is carried out after an uncovering anomaly or failure. This strategy may be cost-effective in regular operation, but once catastrophic faults occur, the costs sustained for downtime and repairs for bringing the machines back to regular are largely expensive. Planned maintenance follows a plan of action, which is created by the manufacturer of equipment according to prescribed criteria, to replace and overhaul key parts of the equipment regularly. Compared with corrective maintenance, planned maintenance can reduce the failure risk or performance degradation of the equipment, but may increase the production costs since the resource waste of parts. With the development of fault diagnosis technology, equipment maintenance is changing from corrective or planned maintenance to condition-based maintenance, which is performed after one or more observation indicators show that equipment is going to fail, or its performance is deteriorating. Ideally, condition-based maintenance is scheduled only when it is necessary, and, in the long term, allows drastically reducing the costs associated with maintenance and avoiding the occurrence of serious faults, thereby minimizing system downtime and time spent on maintenance. Thanks to the development of prognosis techniques, a more powerful and advanced maintenance strategy, called predictive maintenance, can be achieved by continuously monitoring the condition parameters of the system, establishing a health assessment model, recognizing the fault types, fault severity, and degradation trends, and predicting the remaining useful life of key components. Predictive maintenance can guide the maintainers to select the appropriate maintenance strategy and schedule the plan with the optimized cost.

4

1 Introduction

Overall, fault diagnosis, prognosis, and health assessment is of great significance for ensuring equipment reliability, improving system utilization, reducing maintenance costs, and avoiding safety accidents, when it comes to electro-mechanical equipment systems with extremely complex function, structure, and coupling relationship.

1.3 The Contents of Complex Electro-Mechanical System Fault Diagnosis, Prognostics, and Health Assessment The complex electro-mechanical system is working under a complicated condition that various physical processes are coupled with each other. Generally, it is hard to establish a precise physical model for such a system since not only each physical process needs to be constructed but also the coupling relationship between these physical processes should be considered, which is not suitable for solving the problems of fault diagnosis and prognosis for complex systems. Fortunately, IFDPHA methods, which focus on leveraging machine learning algorithms to learn pattern knowledge from historical data, show powerful performance and solid advantages for establishing fault diagnosis and prognostics models in complex systems engineering. IFDPHA methodology mainly include four steps: (1) Signal acquisition; (2) Signal processing; (3) Feature extraction and selection; (4) Fault diagnosis, prognosis, and health assessment. Signal acquisition is to capture the measured signals related to the health conditions of equipment, including vibration, pressure, speed, temperature, acoustic signals, etc. Since the actual working environment of machines is harsh, the collected signals are often noisy. Thus, signal processing is applied to preprocess the collected signals for reducing or even eliminating the affection of noise. After signal acquisition and processing, feature extraction and selection is another crucial step for fault diagnosis and prognosis, where various features, such as time-, frequency- and time–frequency domain features, are extracted from the preprocessed signals using advanced signal processing techniques or algorithms. Then the discriminative features are selected to reduce the feature redundancy which may lead to a negative effect on the final performance, and the selected features can be regarded as comprehensive representations of the health condition of complex electro-mechanical systems. Taking such features as inputs, fault diagnosis, prognosis, and health assessment is to establish a classification or regression model which can be trained with the historical data. With all the steps mentioned above, the monitoring signals can be input into the trained model to obtain the corresponding health status information of the complex electro-mechanical system. On the one hand, IFDPHA methods can be categorized into the following three groups by considering whether labeled data and unlabeled data are available for model training: supervised learning, unsupervised learning, and semi-supervised learning. In the case of supervised learning, the samples annotated with labels are

1.3 The Contents of Complex Electro-Mechanical System Fault Diagnosis …

5

used to train the model. Supervised learning works well and a model with high accuracy and good generalization performance can be obtained under situations when the labeled samples are sufficient. However, the trained model may be overfitted when the labeled samples are insufficient. For unsupervised learning, there are only unlabeled samples available for model training. Unsupervised learning methods learn useful information about health conditions by analyzing the inherited similarity relationship between samples, such as clustering algorithms that divides different samples into clusters of similar samples. Generally, since there is no supervised information, the performance of the unsupervised model is often not good enough. At present, unsupervised learning is mainly used for anomaly detection. In semi-supervised learning, sufficient unlabeled data and a few labeled data are assumed to be available for model training, which has been proven to be an effective solution to improve the performance of the model under the situation of lacking sufficient labeled samples. All three kinds of algorithms can be used to develop IFDPHA methods, but they are suitable for different application scenarios. Specifically, supervised learning tries to obtain a model with satisfying performance when the labeled samples are sufficient. Instead, unsupervised learning is a good choice when the labeled samples are unavailable, while semi-supervised learning aims to address the problem where an overfitting training occurred and was caused by the sparsity of labeled data. On the other hand, IFDPHA methods can be categorized into the following two groups by considering the architecture of the models: shallow machine learningbased and deep learning-based. Shallow machine learning methods, such as Artificial Neural Networks (ANN), Support Vector Machines (SVM), Clustering Algorithms (CA), Hidden Markov Models (HMM), Random Forest (RF), and Manifold Learning, are widely applied for IFDPHA. These methods only make one or two nonlinear changes to the input data, and the calculation amount is small. In addition, its simple structure makes the parameters less, and a good generalization performance can be achieved even with a few training samples. However, the feature extraction ability of shallow structures is limited, so the input features should be extracted and selected manually. In contrast, deep learning, a branch of machine learning that has become increasingly popular since Hinton and Salakhutdinov [2] used a greedy learning algorithm to solve the problem of gradient disappearance in deep neural network training, attempts to abstract data at a higher level using multiple processing layers (deep neural networks) consisting of complex structures or multiple non-linear transformations. Features can be extracted and selected automatically by deep learning algorithms, eliminating manual feature engineering. However, the deep neural network with multiple layers has a large number of parameters that need to be optimized, thereby a large number of fault data are required for network training. Otherwise, the overfitting phenomenon may occur when the training samples are insufficient. Obviously, deep learning methods require much more time for training than shallow machine learning methods. In conclusion, the contents of intelligent diagnosis, prognostics, and health assessment of complex electro-mechanical systems are to use machine learning methods to learn relevant knowledge from the data and establish the corresponding diagnosis and prognostics model to evaluate the health status of equipment.

6

1 Introduction

1.4 Overview of Intelligent Fault Diagnosis, Prognostics, and Health Assessment (IFDPHA) IFDPHA is of great significance to improve production efficiency and reduce accident rates. Industry and academia have paid much attention to related methods and application research, and have proposed a large number of IFDPHA methods. At present, the research on IFDPHA of complex electro-mechanical systems mainly focuses on the key components, such as gears and bearings. In this section, the research status of intelligent diagnosis, prognosis, and health assessment is reviewed according to shallow machine learning-based and deep learning-based methods, respectively.

1.4.1 Shallow Machine Learning-Based Methods Many scholars have carried out much research on intelligent diagnosis, prognosis, and health assessment based on shallow machine learning methods. Lei et al. [3] extracted the same two features from multi-sensor signals measured at different positions of the planetary gearbox, which are the root mean square value of the signal normal meshing components and the normalized sum of all positive values of the spectrum difference between the measured and healthy signal respectively. The adaptive neuro-fuzzy inference system was then used to fuse the above features and diagnose the fault mode and fault degree of the planetary gearbox. Unal et al. [4] used Envelope Analysis (EA), Hilbert Transform (HT), and Fast Fourier Transform (FFT) to extract features from vibration signals as the input of the ANN for fault diagnosis. The structure of the network was then optimized by the genetic algorithm (GA). You et al. [5] proposed a wind turbine gearbox fault diagnosis method based on ensemble empirical mode decomposition (EEMD) and Back Propagation (BP) neural network. Wavelet transform was first adopted to denoise the collected vibration signals, then EEMD was utilized to decompose the denoised signal and extract energy characteristic parameters from the selected intrinsic mode function (IMF). These features were finally normalized and fed into the BP neural network for gearbox fault diagnosis. Chang et al. [6] proposed a fault diagnosis method for rotating machinery based on spindle orbit analysis and fractal theory, which extracted the axis track from vibration signals and then used fractal theory to extract features as the input of the BP neural network for fault diagnosis. Tian et al. [7] proposed a bearing fault diagnosis method based on manifold dynamic time warping (DTW) by measuring the similarity between the test and template samples. Compared with the traditional DTW-based method, it replaced the Euclidean distance (ED)-based similarity with manifold similarity. In addition, shallow machine learning methods are also widely used in signal denoising, dimensionality reduction, and feature extraction. Widodo et al. [8] used principal component analysis (PCA), independent component analysis (ICA), kernel principal component analysis (KPCA), and kernel independent component analysis

1.4 Overview of Intelligent Fault Diagnosis, Prognostics, and Health …

7

(KICA) to extract features from acoustic emission signals and vibration signals, and constructed the classifier with relevance vector machine (RVM) and support vector machine (SVM) respectively. The above feature extractors and classifiers were tested in the six-class bearing fault diagnosis task, and the diagnosis performances of different combinations were compared. Zarei et al. [9] trained the neural network with normal bearings data to establish a filter for removing non-bearing fault components (RNFC). The filtered signal was subtracted from the original signal to remove the non-bearing fault component, and then time-domain features were extracted from the subtracted signal before they were fed into another neural network for bearing health state recognition of the induction motor. Jiang et al. [10] extracted 29 commonly used features from the vibration signal and selected useful features with Nuisance Attribute Projection (NAP). An HMM was then used to process the selected features for bearing degradation assessment. Yu [11] extracted 14 time-domain features and 5 time–frequency domain features from the vibration signals, and the feature dimension was reduced by the PCA. A series of historical HMMs were established adaptively, and the health state of the bearing was evaluated by the overlap rate of the historical HMM and the current one. To improve the generalization ability of machine learning models, ensemble learning (EL), which completes learning tasks by constructing multiple learning machines, is also widely used in IFDPHA of complex electro-mechanical systems. For example, Khazaee et al. [12] fused vibration and acoustic data according to the Dempster–Shafer theory and proposed an effective fault diagnosis method for planetary gearboxes based on EL. First, the vibration and acoustic signals were transformed from the time domain to the time–frequency one by wavelet transform, and the time–frequency features were extracted as the input of the neural network. Then, two neural network classifiers were constructed, and the vibration and acoustic features were fed into different neural networks respectively. Finally, the output of the two neural networks was fused to obtain the final classification result. Wang et al. [13] proposed a particle swarm optimization-based selective ensemble learning (PSOSEN) for fault diagnosis of rotating machinery. First, time- and frequencydomain features were extracted from vibration signals, and a series of probabilistic neural networks (PNNs) were trained. Then, the adaptive particle swarm optimization (APSO) algorithm was proposed to select the network suitable for fault diagnosis from the above PNNs, and singular value decomposition (SVD) was utilized to obtain the best-weighted vector from the output of these networks. The final diagnosis result is the inner product of the output vectors of the PNNs and the best-weighted vector. Shallow machine learning methods are simple-structured, which require few parameters to be trained and consume a small amount of computation. Therefore, if the number of training samples and computing power is insufficient, IFDPHA models with good accuracy and generalization ability can be quickly and effectively established by shallow machine learning. However, due to its simple structures and limited feature extraction capability, manual feature extraction and selection are usually required.

8

1 Introduction

1.4.2 Deep Learning-Based Methods With the rise of deep learning techniques and the rapid development of computing facilities, intelligent diagnosis, prognosis, and health assessment methods based on deep learning are emerging in the field of fault diagnosis. For example, Shao et al. [14] extracted time-domain features from vibration signals as input of the deep neural network (DNN), which was trained by a particle swarm-based optimization algorithm. Qi et al. [15] used the overall empirical mode decomposition (EMD) and autoregressive model to extract features from vibration signals. The extracted features were the input of a stacked sparse autoencoder network for fault diagnosis of rotating machinery. Chen and Li [16] extracted time and frequency-domain features from vibration signals collected by different sensors, and input these features into different two-layer autoencoder networks for further feature extraction. Finally, the outputs of all autoencoder networks were combined and lined up as the input of a deep belief network for bearing fault diagnosis. Guo et al. [17] proposed a bearing remaining useful life (RUL) prediction method based on a long short-term memoryrecurrent neural network (LSTM-RNN). The proposed 6 similarity-based features were first combined with 8 traditional time–frequency domain features as the original feature space, and then the monotonic and correlation measurements were used for feature selection. Finally, the selected features were used as the input of the LSTM-RNN to perform RUL prediction of the bearing. It can be concluded that in early deep learning-based diagnosis models, the input features are manually extracted, and the deep network is used as the classifier to identify fault modes or evaluate health status. To make full use of the feature learning ability of the deep learning algorithms and extract fault features automatically, scholars have carried out further research. Heydarzadeh et al. [18] preprocessed the collected vibration acceleration signals, torque signals, and acoustic signals by wavelet transform, and used the preprocessed wavelet coefficients to train three different DNNs for gear fault diagnosis. Experimental results showed that all three kinds of signals can be used for the effective gear fault diagnosis with the proposed method. Janssens et al. [19] proposed a fault diagnosis method for rotating machinery based on a convolutional neural network (CNN), where the spectra of vibration signals collected from two different locations were obtained by Fourier transforms (FT). The two spectra were placed in the same two-dimensional matrix, which was the input of the CNN for rotating machinery fault diagnosis. Zhang [20] used the sparse autoencoder network to fuse the vibration signals collected by multiple sensors and evaluated the operating status of the equipment using square prediction error (SPE). Guo et al. [21] proposed a CNN-based method with adaptive learning rate adjustment to identify the type and severity of bearing faults. The vibration signal was first input into the first convolutional neural network to identify the fault type and into second to identify the fault degree of the bearing. To solve the problem of bearing fault diagnosis under load fluctuation and noise environment, Lu et al. [22] utilized the powerful feature extraction capability of deep learning by using the original signal directly as the input of the stacked denoising autoencoder (SDA). The extracted

1.5 Organization and Characteristics of the Book

9

features were then used as the input of the classifier to diagnose bearing faults, and different feature extraction methods were compared and analyzed. Jing et al. [23] proposed a fault diagnosis method for planetary gearboxes based on data fusion and CNN. The standardized vibration, sound, current, and speed signals were used as the input of the deep CNN, which extracted features, selected the degree of data fusion, and realized the fault diagnosis of the planetary gearbox. Shao et al. [24] proposed a deep autoencoder network for fault diagnosis of gearboxes and electric locomotive bearings. The proposed method used raw data as the input of the deep autoencoder network, where the key parameters were optimized by an artificial fish swarm algorithm rather than manual adjustment, and a new loss function based on the maximum cross-entropy was proposed to avoid the negative impact of noise. Zhang et al. [25] proposed a bearing fault diagnosis method combining an one-dimensional CNN and EL, which realized the end-to-end mapping from the original data to the fault state and avoided the uncertainty influence of artificial feature extraction. DL-based methods automatically extract and select appropriate features from the original data for IFDPHA, which can avoid the shortage of manual feature extraction and enhance the intelligence of the methods. Effective fault diagnosis, prognosis, and health assessment models, which hardly rely on expert knowledge and have low requirements for users, can be established when the training samples are sufficient and the computing power is strong enough. However, the complex structures and largescale training parameters of the DNNs usually require a large number of training samples and training time. DL-based methods cannot meet the actual needs in case of the number of training samples is insufficient, since overfitting will occur. Research on fault diagnosis, prognosis, and health assessment of complex electromechanical systems has achieved great success in the industry. Especially, in terms of data-driven intelligent diagnosis methods, various machine learning algorithms have been widely used in fault diagnosis, prognosis, and health assessment of mechanical systems, and the research of deep learning continously goes further.

1.5 Organization and Characteristics of the Book From the perspective of machine learning, this book details the applications in fault feature extraction and selection, incipient fault prediction, fault mode classification, and performance degradation evaluation based on supervised learning, semisupervised learning, manifold learning, phase space reconstruction, and other related algorithms. Besides, the applications of deep learning in intelligent prognosis and health assessment are explored and analyzed, which is one of the current research hotspots of machine learning. Focusing on IFDPHA, this book is divided into 7 chapters and structured as follows. Chapter 2 introduces supervised SVM-based algorithms and their applications in machinery fault diagnosis. To fully improve the generalization performance of SVM, the problems such as parameter optimization, feature selection, and ensemble-based incremental methods are discussed. The effectiveness of the SVM-based algorithms

10

1 Introduction

is validated in several fault diagnosis tasks on electrical locomotive rolling bearings, Bently rotor, and motor bearings testbench. Chapter 3 discusses semi-supervised intelligent fault diagnosis methods. Considering the difficulty of obtaining labeled data for supervised learning and the inadequate generalization ability of traditional unsupervised learning, the thought of semisupervised learning is integrated into mature supervised and unsupervised learning algorithms. Semi-supervised intelligent fault diagnosis methods, such as Kernel Principal Component Analysis (KPCA), fuzzy kernel clustering algorithms, Selforganizing Map (SOM) neural networks, and Relevance Vector Machines (RVM), are introduced. These methods are validated in incipient fault diagnosis of transmissions and bearings and have achieved successful results. Chapter 4 addresses a variety of intelligent fault diagnosis and prognosis methods based on manifold learning, including spectral clustering manifold-based fault feature selection, locally linear embedding (LLE)-based fault recognition, and distance-preserving projection-based fault classification. These methods are applied to the fault diagnosis and prognosis of gearbox gears, rolling bearings, engines, etc., and the effectiveness is verified. Chapter 5 mainly introduces four deep learning-based network models, including convolutional neural network (CNN), deep belief network (DBN), stacked autoencoder (SAE), and recurrent neural network (RNN). With cases of automotive transmission fault diagnosis and tool degradation assessment, this chapter gives detailed descriptions of how to apply deep neural networks (DNNs) for machinery fault diagnosis and equipment degradation assessment, and verifies the effectiveness of DNNs. Chapter 6 analyzes the application prospects of Recurrent Quantitative Analysis (RQA), Kalman Filter (KF), Particle Filter (PF), and other algorithms in fault identification and prognosis based on the space reconstruction theory. Furthermore, this chapter introduces KF-based incipient fault prediction and enhanced PF-based remaining useful life (RUL) prediction methods. Experiments show the great performance of the proposed algorithms in multi-parameter identification of bearing faults, degradation tracking, and RUL prediction of transmission systems. Chapter 7 proposes an operation reliability assessment method that realizes health monitoring-based reliability assessment and condition-based maintenance for complex electro-mechanical systems, such as the turbine generator set, compressor gearbox, and aero-engine rotor.

References 1. Zhong, J.: Coupling Design Theory and Methods of Complex Electromechanical Systems (in Chinese). China Machine Press, Beijing (2007) 2. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006) 3. Lei, Y., Lin, J., He, Z., et al.: A method based on multi-sensor data fusion for fault detection of planetary gearboxes. Sensors 12(2), 2005–2017 (2012)

References

11

4. Unal, M., Onat, M., Demetgul, M., et al.: Fault diagnosis of rolling bearings using a genetic algorithm optimized neural network. Measurement 58, 187–196 (2014) 5. You, Z., Wang, N., Li, M., et al.: Method of fan fault diagnosis of gearbox based on EEMD and BP neural network (in Chinese). J. Northeast Dianli Univ. 35(01), 64–72 (2015) 6. Chang, H.C., Lin, S.C., Kuo, C.C., et al.: Using neural network based on the shaft orbit feature for online rotating machinery fault diagnosis. In: Proceeding of 2016 IEEE International Conference on System Science and Engineering (ICSSE), 07–09 July 2016 7. Tian, Y., Wang, Z., Lu, C.: Self-adaptive bearing fault diagnosis based on permutation entropy and manifold-based dynamic time warping. Mech. Syst. Signal Process. 114, 658–673 (2019) 8. Widodo, A., Kim, E.Y., Son, J.D., et al.: Fault diagnosis of low speed bearing based on relevance vector machine and support vector machine. Expert Syst. Appl. 36(3), 7252–7261 (2009) 9. Zarei, J., Tajeddini, M.A., Karimi, H.R.: Vibration analysis for bearing fault detection and classification using an intelligent filter. Mechatronics 24(2), 151–157 (2014) 10. Jiang, H., Chen, J., Dong, G.: Hidden Markov model and nuisance attribute projection based bearing performance degradation assessment. Mech. Syst. Signal Process. 72, 184–205 (2016) 11. Yu, J.: Adaptive hidden Markov model-based online learning framework for bearing faulty detection and performance degradation monitoring. Mech. Syst. Signal Process. 83, 149–162 (2017) 12. Khazaee, M., Ahmadi, H., Omid, M., et al.: Classifier fusion of vibration and acoustic signals for fault diagnosis and classification of planetary gears based on Dempster–Shafer evidence theory. Proc. Inst. Mech. Eng. Part E J. Process Mech. Eng. 228(1), 21–32 (2014) 13. Wang, Z.Y., Lu, C., Zhou, B.: Fault diagnosis for rotary machinery with selective ensemble neural networks. Mech. Syst. Signal Process. 113, 112–130 (2018) 14. Shao, H., Jiang, H., Zhang, X., et al.: Rolling bearing fault diagnosis using an optimization deep belief network. Meas. Sci. Technol. 26(11), 115002 (2015) 15. Qi, Y., Shen, C., Wang, D., et al.: Stacked sparse autoencoder-based deep network for fault diagnosis of rotating machinery. IEEE Access 5, 15066–15079 (2017) 16. Chen, Z., Li, W.: Multisensor feature fusion for bearing fault diagnosis using sparse autoencoder and deep belief network. IEEE Trans. Instrum. Meas. 66(7), 1693–1702 (2017) 17. Guo, L., Li, N., Lei, Y., et al.: A recurrent neural network based health indicator for remaining useful life prediction of bearings. Neurocomputing 240, 98–109 (2017) 18. Heydarzadeh, M., Kia, S.H., Nourani, M., et al.: Gear fault diagnosis using discrete wavelet transform and deep neural networks. In: Proceeding of 2016 42nd Annual Conference of the IEEE Industrial Electronics Society (IECON), Florence, Italy, 23–26 Oct 2016 19. Janssens, O., Slavkovikj, V., Vervisch, B., et al.: Convolutional neural network based fault detection for rotating machinery. J. Sound Vib. 377, 331–345 (2016) 20. Zhang, S.: Bearing condition dynamic monitoring based on multi-way sparse autocoder (in Chinese). J. Vib. Shock 35(19), 125–131 (2016) 21. Guo, X., Chen, L., Shen, C.: Hierarchical adaptive deep convolution neural network and its application to bearing fault diagnosis. Measurement 93, 490–502 (2016) 22. Lu, C., Wang, Z.Y., Qin, W.L., et al.: Fault diagnosis of rotary machinery components using a stacked denoising autoencoder-based health state identification. Signal Process. 130, 377–388 (2017) 23. Jing, L., Wang, T., Zhao, M., et al.: An adaptive multi-sensor data fusion method based on deep convolutional neural networks for fault diagnosis of planetary gearbox. Sensors 17(2), 414 (2017) 24. Shao, H., Jiang, H., Zhao, H., et al.: A novel deep autoencoder feature learning method for rotating machinery fault diagnosis. Mech. Syst. Signal Process. 95, 187–204 (2017) 25. Zhang, W., Li, C., Peng, G., et al.: A deep convolutional neural network with new training methods for bearing fault diagnosis under noisy environment and different working load. Mech. Syst. Signal Process. 100, 439–453 (2018)

Chapter 2

Supervised SVM Based Intelligent Fault Diagnosis Methods

2.1 The Theory of Supervised Learning An important aspect of human intelligence is to learn from examples, to predict facts that cannot be directly observed by analyzing known facts and summarizing patterns [1]. In this kind of learning, it is important to be able to draw inferences from one another, that is, to use the rules learned from the sample data, not only to better explain the known examples, but also to make correct predictions and judgments on the future phenomenon or the phenomenon that cannot be observed. Usually, this kind of learning ability is called generalization ability. With the advent of the information age, a variety of data and information filled every corner of human production and life, but human beings are very limited in the ability to process and use data, and have not yet reached the degree of in-depth mining of data rules. In the research of machine intelligence, we hope that we can simulate this generalization ability with the machine (computer), and make it discover the latent law hidden in the data by studying the sample data, and make predictions about future data or unobservable phenomenon. Statistical reasoning theory infers a functional dependency from an empirical sample of data and plays a fundamental role in solving machine intelligence problems [1]. There are two kinds of statistical reasoning theories: one is the traditional statistical theory, which studies the infinite asymptotic behavior of samples, and the other is the statistical learning theory under the condition of finite samples. In traditional statistical theory, we study the statistical properties of large samples on the premise of the asymptotic theory when the samples tend to infinity, the form of the parameters is known, and the empirical samples are used to estimate the values of the parameters, it is the most basic and commonly used analytical method in the absence of a theoretical physical model. But in practical application, it is difficult to satisfy the premise of the asymptotic theory when the sample tends to infinity. Therefore, in the traditional statistical theory, which is based on the asymptotic theory when the sample tends to infinity, is difficult to obtain the ideal effect when the sample data is limited. Statistical Learning Theory is a Bell Labs of AT&T’s Vapnik proposed in the 1960s, © National Defense Industry Press 2023 W. Li et al., Intelligent Fault Diagnosis and Health Assessment for Complex Electro-Mechanical Systems, https://doi.org/10.1007/978-981-99-3537-6_2

13

14

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

after three decades of painstaking research, until the mid-1990s to develop a new theory of limited sample statistics and learning methods, it makes up the deficiency of traditional statistical theory. The statistical reasoning under this theory system not only considers the requirement of asymptotic performance, but also seeks the optimal solution under the current limited information condition. Because of its more systematic consideration of the case of a limited sample and its better practicability than traditional statistical theory, it has been widely regarded by the world machine learning community [1]. The Support vector machine, developed within this rigorous theoretical framework, can transform real problems into high-dimensional feature spaces through nonlinear transformations using kernel methods and mathematical optimization, in order to realize the nonlinear decision function in the original space, the linear decision function is constructed in the high dimension space, and the dimension problem is solved skillfully, the structural risk minimization principle is used to find a compromise between the approximation accuracy and the complexity of the approximation function, which ensures the generalization ability. Machine learning is not only a core research field of artificial intelligence, but also one of the most active and potential fields in the whole computer field, it plays an increasingly important role in the production and life of human beings. In recent years, many countries in Europe are devoted to the research of machine learning theory and application, and GE, Intel, IBM, Microsoft, and Boeing are also active in this field. Supervised learning is a machine learning task that learns relevant pattern recognition or regress knowledge from labeled training data. This chapter mainly introduces the methods and applications of Support vector machine intelligent diagnosis based on supervised learning.

2.1.1 The General Model of Supervised Learning A general model of a supervised learning problem consists of three components as shown in Fig. 2.1: a data (instances) generator G, a target operator S, and a learning machine LM. (1) Data (instance) generator G. The generator G is the source, which determines the environment in which the trainer and learning machine work. The generator Fig. 2.1 A general model learned from examples

G

x

S

LM

y



2.1 The Theory of Supervised Learning

15

G generates random vectors x ∈ R n independently and identically according to some unknown (but fixed) probability distribution function F(x). (2) The target operator S (sometimes called the trainer operator, or simply the trainer). The destination operator S returns an output value y for each input vector x according to the same fixed but unknown conditional distribution function F(y|x). (3) Learning machine LM, which can generate a certain function set f (x, α), α ∈ Λ for each input vector x (where α is a vector of real numbers, Λ is a set of parameters composed of real numbers) and produce an output value y˜ that approximates the generated value y generated by the target operator S. The problem of supervised learning is to select the function that best approximates the training objective y from the generated function set f (x, α), α ∈ Λ. This selection is based on a training set of l independent identically distributed observations drawn from the joint distribution F(x, y) = F(x)F(y|x). (x 1 , y1 ), (x 2 , y2 ), . . . , (x l , yl )

(2.1)

In the supervised learning process, the learning machine trains a series of point pairs (x i , yi )i=1,2,...,l obtained by observation, and constructs an operator to predict the trainer response yi on a specific vector x i produced by the generator G. The goal of the learning machine is to construct an appropriate approximation. By training, the learning machine can return a y˜ very close to the trainer’s response y for any given x.

2.1.2 Risk Minimization Problem In order to obtain the best approximation to the trainer response, the loss and difference L(y, f (x, α)) between the trainer response y and the response f (x, α) given by the learning machine for a given input x are measured. Consider the mathematical expectation of loss: ∫ R(α) =

L(y, f (x, α))dF(x, y)

(2.2)

where, R(α) is a risk functional, α is a vector of real numbers and α ∈ Λ, Λ is a parameter set composed of real numbers. The goal of machine learning is to find the function f (x, α 0 ) such that it minimizes the risk functional R(α) (on function set f (x, α), α ∈ Λ) when the joint probability distribution function F(x, y) is unknown and the training set (x 1 , y1 ), (x 2 , y2 ), . . . , (x l , yl ) is known.

16

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

2.1.3 Primary Learning Problem There are three main learning problems in supervised learning: pattern recognition, regression function, and probability density estimation, which are formulated as follows.

2.1.3.1

Pattern Recognition

Let the output y of the trainer only take two values y = {0, 1}, and let f (x, α), α ∈ Λ be the set of indicator functions (indicator functions are functions with only 0 or 1 values). Consider the following loss function: { L(y, f (x, α)) =

0, y = f (x, α) 1, y /= f (x, α)

(2.3)

For this loss function, the functional of Eq. (2.2) determines the probability that the answer given by the trainer and the indicator function f (x, α) is different, and the situation that the answer given by the indicator function is different from the output of the trainer is called classification error. In this way, the learning problem becomes to find the function that minimizes the probability of classification error when the probability measure F(x, y) is unknown, but the data Eq. (2.1) is known.

2.1.3.2

Regression Estimation

Let the output y of the trainer be a real value, and let f (x, α), α ∈ Λ be a set of real functions, which contains the regression function ∫ f (x, α 0 ) =

ydF(y|x)

(2.4)

The regression function is the function that minimizes the functional Eq. (2.2) under the loss function L(y, f (x, α)) = (y − f (x, α))2

(2.5)

In this way, the problem of regression estimation is to minimize the risk functional Eq. (2.2) that uses Eq. (2.5) as the loss function when the probability measure F(x, y) is unknown but the data Eq. (2.1) is known.

2.2 Support Vector Machine

2.1.3.3

17

Density Estimation

For the problem of estimating the density function from the set p(x, α), α ∈ Λ of density functions, consider the following loss function: L( p(x, α)) = − log p(x, α)

(2.6)

Therefore, the problem of estimating the density function from data is to minimize the risk functional (2.2) by using Eq. (2.6) as the loss function when the corresponding probability measure F(x) is unknown and independent identically distributed data (2.1) is given.

2.2 Support Vector Machine Statistical learning theory is a theory of statistical estimation and prediction of finite samples. It adopts the principle of structural risk minimization to compromise the empirical risk and confidence range, so as to minimize the actual risk. However, how to construct learning machines to realize the structural risk minimization principle is one of the key problems in statistical learning theory. Support Vector Machine (SVM) [2] is a powerful tool developed under the system of statistical learning theory to realize the principle of structural risk minimization. It mainly achieves the principle of structural risk minimization by keeping the empirical risk value fixed and minimizing the confidence range, which is suitable for small sample learning. In 2001, the American Magazine Science evaluated support vector machine as “a very popular method and successful example in the field of machine learning, and a very impressive development direction” [3]. Support vector machine integrates many techniques such as maximally spaced hyperplane, Mercer kernel, convex quadratic programming, sparse solution and slack variables, and has the following remarkable characteristics: (1) Based on the statistical learning theory and the inductive principle of minimizing structural risk, support vector machine seeks a learning machine that minimizes the sum of empirical risk and confidence range. Compared with the traditional learning machine based on the inductive principle of empirical risk minimization, the support vector machine is better adapted to the situation of small samples and can get the global optimal solution under the condition of limited information, rather than the optimal solution when the sample goes to infinity. (2) The support vector machine algorithm transforms it into a quadratic optimization problem through the duality theory, so as to ensure that the global optimal solution is obtained, and overcome the problem that neural networks and other methods are easy to fall into local extremum.

18

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

(3) The support vector machine uses the kernel method to implicitly map the training data from the original space to the high-dimensional space, and constructs a linear discriminant function in the high-dimensional space to realize the nonlinear discriminant function in the original space, the complexity of the algorithm is independent of the sample dimension, which solves the dimension problem and overcomes the traditional dimension disaster problem. (4) The machine complexity obtained by SVM depends on the number of support vectors rather than the dimension of transformation space, so the phenomenon of “over-learning” is avoided.

2.2.1 Linear Support Vector Machine The support vector machine is developed from the optimal classification surface in the case of linear separability, and the basic idea can be illustrated in the case of the twodimensional plane shown below. In Fig. 2.2, the star and the dot represent two kinds of samples, and the thick solid line in the middle is the optimal classification hyperplane, the two adjacent lines are the ones that are parallel to the classification hyperplane and are closest to the classification hyperplane. The distance between them is the classification interval. The so-called optimal classification hyperplane requires that the classification hyperplane can not only separate the two classes correctly, that is, the training error rate is 0, but also maximizes the classification interval. Maximizing the separation of categories is in fact the control of generalization, which is one of the core ideas of Support vector machine. Linear Support vector machine is only applicable to Linear separability samples, for many practical linear indivisible problems, it is necessary to find a way to transform the original linear indivisible problem into a simple linear divisible problem. See [2] for the algorithm. Fig. 2.2 The optimal classification hyperplane that separates data at maximum intervals

2.2 Support Vector Machine

19

2.2.2 Nonlinear Support Vector Machine The basic idea of the nonlinear Support vector machine is to map the input variables x into a high-dimensional space through nonlinear transformation, and then find the optimal classification surface in the high-dimensional transformation space. This kind of transformation is more complex and difficult to realize under normal circumstances. But it is noted that the problem of finding the optimal classification surface only involves the inner-product operation between the training samples, that is, only the inner-product operation is needed in the high-dimensional space, and this inner-product operation can be realized by the functions in the original space, you don’t even need to know the form of the transformation. According to the theory of functional, as long as a kernel function satisfies the Mercer condition, it corresponds to the inner product in a transformation space. See [4] for the algorithm.

2.2.3 Kernel Function The basic principles of kernel space theory are shown in Fig. 2.3. For a classification problem P, let X stand for the classification sample set, X ∈ R, R is called the input space or measurement space. In this space, P is a nonlinear or linearly indivisible problem (as shown in Fig. 2.3a). By finding an appropriate nonlinear mapping function φ(x), the sample set X in the input space can be mapped to a high-dimensional space F, so that the classification problem P can be linearly classified in the space F (as shown in Fig. 2.3b). Its essence is the same as the optimal classification hyperplane of the support vector machine (Fig. 2.2). F is called the characteristic space and can have any large dimension, even infinite dimension. Using kernels is an attractive approach to computation. The feature space can be defined implicitly through kernel functions, and can be avoided not only in the calculation of inner product but also in the design of Support vector machine. Using

Fig. 2.3 Basic principles of kernel space theory

20

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

different kernel functions and their corresponding Hilbert spaces is equivalent to using different criteria to evaluate the similarity of data samples. To construct different types of Support vector machines, you need to use different kernels that satisfy the Mercer theorem. Therefore, it is very important to construct kernel functions that can reflect the properties of approximation functions. An important feature of support vector algorithm is attributed to the Mercer condition of kernel, which makes the corresponding optimization problem become convex, thus ensuring that the optimal solution is global. First, we consider a finite input space, and assume that 1 , . . . , x n } is a ( X( = {x)) n symmetric function on K (x, z). Consider a matrix: K = K x i , x j i, j=1 , since it is symmetric, there must be an orthogonal matrix V such that K = V ΛV ' , where Λ n is the eigenvalue of K and λt corresponds to the eigenvector λt = (υti )i=1 , which is the column of V . Now, assuming that all eigenvalues are nonnegative, consider the feature map: φ : xi →

(√

λt υti

)n t=1

∈ Rn , i = 1, . . . , n

(2.7)

Now there are: n ) ) ( ( ( )⟩ Σ φ(x i ) · φ x j = λt υti υt j = V ΛV ' i j = K i j = K x i , x j



(2.8)

t=1

This means that K (x, z) is the kernel that really corresponds to the eigenmap φ. The condition that the eigenvalues of K are non-negative is necessary because if there is a negative eigenvalue λs corresponding to the eigenvector υ s , the points in the eigenspace: z=

n Σ

υsi φ(x i ) =

√ ΛV ' v s

(2.9)

i=1

Having a second order norm: √ √ ||z||2 = ⟨z · z⟩ = v 's V Λ ΛV ' v s = v 's V ΛV ' v s = v 's K v s = λs < 0

(2.10)

Contradicts the geometric properties of space. Therefore, the following Mercer theorem is obtained. Mercer’s theorem describes the properties of a function when it is a kernel function. Mercer’s theorem: Let X be a finite input space and K (x, z) be a symmetric function on X. So, the necessary and sufficient condition for K (x, z) to be a kernel is the matrix ))n ( ( K = K x i , x j i, j=1

(2.11)

2.2 Support Vector Machine

21

is positive semidefinite (i.e., the eigenvalue is not negative). We can generalize the inner product in the Hilbert space by introducing weights λi for each feature to obtain: ⟨φ(x) · φ(z)⟩ =

∞ Σ

λi φi (x)φi (z) = K(x, z)

(2.12)

i=1

So, the feature vector becomes: φ(x) = (φ1 (x), φ2 (x), . . . , φn (x), . . .)

(2.13)

Mercer’s theorem gives an acceptable representation of continuous symmetric functions K (x, z): ∞ Σ

K (x, z) =

λi φi (x)φi (z)

(2.14)

i=1

where λi is non-negative, it is equivalent to K (x, z) being the inner product in the eigenspace F ⊇ φ(x), where F is the l2 space of all the following sequences: ψ = (ψ1 , ψ2 , . . . , ψi , . . .)

(2.15)

Σ∞ where i=1 λi ψi2 < ∞. It will implicitly derive a space defined by the eigenvector, and the decision function of the Support vector machine will be expressed as: f (x) =

l Σ

) ( α j y j K x, x j + b

(2.16)

j=1

The four most common kernels are: (1) Support vector polynomial kernel function K (x i , x j ) =

[(

) ]d xi · x j + R

(2.17)

where R is a constant and d is the order of the polynomial. (2) Index Radial basis function ( || || ) || x i − x j || K (x i , x j ) = exp − 2σ 2 where, σ is the width of exponential radial basis kernel function.

(2.18)

22

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

(3) Gauss Radial basis function ( || || ) || x i − x j ||2 K (x i , x j ) = exp − 2σ 2

(2.19)

where, σ is the width of the Gaussian radial basis function. (4) The Sigmoid kernel function ) ] [ ( K (x i , x j ) = tanh υ x i · x j + θ

(2.20)

υ > 0, θ < 0 is the Sigmoid kernel function parameter. Vapnik et al. argue that in Support vector machine, just changing the kernel can change the type of the learning machine (that is, the type of approximation function) [1], and that a key factor in Support vector machine generalization performance is the kernel. By using appropriate kernel functions, the training data can be implicitly mapped to the high-dimensional space, and the data law of the high-dimensional space can be found. But so far, the function and scope of various kernel functions have not been clearly proved and clearly agreed.

2.2.4 The Applications of SVM in Machinery Fault Diagnosis The fault diagnosis of mechanical equipment is a typical small-sample problem, and traditional machine learning methods cannot obtain good generalization performance when solving a small-sample problem. Therefore, the lack of sufficient learning samples has always been a bottleneck problem in mechanical intelligent fault diagnosis. Statistical learning theory is a new theory to study the statistical law and learning method of limited sample data, which makes up for the deficiency of traditional statistical theory. The support vector machine method developed under this rigorous theoretical system is suitable for fault diagnosis in small samples. At present, support vector machines are widely used in mechanical condition monitoring and fault diagnosis of bearings, gears, motors and so on, as shown in Table 2.1. Although the support vector machine has made great progress in the application of mechanical fault diagnosis, it has not achieved the ideal effect in the practical application: (1) Optimization of Support vector machine parameters. The optimization of Support vector machine parameters, also known as parameter selection, has been a key factor that restricts the Support vector machine to perform its good generalization performance in engineering practice. The Support vector machine parameters are directly related to the generalization performance of the Support vector machine. Therefore, it is a key problem to study the mechanism of the effect of Support vector machine parameters on the generalization performance

2.2 Support Vector Machine

23

Table 2.1 Application of support vector machine in mechanical fault diagnosis Diagnostic object

Equipment status

Method

Author

Bearing

Normal, outer ring fault, inner ring fault

Empirical mode decomposition + support vector machine

Yang et al. [5]

Gear

Normal, cracked, broken teeth

Empirical mode decomposition + support vector machine

Cheng et al. [6]

Normal, crack, broken tooth, tooth surface Morlet wavelet + wear support vector machine

Saravanan et al. [7]

Motor

Rotor bar broken, rotor cage end ring broken, coil short circuit, stator winding coil short circuit

Welsh method + support vector machine

Pump

Structural resonance, rotor radial contact friction, rotor axial contact friction, shaft crack, gear breakage, bearing breakage, blade breakage, rotor eccentricity, shaft bending, main body connection loosening, bearing loosening, rotor partial loosening, air pressure pulsation, cavitation phenomenon

Principal Chu and component analysis Yuan [9] + support vector machine

Aero-engine

Bearing failure, oil failure, reducer failure Support vector machine

Sun et al. [10]

Diesel engine

Normal, fuel injector nozzle enlarged, fuel Support vector injector nozzle plug, fuel pump plug, fuel machine system leakage, fuel mixed with impurities

Zhu and Liu [11]

Poyhonen et al. [8]

of Support vector machine and to propose an effective optimization method of Support vector machine parameters in the application of mechanical fault diagnosis. (2) Fault feature selection problem. In engineering practice, when the structure of the Support vector machine kernel and algorithm is determined, the two key factors that restrict the Support vector machine to achieve good generalization performance are how to select the optimal Support vector machine parameters and how to select the features related to the attributes of the samples for Support vector machine learning. Most of the current research is confined to analyzing and solving the two key factors that restrict the generalization performance of Support vector machine, namely, feature selection and parameter optimization, without considering the common influence of feature selection and parameter optimization on the generalization performance of Support vector machine, it is inevitable to lose one and lose the other, which restricts the full Support vector

24

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

machine of generalization performance. Therefore, how to solve the problem of simultaneous optimization of Support vector machine feature selection and parameters, improving the generalization ability of Support vector machine by obtaining the best features and parameters that match each other is another problem that Support vector machine face in the application of mechanical fault diagnosis. (3) The Support vector machine algorithm structure improvement problem. The establishment of Support vector machine intelligent fault diagnosis model requires the Support vector machine algorithm structure to have good generalization performance, that is, to predict future samples and unknown events. The generalization performance of the Support vector machine classifier (or predictor) depends mainly on the availability of limited samples and the adequacy of the Support vector machine algorithm in mining sample information. However, it is very expensive and time-consuming to collect representative training samples in engineering practice, and it is not easy to accumulate a certain amount of training samples in a certain period of time. How to fully mine and make use of the knowledge and laws in the limited sample information, improving the generalization ability of Support vector machine Support vector machine from the aspects of Support vector machine theory and algorithm construction is a problem that needs to be further studied in the application of mechanical fault diagnosis. The technology of mechanical fault diagnosis can improve the reliability and normal running time of mechanical products, and the current available fault diagnosis technology usually needs a lot of professional knowledge and experience accumulation, so in practical application, without professional training or expertise, it is difficult for technicians in the production line to analyze the monitored data and even more difficult for them to make accurate diagnoses. In addition, due to the large noise interference in the industrial field, sometimes the fault conclusions obtained by the conventional diagnosis methods are not sufficient, so the intelligent diagnosis technology needs to be developed.

2.3 The Parameters Optimization Method for SVM The performance of support vector machine mainly refers to the prediction ability of support vector machine to unknown samples established by learning the attributes of training samples, which is usually called generalization ability. The parameter value of support vector machine profoundly affects the generalization performance of support vector machine. In 2002, Chapelle and Vapnik et al. published an article in the internationally renowned academic journal Machine Learning describing the important influence of support vector machine parameters (penalty factor C and kernel parameters) on support vector machines [12]: The penalty factor C determines the compromise between maximizing the interval and minimizing the error, Kernel

2.3 The Parameters Optimization Method for SVM

25

parameters (such as order d in polynomial kernel, exponential radial basis kernel width σ , Gaussian radial basis kernel width σ , Sigmoid kernel parameters υ, θ , and other kernel parameters) determine the nonlinear mapping characteristics and distribution of the sample from the input space to the high-dimensional feature space. If the selected parameter values (such as penalty factor C, kernel parameters) are inappropriate, the predictive performance of binary Support vector machine will even be lower than 50% random guess error probability. To achieve good generalization performance of support vector machine, parameter optimization is an indispensable step. Therefore, the parameter optimization of support vector machine has been a focus of scholars at home and abroad, and also a key factor restricting the excellent generalization performance of support vector machine in engineering practice. The machine optimization methods of Support vector machine proposed at home and abroad can be classified into the following four categories: Trial and Error Procedure, Cross Validation Method, Generalization Error Estimation and Gradient Descend Method, Artificial Intelligent and Evolutionary Algorithm, and so on. Trial and error procedure is also known as “cut-and-try method”, where the user has no prior knowledge or relies on less experience, by testing a limited number of parameter values and retaining the minimum test error of the parameters as the optimal parameters. Although the trial and error procedure is used by most people because of its simplicity in practical application, due to insufficient optimization in Support vector machine parameter space, therefore, the selected optimal parameters of Support vector machine are neither rigorous nor convincing. Another commonly used method is cross validation method, which divides the sample data set into k equal data subsets and uses k − 1 data subsets for training and the remaining data subsets for testing. The resulting k test error means are used as the test errors under the current support vector machine parameter combination. Then adjust the parameters and repeat the above steps to obtain the test error of the adjusted parameters; When a satisfactory test error is obtained, the corresponding parameter combination is taken as the optimal parameter of the support vector machine. It can be seen that the cross-validation method is time-consuming and time-consuming, and can only be optimized in a limited Support vector machine parameter space. Generalization error estimation and gradient descent is a method that uses the error bound estimation function of Support vector machine generalization performance and combines the gradient descent to search for the optimal parameters in the Support vector machine parameter space, however, since the error estimation of generalization performance of Support vector machine is a complex mathematical problem and requires that the gradient calculation of the error bounds of Support vector machine parameters is feasible, therefore, generalization error estimation and gradient descent method is not commonly used in practical engineering applications. Currently, the optimization of Support vector machine parameters is strongly supported by artificial intelligence techniques and evolutionary algorithms. Because the nature of natural evolution process is an optimization process of survival of the fittest, which follows a kind of wonderful law and rule, for thousands of years, it has inspired human beings to learn the laws of nature, simulate the evolution of nature and biology, and realize invention and practical activities (as shown in Table 2.2).

26

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

Table 2.2 Biological revelation and human invention Creature

Invention

Creature

Invention

Bird

The plane

The tail fin of a fish

Rudder

Bats

Sonar and radar

Pectoral fins of fishes

Oars

Dragonflies

Helicopters

Frogs

Electronic frog eyes

Cobwebs

Fishing nets

Jellyfish

Storm forecasters

Spider silk

New fibers

Flying squirrels

Parachutes

Fireflies

Artificial luminescence

Shells

Tank

Flies

Smell detector

Butterflies

Satellite temperature control system

Scholars at home and abroad have proposed various intelligent and evolutionary algorithms for Support vector machine parameter optimization, such as covariance matrix adaptive evolutionary strategy, genetic algorithm, artificial immune algorithm, particle swarm optimization algorithm, etc. This section focuses on the Support vector machine parameter optimization problem. Based on the ant colony optimization algorithm, the mechanism of the influence of the Support vector machine parameters on the Support vector machine is analyzed, an ant colony optimization algorithm is proposed to optimize Support vector machine parameters. Then, through the International Standard Database, the mechanism of the influence of the parameters of ant colony optimization algorithm on the process of optimizing the parameters of the Support vector machine is analyzed, the feasibility and effectiveness of the proposed Support vector machine parameter optimization method based on ant colony optimization algorithm are verified by comparison with other existing methods. Finally, the proposed ant colony optimization algorithm based Support vector machine parameter optimization method is applied to the analysis of an electric locomotive bearing fault case, the bearing fault diagnosis method of electric locomotive will not be based on the accurate extraction of bearing fault characteristic frequency, instead, it simply extracts time-domain and frequency-domain statistical features of the signal and then uses the proposed Support vector machine parameter optimization method based on the ant colony optimization algorithm for fault pattern recognition, so as to highlight the effectiveness of the parameter optimization method. The results show that this method can improve the generalization performance of Support vector machine and successfully identify the common single fault modes in electric locomotive bearings.

2.3.1 Ant Colony Optimization Dorigo et al. proposed the ant colony optimization algorithm in 1991 [13], and elaborated their research results in international famous journals such as Nature for the model and theory of ant colony optimization algorithm, laying a solid foundation

2.3 The Parameters Optimization Method for SVM

27

for the construction of the theoretical system of ant colony optimization algorithm. Aiming at the parameter optimization problem of Support vector machine, this paper expounds the algorithm design and example analysis of ant colony optimization algorithm. For the specific algorithm derivation, see [13].

2.3.1.1

Algorithm Design

For continuous domain optimization problems, ANT colony optimization algorithms usually need to discretize the continuous domain first, and the artificial ants can move freely on the abstract points for the convenience of computer. Ant colony optimization algorithm flow as shown in Fig. 2.4, mainly includes the following five key steps: (1) Variable initialization: set the number of ants in the ant colony, pheromone concentration at the initial moment, and pheromone intensity variables such as the initial value. (2) Discretization of continuous domain: the value range xilower ≤ xi ≤ upper xi (i = 1, 2, . . . , n) of the variable to be optimized is discretized as N equal parts, then the interval between each discrete point is upper

hi =

xi

− xilower , i = 1, 2, . . . , n N

(2.21)

upper

where, xi is the upper bound of the value range of variables xi ; xilower is the lower bound of the value range of variables xi ; n is the number of variables xi . (3) Pheromone construction and management: let τi j be the pheromone concentration on node (i, j ), and its initial value be a constant, so as to ensure that each node has equal probability of being selected by ants at the initial moment. In the subsequent search process, after all ants have completed a traversal search, the pheromone on each node needs to modify the current pheromone concentration according to the pheromone update equation τinew = (1 − ρ)τiold j j +

Q e fi j

(2.22)

where, τinew is the current pheromone concentration on node (i, j ); ρ is j pheromone volatility coefficient; τiold j is the historical pheromone concentration on node (i, j ); Q is the pheromone intensity, usually a constant; e is a mathematical constant; f i j is the value of the objective function f on node (i, j ). (4) Calculate the probability of next target node of each ant according to the state transition probability equation (Pij) to determine the moving direction of the ants, τi j Pi j = Σ N

i=1 τi j

(2.23)

28

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

Fig. 2.4 Flow chart of ant colony optimization algorithm

2.3 The Parameters Optimization Method for SVM

29

When the movement number of ant colony Nc is less than preset value Ncmax , the ant colony will continue to optimize according to the pheromone renewal equation defined in Eq. (2.21) and the state transition probability equation defined in Eq. (2.22). If the movement number of ant colony Nc reaches the preset value Ncmax , coordinate m i , i = 1, 2, . . . , n corresponding to the maximum pheromone concentration τinew j on each node is found, and the value range of variable is narrowed: xilower ← xilower + (m i − Δ) ∗ h i upper

xi

upper

← xi

+ (m i − Δ) ∗ h i

(2.24) (2.25)

Δ is a constant. Then continue from the above step (2) to start the ant colony optimization process. (5) Algorithm iteration termination condition: if the maximum interval max(h i ) between the discrete points in step (2) is less than the given precision ε, then the optimization algorithm stops and outputs the optimal solution xi∗ =

2.3.1.2

upper

xilower + xi 2

, i = 1, . . . , n

(2.26)

Example Analysis

To simply verify the effectiveness of the ant colony optimization algorithm, the ant colony optimization algorithm is first applied to three basic mathematical examples. Example 1: for the following typical univariate continuous domain optimization problem min f (x) = (x − 1)2 + 1, x ∈ [0, 8]

(2.27)

The function is of one variable and has a theoretical minimum f (x ∗ ) = 1 at x = 1. The ant colony optimization algorithm is used to optimize the function. The experimental results shown in Fig. 2.5 show that the ant colony optimization ˜ = 1.0003 at algorithm obtains the optimal value of the objective function f (x) x˜ = 0.9396. The calculation time t = 0.2190 s. ∗

Example 2: for the following typical univariate continuous domain optimization problem min f (x) = 5 sin(30x)e−0.5x + sin(20x)e0.2x + 6

(2.28)

The univariate function has several local extremum values. The optimization result of the ant colony optimization algorithm (as shown in Fig. 2.6) is as follows: the

30

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

Fig. 2.5 Example 1 curve and ant colony optimization algorithm results

function obtains the minimum value f (x) ˜ = 1.2728 at x˜ = 0.5754. The calculation time t = 0.0940 s. Example 3: for the following typical two-variable continuous domain optimization problem min f (x1 , x2 ) = x1 e(−x1 −x2 ) 2

2

(2.29)

) ( The function obtains the theoretical minimum f x1∗ , x2∗ = − 0.4289 at x1∗ = − √12 , x2∗ = 0. The optimization result of the ant colony optimization algorithm is as follows: the function is at x˜1 = − 0.7022, x˜2 = 0.0058, f (x˜1 , x˜2 ) = − 0.4288, as shown in Fig. 2.7. Calculation time t = 0.3750 s.

2.3.2 Ant Colony Optimization Based Parameters Optimization Method for SVM 2.3.2.1

Support Vector Machine Parameters

The performance of Support vector machine is the ability to predict an unknown sample by learning the attributes of the training samples, which is often referred to as generalization or generalization. The penalty factor C and kernel parameters

2.3 The Parameters Optimization Method for SVM

Fig. 2.6 Example 2 curves and results of ant colony optimization algorithm

Fig. 2.7 Example 3 curves and ant colony optimization algorithm results

31

32

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

have a significant impact on the generalization performance of the Support vector machine. The penalty factor C determines the tradeoff between the minimization of fitting error and the maximization of classification interval. Parameters of the kernel function, such as the kernel width parameter σ in Gauss’s Radial basis function, affect the mapping transformation in the data space and change the complexity of the sample distribution in the high-dimensional feature space. Either parameter is set too large or too small, which will impair the excellent performance of the Support vector machine. Therefore, in practical engineering applications, it is very important to obtain good generalization performance of Support vector machine by optimizing parameters of the Support vector machine. This section takes the optimization of the penalty factor C and the Gauss Radial basis function parameters σ as examples to illustrate the proposed Support vector machine parameter optimization method based on the ant colony optimization algorithm.

2.3.2.2

Objective Function for Optimizing Support Vector Machine Parameters

The goal of optimizing the Support vector machine parameters is to use the optimization program to explore a finite subset of possible solutions and find the parameter values that minimize the generalization error of the Support vector machine. Because the true error on the test samples is unbiased and the variance of the error decreases with the increase of the verification sample set, the true error on the test samples is estimated in this section. ) {( } Suppose a test sample set: S ' = x i' , yi' |x i' ∈ H, yi' ∈ Y, i = 1, . . . , l , where H is the feature set and Y is the label set, then the objective function of parameter optimization of Support vector machine is: 1 Σ ( ' ( ' )) ψ −yi f x i l i=1 l

min T =

(2.30)

where, ψ is the step function: when x > 0, ψ(x) = 1; Otherwise, ψ(x) = 0. f is the decision function of Support vector machine.

2.3.2.3

Pheromone Model of Ant Colony Optimization Algorithm

In the ant colony optimization algorithm, the artificial ant constructs the solution by measuring the dynamic artificial pheromone in the form of probability. The main component of ant colony optimization algorithm is the pheromone construction and management mechanism. The pheromone model for optimizing Support vector machine parameters is as follows: (1) State transition rule

2.3 The Parameters Optimization Method for SVM

33

The state transfer rule makes ants have the ability to find the optimal parameters through pheromones. In the ant colony optimization algorithm for parameter optimization of Support vector machine, the role of each ant is to establish a set of solutions. An optimal solution is established by applying probabilistic decisions to move ants in adjacent state spaces. The state transition probability equation is shown as follows: τi j Pi j = Σ N

(2.31)

i=1 τi j

where, i represents the parameter value label of a parameter to be optimized, j represents the parameter value label of another parameter to be optimized, τi j represents the pheromone value of a parameter value combination (i, j ) of a parameter to be optimized, and Pi j represents the probability value of a parameter value combination (i, j ) to be selected by an ant colony. (2) State update rule The state update rule is designed to motivate the ant colony to find the optimal solution. When all the ants have established their own solutions, the state update rule is only applied to the subset of locally optimal parameter combinations obtained in the current iteration. By this rule, the pheromone of a subset of locally optimal parameter combinations will be increased. The state update rule and state transition rule are used to guide the ant colony to search for a better solution in the vicinity of a good solution in the current iteration. By applying the status update rule, the updated pheromone is as follows: τinew = (1 − ρ)τiold j j +

Q eT

(2.32)

where, T is the target function value in Eq. (2.29), ρ is the volatility factor, and Q is the pheromone intensity. The state update rule assigns more pheromones to the Support vector machine solution with less generalization error, making these parameters more likely to be selected by other ants or by subsequent iterations.

2.3.2.4

Ant Colony Optimization Algorithm Based Support Vector Machine Parameter Optimization Algorithm Steps

The flow chart of the Support vector machine optimization method based on the ant colony optimization algorithm is shown in Fig. 2.8. It consists of the following three main steps: (1) Initialize the parameters and variables, divide the variable interval to be optimized into N grids, and calculate the grid interval size:

34

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

Fig. 2.8 Flow chart of support vector machine parameter optimization based on ant colony optimization algorithm upper

hj = upper

vj

− vlower j N

,

j = 1, . . . , m

(2.33)

where, v j and vlower are respectively the upper and lower bounds of each j parameter to be optimized. h j is the size of the grid interval after grid division, and m is the number of parameters to be optimized. The grid interval for each parameter is equivalent. So, each grid node represents a combination of parameters. The larger N is, the denser the grid division is, and more ants are needed to join the calculation, which increases the amount of calculation. If N is too small, the convergence rate of ant colony optimization algorithm decreases. Therefore, in this section, considering the calculation time and complexity,

2.3 The Parameters Optimization Method for SVM

35

vj = 210 , vlower = 2−10 and N = 10 are set in the following numerical j examples. (2) At the initial moment, the pheromone levels on all the parameter combination nodes are the same, that is, all the grid nodes are the same distribution, so all the ants randomly choose their initial moving positions, and then, the ant colony selects its own moving position by a state transition rule described in Eq. (2.30). After that, the Support vector machine is trained with the selected parameter combinations, and the objective function for optimizing the parameters of Support vector machine parameters is calculated according to Eq. (2.29). When all the ants have completed the iteration, the state update rules shown in Eq. (2.31) are applied to the parameter set (grid nodes) that produces the minimum Support vector machine generalization error. The grid nodes with the smallest error per iteration are rewarded with more pheromones, which makes these nodes with more pheromones attract more ants to select, thus forming a positive feedback search mechanism. Step 2 will loop until the maximum number of loop iterations is reached. (3) Find the node with the largest pheromone concentration, and then do the next round of searching near that node. The range of parameters to be optimized is narrowed to: upper

( ) vlower ← vlower + mj − Δ ∗ hj j j

(2.34)

( ) + mj + Δ ∗ hj

(2.35)

upper

vj

upper

← vj

where Δ is a constant and m j is the nodal subscript of the maximum pheromone concentration. The above three steps will loop until the grid interval is less than the given precision ε, which is also the iteration termination condition of the parameter optimization method of support vector machine based on ant colony optimization algorithm. Generally speaking, the smaller ε is, the more accurate the optimal solution will be, but the calculation time will be greatly increased. In order to find a compromise between solving accuracy and computational complexity, ε was set as 0.01 in the subsequent experiments in this section. The optimal parameters obtained are: upper

v ∗j =

+ vj vlower j 2

,

j = 1, . . . , m

where, v ∗j represents the optimal parameter obtained.

(2.36)

36

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

Table 2.3 Dataset properties Name

Number of groups

Number of training samples

Number of test samples

Dimensionality

Breast cancer

100

200

77

Diabetes

100

468

300

8

Heart

100

170

100

13

Thyroid

100

140

75

5

Titanic

100

150

2051

3

9

2.3.3 Verification and Analysis by Exposed Datasets To verify the effectiveness of the proposed Support vector machine parameter optimization method based on the ant colony optimization algorithm, the proposed method will be validated in this section using datasets from commonly used international standard databases, it is also compared with other common parameter optimization methods of Support vector machine.

2.3.3.1

Data Description

The experimental data were selected from Breast Cancer, Diabetes, Heart, Thyroid and Titanic data sets in the world-famous UCI and Ida Standard Database. Table 2.3 describes the attributes of these data sets: the name of the data set, number of data groups in the data set, the number of training samples in each data group, the number of test samples in each data group, and the dimensions of the data set.

2.3.3.2

Support Vector Machine Analysis

In order to analyze the effect of the parameters on the generalization performance of the Support vector machine, surface graphs and contours of the test errors of the above five data sets are drawn by taking the Gaussian radial basis kernel function as an example, as shown in Figs. 2.9, 2.10, 2.11, 2.12 and 2.13. The interval of Support vector machine parameters C, σ is [2−10 , 210 ]. The (a) plots from Figs. 2.9, 2.10, 2.11, 2.12 and 2.13 show the test error curves, with the x-axis and y-axis representing log2 σ and log2 C, respectively. Each node in the (x, y)-plane on the test error surface represents a combination of parameters, the z-axis represents the percentage test error of the Support vector machine under each combination. Figures 2.9, 2.10, 2.11, 2.12 and 2.13b plots the test error contours for the five data sets, with the x-axis and y-axis representing log2 σ and log2 C, respectively. As can be seen from Figs. 2.9, 2.10, 2.11, 2.12 and 2.13, there are many local minimums on the test error surface (contour lines), and it is difficult to find a parameter combination that minimizes the error. And the parameters of the Gaussian radial basis

2.3 The Parameters Optimization Method for SVM

(a) test error surface chart

37

(b) test error contour

Fig. 2.9 Test error surface and test error contour of breast cancer data set

(a) test error surface chart

(b) test error contour

Fig. 2.10 Diabetes data set test error surface and test error contour

(a) test error surface chart

(b) test error contour

Fig. 2.11 Test error surface and test error contour of the 11 heart dataset

kernel function which makes the test error of Support vector machine smaller are generally valued near the point (0, 0). if the Gauss Radial basis function parameter is too large or too small, the test error of the Support vector machine on the five data sets will increase. Based on such a rule, the test error surfaces of the five data sets all present a bowl shape, that is, the error on both sides are large, and the error in the

38

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

(a) test error surface chart

(b) test error contour

Fig. 2.12 Thyroid data set test error surface and test error contour

(a) test error surface chart

(b) test error contour

Fig. 2.13 Test error surface and test error contour of Titanic data set

middle are small. The error surface of the bowl shape can help us find the minimum error point through the optimization algorithm.

2.3.3.3

Parameter Influence Analysis of Ant Colony Optimization Algorithm

When the optimization algorithm is used to optimize the parameters of the Support vector machine, it is inevitable to introduce some parameters that belong to the optimization algorithm. Whether the parameters of these optimization algorithms are set properly or not has a great effect on the final optimal solution. Setting the parameters of any optimization algorithm should be at least as important as designing the optimization algorithm. The main parameters of ant colony optimization algorithm are: the number of ants, volatility coefficient ρ, volatility intensity Q, and the initial value of pheromone τ . In order to set the parameters of ant colony optimization algorithm reasonably, the effect mechanism of these parameters on the results of Support vector machine optimization was analyzed in detail by taking Thyroid data set as an example.

2.3 The Parameters Optimization Method for SVM

(a) graph of the relationship between test error and the number of ants

39

(b) graph of the relationship between calculation time and the number of ants

Fig. 2.14 The effect of the number of ants on the support vector machine parameter optimization method based on ant colony optimization algorithm (Thyroid dataset)

(1) Number of ants In the ant colony optimization algorithm, setting a reasonable number of ants has an important impact on the results of the ant colony optimization algorithm. In order to explore the potential optimal solution in the feasible solution space in a short time, a sufficient number of ants is required. Therefore, the performance of ant colony optimization algorithm is tested by setting the number of ants of 10, 20, 30, 40, 50, 60, 70, 80, 90 and 100 respectively. The effect of the number of ants on the parameter optimization method of Support vector machine based on ant colony optimization algorithm is shown in Fig. 2.14. As can be seen from the graph, when the number of ants is 20, 30, 70, 80, 90, the performance of the Support vector machine is the best (the test error is the smallest), and when the number of ants is increased, the operation time of the program is increased. (2) Volatility coefficient In the ant colony algorithm, the volatility coefficient ρ has the function of uniformly reducing all the pheromones of the state points. From the point of view of practical application, the volatility coefficient is needed to prevent the ant colony optimization algorithm from converging to the local optimal solution region at a very fast speed. The effect of volatility coefficient is beneficial to the solution in a new search space. Figure 2.15 shows the effect of the volatility coefficient ρ on the parameter optimization method of Support vector machine based on the ant colony optimization algorithm. The figure shows that when the volatility coefficient ρ values are 0.2, 0.4, 0.6–0.7, the test error of Thyroid data set is the smallest, and the calculation time is the shortest when ρ is 0.5. (3) Pheromone intensity In ant colony algorithm, pheromone intensity Q is used to adjust the global optimal solution of the algorithm with a suitable evolutionary speed in the process of positive

40

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

feedback. Figure 2.16 shows the effect of volatility Q on the optimization method of Support vector machine based on the ant colony optimization algorithm. The figure shows that when the volatility intensity Q is 20–100, 140, 190–200, the test error of Thyroid data set is small. (4) Initial value of pheromone Because the ant colony optimization algorithm mainly obtains the suitable solution through the artificial pheromone indirect communication between the artificial ants, so the initial value setting of pheromone is very important. Figure 2.17 shows the complex effect of the initial value of pheromone τ on the parameter optimization method of Support vector machine based on ant colony optimization algorithm. The figure shows that when the initial value of a pheromone τ is 160, the, the test error of Thyroid data set is minimum, and the computing time is shortest when the initial value of pheromone τ is 80. Considering the effect of the parameters of ant colony optimization algorithm, Samrout et al. [14], Duan et al. [15] and others have systematically studied the effect mechanism and setting of the parameters of ant colony optimization algorithm. Samrout et al. [14] suggested that the ratio of the number of ants to the grid nodes should be approximately 1:0.7, and Duan et al. [15] suggested that the ant colony optimization algorithm can achieve good global convergence when the volatilization coefficient ρ = 0.5. Based on the current research results, Table 2.4 gives the parameters of the ant colony optimization algorithm set in this section. (5) Numerical results and analysis Firstly, the first training set and the test set of Breast Cancer, Diabetes, Heart, Thyroid and Titanic were analyzed respectively. The parameter optimization method of Support vector machine based on ant colony optimization algorithm proposed in this section is compared with the parameter optimization method of Support vector

(a) plot of test error versus volatility coefficient

ρ (b) plot of calculation time versus volatility coefficient ρ

Fig. 2.15 Effect of the volatility coefficient ρ on the support vector machine parameter optimization method based on ant colony optimization algorithm (Thyroid dataset)

2.3 The Parameters Optimization Method for SVM

(a) plot of test error versus volatility intensity

Q

41

(b) plot of calculated time versus volatility intensity

Q

Fig. 2.16 Effect of volatility intensity Q on the support vector machine parameter optimization method based on ant colony optimization algorithm (Thyroid dataset)

(a) plot of test error versus pheromone initial value τ (b) plot of calculation time versus pheromone initial value τ

Fig. 2.17 Effect of initial pheromone values τ on the support vector machine parameter optimization method based on ant colony optimization algorithm (Thyroid dataset)

Table 2.4 Parameter settings

Parameter

Value

Number of ants

80

Coefficient of volatilization

0.5

Pheromone intensity

100

Initial pheromone value

100

machine based on the grid algorithm [16] in terms of the optimal parameter values obtained, the test error rate, the calculation time, etc., as shown in Table 2.5. The results show that the parameter optimization method of Support vector machine based on ant colony optimization algorithm takes much less time than

42

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

Table 2.5 Comparison of the ant colony optimization method for support vector machine parameters and the grid support vector machine parameter optimization method The name of the dataset Breast cancer

Grid method

Ant colony optimization algorithm

Optimal Optimal Test error C σ (%) 8.03

5.04

Diabetes

174.30

151.74

23.33 29,078.00

Heart

799.55

150.29

19.00

1446.40

Thyroid Titanic

25.97

Calculating Optimal Optimal Test time (s) error C σ (%) 2547.30

1.98

14.41

176.23

151.58

1043.80

2.00

25.97

Calculating time (s) 1437.80

23.00 19,298.00 16.00

519.58

89.84

3.31

4.00

702.89

37.38

2.22

2.67

666.20

1024.50

60.10

22.57

639.95

1045.00

54.50

22.57

429.22

the grid algorithm in the experiments of five data sets, we also get the same error results on Diabetes, Heart and Thyroid, Breast cancer and Titanic, as well as the grid optimization algorithm. This shows that the proposed parameter optimization method of Support vector machine based on the ant colony optimization algorithm is easier to obtain the desired optimal parameters than the Support vector machine parameter optimization method based on the grid algorithm. Second, we set up the same experiment as in the literature [6, 17, 18] using the Gauss Radial basis function, the first five training sets and test sets of each data set were used for the experiment. The experimental results of the proposed parameter optimization method of Support vector machine based on the ant colony optimization algorithm are shown in Table 2.6 and compared with those in Refs. [6, 17, 18]. The experimental results obtained by each method are composed of test error mean ± test variance. Compared with other methods, the parameter optimization method of Support vector machine based on ant colony optimization algorithm proposed in this section achieves the minimum average test error on Breast Cancer, Diabetes, Thyroid and Titanic data sets. In addition, the minimum mean test error obtained with the radius-interval boundary on the Heart data set was similar. The variance of test error is used to describe the deviation of each test error from the mean value. Compared with other methods, the parameter optimization of Support vector machine method based on ant colony optimization algorithm achieves the minimum variance in addition to the minimum average test error on the Titanic data set, in Breast Cancer, Diabetes and Thyroid, the variance of test error is also better. The mean test error of the numerical experiment (as shown in Table 2.6) is graphically represented by the histogram shown in Fig. 2.18, from the figure, we can clearly see the test error comparison results of various methods on five commonly used data sets in the International Common Standard Database. The CV in the figure represents the Five-fold Cross Validation Method mentioned in the literature [12], the RM represents the Radius-Margin Bound method mentioned in the literature [12], and the SB represents the Span Bound method mentioned in the literature [12], M denotes a fast algorithm for optimizing Support vector machine parameters based on

Radius-interval boundary [12] (%)

26.84 ± 4.71

23.25 ± 1.70

15.92 ± 3.18 4.62 ± 2.03

22.88 ± 1.23

5-order cross-validation method [12] (%)

26.04 ± 4.74

23.53 ± 1.73

15.95 ± 3.26

4.80 ± 2.19

22.42 ± 1.02

The name of the dataset

Breast cancer

Diabetes

Heart

Thyroid

Titanic

22.50 ± 0.88

4.56 ± 1.97

16.13 ± 3.11

23.19 ± 1.67

25.59 ± 4.18

Span bound [12] (%)

Table 2.6 Support vector parameter optimization method test error comparison

22.90 ± 1.16

4.70 ± 2.07

15.96 ± 3.13

23.41 ± 1.68

25.48 ± 4.38

Methods (%) proposed in literature [17]



3.44 ± 0.08

16.19 ± 0.04

23.16 ± 0.11

26.00 ± 0.08

CMA-ES [18] (%)

21.63 ± 0.54

3.20 ± 2.02

16.00 ± 5.70

22.80 ± 1.33

23.38 ± 4.00

Ant colony optimization algorithm (%)

2.3 The Parameters Optimization Method for SVM 43

44

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

Fig. 2.18 Support vector machine comparison of test error for parameter optimization method

empirical error gradient estimation, and ES denotes the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) proposed in the Covariance Matrix, ACO (Ant Colony optimization) represents the optimization method of Support vector machine proposed in this section. The experimental comparison of the internationally used standard database in Table 2.6 and Fig. 2.18 shows that the proposed parameter optimization method of Support vector machine based on ant colony optimization algorithm achieves satisfactory results compared with other common parameter optimization methods of Support vector machine. Numerical results show that it is feasible and effective to use ant colony optimization algorithm to optimize the parameters of Support vector machine.

2.3.4 The Application in Electrical Locomotive Rolling Bearing Single Fault Diagnosis 2.3.4.1

Fault Diagnosis Method for Rolling Bearing of Electric Locomotive

In the fault diagnosis of rolling bearing, the fault characteristic frequency of rolling bearing is usually analyzed by various signal processing methods. Although the fault characteristic frequency of rolling bearing can be obtained from the bearing parameters, but in many cases, it is very difficult to analyze the fault characteristic frequency in the signal. (1) The change of bearing geometry and assembly makes it difficult to determine the characteristic frequency of bearing accurately. (2) Bearing faults at different positions will cause different instantaneous responses in the signal, moreover, the instantaneous response is easily submerged in the

2.3 The Parameters Optimization Method for SVM

45

wideband response signal and noise signal, which makes it difficult to extract the bearing fault features. (3) Even with the same fault, the signal characteristics are different in different damage stages (different degree of damage). (4) The running speed and load of the rotating shaft greatly affect the vibration of the machine, which makes the monitored vibration signal show different characteristics. (5) The signal and parameters of special bearing are difficult to obtain, so the method of analyzing the vibration signal of bearing based on the characteristic frequency of bearing is not feasible in this situation. In this section, the method of mechanical fault diagnosis based on ant colony optimization algorithm to optimize the parameters of Support vector machine will not be based primarily on the accurate extraction of bearing fault characteristic frequencies, instead, they simply extract the time and frequency domain statistical features of the signal and then use the proposed ant colony optimization algorithm to optimize the parameters of Support vector machine for fault pattern recognition, to highlight the effectiveness of the parameter optimization method Support vector machine based on ant colony optimization algorithm. (1) Feature extraction of signal The time-domain and frequency-domain features of the signal are extracted as shown in Table 2.7. Among them, feature F1 is the waveform index, feature F2 is the peak index, feature F3 is the pulse index, feature F4 is the margin index, feature F5 is the kurtosis index, feature F6 is the deflection index, feature F7 represents the vibration energy in the frequency domain, features F8 − F10 , F12 and F16 − F19 are representations of the signal energy concentration in the frequency domain, features F11 and F13 − F15 are representations of the dominant frequency position changes [19]. (2) Fault pattern recognition Because there are multiple fault patterns in mechanical failure, it is necessary to use a multi-classification strategy of Support vector machine to identify multiple fault patterns. Let the training set S = {(x i , yi )|x i ∈ H, yi ∈ {1, 2, . . . , M}, i = 1, . . . , l} be known, where x i is the training sample, H is the Hilbert space, yi is the attribute label of the training sample x i , M is the number of classes of training samples, and l is the number of samples. The classification strategy of Support vector machine usually adopts two methods: the “one-to-many” and “one-to-one” multi-classification algorithm. (1) The basic principle of “one-to-many” multi-classification algorithm of Support vector machine is as follows: For j = 1, 2, . . . , M − 1, the following operations are carried out. The jth class is regarded as a positive class and the rest M − 1 classes as a negative class. The decision function is obtained by Support vector machines

46

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

Table 2.7 Statistical characteristics in time domain and frequency domain Feature 1 (waveform index) /

1 ΣN 2 n=1 x(n) N 1 Σ N |x(n)| n=1 N

F1 =

F2 =

Feature 4 (margin index) F4 =

(

F5 =

Feature 7 F7 =

k=1

s(k)

F8 =

K

Feature 10 F10 =

ΣK

k=1

F11 =

s(k)

F17 =

max|x(n)| ΣN n=1 |x(n)|

Feature 6 (skew index) F6 =

ΣN

1

n=1

(/ N

ΣN

1 N

n=1

(x(n)−F1 )3

)3

(x(n)−F1 )2

(s(k)−F7 )2 K −1

ΣK k=1 f k s(k) Σ K k=1 s(k)

F9 =

ΣK

(s(k)−F7 )3 √ K ( F8 )3

k=1

Feature 12 / ΣK 2 k=1 ( f k −F11 ) s(k) F12 = K Feature 15

f k4 s(k)

F15 =

ΣK

2 k=1 f k s(k) ΣK 4 k=1 s(k) k=1 f k s(k)

/Σ K

Feature 18

ΣK

3 k=1 ( f k −F11 ) s(k) 3 K F12

F18 =

ΣK k=1

( f k −F11 )4 s(k) 4 K F12

x(n) represents the time series, where n = 1, 2, . . . , N and N represents the number of data points; s(k) represents the spectrum, k = 1, 2, . . . , K , K represents the number of spectral lines, and f k represents the frequency value at the kth spectral line in the spectrum

Feature 19 k=1

k=1

Feature 17

F11 F12

1 N

Feature 9

ΣK

2 k=1 f k s(k)

Feature 16

ΣK

F3 =

4 n=1 (x(n)−F1 ) )4 ΣN 2 n=1 (x(n)−F1 )

Feature 14 /Σ K F14 = Σk=1 K

f k2 s(k)

k=1

F19 =

1 N

ΣN

Feature 11 (s(k)−F7 )4 K F82

Feature 13 /Σ K k=1 Σ F13 = K

F16 =

(/

1 N

Feature 8

ΣK

Feature 3 (pulse indicator)

/ max|x(n)| 1 ΣN 2 n=1 x(n) N

Feature 5 (kurtosis index)

max|x(n)| )2 ΣN √ n=1 |x(n)|

1 N

Feature 2 (peak index)

( f k −F11 )1/2 s(k) √ K F12

Fig. 2.19 The “one-to-many” classification flow chart of support vector machine

f (x) = sgn j

[ l Σ

] j yi αi K (x i

· x) + b

j

(2.37)

i=1 j

where, αi and b j are the coefficients of the jth Support vector machine, and K (x i · x) is the kernel function. If f j (x) = 1, then x belongs to class j, otherwise input x to the next Support vector machine until all Support vector machines have been tried. Figure 2.19 shows the classification policy.

2.3 The Parameters Optimization Method for SVM

47

(2) The basic principles of the “one-to-one” multi-classification algorithm of Support vector machine is as follows: In M classes of samples, each two classes of samples are used to construct a Support vector machine respectively, and a total of Ck2 = k(k − 1)/2 Support vector machines can be constructed. For the Support vector machine constructed by the class i and class j samples, the decision function is f (x) = sgn ij

[ m Σ

] yn αni j K (x n

· x) + b

ij

(2.38)

n=1

where, m is the total number of samples of class i and class j; x n is a sample of class ij i and class j; yn ∈ {1, 2, . . . , M} is the attribute label of the sample x n ; αn and bi j are the coefficients of the Support vector machine constructed by the data of class i and class j; K (x n · x) is the kernel function. When identifying unknown samples, Ck2 = k(k − 1)/2 Support vector machines constructed above are used successively to make decisions. If the Support vector machine determines that x i belongs to class i during the classification between class i and class j, the number of votes of class i will be increased by 1; otherwise, the number of votes of class j will be increased by 1. Finally, x i will be identified as the class with the most votes. The decision principle is based on the maximum voting method. In this section, a “one-to-one” multi-class algorithm of Support vector machine is used as the basic classifier for mechanical fault pattern recognition, and then the proposed ant colony optimization algorithm is used to optimize the parameters of Support vector machine, realize mechanical intelligent fault diagnosis.

2.3.4.2

Description of Rolling Bearing Experimental System for Electric Locomotive

The current operating train equipment state monitoring in our country is mostly based on the human sense and a few quantitative monitoring systems (such as axle temperature alarm system), many of the key equipment on the train safety state cannot be monitored in real time relying on this way, such as running gear system, braking system and train electrical equipment, etc. It can only be detected when there is a major accident, and then the damage is inevitable. Therefore, this section takes locomotive rolling bearings as an example to verify the effectiveness of the proposed parameter optimization method of Support vector machine based on ant colony optimization algorithm in the application of mechanical fault diagnosis. The experiment system mainly refers to the test platform and sensor of locomotive rolling bearing. The test platform consists of a hydraulic motor, two supports (on which two normal bearing bearings are mounted), a test bearing (52732QT), and a tachometer for measuring the speed of rotation, and a loading module for loading the test bearing. The 608A11-type ICP accelerometer is mounted below the loading

48

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

Fig. 2.20 Structure diagram of rolling bearing test platform for locomotive

Table 2.8 Locomotive rolling bearings state description Status

Normal

Outer ring fault

Inner ring fault

Rolling element fault

Label

State 1

State 2

State 3

State 4

Rotating speed (r/min)

490

490

500

530

module adjacent to the outer ring of the test bearing. The sampling frequency is 12,800 Hz. The structural diagram of the test platform is shown in Fig. 2.20. The load applied to the test bearing is 9800 N. The experimental data of rolling bearing for locomotive include four classes: normal state, outer ring fault state, inner ring fault state and rolling element fault state. Table 2.8 briefly describes these four classes of experimental data. The physical diagram in Fig. 2.21 records bearing conditions under three classes of single fault conditions. Each of these four classes of experimental data contains 30 samples, and each sample is composed of 2048 data points, among which 20 samples are used to train support vector machine and the remaining 10 samples are used to test the performance of Support vector machine. For each sample, the time-domain and frequency-domain statistical features shown in Table 2.7 are extracted, and the Gauss Radial basis function is used as the kernel function of Support vector machine, then the proposed ant colony optimization algorithm is used to optimize the parameters of the “one-to-one” multi-classification Support vector machine to achieve accurate fault pattern recognition.

2.3.4.3

Fault Diagnosis Results and Analysis

In order to analyze the effect of Support vector machine parameters on fault diagnosis results, the range of Support vector machine parameters (penalty factor C and Gaussian radial basis kernel parameter σ ) is [2−10 , 210 ]. A “one-to-one” multiclassification Support vector machine without parameter optimization is used as the

2.3 The Parameters Optimization Method for SVM

(a) outer ring fault

(b) inner ring fault

49

(c) rolling element fault

Fig. 2.21 Three classes of faults of rolling bearing of locomotive

basic classifier, and the fault diagnosis error surface and contour lines obtained are shown in Fig. 2.22. Figure 2.22a shows the fault diagnosis error surface, where the x-axis and y-axis are log2 σ and log2 C respectively. Each node in the (x, y)-plane on the diagnostic error surface represents a parameter combination, and the z-axis represents the diagnostic error percentage of the Support vector machine under each parameter combination. Figure 2.22b shows the diagnostic error contour lines, and the x-axis and y-axis are log2 σ and log2 C respectively. As can be seen from the figure, the diagnostic error presents the characteristics of high on the left and low on the right. Therefore, parameter optimization of Support vector machine is an important step when using “one-to-one” multi-classification Support vector machine for fault diagnosis. The proposed parameter optimization method of Support vector machine based on ant colony optimization algorithm is applied to fault diagnosis experiments. The range of parameters of Support vector machine (penalty factor C and Gauss Radial basis function parameter σ ) is still [2−10 , 210 ]. A total of five experiments are carried out, and the optimal parameter values, fault diagnosis accuracy and operation time obtained in each experiment are shown in Table 2.9. The optimization parameters

(a) test error surface chart

(b) test error contour

Fig. 2.22 Analysis of support vector machine parameters in rolling bearing fault diagnosis

50

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

(penalty factor C and Gauss Radial basis function parameters σ ) obtained from the five experiments are more dispersive because the ant colony optimization algorithm is a probabilistic optimization algorithm based on multiple ant agents, the starting point of random selection of ants is different with different experimental times, however, there is a large zero error area in the test error surface of the four types of locomotive rolling bearings used in the experiment (as shown in Fig. 2.22), so the optimization parameters C and σ are different each time. As can be seen from the diagnosis, the parameter optimization method of Support vector machine based on ant colony optimization algorithm can accurately identify four common bearing states in locomotive rolling bearing experiments: normal, outer ring fault, inner ring fault and rolling element fault, and the operation time is less. By analyzing the effect of the Support vector machine parameters on its generalization performance, the method is validated on five commonly used data sets in international standard databases, then the parameter optimization method of Support vector machine based on ant colony optimization algorithm is applied to locomotive rolling bearing fault diagnosis. The main conclusions are as follows: (1) The parameter optimization method of Support vector machine based on ant colony optimization algorithm is a kind of probabilistic global optimization algorithm, which does not depend on the strict mathematical properties of the optimization problem itself, it is an intelligent algorithm based on multiant agents, which has parallelism, convergence, evolution and robustness. Compared with the gradient method and the traditional parameter optimization method of Support vector machine based on evolutionary algorithm, it has the following advantages: the continuity of the problem definition is not required, the algorithm is simple and easy to implement, only involved in a variety of basic mathematical operations; only the output value of the objective function, no need for gradient information and other requirements; fast data processing. (2) By performing validation analysis on five commonly used data sets in internationally common standard databases, the parameter optimization method of Support vector machine based on ant colony optimization algorithm is more effective than other common parameter optimization methods of Support vector Table 2.9 Support vector machine fault diagnosis results based on ant colony optimization algorithm Number of experiments

Optimal C

Optimal σ

Accuracy (%)

1

2.0481

20.4810

100

84.5717

2

630.2929

85.5254

100

104.5814

3

929.7921

224.4616

100

100.5674

4

105.6777

14.3367

100

127.4912

5

700.0886

421.8886

100

97.3418

Average





100

102.9107

Calculating time (s)

2.4 Feature Selection and Parameters Optimization Method for SVM

51

machine. The research results show the feasibility and effectiveness of the algorithm. (3) The parameter optimization method of Support vector machine based on ant colony optimization algorithm provides excellent parameters for fault diagnosis of rolling bearing of electric locomotive, and reduces the blindness of setting parameters by Support vector machine, four kinds of common rolling bearing states (normal, outer ring fault, inner ring fault and rolling element fault) can be accurately identified. The parameter optimization method of Support vector machine based on ant colony optimization algorithm is proved to be effective in mechanical fault diagnosis.

2.4 Feature Selection and Parameters Optimization Method for SVM In engineering practice, when the structure of the kernel and algorithm of Support vector machine is determined, the two key factors that restrict the Support vector machine to achieve its good generalization performance are how to select the optimal parameters of Support vector machine and how to select the features related to the sample attributes for the support vector machine to learn. In 2002, Chapelle and Vapnik pointed out three importance of feature selection while studying the problem of parameter selection of support vector machine: it can improve the generalization performance of support vector machine, determine the features related to attributes, and reduce the dimension of input space [12]. Most parameter optimization problems and feature selection problems of Support vector machine are studied and solved separately. For example, trial and error procedure, generalization error estimation and gradient descend method, and artificial intelligent and evolutionary algorithm are proposed for the optimization of Support vector machine parameters [15]. The research methods of Support vector machine feature selection include minimum error upper bound method, genetic algorithm and particle swarm optimization. The two key factors (feature selection and parameter optimization) that restrict the generalization performance of Support vector machine are analyzed and solved respectively, the coupling effect of feature selection and parameter optimization on the generalization performance of Support vector machine is not taken into account at the same time, so it is inevitable that one side loses the other side, which restricts the generalization performance of Support vector machine to full play. In addition, there is no simple one-to-one correspondence between the faults of mechanical equipment and the characteristic signs. Different faults can have the same characteristic signs, and the characteristics of the same fault under different conditions are not exactly the same. Even the same equipment, in different installation and use conditions, equipment failure symptoms will vary greatly. Usually, the sample features input to the Support vector machine are redundant, and the features are

52

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

interrelated, which weakens the generalization performance of the Support vector machine. Selecting the optimal fault characteristics and optimizing the Support vector machine parameters simultaneously is a powerful way to improve the generalization performance of Support vector machine. Ant colony optimization algorithm-based Support vector machine feature selection and parameter optimization fusion method uses ACA to solve the problem of feature selection and parameter optimization in Support vector machine once and for all, by obtaining the best matching features and parameters synchronously, the generalization ability of the Support vector machine can be further improved and the application of multi-class fault diagnosis for mechanical equipment can be realized.

2.4.1 Ant Colony Optimization Based Feature Selection and Parameters Optimization Method for SVM Ant colony optimization algorithm-based Support vector machine feature selection and parameter optimization fusion method mainly uses the heuristic information between ant colonies to find the optimal feature subset and parameters, the algorithm consists of four parts: initialization, Ant Colony Algorithm for feature subset, feature evaluation, pheromone update. The algorithm structure block diagram is shown in Fig. 2.23.

2.4.1.1

Initialize

Input the original feature set and set the initial parameters of the ant colony optimization algorithm-based Support vector machine feature selection and parameter optimization fusion method. For example, the size of the ant colony and the number of ants should be selected according to the size of the input feature set.

2.4.1.2

Ant Colony Algorithm for Feature Subset

Ant colony solving feature subset is an important part of the fusion method of the ant colony optimization algorithm-based Support vector machine feature selection and parameter optimization fusion method. At the initial moment of the algorithm, each ant after initialization freely selects the feature subset from the original feature set containing N features according to the random selection criterion. The rest of the time, the ant colony will select the feature subset according to the state transition criterion. The feature subset solved by each ant is s1 , s2 , . . . , sr , where r is the number of the ant; Each feature subset contains n 1 , n 2 , . . . , n r features respectively. Ant colony solving feature subset mainly includes two main contents: random selection criterion and state transition criterion.

2.4 Feature Selection and Parameters Optimization Method for SVM

53

Fig. 2.23 Flow chart of ant colony optimization algorithm and support vector machine fusion method

(1) Random selection criteria At the initial moment, because each feature has the same pheromone level and all feature quantities have the same distribution, all ants randomly select features to construct feature subsets. (2) State transition criteria In addition to constructing the solution subset according to the random selection criterion at the initial moment, the ant colony constructs the feature subset using a probabilistic decision-making strategy called the state transition criterion at other moments.

54

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

si = arg max{τ (u)}, i = 1, 2, . . . , r

(2.39)

where, si is the feature subset constructed by the ith ant, and τ (u) is the pheromone concentration on feature u. Formula (2.39) aims to guide ants to construct a feature subset si by selecting feature u with a higher pheromone concentration. In general, a high concentration of pheromones must be associated with the optimal target solution, thus ensuring that the features selected by the ant colony are the optimal features that can produce the desired optimal target solution. This will be explained in more detail in the pheromone update section below.

2.4.1.3

Feature Assessment

A subset of the features obtained from the ant colony algorithm need to be input into the Support vector machine to further evaluate their merits. At the same time, since the parameters (such as penalty factor C and Gauss Radial basis function parameter σ ) affect the performance of Support vector machine, even if the input of the same feature subset, setting different parameters of support vector machine will make the generalization performance of support vector machine different. Therefore, during the evaluation of the feature subset constructed by the ant colony, the parameters of the Support vector machine are also optimized to ensure the optimal generalization performance of the Support vector machine. The feature evaluation section consists of the following three main elements: (1) Enter the feature subset The feature subset s1 , s2 , . . . , sr , which contains n 1 , n 2 , . . . , n r features and is solved by each ant, is input into the support vector machine respectively, where r is the number of ants. (2) Ant colony optimization algorithm optimizes the parameters of support vector machine For each feature subset si , i = 1, 2, . . . r , optimal parameters of Support vector machine under the corresponding feature subset are obtained respectively. Detailed implementation principles and methods were described in Sect. 2.3.2. The optimal parameters obtained are as follows: upper

v ∗j =

+ vj vlower j 2

,

j = 1, . . . , m

(2.40)

where, v ∗j represents the optimal parameters obtained. (3) Feature subsets and parameters evaluated by support vector machine ) ( The feature subsets si = e1i , e2i , . . . , eni i , i = 1, 2, . . . r constructed by ant colony and the corresponding optimal parameters v ∗j , j = 1, . . . , m are respectively input into the Support vector machine. Assuming that the test sample set is

2.4 Feature Selection and Parameters Optimization Method for SVM

55

) {( } V ' = x i' , yi' |x i' ∈ sr , yi' ∈ Y, i = 1, . . . , q , Y is the attribute label set, and q is the number of samples in the test sample set. Then the evaluation error of the ith ant based on the feature subsets and the corresponding optimal parameters v ∗j , j = 1, . . . , m is: i Tant

q 1 Σ ( ' ( ' )) = ψ −y j f i x j q j=1

(2.41)

where, ψ is the step function: when x > 0, ψ(x) = 1; Otherwise, ψ(x) = 0; f i is the decision function of Support vector machine constructed by ant i. The resulting i is the evaluation result of SVM evaluation feature subsets and SVM error value Tant parameters.

2.4.1.4

Pheromone Update

Pheromone update is required when the colony has completed the construction and evaluation of the colony solutions. There are two main criteria for pheromone renewal: Pheromone update includes two criteria: global update criteria and local update criteria. (1) Global update criteria The application condition of global update criteria is if and only if all ants have completed the process of solving the feature subsets and the task of feature evaluation. The goal of global update is to motivate ants to produce optimal feature subsets and optimal parameters of Support vector machine. The pheromone concentration of each feature in the optimal feature subset will be enhanced to attract more ants to select the feature quantity that produces the optimal solution. The global update criteria of pheromone is: τ (k + 1) = (1 − ρ)τ (k) + QTmax

(2.42)

where, τ (k + 1) is the pheromone concentration value at time k + 1; ρ is the volatile coefficient; τ (k) is the pheromone concentration at time k; Q is pheromone concentration; Tmax is the optimal solution obtained by ant colony, and its expression is as follows: { i } Tmax = max Tant

(2.43)

) ( i where, Tant is the evaluation error of the feature subsets si = eii , e2i , . . . , eni i , i = 1, 2, . . . r evaluated by the Support vector machine and the corresponding optimal parameters v ∗j , j = 1, . . . , m, as shown in Eq. (2.41). (2) Local update criteria

56

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

(3) The goal of the local update criteria is to reduce the pheromone concentration of those features selected by the ant colony but not achieved good results, and to maintain the pheromone concentration of those features not selected by the ant colony. The local update criteria not only reduces the probability that ants select features that do not achieve good results, but also keeps pheromone that do not have selected features from being reduced, this increases the probability that ants will select features that have not been selected yet. The local update criteria of pheromone is: τ (k + 1) = (1 − α0 )τ (k) + α0 τ0

(2.44)

where, α0 (0 < α0 < 1) is the local pheromone update coefficient, and τ0 is the initial pheromone value. By using global and local update criteria, the pheromone concentration of each feature that makes up the optimal feature subset is increased, and that of each feature selected by the ant colony but does not produce the optimal solution is decreased. Pheromone concentration of features not selected by the colony remain constant. It is advantageous for the ant colony to continue to select the features that have produced the optimal solution in the subsequent optimization process, and to continue to construct new optimal features in the features that have not been selected.

2.4.1.5

Termination Conditions

The ant colony optimization algorithm-based Support vector machine feature selection and parameter optimization fusion method is terminated when a feature subset and an optimal parameters can make the generalization performance of SVM reach 100% accuracy or all features have been selected by ant colony. When the termination condition is reached, the optimal feature subset and the optimal parameters combination are output.

2.4.2 The Application in Rotor Multi Fault Diagnosis of Bently Testbench In order to verify the effectiveness of the proposed ant colony optimization algorithmbased Support vector machine feature selection and parameter optimization fusion method in the application of mechanical fault diagnosis, the Bently rotor multi-class fault diagnosis experiment is carried out.

2.4 Feature Selection and Parameters Optimization Method for SVM

2.4.2.1

57

Description of Test System

The Bently rotor is a general and concise model of rotating machinery, which can simulate multiple types of faults caused by vibration in large rotating machinery. This section uses the Bently rotor experimental platform to carry out the simulation experiment of rotor multi-class faults. The Bently rotor experiment platform is shown in Fig. 2.24, Fig. 2.24a is the physical picture of the Bently rotor experiment platform, and Fig. 2.24b is the structure diagram of the Bently rotor experiment platform. The Bently rotor experiment system mainly consists of a Bently rotor test bench (consisting of a motor, two sliding bearings, a spindle, a rotor mass disk, a speed regulator, and a signal regulator), sensors, and a Sony EX data acquisition system. The diameter of the shaft is 10 mm and the length of the shaft is 560 mm. The mass of the rotor mass disc is 800 g and the diameter is 75 mm. The eddy current displacement sensor is installed on the mounting frame in the radial direction of the spindle, and the sampling frequency is 2000 Hz. The experiment of Bently rotor is carried out under six different operating conditions and speeds (as shown in Table 2.10): Compound fault of mass unbalance (0.5 g eccentric mass), oil film eddy, slight radial friction of the rotor (the rotor is slightly rubbed near the right bearing by the friction rod that comes with Bently), rotor crack (crack depth is 0.5 mm), mass unbalance (0.5 g eccentric mass) and radial friction of the rotor, normal state.

(a) the physical diagram of the Bently rotor test stand

(b) the schematic diagram of the Bently rotor test bed

Fig. 2.24 Bently rotor test stand

58

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

Table 2.10 Bently rotor test bench fault status table State type

Speed (r/min)

State type abbreviated name

Label

Quality imbalance

1800

U

1

Oil film whirling

1900

W

2

Radial friction of rotor

1800

R

3

Shaft crack

2000

S

4

The combined fault of mass unbalance 1800 and radial friction of rotor

C

5

Normal

N

6

2.4.2.2

2000

Bently Rotor Fault Diagnosis Method

(1) Signal acquisition The sensor and the Sony EX data acquisition system are used in the experiment to collect the signals of the Bently rotor under six different running states respectively. The acquired time-domain signals are shown in Fig. 2.25. Although the time-domain waveform in the figure reflects some abnormal characteristic information of fault states, it is not enough to accurately reveal the fault characteristics of each state. In the experiment, 32 vibration signal samples will be collected from each state, and each signal sample contains 1024 sampling points. (2) Feature extraction The rotor vibration signals under various running conditions collected from the Bently rotor experimental platform as shown in Fig. 2.25 can be found as follows: the amplitude of vibration signal time-domain waveform of Bently rotor under normal operation is small, while the amplitude of vibration signal time-domain waveform collected under other fault conditions increases, or the vibration signal time-domain waveform changes to a certain extent. According to the vibration signals collected by the Bently rotor under various running conditions, the vibration characteristics of the Bently rotor under different running conditions are characterized by extracting 19 statistical features in time-domain and frequency-domain as shown in Table 2.7. To verify the effectiveness of the proposed fusion method of SVM feature selection and parameter optimization based on ant colony optimization algorithm in rotor intelligent fault diagnosis. (3) Fault diagnosis The 19 statistical features shown in Table 2.7 are extracted from the vibration signal samples in each state. The first 16 samples of each state are taken as training samples of Support vector machine, and the last 16 samples are taken as test samples. A “oneto-one” multi-classification Support vector machine is constructed as a basic learner, and then the feature selection and parameter optimization of SVM are carried out using the proposed ant colony optimization algorithm.

2.4 Feature Selection and Parameters Optimization Method for SVM

(a) Quality Imbalance

(b) Oil film whirling

(c) Radial friction of rotor

(d) Normal

(e) The combined fault of mass unbalance and radial friction of rotor

(f) Shaft crack

Fig. 2.25 Bently rotor vibration signal waveform in time domain under six operating states

59

60

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

2.4.2.3

Results and Analysis

The experimental results of parameter optimization method of Support vector machine based on ant colony optimization algorithm (Method 1), feature selection method of Support vector machine based on ant colony optimization algorithm (Method 2), and fusion method of feature selection and parameter optimization of Support vector machine based on ant colony optimization algorithm (Method 3) are shown in Table 2.11. Since the fault classes are numerous and difficult to distinguish, the Support vector machine (Method 1), which only optimizes its parameters by ant colony optimization algorithm, obtains the optimal[ parameters ] (C = 63.49, σ = 37.64) within the parameter interval of C, σ ∈ 2−10 , 210 . The accuracy rate of identifying six operating states of Bently rotor experiment is 97.92%, and the normal state of rotor is easy to be misdiagnosed. The Support vector machine (Method 2) that uses the ant colony optimization algorithm for feature selection uses the optimal parameters obtained in Method 1 and selects the optimal features, but there is no consistent optimal matching relationship between the two, so the generalization performance of the support vector machine is not improved, and its fault diagnosis accuracy is the same as the result of Method 1. This shows that the sample features and support vector machine parameters in support vector machine jointly affect the performance of support vector machine. Only achieving the optimal performance unilaterally cannot make the generalization performance of support vector machine reach the optimal performance. Only by obtaining the best features and parameters matching each other synchronously can the generalization performance of support vector machine be further improved. Therefore, it is necessary to use an optimization algorithm to optimize the features and parameters of the support vector machine. The fusion method of feature selection and parameter optimization of support vector machine (Method 3) based on ant colony optimization algorithm obtained the best feature (F1 , F3 , F6 , F9 , F10 ∼ F14 , F19 ) and the best parameter (C = 50.69, σ = 0.39) which matched each other at the same time, so the test accuracy was 100%, and the fault diagnosis ability was better than that of Methods 1 and 2. In 2007, Sun et al. from Shanghai Jiao Tong University used C4.5 decision tree and principal component analysis to carry out fault diagnosis research on six rotor Table 2.11 Bently rotor multi-class fault diagnosis experiment results comparison Name

Optimal characteristics

Method – 1

Optimal Accuracy of each failure (%) parameters U W R S C N (C, σ ) 63.49, 37.64

Average accuracy (%)

100 100 100 100 100 87.5 97.92

Method F1 , F3 , F5 , F7 , F9 , F14 , F17 , F19 – 2

100 100 100 100 100 87.5 97.92

Method F1 , F3 , F6 , F9 , F10 ∼ F14 , F19 3

100 100 100 100 100 100

50.69, 0.39

100

2.4 Feature Selection and Parameters Optimization Method for SVM

61

Table 2.12 Bently rotor multi-class fault diagnosis experimental results Method name

Accuracy of each failure (%) U

W

R

S

C

N

Average accuracy (%)

C4.5 decision tree (principal component analysis for feature selection)

95

100

100

100

95

100

98.3

C4.5 decision tree (no feature selection)

100

100

95

100

95

100

98.3

Back propagation neural network (PCA for feature selection)

100

95

100

85

95

100

95.8

Back propagation neural network (UNFEATURE selection)

100

85

100

90

95

100

95

operating states similar to this section (normal, mass unbalance, oil film vortex, radial friction of rotor, compound fault of mass unbalance and radial friction of rotor, and rotor shaft crack) on the Bently rotor test bench [20]. In addition, seven time-domain statistics (peak-peak value, waveform index, pulse index, peak index, margin index, skewness index, kurtosis index) and eleven frequency-domain characteristics of test signals under various states are extracted, and C4.5 decision tree and backward propagation neural network are used to realize fault intelligent diagnosis and analysis, and the experimental results are shown in Table 2.12 [21]. By comparing the experimental results in Table 2.12 with those in Table 2.11, it is found that: The fault diagnosis experiments of six rotor operating states (normal, mass unbalance, oil film eddy, radial friction of rotor, compound fault of mass unbalance and radial friction of rotor, crack of rotating shaft) are carried out on the Bently rotor test bench. The fault diagnosis method based on C4.5 decision tree also obtains good experimental results, with an accuracy of 98.3%. Some diagnostic errors appear when identifying the compound faults of mass unbalance, mass unbalance and rotor radial friction. The method proposed in this section can accurately identify six common rotor operating states and obtain the best diagnostic results.

2.4.3 The Application in Electrical Locomotive Rolling Bearing Multi Fault Diagnosis The schematic diagram of the experimental test platform for rolling bearing of electric locomotive is shown in Fig. 2.20. The experimental system is described in Sect. 2.3.4. The vibration signals of locomotive bearings are collected by a accelerometer mounted under a loading module adjacent to the outer ring of the test bearing in

62

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

nine states: normal, minor outer ring abrasion fault, outer ring abrasion fault, inner ring abrasion fault, rolling element abrasion fault, outer ring abrasion and inner ring abrasion compound fault, outer ring abrasion and rolling element abrasion compound fault, inner ring abrasion and rolling element abrasion compound fault, outer ring abrasion and rolling element abrasion compound fault, outer ring abrasion and inner ring abrasion and rolling element abrasion compound fault. The nine states cover common fault classes, including both complex compound fault classes and different degrees of damage for the same type of fault. The nine states of locomotive rolling bearings are shown in Table 2.13. The vibration signals of locomotive bearing in nine states are collected, and 32 vibration signal samples are collected for each state, each sample containing 2048 points. Figure 2.25 shows the time-domain signals of locomotive rolling bearings in the above nine different states. By observing the timedomain waveforms of the tested signals of locomotive rolling bearings in the nine different states, we can find that: (1) The test signals of locomotive bearing in nine states are all disturbed by noise, and the fault information is drowned in different degrees, and the amplitude of vibration signal in normal state is smaller than that in eight kinds of fault states. (2) When the outer ring of the bearing fails, the vibration and shock will occur when the rolling element passes through the damaged position of the outer ring. Therefore, the time-domain waveform of the vibration signals in Fig. 2.25b, c reflect certain shock characteristics The vibration signal amplitude of the bearing outer ring with minor abrasion as shown in Fig. 2.25b is smaller than that of the outer ring with serious damage fault (Fig. 2.25c). (3) The vibration signal waveform in time-domain, as shown in Fig. 2.25d, shows some impact characteristics when the inner ring of the bearing is damaged, but Table 2.13 Locomotive rolling bearing nine types of fault state description State type

State type abbreviated name

Label

Normal

N

1

Minor abrasion fault on outer ring

O

2

Outer ring damage fault

S

3

Inner ring abrasion fault

I

4

Rolling element abrasion fault

R

5

Compound fault of outer ring damage and inner ring abrasion

OI

6

Compound fault of outer ring damage and rolling element abrasion

OR

7

Compound fault of inner ring abrasion and rolling element abrasion

IR

8

Compound fault of outer ring damage, inner ring abrasion and rolling element abrasion

OIR

9

2.4 Feature Selection and Parameters Optimization Method for SVM

63

the amplitude modulation phenomenon is not obvious when the inner ring of the bearing is damaged. (4) When the rolling element fails, the vibration signal waveform in Fig. 2.25e appears the shock characteristics in time-domain because the rolling element contacts with the raceway surface of the inner ring and the rolling surface of the outer ring once each time during one rotation cycle, and the phenomenon of amplitude modulation occurs. (5) When compound faults occur among the outer ring, inner ring and rolling element of the bearing, the time-domain waveform of the vibration signal has different impact characteristics. When the compound fault of outer ring damage and rolling element abrasion occurs, and the compound fault of inner ring abrasion and rolling element abrasion occurs, amplitude modulation occurs in the time-domain waveform of vibration signal, as shown in Fig. 2.26f–i. The spectrum diagram of the vibration signal of locomotive rolling bearings in nine different states is shown in Fig. 2.27. The characteristic frequency information of different bearing fault classes is completely submerged in the signal; it is not easy to recognize the characteristics of each state from the spectrum. The data sets obtained in normal, minor abrasion fault on outer ring, outer ring damage fault, inner ring abrasion fault, rolling element abrasion fault, compound fault of outer ring damage and inner ring abrasion, compound fault of outer ring damage and rolling element abrasion, compound fault of inner ring abrasion and rolling element abrasion, compound fault of outer ring damage, inner ring abrasion and rolling element abrasion, a total of nine types of states, are taken as diagnostic objects. According to Table 2.7, 19 time-domain and frequency-domain statistical features are extracted from vibration signals respectively, and a “one-to-one” multiclassification Support vector machine is constructed as a basic learning machine, in which 16 feature sample sets in each state are used to train the Support vector machine. The remaining 16 feature sample sets are used to test the performance of the Support vector machine. The fault diagnosis results of the parameter optimization method of Support vector machine based on ant colony optimization algorithm (Method 1), the feature selection method of Support vector machine based on ant colony optimization algorithm (Method 2), and the fusion method of feature selection and parameter optimization of Support vector machine based on ant colony optimization algorithm (Method 3) are shown in Table 2.14. Method 1 uses ant colony optimization algorithm to optimize of Support vector machine within the optimal parameters (C = 57.60, ] [ σ = 64.26) the parameter interval of C, σ ∈ 2−10 , 210 , which makes the average recognition accuracy of nine kinds of locomotive rolling bearing faults reach 89.58%. Although the optimal parameters obtained in Method 1 is used in Method 2, the generalization performance of Support vector machine is not improved because the optimal parameters does not match the optimization feature solved in Method 2 synchronously. The average accuracy of fault diagnosis in Method 2 and Method 1 is the same, indicating that the sample features and parameters of Support vector machine jointly affect the performance of Support vector machine, and the best features and parameters

64

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

(a) Normal

(b) Minor abrasion fault on outer ring

(c) Outer ring damage fault

(d) Inner ring abrasion fault

(e) Rolling element abrasion fault

(f)

Compound fault of outer ring damage and inner ring abrasion

Fig. 2.26 Time domain waveforms of vibration signals of locomotive rolling bearings in nine states

2.4 Feature Selection and Parameters Optimization Method for SVM

65

(g) Compound fault of outer ring damage and rolling element abrasion

(h) Compound fault of inner ring abrasion and rolling element abrasion

(i)

Compound fault of outer ring damage, inner ring abrasion and rolling element abrasion

Fig. 2.26 (continued)

matching each other should be obtained synchronously to improve the generalization performance of Support vector machine comprehensively. Therefore, the proposed fusion method of feature selection and parameter optimization of Support vector machine based on ant colony optimization algorithm (Method 3) is used to improve the diagnostic capability of Support vector machine by simultaneously selecting the optimal feature F1 ∼ F18 and the optimal parameters (C = 1.02, σ = 0.04), and the average fault accuracy of 95.83% is obtained. Table 2.14 also shows the following information: The parameter optimization method of Support vector machine based on ant colony optimization algorithm (Method 1), the feature selection method of Support vector machine based on ant colony optimization algorithm (Method 2), and the fusion method of feature selection and parameter optimization of Support vector machine based on ant colony optimization algorithm (Method 3) identify the normal state, inner ring fault state and rolling element fault state of locomotive bearings with the same capability, all of them have achieved 100% test accuracy, which indicates that these three methods based on support vector machine have the same diagnostic capability for simple fault states. The proposed fusion method of feature selection and parameter optimization of Support vector machine based on ant colony optimization algorithm (Method 3) effectively improves the identification ability of compound faults of outer ring and inner ring, outer ring and rolling element, inner ring and rolling element,

66

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

(a) Normal

(b) Minor abrasion fault on outer ring

(c) Outer ring damage fault

(d) Inner ring abrasion fault

(e) Rolling element abrasion fault

Fig. 2.27 Vibration signal spectrum of locomotive rolling bearings in nine states

2.4 Feature Selection and Parameters Optimization Method for SVM

(f)

67

Compound fault of outer ring damage and inner ring abrasion

(g) Compound fault of outer ring damage and rolling element abrasion

(h) Compound fault of inner ring abrasion and rolling element abrasion

(i)

Compound fault of outer ring damage, inner ring abrasion and rolling element abrasion

Fig. 2.27 (continued)

outer ring and inner ring and rolling element. It is shown that the fusion method of feature selection and parameter optimization of Support vector machine based on ant colony optimization algorithm can further improve the generalization performance of Support vector machine by solving the matching optimal sample features and optimal parameters of Support vector machine at one time, thus enhancing the diagnosis ability of complex faults. However, it is worth noting that the fusion method of feature selection and parameter optimization of Support vector machine based on ant colony optimization algorithm (Method 3) improves the identification ability of severe fault of outer ring, compound fault of outer ring and inner ring, compound

57.60, 64.26



1.02, 0.04



F2 , F4 , F6 , F13 , F14 , F19

F1 ∼ F18

Method 1

Method 2

Method 3

Optimal parameters (C, σ )

Optimal features

Name

100

100

100

N

93.75

100

100

O

93.75

87.5

87.5

S

100

100

100

I

Accuracy of each failure (%)

Table 2.14 Comparison of multi-fault diagnosis results of locomotive rolling bearings

100

100

100

R

87.5

75

75

OI

93.75

75

75

OR

93.75

75

75

IR

100

93.75

93.75

OIR

95.83

89.58

89.58

Average accuracy (%)

68 2 Supervised SVM Based Intelligent Fault Diagnosis Methods

2.5 Ensemble-Based Incremental Support Vector Machines

69

fault of outer ring and rolling element, and compound fault of outer ring and inner ring and rolling element. Due to the similarity between the external slight abrasion fault and the above three faults, the Support vector machine reduces the ability of the external slight abrasion fault. Aiming at the coupling effect of sample features and parameters on support vector machine, a fusion method of Support vector machine feature selection and parameter optimization based on ant colony optimization algorithm is proposed, and the algorithm structure and flow of fusion method of feature selection and parameter optimization of Support vector machine based on ant colony optimization algorithm is constructed. The application of feature selection and parameter optimization fusion method of Support vector machine based on ant colony optimization algorithm in fault diagnosis of Bently rotor and rolling bearings of electric locomotive is realized. The following conclusions can be drawn from the experimental results: (1) The fusion method of feature selection and parameter optimization of Support vector machine based on ant colony optimization algorithm uses ACA to solve the problem of feature selection and parameter optimization in Support vector machine at the same time, the optimal features and parameters are obtained synchronously, which improves the generalization performance of the Support vector machine and obtains better fault diagnosis results. (2) The fusion method of feature selection and parameter optimization of Support vector machine based on ant colony optimization algorithm can more effectively identify multiple complex fault states including compound fault. However, Support vector machines which only optimize its parameters by ant colony optimization algorithm or select features by ant colony optimization algorithm have limited diagnostic ability for complex fault states. (3) Because the time–frequency statistical characteristics of vibration signal do not contain much professional knowledge and experience, it is easy to operate and realize, therefore, the statistical features of Bently rotor and locomotive bearings vibration signals in time-domain and frequency-domain are extracted as input sample features. If other advanced fault feature extraction techniques (such as wavelet analysis) are used to provide more effective fault features, their ability to diagnose complex fault states can be further improved.

2.5 Ensemble-Based Incremental Support Vector Machines The generalization performance of the support vector machine classifier (or predictor) depends mainly on the availability of limited samples and the adequacy of sample information. In practical application, it is very expensive and time-consuming to collect representative training samples, and it is not easy to accumulate a certain amount of training samples in a certain period of time. At the same time, how to fully mine and use the knowledge and rules of limited sample information to improve the generalization performance of the classifier (or predictor) is the eternal pursuit and goal of machine learning. According to statistical learning theory, when the training

70

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

sample is very limited, there is a risk that the classifier (or predictor) has a low prediction accuracy for unknown samples. In order to improve the generalization ability of support vector machine, the theory of ensemble learning through training and combining multiple classifiers (or predictors) and the theory of reinforcement learning through adaptive learning have become the main research directions in the field of machine learning in recent years, therefore, studying how to improve the structure of support vector machine algorithm is an essential problem to improve the performance of support vector machine generalization. At present, the research results of ensemble learning emerge constantly. The construction strategies of ensemble learning algorithms mainly include Bagging, Adaboost, Grading and Decorrelated. The research on the construction type of basic classifiers in ensemble learning algorithms mainly includes two different ensemble strategies: one is the ensemble of the same classifier and the other is the ensemble of different classifiers. In addition, the basic classifiers in ensemble learning algorithms are composed in different ways, mainly selection-oriented ensemble and combineroriented ensemble. The construction mode of ensemble learning can be divided into serial ensemble, parallel ensemble and hierarchical ensemble. The research shows that ensemble learning can improve the generalization performance of classifiers (predictors), and overcome the following problems to some extent: (1) Few-shot learning problem. When the number of training samples is sufficient, many machine learning methods can construct the optimal classifier (predictor) and show excellent generalization performance. But when the training samples are limited, machine learning algorithms can only construct many classifiers (predictors) with poorly consistent prediction accuracy. Although the structural complexity of the constructed classifier (predictor) is low, there is a great risk that the performance of predicting unknown samples will be poor. However, when multiple single basic classifiers (predictors) are integrated by ensemble learning method, the generalization performance can be better than that of single basic classifiers (predictors). (2) Generalization performance problem. When data are limited, the search space of machine learning algorithms is a function of the available training data, which may be much smaller than the hypothesis space considered in the finite sample asymptotic case. Ensemble learning can expand the function space and obtain a more accurate approximation and prediction of the target function, thus promoting the generalization performance of the classifier (predictor). Since the support vector machine algorithm is ultimately reduced to solving a linear constrained quadratic programming problem (QP), it is necessary to compute and store a kernel function matrix whose size is related to the square of the number of training samples. When the ensemble learning algorithm is applied to train several single basic classifiers (predictors) and to construct ensemble-based classifiers (predictors), it brings more complex and large-scale training and learning tasks, this requires learning new knowledge and updating the trained classifier (predictor) in a manner of reinforcement learning to ensure better generalization performance with higher learning efficiency. Reinforcement learning is a kind of adaptive learning

2.5 Ensemble-Based Incremental Support Vector Machines

71

method with feedback as input, which can get uncertain motivation and optimal behavior through interaction with the environment. Because of the characteristics of on-line learning and adaptive learning, the reinforcement learning method has good generalization performance in large spaces and complex nonlinear systems, therefore, it is becoming an effective tool to solve the intelligent strategy optimization problem, and has been more and more widely used in practice. In 2007, the American scholar Parikh et al. [20, 22] proposed a data fusion method based on ensemble-based reinforcement learning, which searches for the most differentiated information under various datasets through ensemble-based incremental classifiers, and build the ability to continuously learn new knowledge from a variety of data sources. Therefore, combining the advantages and characteristics of ensemble learning and reinforcement learning, aiming at improving the generalization performance of support vector machine, based on the theory of ensemble learning and reinforcement learning, a method of ensemble-based incremental support vector machines is proposed, to achieve the goal of improving generalization of support vector machines from the perspective of machine learning architecture and algorithm construction.

2.5.1 The Theory of Ensemble Learning There is an objective fact in the field of machine learning: it is much easier to find a large number of rough empirical rules than to find a highly accurate prediction rule. Although it is difficult to directly establish a highly accurate prediction rule, it is an achievable goal to indirectly induce a more accurate prediction rule through a large number of rough empirical rules. The basic idea of ensemble learning theory is to first find a large number of rough experience rules with a weak learning algorithm; then the weak learning algorithm is used circularly, and the training sets with different weight distribution coefficients are input to the weak learning algorithm, and a new empirical rule is generated each time; after several cycles, the ensemble learning algorithm generates a final empirical rule according to the empirical rules generated by multiple cycles. In recent years, ensemble learning theory has been successfully applied to machine learning to improve the performance of a single classifier (learner). The ensemble support vector machines were first proposed by Vapnik [1], use boosting technology to train each single support vector machine, and then combine these single support vector machines with another support vector machine. Suppose there is a set of N single support vector machines: { f 1 , f 2 , . . . , f n }. If the performance of each single support vector machine is equal, the ensemble of these single support vector machines will be the same as that of each single support vector machine. However, if the performance of these single support vector machines is different and their errors are not correlated, except for support vector machine f i (x), the recognition results of most other support vector machines for sample x may be correct. More precisely, for a binary classification problem, since the error probability of random guess is 1/2, assuming that the error probability of each single support vector machine is p < 1/2,

72

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

then the error of the ensemble support vector machines established by the “majority voting method” is: PE =

n Σ

p k (1 − p)(n−k)

(2.45)

k=[n/2]

( )k ( )(n−k) ( )n Σ Σ Since p < 21 and PE < nk=[n/2] 21 21 , then PE < nk=[n/2] 21 . When the number of single support vector machine is large, the error of ensemble support vector machines will be very small. Since the error probability of a single support vector machine with random guess is 1/2, if a single support vector machine better than random guess can be established, the error of the final ensemble support vector machines will be greatly reduced. In order to overcome the limitation of multi classification problem of single support vector machine in engineering application, the ensemble support vector machines take multi classification support vector machines composed of k(k − 1)/2 binary support vector machines as the basic classifier, and use ensemble learning algorithm to integrate multiple multi classification support vector machines to improve the generalization performance of support vector machines. The algorithm diagram of ensemble support vector machines is shown in Fig. 2.28. The ultimate goal of ensemble learning is to improve the generalization performance of learning algorithms. Because of the huge potential and application prospect of ensemble learning, ensemble learning method has become one of the most important research directions in the field of machine learning, and has been evaluated by international scholar Dietterich as the first of the four research directions in the

Fig. 2.28 The algorithm diagram of ensemble support vector machines

2.5 Ensemble-Based Incremental Support Vector Machines

73

field of machine learning [23]. However, how to explore and research effective new ensemble learning methods and apply them to engineering practice is still one of the key problems of ensemble learning of support vector machines.

2.5.2 The Theory of Reinforcement Learning The real world is full of massive data and information, which contains rich potential knowledge waiting for people to explore; on the other hand, the updating speed of data and information is amazing, and information data technology is required to overcome the dimension disaster while showing excellent generalization ability. Reinforcement learning technology is an adaptive learning method with feedback as input, through interaction with the environment, uncertain incentives can be obtained, and finally the optimal behavior strategy can be obtained. Due to its characteristics of online learning and adaptive learning, reinforcement learning is an effective tool to solve the problem of intelligent strategy optimization, and is gradually becoming one of the key information technologies to build a “smart earth”. The structural framework of the standard reinforcement learning algorithm is shown in Fig. 2.29, which is mainly composed of a state perceptron, a learner and a behavior selector. The state perceptron mainly completes the mapping process from the external environment to the intelligent agent’s internal perception. The learner updates the intelligent agent’s strategy knowledge according to the observation value and incentive value of the environment state. The behavior selector makes behavior choices based on the current intelligent agent’s policy knowledge and acts on the external environment. The basic principle of reinforcement learning is: if a certain behavior of reinforcement learning leads to positive rewards from the environment, the trend of this behavior will be strengthened in the future; on the contrary, the trend of this behavior will weaken [24]. According to the basic principle of reinforcement learning, the goal of reinforcement learning is to learn a behavioral strategy so as to obtain the maximum reward

Fig. 2.29 The algorithm diagram of reinforcement learning

74

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

from the environment. Therefore, its objective function is usually expressed in the following three forms: V d (st ) =

∞ Σ

γ i rt+i , 0 < γ ≤ 1

(2.46)

i=0 h Σ

V d (st ) =

rt

i=0

( V (st ) = lim d

h→∞

h 1Σ rt h i=0

(2.47) ) (2.48)

where, γ is the discount factor, and rt is the environmental reward value received after the transfer from the environmental state st to st+1 . Equation (2.46) is the infinite reward model; Eq. (2.47) is the priority reward model, which only considers the reward of h steps in the future; and Eq. (2.48) is the average reward model. According to the objective function, the optimal behavior strategy can be determined: d ∗ = arg max V d (s), ∀s ∈ S d

(2.49)

where, d is the behavior strategy, V is the objective function, s is the environment state, and S is the set of environment state. A simple and common enhancement learning algorithm is as follows (as shown in Fig. 2.30). According to the initial conditions, the training samples are randomly divided into independent training subsets: sub1 , sub2 , …, sub N . Take subset sub1 to train the learner, and obtain the current objective function V1 of the learner and the current optimal behavior strategy (behavior 1). According to the current objective function V1 of the learner and the current optimal behavior strategy (behavior 1), combine training subset sub2 to retrain the learner to obtain the current objective function V2 of the learner and the current optimal behavior strategy (behavior 2), and then combine training subset sub3 with

Fig. 2.30 The algorithm diagram of a reinforcement support vector machine

2.5 Ensemble-Based Incremental Support Vector Machines

75

the current objective function V2 of the learner and the current optimal behavior strategy (behavior 2) to retrain the learner; so as to repeat the above process until training subset sub N . By using the current objective function VN −1 of the learner and the current optimal behavior strategy (behavior N − 1) and combining with the training sample subset sub N to retrain the learner, the final learner obtained is the termination target.

2.5.3 Ensemble-Based Incremental Support Vector Machines Combining the advantages and characteristics of ensemble learning and reinforcement learning, aiming at the goal of improving the generalization performance of support vector machines, this section will further propose the Ensemble-based Incremental Support Vector machines (EISVM) based on the theoretical framework of ensemble learning and reinforcement learning, by fully mining the knowledge information contained in the limited sample space data, so as to achieve the goal of improving the generalization ability of support vector machines from the aspects of machine learning theory system and algorithm construction. The goal of ensemble reinforcement learning is to improve the generalization performance of support vector machines. In the process of ensemble reinforcement learning, a single support vector machine as a basic learner is considered as an assumption h from input space X to output space Y . The parameters of a single support vector machine can be obtained according to the ant colony optimization algorithm proposed in Sect. 2.3 of this paper. For each iteration t = 1, 2, . . . , Tk , dataset Sk (k = 1, 2, . . . , n) is divided into training subset T Rt and testing subset T E t according to the current distribution Dt (t = 1, 2, . . . , Tk ). Then, using training subset T Rt , an assumption h t from input space X to output space Y is generated by a single support vector machine, h t : X → Y . The distribution Dt is obtained from the weight set assigned to each sample according to the classification performance of a single support vector machine for each sample. In general, those samples that are difficult to classify correctly will be given a higher weight to increase the probability that they are selected to enter the next training subset. The distribution function D1 of the initial iteration is initialized to 1/m (m is the number of samples in dataset Sk ), and the iterative distribution function gives the same probability to each sample selected to enter the first training subset. If there are other reasons or prior knowledge, you can customize the initial distribution function in other ways. The error of single support vector machine h t on dataset Sk (k = 1, 2, . . . , n) is [20, 22]: εt =

Σ

Dt (i )

(2.50)

i,h t (x i )/= yi

where, error A is the sum of distribution weights of misclassified samples. If εt > 1/2, the assumption h t : X → Y from the input space X to the output space Y generated by the current single support vector machine must be discarded, then a new training

76

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

subset T Rt and a testing subset T E t are constructed, and the support vector machine is retrained. This means that a single support vector machine only hopes to obtain 50% (or less) error on the dataset. Since half of the error in a binary classification problem means that it is a random guess, this is the easiest condition for a binary classification problem. However, for classification problems with N types, the error probability of random guess is NN−1 , so it is more difficult to ensure that the error is 50% or less with the increase of N types. If εt < 1/2 is satisfied, the regularized error βt can be calculated as follows [20, 22]: βt =

εt 1 − εt

(2.51)

All assumptions h t : X → Y from input space X to output space Y generated by a single support vector machine in the previous t iterations will be combined according to the maximum weight election method. The election weight is equal to the logarithm of the reciprocal of the regularized error βt . Those who get good results on the own training subset and testing subset will be given more votes. Composite classification assumption Ht will be obtained according to the combination of each single classification assumption h t : X → Y [20, 22]: Ht = arg max y∈Y

Σ t:h t (x)=y

log

1 βt

(2.52)

The classification performance of composite classification assumption Ht depends on the assumption of obtaining the largest vote among t single classification assumptions. The error of composite classification assumption Ht is [20, 22]: Et =

Σ

Dt (i ) =

i:Ht (x i )/= yi

m Σ

Dt (i )[|Ht (x i ) /= yi |]

(2.53)

i=1

where, when the result is true, [| · |] is 1, otherwise it is 0. If E t > 1/2, the current assumption h t will be discarded, and then a new training subset and testing subset will be reconstructed, and a new assumption h t : X → Y from input space X to output space Y generated by a single support vector machine will be constructed. We can find that when a dataset Sk+1 with new information is input, the composite error E t may exceed 1/2. In other cases, since all single assumptions h t : X → Y , which constituting composite assumption Ht , have been verified by Eq. (2.49) to ensure a maximum error of 50% on dataset Sk , the condition E t < 1/2 can be almost always satisfied. If E t < 1/2, the regularized composite error can be calculated as follows [20, 22]: Bt =

Et 1 − Et

(2.54)

2.5 Ensemble-Based Incremental Support Vector Machines

77

Update the weight ωt+1 (i) and calculate the next distribution Dt+1 according to the composite assumption Ht generated in the ensemble learning process. The update rule of this distribution is the key to ensemble reinforcement learning [20, 22]. { ωt+1 (i ) = ωt (i) ×

Bt , if Ht (x i ) = yi 1, otherwise 1−[|Ht (x i )/= yi |]

= ωt (i) × Bt

ωt+1 (i ) Dt+1 = Σm i=1 ωt+1 (i)

(2.55) (2.56)

According to this rule, if sample x i is correctly classified by composite classification assumption Ht , the weight of the sample is multiplied by a factor Bt less than 1; If it is wrongly classified, the weight of the sample remains unchanged. This distribution update rule reduces the probability that the correctly classified sample is selected as the next training sample T Rt+1 , and increases the probability that the currently incorrectly classified sample is selected as the next training sample T Rt+1 . The ensemble-based incremental support vector machines focus on the samples which are repeatedly and wrongly classified. It can be found that when new types of samples are input into the sample set, the current composite classification assumption Ht will be particularly easy to misclassify new samples. Therefore, Eq. (2.55) guarantees that misclassified samples will be selected into the next training sample set, thus ensuring the feasibility of ensemble reinforcement learning. After the Tk classification assumptions of each data subset Sk are generated, the final classification assumption of the ensemble-based incremental support vector machines will be output according to all composite assumptions [20, 22]: ⎛

⎞ ( ) 1 ⎝ ⎠ H f inal (x) = arg max log y∈Y βt k=1 t:H (x)=y K Σ

Σ

(2.57)

t

It can be seen that when ensemble reinforcement learning is carried out, if there are new samples input, the original knowledge is not lost because all the historical classification assumptions h t : X → Y of the support vector machines are retained. The ensemble-based incremental support vector machines can inherit the previously learned knowledge and continue to learn new knowledge from new samples, thus comprehensively improving the generalization performance of support vector machines. The algorithm structure of the ensemble-based incremental support vector machines is shown in Fig. 2.31. And the algorithm flow of the ensemble-based incremental support vector machines is shown in Table 2.15.

78

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

Fig. 2.31 The algorithm structure diagram of the ensemble-based incremental support vector machines Table 2.15 The algorithm flow of the ensemble-based incremental support vector machines Input: Dataset Sk = {(x 1 , y1 ), (x 2 , y2 ), . . . , (x m , ym )}, k = 1, 2, . . . , n Integer Tk , number of support vector machines to be generated Do for k = 1, 2, . . . , n Initialize ω1 (i ) =

1 m,

m is the total number of samples in each sample set Sk

Do for t = 1, 2, . . . , Tk (1) Let Dt =

Σmωt (i) , i=1 ωt (i )

Dt is a distribution

(2) Select training subset T Rt and testing subset T E t according to distribution Dt (3) Use a single support vector machine to generate an assumption h t from input space X to output space Y , h t : X → Y , and calculate the error εt of classification assumption h t on the Σ dataset St = T Rt + T E t , εt = i: h t (xi )/= yi Dt (i ) If εt > 21 , go to step (2), otherwise calculate the regularized error βt =

εt (1−εt )

(4) Use the maximum weight election method to obtain the composite assumption Ht , ( ) Σ Ht = arg max t:h t (x)=y log β1t , and calculate the composite error E t , Et =

Σ

y∈Y

i:Ht (x t )/= yi

(5) Let Btk =

E tk 1−E tk

Σm i=1

Dt (i )[|Ht (x i ) /= yi |]

(regularized composite error), and update sample weights:

{

ωt+1 = ωt ×

Dt (i ) =

Btk , if Htk = yi 1,

otherwise

End End Output the final classification assumption of the ensemble-based incremental support vector ( )) Σ K (Σ 1 machines: H f inal (x) = arg max k=1 t:Ht (x)=y log βt y∈Y

2.5 Ensemble-Based Incremental Support Vector Machines

79

2.5.4 The Comparison Experiment Based on Rolling Bearing Incipient Fault Diagnosis Since the ensemble-based incremental support vector machines can improve the generalization performance of the support vector machines, in order to verify the effectiveness of the ensemble-based incremental support vector machines, the rolling bearings early fault experiments of the internationally famous rolling bearings fault data platform—Case Western Reserve University (CWRU) bearing data center [25] will be used for method comparison. In addition, the method of ensemble-based incremental support vector machines will be further applied to the fault diagnosis of locomotive rolling bearings, including compound faults and different damage degrees of the same fault type.

2.5.4.1

Introduction to Experimental System

The rolling bearing fault data provided by the Case Western Reverse University (CWRU) bearing data center in American is often used by scholars around the world to verify the effectiveness of the proposed method. Therefore, the experimental data of this data center is used as a standard experiment to carry out method verification and comparative analysis. As shown in Fig. 2.32, the rolling bearing test bench contains a 1.5 kW motor (left), a torque sensor (center), and a dynamometer (right). The experimental bearings (including the drive end bearings and the fan end bearings) are used to support the motor shaft. The parameters and other information of the experimental bearings are shown in Table 2.16. The single point early fault of the test bearings are made by EDM. The fault diameters are 0.18 mm, 0.36 mm, 0.53 mm and 0.71 mm respectively, and the fault depth is 0.28 mm. The fault degree is weak, which belongs to early fault. Table 2.17 shows the detailed fault parameters of drive end bearings and fan end bearings. It includes different fault types under different speeds and different fault diameters. Two acceleration sensors are respectively installed on the motor at the drive end and fan end to test the vibration signals of bearings under different fault states. The data acquisition system includes a high-frequency signal amplifier with a sampling frequency of 12,000 Hz. The signal samples under each fault state are collected, and the statistical characteristic quantities (19 in total) in time domain and frequency domain are extracted from each sample according to Table 2.7, then the feature sets are input into the ensemble-based incremental support vector machines for training. The experimental results of the ensemble-based incremental support vector machines are compared with those of other methods.

80

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

(a) The picture of CWRU bearing test bench in American

(b) The structural diagram of CWRU bearing test bench in American Fig. 2.32 The CWRU bearing test bench in American

Table 2.16 The bearing experimental parameters (size unit: mm) Bearing state

Drive end bearing

Fan end bearing

Bearing designation

6205-2RS JEM SKF, deep groove ball bearing

6203-2RS JEM SKF, deep groove ball bearing

Diameter of inner ring

25

17

Diameter of outer ring

52

40

Thickness

15

12

Diameter of rolling element

8

7

Diameter of pitch circle

39

29

2.5.4.2

Comparative Analysis of Experiment I

The same experimental analysis was carried out on CWRU bearing data according to the experimental process and parameter settings in [26]. Three types of fault signals, including rolling element fault, inner ring fault and outer ring fault (the loading area is concentrated in the 12:00 direction, that is, the vertical upward direction), are collected from the drive end bearings, and the fault size is 0.18 mm. The experimental

2.5 Ensemble-Based Incremental Support Vector Machines

81

Table 2.17 The list of bearing faults Bearings

Fault location

Fault diameter (mm)

Fault depth (mm)

Speed (r/min)

Motor load (kW)

No fault (normal)







1797/1772/ 1750/1730

0/0.74/1.48/ 2.21

Drive end bearings

Outer ring

0.18

0.28

1797/1772/ 1750/1730

0/0.74/1.48/ 2.21

0.36

0.28

1797/1772/ 1750/1730

0/0.74/1.48/ 2.21

0.53

0.28

1797/1772/ 1750/1730

0/0.74/1.48/ 2.21

0.18

0.28

1797/1772/ 1750/1730

0/0.74/1.48/ 2.21

0.36

0.28

1797/1772/ 1750/1730

0/0.74/1.48/ 2.21

0.53

0.28

1797/1772/ 1750/1730

0/0.74/1.48/ 2.21

0.71

0.28

1797/1772/ 1750/1730

0/0.74/1.48/ 2.21

0.18

0.28

1797/1772/ 1750/1730

0/0.74/1.48/ 2.21

0.36

0.28

1797/1772/ 1750/1730

0/0.74/1.48/ 2.21

0.53

0.28

1797/1772/ 1750/1730

0/0.74/1.48/ 2.21

0.71

0.28

1797/1772/ 1750/1730

0/0.74/1.48/ 2.21

0.18

0.28

1797/1772/ 1750/1730

0/0.74/1.48/ 2.21

0.36

0.28

1797/1772/ 1750/1730

0/0.74/1.48/ 2.21

0.53

0.28

1797/1772/ 1750/1730

0/0.74/1.48/ 2.21

0.18

0.28

1797/1772/ 1750/1730

0/0.74/1.48/ 2.21

0.36

0.28

1797/1772/ 1750/1730

0/0.74/1.48/ 2.21

0.53

0.28

1797/1772/ 1750/1730

0/0.74/1.48/ 2.21

0.18

0.28

1797/1772/ 1750/1730

0/0.74/1.48/ 2.21

0.36

0.28

1797/1772/ 1750/1730

0/0.74/1.48/ 2.21

0.53

0.28

1797/1772/ 1750/1730

0/0.74/1.48/ 2.21

Inner ring

Rolling element

Fan end bearings

Outer ring

Inner ring

Rolling element

82

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

parameters are shown in Table 2.18. The sampling frequency is 12,000 Hz, and the time domain waveforms of the vibration signals are shown in Fig. 2.33 respectively. The frequency spectrums of bearing vibration signals under three types of fault states obtained through fast Fourier transform are shown in Fig. 2.34. The same data samples are used as in [26]. First, 50 signal subsamples are formed by intercepting 1024 points from each signal under three types of fault states, of which 70% are used as training samples, and the remaining 30% are used as testing samples. Table 2.19 shows the intelligent fault diagnosis results based on the ensemblebased incremental support vector machines, and compares them with the fault diagnosis results of seven methods (discrete Cosine transform, Daubechies wavelet, Symlets wavelet, Walsh transform, FFT, Walsh transform + Rough set theory, FFT Table 2.18 The description of experimental data including three types of fault states Bearing

Fault location

Fault diameter (mm)

Motor speed (r/ min)

Motor load (kW)

Label

Drive end bearings

Outer ring

0.18

1750

1.48

1

Inner ring

0.18

1750

1.48

2

Rolling element

0.18

1750

1.48

3

(a) 0.18mm outer ring fault

(b) 0.18mm inner ring fault

(c) 0.18mm rolling element fault

Fig. 2.33 The time domain vibration signals of rolling bearings in three types of fault states

2.5 Ensemble-Based Incremental Support Vector Machines

83

(a) 0.18mm outer ring fault

(b) 0.18mm inner ring fault

(c) 0.18mm rolling element fault

Fig. 2.34 The frequency domain vibration signals of rolling bearings in three types of fault states

+ Rough set theory), which given in [26]. It can be seen from Table 2.19 that the intelligent fault diagnosis methods based on the ensemble-based incremental support vector machines and the general support vector machine have the same good generalization performance (100% accuracy) in three types of fault states: outer ring fault, inner ring fault and rolling element fault, and can completely and effectively identify three types of simple early bearing fault states. The comparison results with other seven methods in the table also prove that: under the same experimental environment and experimental data, the ensemble-based incremental support vector machines and the general support vector machine have better fault identification capabilities.

2.5.4.3

Comparative Analysis of Experiment II

In order to further verify the intelligent fault diagnosis method based on the ensemblebased incremental support vector machines proposed in this section, according to the experimental process and parameters in [27] (as shown in Table 2.20), three types of fault signals, including rolling element fault, inner ring fault and outer ring fault (the loading area is concentrated in the 12:00 direction, that is, the vertical upward direction) collected from the CWRU drive end bearing, and the signal data under the

84

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

Table 2.19 The comparison of fault diagnosis results of rolling bearings in three types of states Bearing fault states

Methods

Accuracy (%)

(1) Outer ring fault

Discrete cosine transform

85 [26]

(2) Inner ring fault

Daubechies wavelet

78 [26]

(3) Rolling element fault

Symlets wavelet

74 [26]

Walsh transform

78 [26]

FFT

84 [26]

Walsh transform + rough set theory

80 [26]

FFT + rough set theory

86 [26]

Support vector machine

100

Ensemble-based incremental support vector machines 100

normal state of the bearing, are used for analysis. The time domain waveform and frequency spectrum of the vibration signals collected under the four bearing states are shown in Figs. 2.35 and 2.36 respectively. The fault size of the three types of bearing fault is 0.36 mm, which is the same type of fault as the three types of bearing fault in the comparative analysis of experiment I, but the difference is that the bearing fault size in the current experiment is 0.36 mm, which is more serious than the 0.18 mm fault in experiment I, the motor speed is slightly increased compared with experiment I, and the motor load is relatively reduced. The vibration signals of bearings under the four types of fault states are intercepted every 1024 points to form 50 samples, of which 70% are used as training samples, and the remaining 30% are used as testing samples. The experimental data and parameters shown in Table 2.20 are consistent with [27], so as to ensure the fairness and reliability of method validation and result comparison. Reference [27] proposed a fault diagnosis method based on improved fuzzy ARTMAP method and modified distance evaluation technology for the above four types of bearing states: normal state, outer ring fault, inner ring fault and rolling element fault, by extracting nine time domain statistical features (mean, root mean square, variance, skewness, kurtosis, peak index, margin index, waveform index, pulse index), seven frequency domain statistical features and First-order Continuous Wavelet Grey Moment features, using modified distance evaluation technology to Table 2.20 The description of experimental data including four types of fault states Bearings

Fault location

Fault diameter (mm)

Motor speed (r/ min)

Motor load (kW)

Label

Drive end bearings

Normal

0

1772

0.74

1

Outer ring fault

0.36

1772

0.74

2

Inner ring fault

0.36

1772

0.74

3

Rolling element fault

0.36

1772

0.74

4

2.5 Ensemble-Based Incremental Support Vector Machines

85

(a) Normal

(b) 0.36mm outer ring fault

(c) 0.36mm inner ring fault

(d) 0.36mm rolling element fault

Fig. 2.35 The time domain vibration signals of rolling bearings in four types of fault states

extract the optimal features, and then using improved fuzzy ARTMAP method to identify fault types. The experimental results of the method proposed in [27] are compared with three other similar methods: the first is to use the improved fuzzy ARTMAP method for fault diagnosis without feature optimization; The second is to use the modified distance evaluation technology to extract the optimal features, and use the fuzzy ARTMAP method for fault diagnosis; The third is to use fuzzy ARTMAP method to diagnose faults without feature optimization. This section will extract the time domain and frequency domain statistical characteristics of each sample signal as shown in Table 2.7, and then use the proposed

86

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

(a) Normal

(b) 0.36mm outer ring faul

(c) 0.36mm inner ring fault

(d) 0.36mm rolling element fault

Fig. 2.36 The frequency domain vibration signals of rolling bearings in four types of fault states

ensemble-based incremental support vector machines for fault diagnosis and compare with the experimental results of the above methods and general support vector machine. It can be seen from Table 2.21 that under the current experimental conditions, the method of ensemble-based incremental support vector machines proposed in this section has achieved better experimental results than general support vector machine and three types of methods based on fuzzy ARTMAP, but it is slightly worse than the experimental results of fault diagnosis methods based on improved fuzzy ARTMAP method and modified distance evaluation technology. One of the reasons for the gap may be that the fault diagnosis method based on the improved fuzzy ARTMAP method and the modified distance evaluation technology uses the Firstorder Continuous Wavelet Grey Moment feature which is more advanced than the

2.5 Ensemble-Based Incremental Support Vector Machines

87

Table 2.21 The comparison of fault diagnosis results of rolling bearings in four types of states Bearings fault states

Methods

Accuracy (%)

(1) Normal

Improved fuzzy ARTMAP method + modified distance evaluation technology

99.541 [27]

(2) Outer ring fault

Improved fuzzy ARTMAP method

89.382 [27]

(3) Inner ring fault

Fuzzy ARTMAP method + modified distance evaluation technology

91.185 [27]

(4) Rolling element fault

Fuzzy ARTMAP method

79.228 [27]

Support vector machine

91.67

Ensemble-based incremental support vector machines

98.33

time domain and frequency domain statistical features, and then selects the optimal feature on this basis, thereby improving the accuracy of fault diagnosis.

2.5.4.4

Comparative Analysis of Experiment III

In order to further verify the generalization performance of the ensemble-based incremental support vector machine, comparative analysis of experiment III will include multiple types of faults and different fault degrees under the same fault type: normal, rolling element fault, outer ring fault, and four types of inner ring faults with different fault degrees (minor fault degree 0.18 mm, medium fault degree 0.36 mm, serious fault degree 0.53 mm, severe fault degree 0.71 mm). The experimental data are shown in Table 2.22. Because it is very difficult to identify different fault degrees under the same fault type for many signal analysis methods based on extracting fault feature frequency, this experiment is used to further verify the intelligent diagnosis ability of ensemble-based incremental support vector machine for different fault degrees under the same fault type. Table 2.22 The description of experimental data including seven types of fault states Bearings

Fault location

Fault diameter (mm)

Motor speed (r/ min)

Motor load (kW)

Label

Drive end bearings

Normal



1772

0.74

1

Outer ring fault

0.36

1772

0.74

2

Inner ring fault

0.18

1772

0.74

3

Inner ring fault

0.36

1772

0.74

4

Inner ring fault

0.53

1772

0.74

5

Inner ring fault

0.71

1772

0.74

6

Rolling element fault

0.36

1772

0.74

7

88

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

The time domain waveforms of bearing vibration signals under seven states are shown in Fig. 2.37, and the frequency spectrums of vibration signals are shown in Fig. 2.38. It is difficult to identify these seven states only from the time domain signals and frequency spectrums, so other methods need to be used for further analysis. The fault diagnosis process based on the ensemble-based incremental support vector machines is consistent with [28]: each 1024 points of vibration signals are intercepted under each type of fault state to form 80 samples. 50% of the samples under each fault state are used as training samples, and the remaining samples are used as testing samples. The time domain and frequency domain statistical features of each sample signal are extracted respectively as shown in Table 2.8, and then the proposed ensemble-based incremental support vector machines are used for fault diagnosis. Reference [28] aims at the normal, rolling element fault, outer ring fault and four kinds of inner ring faults with different degrees of fault (0.18 mm for minor fault, 0.36 mm for medium fault, 0.53 mm for serious fault and 0.71 mm for severe fault) described in Table 2.22, by extracting nine time domain statistical features (mean, root mean square, variance, skewness, kurtosis, peak index, margin index, waveform index, pulse index), eight frequency domain statistical features and the First-order Continuous Wavelet Grey Moment feature, the fuzzy ARTMAP network model based on feature weight learning is used to identify bearing fault states that contain multiple types of faults and different fault degrees under the same fault type. Table 2.23 shows the experimental comparison results between the fault diagnosis method based on the ensemble-based incremental support vector machines and the general support vector machine and three fuzzy ARTMAP methods proposed in [27, 28]. Among the four methods, the method based on the ensemble-based incremental support vector machines achieves the highest diagnosis accuracy of 96.42%. The experimental results show that when faced with the complicated problem of fault pattern recognition of different fault degrees of the same fault type, the fault diagnosis ability of general support vector machine is limited, while the ensemble-based incremental support vector machines improve the generalization ability of support vector machine on the basic framework of machine learning of ensemble theory and reinforcement theory, which enhances the ability to identify the multiple early fault states of rolling bearings including different fault degrees of the same fault type, and the experimental results are better than the three methods based on fuzzy ARTMAP.

2.5.5 The Application in Electrical Locomotive Rolling Bearing Compound Fault Diagnosis The experimental testing process and platform of rolling bearing for electric locomotive are shown in Sect. 2.3.4. The experimental data are described in detail in Sect. 2.4.3. The vibration signals of each state were intercepted 2048 sampling points to form a sample data, a total of 32 samples were intercepted. The time-domain and

2.5 Ensemble-Based Incremental Support Vector Machines

(a) Normal

(b) 0.36mm outer ring fault

(c) 0.18mm inner ring fault

(d) 0.36mm inner ring fault

(e) 0.53mm inner ring fault

(f) 0.71mm inner ring fault

(g) 0.36mm rolling element fault

Fig. 2.37 The time domain vibration signals of rolling bearings in seven types of fault states

89

90

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

(a) Normal

(b) 0.36mm outer ring fault

(c) 0.18mm inner ring fault

(d) 0.36mm inner ring fault

(e) 0.53mm inner ring fault

(f) 0.71mm inner ring fault

(g) 0.36mm rolling element fault

Fig. 2.38 The frequency domain vibration signals of rolling bearings in seven types of fault states

2.5 Ensemble-Based Incremental Support Vector Machines

91

Table 2.23 The comparison of fault diagnosis results of rolling bearings in seven types of states Bearings fault states (fault diameter: mm)

Method

Accuracy (%)

(1) Normal

Improved fuzzy ARTMAP method

77.551 [27]

(2) Outer ring fault (0.36)

Improved fuzzy ARTMAP method + modified distance evaluation technology

84.898 [27]

(3) Inner ring fault (0.18)

Improved fuzzy ARTMAP method + feature weight learning

87.302 [28]

(4) Inner ring fault (0.36)

Support vector machine

89.29

(5) Inner ring fault (0.53)

Ensemble-based incremental support vector machines

96.42

(6) Inner ring fault (0.71) (7) Rolling element fault (0.36)

frequency-domain statistical characteristics of each sample were extracted according to Table 2.7. Sixteen samples were used to train support vector machines, and the remaining 16 samples were used to test. The fault types of the rolling bearing samples of the experimental locomotive were identified using ensemble-based incremental support vector machines. The experimental results are shown in Table 2.24. Among them, SVM1 represents the proposed parameter optimization method of support vector machine based on ant colony optimization algorithm, and SVM2 represents the feature selection and parameter optimization fusion method of support vector machine based on ant colony optimization algorithm, EISVM stands for ensemblebased incremental support vector machines. The experimental results obtained by optimizing the features and parameters of the ensemble-based incremental support vector machines with ant colony optimization algorithm are represented by EISVM *. The results show the parameter optimization method of support vector machine (SVM1 ), based on ant colony optimization algorithm, achieves 89.58% of the fault diagnosis accuracy of locomotive bearings by optimizing the optimal parameters Table 2.24 Comparison of experimental results of composite fault diagnosis of locomotive bearings based on support vector machines Method

Accuracy of each fault type (%) N

O

S

I

R

OI

OR

IR

OIR

Average accuracy (%)

SVM1

100

100

87.5

100

100

75

75

75

93.75

89.58

SVM2

100

93.75

93.75

100

100

87.5

93.75

93.75

100

95.83

EISVM

100

100

93.75

100

100

87.5

93.75

93.75

100

96.53

EISVM *

100

100

100

100

100

100

93.75

100

100

99.31

92

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

(C = 57.60, σ = 64.26). The feature selection and parameter optimization fusion method of support vector machine (SVM2 ), based on ant colony optimization algorithm, improves the diagnostic ability of support vector machine by simultaneously selecting the optimal features F1 ∼ F18 and the optimal parameters (C = 1.02, σ = 0.04), thus obtaining 95.83% accuracy. The ensemble-based incremental support vector machines (EISVM) proposed in this section, improves the generalization performance of a single support vector machine in terms of algorithm construction, making its fault diagnosis of locomotive rolling bearings achieve 96.53% accuracy. Further using ant colony optimization algorithm to simultaneously optimize features and parameters for ensemble-based incremental support vector machines (EISVM *), can further improve the generalization ability of ensemble-based incremental support vector machines, making the average accuracy of fault diagnosis achieve 99.31%. For the three common types of simple faults: normal state, inner ring fault, and rolling element fault, the support vector machine parameter optimization method based on ant colony optimization algorithm (SVM1 ) proposed in Sect. 2.3, the support vector machine feature selection and parameter optimization fusion method based on ant colony optimization algorithm (SVM2 ) proposed in Sect. 2.4, and the ensemble-based incremental support vector machines (EISVM) proposed in this section can all 100% identify, which shows that these three methods of support vector machine have equally good performance in identifying simple fault types. In the face of other complicated fault state (such as serious fault of outer ring, minor fault of outer ring, compound fault of outer ring and inner ring, compound fault of outer ring and rolling element, compound fault of inner ring and outer ring and rolling element), by improving the generalization ability of a single support vector machine, the ensemble-based incremental support vector machines can effectively further improve its accuracy in identifying compound faults and different degrees of damage to the outer ring. The results of rolling bearing fault diagnosis for electric locomotive show that the ensemble-based incremental support vector machines improve the generalization performance of support vector machines through machine learning theory system and algorithm construction. The method of ensemble-based incremental support vector machines is applied to the field of mechanical fault diagnosis, and the experimental data of CWRU bearing data center in American are used as the standard experiment to validate the method. At last, the method is applied to the fault diagnosis of locomotive rolling bearing, which includes all kinds of compound faults, can effectively identify multiple fault types including compound fault and different fault degrees of the same fault type. The results show that: The method of ensemble-based incremental support vector machines is based on the theories of ensemble learning and reinforcement learning. By fully mining the knowledge information contained in the limited sample space data, it realizes the goal of improving the generalization performance of support vector machines from the perspective of machine learning theory system and algorithm construction. The ensemble-based incremental support vector machines can effectively identify multiple types of early faults and different degrees of damage of rolling bearings.

References

93

Through three bearing fault testing cases of CWRU bearing experimental center in American, under the same experimental parameters and process conditions, comparison with other methods shows that the application of ensemble-based incremental support vector machines in early fault diagnosis of rolling bearings has obtained satisfactory diagnosis results, which can effectively identify multiple early fault types of rolling bearings and different damage degrees of the same fault type. The generalization performance of support vector machines has a great relationship with the parameters and sample features of support vector machines. Therefore, the feature selection and parameter optimization fusion method based on ant colony optimization algorithm is applied to the ensemble-based incremental support vector machines to further improve its diagnostic ability for complicated fault types, which includes compound faults and different damage degrees of the same fault type. The research results show that the ensemble-based incremental support vector machines improve the generalization performance of a single support vector machine, and can effectively identify various compound faults and different damage degrees of the same fault type in locomotive rolling bearings.

References 1. Vapnik, V.: The Nature of Statistic Learning (in Chinese). Tsinghua University Press, Beijing (2000) 2. Vapnik, V.: The Nature of Statistical Learning Theory. Springer-Verlag, New York (1995) 3. Mjolsness, E., DeCoste, D.: Machine learning for science: state of the art and future prospects. Science 293(5537), 2051–2055 (2001) 4. Bian, Z., Zhang, X., et al.: Pattern Recognition (in Chinese), 2nd edn., pp. 296–301. Tsinghua University Press, Beijing (2000) 5. Yang, Y., Yu, D.J., Cheng, J.S.: A fault diagnosis approach for roller bearing based on IMF envelope spectrum and SVM. Measurement 40(9–10), 943–950 (2007) 6. Cheng, J.S., Yu, D.J., Yang, Y.: A fault diagnosis approach for gears based on IMF AR model and SVM. EURASIP J. Adv. Signal Process. (2008) 7. Saravanan, N., Siddabattuni, V.N.S.K., Ramachandran, K.I.: A comparative study on classification of features by SVM and PSVM extracted using Morlet wavelet for fault diagnosis of spur bevel gear box. Expert Syst. Appl. 35(3), 1351–1366 (2008) 8. Poyhonen, S., Arkkio, A., Jover, P., et al.: Coupling pairwise support vector machines for fault classification. Control Eng. Pract. 13(6), 759–769 (2005) 9. Chu, F.L., Yuan, S.F.: Fault diagnosis based on support vector machines with parameter optimisation by artificial immunisation algorithm. Mech. Syst. Signal Process. 21(3), 1318–1330 (2007) 10. Sun, C., Liu, L., Liu, C., et al.: Boosting-SVM based aero engine fault diagnosis (in Chinese). J. Aerosp. Power 11(25), 2584–2588 (2010) 11. Zhu, Z., Liu, W.: Fault diagnosis of marine diesel engine based on support vector machine (in Chinese). Ship Eng. 5(28), 31–33 (2006) 12. Chapelle, O., Vapnik, V., Bousquet, O., et al.: Choosing multiple parameters for support vector machines. Mach. Learn. 46(1–3), 131–159 (2002) 13. Colorni, A., Dorigo, M., Maniezzo, V.: Distributed optimization by ant colonies. In: Proceedings of the First European Conference on Artificial Life, vol. 142 (1991) 14. Samrout, M., Kouta, R., Yalaoui, F., et al.: Parameter’s setting of the ant colony algorithm applied in preventive maintenance optimization. J. Intell. Manuf. Autom. Technol. 18, 663–677 (2007)

94

2 Supervised SVM Based Intelligent Fault Diagnosis Methods

15. Duan, H.B., Wang, D.B., Yu, X.F.: Research on the optimum configuration strategy for the adjustable parameters in ant colony algorithm. J. Commun. Comput. 2(9), 32–35 (2005) 16. Chen, C.-W.: Modeling, control, and stability analysis for time-delay TLP systems using the fuzzy Lyapunov method. Neural Comput. Appl. 20(4), 527–534 (2011) 17. Adankon, M.M., Cheriet, M.: Optimizing resources in model selection for support vector machine. Pattern Recogn. 40(3), 953–963 (2007) 18. Friedrichs, F., Igel, C.: Evolutionary tuning of multiple SVM parameters. Neurocomputing 64, 107–117 (2005) 19. Rényi, A.: On measures of entropy and information. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 547–561 (1961) 20. Polikar, R., Topalis, A., Parikh, D., et al.: An ensemble based data fusion approach for early diagnosis of Alzheimer’s disease. Inf. Fusion 9(1), 83–95 (2008) 21. Sun, W.X., Chen, J., Li, J.Q.: Decision tree and PCA-based fault diagnosis of rotating machinery. Mech. Syst. Signal Process. 21(3), 1300–1317 (2007) 22. Parikh, R., Polikar, R.: An ensemble-based incremental learning approach to data fusion. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 32(2), 437–450 (2007) 23. Dietterich, T.G.: Machine learning research: four current directions. AI Mag. 18(4), 97–136 (1997) 24. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press (1998); Wang, T., Song, G., Liu, S., et al.: Review of bolted connection monitoring. Int. J. Distrib. Sens. Netw. 2013, 1–8 (2013) 25. Bearing Data Center Seeded Fault Test Data. The Case Western Reserve University Bearing Data Center Website. http://csegroups.case.edu/bearingdatacenter/pages/welcome-case-wes tern-reserve-university-bearing-data-center-website 26. Li, Z., He, Z.J., Zi, Y.Y., et al.: Rotating machinery fault diagnosis using signal-adapted lifting scheme. Mech. Syst. Signal Process. 22(3), 542–556 (2008) 27. Xu, Z.B., Xuan, J.P., Shi, T.L., et al.: A novel fault diagnosis method of bearing based on improved fuzzy ARTMAP and modified distance discriminant technique. Expert Syst. Appl. 36(9), 11801–11807 (2009) 28. Xu, Z.B., Xuan, J.P., Shi, T.L., et al.: Application of a modified fuzzy ARTMAP with featureweight learning for the fault diagnosis of bearing. Expert Syst. Appl. 36(6), 9961–9968 (2009)

Chapter 3

Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

3.1 Semi-supervised Learning Machine learning requires large amounts of labeled training data as input to improve the generalization of supervised learning. However, it is more difficult to obtain labeled data than unlabeled data, especially when the data is applied to the area of fault diagnosis. On the contrary, unsupervised learning is an automatic study method, and the classical labels of the training data are not required. But without providing supervised information, the trained model is often not accurate enough, and consistency and generalization of learning results are also difficult to meet the usage requirements. Semi-supervised learning is a concept that is situated between supervised and unsupervised learning. The key to semi-supervised learning is to consider how to take advantage of the data structure and automatic study ability of unknown data, and design algorithms that combine the feature of labeled and unlabeled data. The emphasis of semi-supervised learning is not on the methodological approaches themselves but on the learning mechanism of collaborative learning of the supervised paradigm and the unsupervised paradigm samples. It can be considered that semi-supervised learning classification is a classification algorithm for incorporating labeled data into specific unlabeled data to perform classification and recognition tasks [1]. Semi-supervised focuses primarily on how to get a learning machine that has stateof-art performance and generalization ability when the lack of partly training data including the loss of class label or the appearance of noise data, and the loss of feature dimensions of the data takes place. The theoretical research of semi-supervised learning has a very important guiding significance for deeply understanding many important theoretical issues in machine learning, such as the relationship between data manifold and data class label, reasonable processing of missing data, effective use of labeled data, the relationship between supervised learning and unsupervised learning, and the design of active learning algorithm.

© National Defense Industry Press 2023 W. Li et al., Intelligent Fault Diagnosis and Health Assessment for Complex Electro-Mechanical Systems, https://doi.org/10.1007/978-981-99-3537-6_3

95

96

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

3.2 Fault Detection and Classification Based on Semi-supervised Kernel Principal Component Analysis 3.2.1 Kernel Principal Component Analysis Principal component analysis (PCA) is a commonly used feature extraction method, that obtains the principal component variables with the most variation information of the original data, and realizes the feature extraction of complex system information. Kernel principal component analysis (KPCA) introduces the kernel method into PCA, maps the input data to the high-dimensional feature space, and extracts nonlinear features by linear PCA in the feature space. PCA is a linear method based on Gauss Statistical Assumptions, that each principal component is a linear combination of the original variables. Nevertheless, linear PCA cannot effectively extract the nonlinear features of mechanical fault signals, which, in turn, affected the accuracy of fault diagnosis. Therefore, nonlinear PCA is required to deal with the signal of fault diagnosis. The main differences between linear PCA and nonlinear PCA are as follows. First, in nonlinear PCA, a nonlinear function is required to be introduced to map original variables into the nonlinear principal component. Second, the linear principal component is the linear combination of the original variables, then minimizes the sum of the distance from the data points to the linear line that it represents. In turn, the nonlinear PCA minimizes the sum of the distance from data points to the curve or surface that it represents. KPCA provides a new idea by using kernel trick, mapping nonlinear problems in low dimensional space to linear problems in high dimensional space and extending PCA to the nonlinear field, for solving the nonlinear problem. This method maps the input data matrix X to a high dimensional feature space F through a pre-selected nonlinear mapping method and gets more separable input data. Then the linear PCA is used to analysize the mapping data in high dimensional space to obtain the nonlinear principal component of input data. This nonlinear map is implemented using the inner product operation and only needs to compute the kernel function corresponding to the inner product in the original space, without paying attention to the specific implementation of nonlinear mapping. For the feature extraction of any test dataset sample Z, it can be achieved by computing the projections of the mapping data matrix ϕ(Z) on the eigenvector of the normalized correlation coefficient matrix. The KPCA method can be summarized by the following steps: N M and test dataset {z i }i=1 . (1) Select the training dataset {x i }i=1 (2) Compute the kernel matrix K by equation K i j = (ϕ(x i ) · ϕ(x j )), where the dimension of K is M × M. (3) Normalize the kernel matrix K by equation K˜ = K − 1 M K − K 1 M + 1 M K 1 M , where 1i j = 1 (1 M )i j = 1/M, (i, j = 1, 2, . . . , M). ˜ (4) Compute the eigenvalue λ˜ and eigenvector α˜ by eigenequation λ˜ α˜ = K˜ α.

3.2 Fault Detection and Classification Based on Semi-supervised Kernel …

97

(5) Compute the normalization eigenvector α˜ kb by equation λk (α˜ k · α˜ k ) = 1. (6) Compute the kernel matrix K test by equation K itest = (ϕ(z i ) · ϕ(x j )), where j the dimension of K test is N × M. test (7) Normalize the kernel matrix K test by equation K˜ = K test −1 N K − K test 1 M + 1N K 1M . test (8) Extract eigenvalue Ftest by equation Ftest = K˜ ∗ α˜ kb , where Ftest is the kth k k k eigenvalue of the nonlinear principal component.

3.2.2 Semi-supervise Kernel Principal Component Analysis Semi-supervised emphasizes the fusion of labeled and unlabeled data to improve the performance of the learning machine. The kernel function of KPCA is used to achieve the nonlinear feature extraction. The lack of prior information about different pattern types during the process of diagnosis will influence the reliability of fault detection and diagnosis. Therefore, in this section, we wish to build kernel function in the semi-supervised pattern for fault diagnosis.

3.2.2.1

Separability Index

(1) Intraclass distance and interclass distance for the separability criterion KPCA is a feature extraction method that requires quantitative indicators and criteria to evaluate the effectiveness of the extracted feature for classification. Generally, the accuracy rate of the classifier is used to evaluate the effectiveness of classification. However, a huge number of prior information and labeled data are required to calculate the accuracy rate. It is necessary to introduce some criteria to evaluate the superiority of the feature extraction method. The feature samples located in different areas of the feature space correspond to their fault patterns. Consequently, the samples of different classes are separable. If the interclass scatter is large and the intraclass scatter is small in the samples cluster process, it indicates that the separability of samples and the clustering effect of KPCA is excellent. It is easy to see that the distance of different sample points reflects the separability of sample classes. The distance between different classes is researched under the assumption that there are two types of sample classes to start. Let the two classes be ω1 and ω2 . Any point in ω1 has a distance from every point in ω2 . The value averaged by summing the distance of the points represents the distance between the two classes. ( j) As for the case of clustering for multiple types of classes, let the x (ik ) and x l ( j) (i) be the D dimensional eigenvector of class ωi and class ω j ; δ(x k , x l ) is the distance between the two types of eigenvectors. The average distance between each eigenvector:

98

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Jd (x) =

nj ni Σ c c ) ( 1 Σ 1Σ Σ ( j) Pi Pj δ x (i) , x k l 2 i=1 n i n j k=1 l=1 j=1

(3.1)

where c is the number of classes; n i is the number of samples in class ωi , n j is the number of samples in class ω j , Pi and P j are the prior probabilities of the corresponding classes. When the prior probability is unknown, the training sample data can also be used for estimation: ni P˜i = n

(3.2) ( j)

There are multiple distance measures to compute δ(x (i) k , x l ) for two vectors in a multidimensional space. In this paper, we mainly use Euclidean distance to evaluate ( j) δ(x (i) k , x l ): ) ( )T ( ) ( ( j) ( j) ( j) = x (i) x (i) δ x (ik ) , x l k − xl k − xl

(3.3)

Let the mi denotes the mean vector of the i-th class: mi =

ni 1 Σ x (i) n i k=1 k

(3.4)

Let the m denotes the mean vector of all classes: m=

c Σ

Pi mi

(3.5)

i=1

By substituting Eqs. (3.4) and (3.5) into Eq. (3.1), the result is: Jd (x) =

c Σ i=1

[

ni ( )T ( ) 1 Σ (i) x (i) x + (mi − m)T (mi − m) Pi − m − m i i k n i k=1 k

] (3.6)

where (mi − m)T (mi − m) denotes the square distance between the ith class mean vector and the population mean vector. After using the prior probability-weighted average, it can represent the average square distance of all classes of mean vectors: c Σ

)T ( ) 1Σ Σ ( Pi P j mi − m j mi − m j 2 i=1 j=1 c

Pi (mi − m)T (mi − m) =

i=1

The Jd (x) can be defined as:

c

(3.7)

3.2 Fault Detection and Classification Based on Semi-supervised Kernel …

S˜ b =

c Σ

Pi (mi − m)(mi − m)T

99

(3.8)

i=1

S˜ ω =

c Σ

Pi

ni ( )( )T 1 Σ x (i) x (ik ) − mi k − mi n i k=1

(3.9)

Pi

ni ( )( )T 1 Σ (i ) x (i) x − m − m i i k n i k=1 k

(3.10)

i=1

S˜ ω =

c Σ i=1

The above derivation is based on a finite number of sample numbers. Where mi and m denote the mean of the ith class and the mean of all classes, S˜ b and S˜ ω denote the interclass scatter and the intraclass scatter. The formulations are as follows:

Sb =

c Σ

μi = E i [x]

(3.11)

μ = E[x]

(3.12)

( )( )T Pi μi − μ μi − μ

(3.13)

i=1

Sω =

c Σ

Pi E i

[(

)( )T ] x − μi x − μi

(3.14)

i=1

The mean square distance of all classes also can be defined as: Jd (x) = tr (Sω + Sb )

(3.15)

(2) Separability index A distance metric criterion can be obtained by Eq. (3.7): J1 (x) = tr (Sω + Sb )

(3.16)

To improve the KPCA effect, the interclass scatter should be as large as possible and the intraclass scatter should be as small as possible. Therefore, the following criteria are proposed: ( ) J2 = tr S−1 ω Sb [

|Sb | J3 = ln |Sω |

(3.17)

] (3.18)

100

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

[ J4 = ln J5 =

tr Sb tr Sω

]

|Sb + Sω | |Sω |

(3.19) (3.20)

The actual diagnosis work condition is mostly a few-shot case, for which the parameters in Jd (x) can be obtained directly by calculating the sample points. Therefore, the separability criterion Jbw (x) can be constructed by combining the above separability criterion J1 ∼ J5 : Jbw (x) =

Scb = Σc Scw i=1

Σc

Pi (mi − m)T (mi − m) )T ( ) Σni ( (i) x (ik ) − mi k=1 x k − mi

i=1

Pi n1i

(3.21)

where Scw is the index of intraclass scatter and denotes the mean distance of intraclass vector, Scb is the index of interclass scatter and denotes the mean distance of interclass vector. The values of the separability criterion Jbw (x) are normalized to obtain the separability evaluation index Jb (x): Jb (x) =

Scb Scb + Scw

(3.22)

The Jb (x) which over the interval [0, 1] denotes the similarity of all samples. If Jb (x) = 0 means all samples belong to the same class, that is, there is no interclass mean distance. In contrast, if Jb (x) = 1 means each sample belongs to different classes, that is, there is no intraclass mean distance. The larger value Jb means higher intraclass sample aggregation, farther interclass mean distance, and better separability of cluster sampling. Jb is important for feature extraction which can quickly measure the effectiveness of feature extraction in the case of few-shot conditions, effectively guide the feature index selection in pattern recognition and classification process, and reasonably set the parameters of the kernel function.

3.2.2.2

Nearest Neighbor Function Rule Algorithms and Feature Classification

Nearest neighbor function rule algorithms [2] which are based on similarity measurement rules can categorize samples according to the clustering distribution of feature data. The original samples present different distribution feature points on the feature surface after KPCA processing. Nearest neighbor function rules can obtain clear detection and classification result by effectively classifying clustering samples.

3.2 Fault Detection and Classification Based on Semi-supervised Kernel …

101

1. Unsupervised category separation method In the pattern classification area, due to the lack of prior class label information, or due to the difficulty of classifying samples in practical work conditions, the learning machine can usually only be trained using samples without the class label. This is the motivation for the unsupervised learning method. The unsupervised learning is divided into two categories: a direct method based on probability density function, which decomposes a set of functions with mixed probability densities into lots of subsets, each of which corresponds to a class, and an indirect clustering method based on similarity measurement between samples, where divides the set into subsets, at the same time the result of the division should be such that some criterion function representing the quality of clustering is maximum. Distance is usually used as the similarity measure between samples. Iterative dynamic clustering algorithms are commonly used in indirect clustering methods. It has the following three main features: (1) Select some distance measurement as the similarity measurement of the sample. (2) Determine a certain criterion function to evaluate the quality of clustering. (3) Given a certain initial classification, an iterative algorithm is used to find the best clustering result which takes the extreme value of the criterion function. Common dynamic clustering algorithms include C-means clustering algorithms, dynamic clustering algorithms based on similarity measures of samples and kernels, and nearest neighbor function rule algorithms. C-means clustering algorithms take the sum of squares for error (SSE) as the clustering criterion. Only when the natural distribution of classes is spherical or nearly spherical, in other words, when the variance of each component in each class is close to equal, can there be a good classification effect. The C-means algorithm usually does not work well when normal distributions with elliptical shapes due to unequal variance of the components [3]. As Fig. 3.1 shows, m1 and m2 are the clustering centers of class 1 and class 2 respectively. But due to the defect of C-means clustering algorithms, point A is classified into class 2. Dynamic clustering algorithms based on similarity measures of samples and kernel can fix the shortcoming of the above C-means clustering algorithms. The algorithm defines a kernel parameter K j = K (y, V j ) to represent the class ⎡ j , then judges whether the sample y belongs to the class ⎡ j by establishing a measurement function Δ(y, K j ) between some sample points y and the kernel K j . The algorithm enables the clustering results to fit a priori assumed data constructions of different shapes. But the algorithm has trouble clustering samples when the form of the defined kernel function cannot be determined or cannot be represented by a simple function. For the case of several different shapes of data constructions in Fig. 3.2, dynamic clustering algorithms based on similarity measures of samples and kernel often fail to properly select defined kernel functions and the clustering results obtained are still hardly satisfactory. 2. Nearest neighbor function rule algorithms

102

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.1 Classification effect of C-means clustering algorithms with elliptical distribution

Fig. 3.2 Examples of several different shapes of data construction

For solving the clustering problem in the above cases, the nearest neighbor function rule algorithms are considered for classification and the specific steps are as follows: (1) Calculate the distance matrix Δ such that its element Δi j represents the distance between the sample yi and y j . Δi j = Δ( yi , y j )

(3.23)

(2) The above distance matrix is adopted to construct the nearest neighbor matrix M, in which the constituent element Mi j is the value of the nearest neighbor coefficient of the sample y j to yi . Generally speaking, M is a positive definite matrix. Therefore, the number of nearest neighbors of a sample point can only be a series 1, 2, . . . N − 1, so each element of the matrix must be an integer. (3) Construct the nearest neighbor function matrix L, the elements of L are: L i j = Mi j + M ji − 2

(3.24)

3.2 Fault Detection and Classification Based on Semi-supervised Kernel …

103

where the L i j denotes the value of the nearest neighbor function represents the connection relation between y j and yi . Set the value of diagonal entries L ii equal to 2N, i = 1, 2, . . . N . (4) Connect each point with the points which have the value of the nearest neighbor function through the matrix L as the initialized clustering condition. (5) Calculate the parameter γi by each class i obtained in step 4 and compare the value with αi max and αk max . If it is less than or equal to either αi max or αk max , then classes i and k are combined and considered to be connected between the two classes. (6) Repeat step 5 until there’s no more γi meets the above conditions then stop it. 3.2.2.3

Semi-supervised KPCA Detection Algorithms

1. Semi-supervised abnormal detection In the fault diagnosis area, it is widely concerned whether minor faults or fault trends can be accurately and timely detected by pattern recognition. The equipment or systems in the industrial environment are operated in normal conditions. Therefore, it is difficult to get prior information about certain faults. There is even a lack of relevant sample data to train the learning machine when the equipment has some minor faults or insignificant fault trends. There are many practical difficulties performing fault diagnosis using supervised learning. Unsupervised detection which enables the learning machine to have the ability to detect abnormal patterns using unlabeled samples provides the basis for condition monitoring and degradation trend analysis in industrial environments. Abnormal detection in unsupervised mode is essentially the process of pattern classification. This method emphasizes the separation of unknown abnormal patterns from normal patterns for fault detection and advance warning functions. This section on unsupervised abnormal detection is based on the following two basic assumptions: (1) The number of normal data in the training set far exceeds the number of abnormal data. (2) Abnormal data is different in nature from normal data. Therefore, abnormal data are different from normal data in terms of both “quality” and “quantity”. The basic idea of unsupervised abnormal detection is to use a dataset of unknown fault types as the training and test datasets and to map the detected data to feature points in the feature space in the selected specific algorithm. Then the detection boundary is determined according to the distribution characteristics of the feature points, and the points in the sparse area of the feature space are marked as abnormal data. The method of unsupervised abnormal detection which maps the data in the original space to the feature space is usually unable to determine the probability distribution of the data points to be detected. Therefore, the feature points in areas with

104

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

sparse distribution (i.e., low density) in the feature space are identified as abnormal data. The first step for constructing a known dataset is to map the original space composed of all its data elements into the feature space. Usually, often the case that easily getting into a dimensional catastrophe due to the high dimensionality of the dimensional space can make the first step difficult. The kernel function uses the data elements in the original space to directly calculate the inner product between two points in the feature space. The kernel method does not require knowing the specific form of the mapping function ϕ(x), and thus the dimension n of the input space does not affect the kernel matrix [4]. The use of kernel functions can avoid “dimensional catastrophe” and greatly reduce the computational effort. Therefore, the kernel method can perform the task of feature mapping in unsupervised abnormal detection. Another task of unsupervised abnormal detection is to determine the detection boundaries in the feature space, which can be done by different unsupervised learning algorithms such as the k-Means clustering algorithm and support vector machine algorithm, etc. But every coin has two sides. On the one hand, unsupervised methods take longer to learn because the training data are mostly high-dimensional. On the other hand, unsupervised abnormal detection is often less effective than supervised abnormal detection due to the lack of a priori knowledge to guide it. 2. Improved nearest neighbor function rule algorithms The nearest neighbor function rule algorithms are widely used in the case of unsupervised classification. However, the nearest neighbor function rules algorithm cannot be fully applied in pattern detection due to its own limitations. Calculate the parameter γi by each class i obtained in step 4 (refer to Eq. 3.23), then compare the value with the maximum connection loss αi max between two points in class ωi and maximum connection loss αk max between two points in class ωk to judge if a connection is constructed. The calculation of the connection loss is based on the nearest neighbor coefficient and not on the distance between the actual feature points, which may lead to the situation in Fig. 3.3. As shown in the figure, ωi is a cluster group with lots of samples, which is formed by “connecting” the initial clusters several times. Obviously, ωi and ωk are two different types of class samples. The furthest two points in class ωi are point 1 and point 2, so the maximum connection loss αi max = α12 = 32. Similarly, we can obtain the maximum connection loss between point 3 and point 4 in class ωk αk max = α34 = 6. The value of the nearest neighbor function between ωi and ωk is γi = α23 = 21. Due to this γi < αi max , the algorithm combines ωi and ωk into one class and it clearly doesn’t make sense. The reason why it makes such mistake is that the difference in the amount of sample data in clustering, and the nearest neighbor function rule algorithms mainly consider the nearest neighbor coefficient but ignore the actual distance between clusters when establishing the “connection”. In the detection process, the nearest neighbor function rule algorithms are likely to cause a wrong “connection” and classify the abnormal clustering points into the

3.2 Fault Detection and Classification Based on Semi-supervised Kernel … Fig. 3.3 Incorrectly connected two different classes

105

1

α i max = 32

ωi

α k max = 6

2

γ i = 21

4

3

ωk normal class when the number of normal data far exceeds the number of abnormal data. In turn, this leads to detection errors. After the initial clusters are formed in step 4 of the nearest neighbor function rule algorithms, each cluster group has its cluster center and the actual distance between clusters can be measured by the distance between cluster centers. Starting from this, the nearest neighbor function rule algorithms are improved by applying the concepts of nearest neighbor coefficient and nearest neighbor function to the cluster centers. Judge which cluster group has the greatest difference from most of the initial clusters by analyzing the nearest neighbor function of the cluster centers to measure the similarity between the initial clusters. The identified clusters are discriminated as abnormal clusters, in which the sample points are abnormal data. Finally abnormal detection is achieved. The improved nearest neighbor function rule algorithms are calculated as follows: Step (1) to step (4) is the same as the above Nearest neighbor function rule algorithms steps. (5) Calculate the coordinates of each initial clustering center location and the distance matrix Δc which contains initial clustering centers information, let the element Δci j of Δc denotes the distance between cluster center ci and c j . Δci j = Δ(ci , c j )

(3.25)

(6) The distance matrix Δc is adopted to construct the nearest neighbor matrix M c , and its elements Mci j denote the value of the nearest neighbor coefficient of the clustering center c j to the clustering center ci . (7) Construct the nearest neighbor function matrix L c , the elements of L c is: L ci j = Mci j + Mcji − 2

(3.26)

(8) Calculate the sum of values of the nearest neighbor functions on the ith row of the nearest neighbor function matrix L c .

106

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Ti =

n Σ

L ci j i = 1, . . . n

(3.27)

j=1

where n is the number of initial cluster groups. I f Tk = max(Ti ) i = 1, . . . n

(3.28)

Then determine the sample points in the k-th cluster group are suspected abnormal data. (9) Assume that the sum of the values of the nearest neighbor functions Ti follows a normal distribution. And all initial clusters Ti (except Tk ) are grouped into the same normal distribution. Calculate the mean μ and variance σ 2 . μ=

σ2 =

n Σ 1 Ti n−1 i =1 i /= k

n Σ 1 (Ti − μ)2 n−1 i =1 i /= k

(3.29)

(3.30)

If Tk > μ + aσ , the suspected abnormal cluster groups are judged to be abnormal cluster groups, and the samples in the cluster groups are abnormal data. a is a constant coefficient. On the contrary, if Tk ≤ μ + aσ , the suspected abnormal cluster groups are judged to be normal cluster groups, and the samples in the cluster groups are normal data. The setting of the coefficient a is related to the detection rate and the false alarm rate. The detection rate is the ratio of detected abnormal data to the total number of abnormal data, and the false alarm rate is the ratio of normal data to the total number of abnormal data. The detection rate reflects the accuracy of the detection model, while the false alarm rate reflects the stability of the detection model. However, a high detection rate and a low false alarm rate are often contradictory, and the usual detection model must find a balance between accuracy and stability. A larger coefficient a means that the false alarm rate of the algorithm will decrease, but at the same time, the detection rate will decrease. If the coefficient a becomes smaller then take the opposite. According to the knowledge of probability, the samples of the normal distribution are mainly concentrated around the mean, and their dispersion can be characterized by the value of standard deviation. Sampling from a normally distributed dataset, about 95% of samples will fall in the interval (μ − 2σ, μ + 2σ ). Therefore, we’d initialize the coefficient a = 2 and adjust it according to the actual situation.

3.2 Fault Detection and Classification Based on Semi-supervised Kernel …

107

After constructing the nearest neighbor function matrix L c in step 7, any row of L c represents the nearest neighbor function of one initial cluster group to other initial cluster groups. According to similarity measurement rules, a large nearest neighbor function indicates that the cluster groups are far away from each other, or said the initial clusters are different from each other. The nearest neighbor function can be used as an evaluation criterion to measure the similarity between each initial cluster group. Therefore, the sum of values of the nearest neighbor functions on each row, Ti represents the overall similarity of the cluster group corresponding to the i-th row with other cluster groups. From the perspective of abnormal detection, the cluster group with the lowest similarity degree is the class which is the most different from the other initial clusters. Step 8 assumes that abnormal detection is a process of detecting a certain abnormal class sample among the most normal class samples, and such abnormal samples can self-cluster in the feature space after KPCA. The purpose of detection is to find abnormal states for prediction and alarm. The improved nearest neighbor function rule algorithms described above are equally capable of accomplishing the task that the detecting samples contain multiple classes of data. However, the KPCA fault classification method is needed to further examine the types of various anomalous patterns. From the above analysis, it is obvious that the improved nearest neighbor function rule algorithms combine the effect of the nearest neighbor coefficients of clustering points and the actual distance between clustering points in categorization, and are more suitable for abnormal detection. 3. Semi-supervised KPCA detection algorithms The kernel functions can be combined with different algorithms to form different methods based on kernel function, and the two parts can be designed separately. Combining the feature detection characteristics that the output of one feature extraction process can be used as the input data for another feature extraction and the idea of unsupervised pattern anomaly detection, propose the improved nearest neighbor rules algorithms KPCA abnormal detection method. The method uses the eigenvalues of the principal direction mapping derived from KPCA as the input information of the improved nearest neighbor function rule algorithms, and to achieve abnormal detection by using algorithms to categorize and analyze samples. This process covers the process from feature mapping to determining the detection boundary in the abnormal detection process. Combining semi-supervised learning methods with unsupervised detection, the limited number of labeled sample information is incorporated into the testing process to guide the final pattern recognition. The steps of semi-supervised KPCA detection algorithms are as follows: (1) Calculate the input characteristic index and set parameters of the kernel function. (2) Train learning machine with the training data which contains unlabeled data in the training set and part of labeled normal data. (3) Testing the learning machine with the test data which contains the labeled data in the test set and the other part of labeled normal data.

108

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.4 The flow chart of the semi-supervised KPCA detection method

(4) Calculate the intra-class scatter Scw based on the labeled data in the test set and determine if it gets the minimum value. If a negative result is obtained, the kernel function parameters are adjusted and the KPCA detection model is reconstructed. (5) Detect the feature distribution points generated by KPCA algorithms by performing the improved nearest neighbor function rule algorithms. As shown in Fig. 3.4, the flow chart of the semi-supervised KPCA detection method is shown. The algorithm feeds a limited amount of labeled data into the training process in step 2 and the testing process in step 3. The purpose of involving labeled normal class sample data in step 3 is to calculate the intra-class scatter Scw in step 4. After the feature index is determined, the kernel function parameter settings have a significant effect on the clustering effect. Higher inter-class scatter of clustering and lower intra-class scatter indicate a better clustering effect. In the abnormal detection process, labeled data usually belong to the normal class information occupying the majority of the feature distribution. The process of evaluating the clustering effect for a single type of sample does not have inter-class scatter, so the clustering effect can be evaluated by intra-class scatter. Referring to Eq. (3.21) to calculate the intra-class scatter Scw on a sample basis. A certain kernel function parameter is set to obtain the distribution of sample features in each principal component direction by KPCA. Step 4 analyzes the clustering effect of the labeled normal data by calculating the intra-class scatter Scw for the labeled normal data points. The algorithm reconstructs the KPCA by adjusting the kernel

3.2 Fault Detection and Classification Based on Semi-supervised Kernel …

109

function parameters according to the exhaustive attack method until the smallest Scw is obtained, which indicates that the sample has achieved the best clustering effect. Meanwhile, the prerequisite for the judgment is that there is no overlap between normal data points (i.e., Scw /= 0). Step 5 applies the improved nearest neighbor function rule algorithms to categorize and judge the feature distribution points generated by KPCA to determine whether the test set contains abnormal data.

3.2.3 Semi-supervised KPCA Classification Algorithms 3.2.3.1

Supervised KPCA Classification Algorithms

1. Algorithm steps The supervised KPCA classification algorithm is proposed by training on known sample information. The flow of the proposed method is shown in Fig. 3.5. The steps of supervised KPCA classification algorithms are as follows: (1) Set the initial feature index and kernel function parameters. (2) Train the classifier with the training data which contains labeled data, and pretest the samples in the training set.

Fig. 3.5 The flow chart of the supervised KPCA classification method

110

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

(3) Calculate the separability evaluation index Jb based on the labeled data in the training set and determine if it gets the maximum value. If a negative result is obtained, the kernel function parameters and feature index combinations are adjusted to reconstruct KPCA to generate clusters until the maximum Jb is obtained. (4) Train the classifier using the combination of feature index and kernel function parameters corresponding to the maximum Jb , and test the representative samples of each class. (5) Categorize samples using the nearest neighbor function rule algorithms combined with labeled data. 2. Algorithm analysis The separability evaluation index Jb is calculated by analyzing the labeled sample information. Therefore, the testing process must combine the labeled data that can be used as the basis for calculating Jb in the training set and pre-test the samples in the training set in step 2. The separability evaluation index Jb can guide the selection of the feature index and the determination of the kernel function parameters, such as the setting of the kernel width value σ 2 in the Gaussian radial basis function (RBF) kernel function. After pre-testing the samples in the training set and calculating the separability evaluation index Jb , if a certain combination of feature indexes corresponds to a larger one Jb , it means that this group of feature index is more suitable as the original input of the classifier. A preliminary analysis of the device to be measured can be performed before classification, and then its pattern type can be pre-evaluated and calculated to generate the original set of feature indexes. In pattern classification, the algorithm automatically selects feature indexes based on specific fault modes combined with actual clustering effects. In KPCA, the nonlinear transformation mapping from the input space to the feature space is determined by the kernel function. The type and parameters of the kernel function determine the nature of the feature space, which in turn has an impact on the classification performed in the feature space. Related studies have shown that the choice of Gaussian radial basis function (RBF) kernel function gives better results when there is a lack of labeled information for the pattern recognition process [5]. Therefore, the KPCA method here uses a Gaussian RBF kernel function in this section. The RBF kernel function needs to determine the values of its parameters σ 2 . However, the classification process often lacks accurate known pattern information to guide the determination of the kernel function parameters. Obtain the maximum Jb and select the corresponding kernel function parameter value from the perspective of the final classification effect. The input parameters of KPCA contains a combination of feature index and kernel function parameter. The selection of various input parameters for the classifier can be known after the initial set of feature indexes and the range of kernel parameter values are determined. In the initial parameter selection stage, the separability evaluation index Jb can be used as a criterion to evaluate the clustering effect.

3.2 Fault Detection and Classification Based on Semi-supervised Kernel …

111

In step 3, combining the above two factors, we can determine the feature index and kernel function parameters used in the classification method by exhaustive attack method. Initially set the combination of a certain feature index and the kernel parameter values to construct KPCA to generate clusters and calculate the separability evaluation index Jb . Adjust the combination of feature indexes and kernel function parameters, then reconstruct KPCA and pre-test the training samples to generate feature clusters. Calculate Jb and compare the value with the previous Jb . After several iterations, the combination of the feature index corresponding to the maximum Jb and the kernel parameter values which are used as the best input parameters to construct the test-specific KPCA are retained. In step 4, add a representative sample of each category from the training set to the test set to guide the nearest neighbor function rules classification algorithms. The nearest neighbor function rules classification algorithms are unsupervised clustering algorithms used to classify test datasets. The samples can be divided into different groups according to the principle of similarity measure, but the algorithm itself cannot determine the type of the divided data. Therefore, information about samples of labeled data is introduced to guide the nearest neighbor function rule algorithms for categorization. In step 5, the feature values projected in the direction of the first two principal components with the largest cumulative contribution in the set of features generated by the test samples are taken as the coordinate values of the x and y axes on the classification plane. Therefore, the feature projection points representing different samples are formed on the classification plane. The nonlinear principal components obtained by KPCA represent the maximum direction of variation of the sample data. The feature mapping map also reveals only a spatial distribution of the samples, and the axes in the map do not correspond to a specific physical meaning [6]. The nearest neighbor function rule algorithms make it possible to effectively classify and identify data with different shapes that appear on the feature distribution map. Combine the labeled data information in the test sample set to identify and analyze the data in each group after categorization, and put the other data in the group of known samples into the corresponding labeled group. When the same type of labeled samples is divided into different groups, it is considered that the group which contains the majority of labeled samples is in the same type of label as the known labeled sample. When the different type of labeled samples is divided into the same group, it is considered that the type of known samples account for a larger number is the type of this group of data. When a group or some groups do not contain any labeled sample data, it is considered that this group(s) of samples belongs to a new pattern type. 3. The setting of connection adjustment coefficients for the nearest neighbor function rule algorithms The nearest neighbor function rules algorithm is likely to cause false connections when the number of samples in different cluster groups differs widely. Further analysis of the process of classifying the initial clusters and establishing connections by the nearest neighbor function rules. As shown in Fig. 3.6, after the

112

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.6 Consequences of a wrong connection

α i max

ωi

ωk γi

α k max

ωi −k

ωm

initial connection, three initial clusters are formed on the distribution map: ωi , ωk and ωm . It is obvious that ωk and ωm should belong to the same type of sample, while ωm is another type of sample. The nearest neighbor function rules algorithm analyzes each initial cluster to establish connections and group the clusters. Assume that the algorithm first analyzes the class ωi and that there is a minimum connection loss γi between class ωk and class ωi . The algorithm compares γi to the maximum connection loss αi max between two points in class ωi and the maximum connection loss αk max between two points in class ωk . The numbers of samples in ωk and ωi differ widely, thus γi < αi max and cause false connections between class ωk and class ωi . The new clusters ωi−k will continue to seek connections with other initial clusters after the wrong connection. Meanwhile, the numbers of samples in ωi−k and ωm differ greater which can easily lead to the wrong connection between ωi−k and ωm . Things are getting worse and may lead to the eventual fault of the classification. Avoiding incorrect connections requires optimizing the algorithm’s guidelines for determining connections. In step 5, the algorithm groups ωi and ωk into one class if γi ≤ αi max or γi ≤ αk max . Therefore, the following adjustment factor is added to the inequality: γi ≤ ra × αi max ra ∈ (0, 1]

(3.31)

γi ≤ ra × αk max ra ∈ (0, 1]

(3.32)

The adjustment coefficient ra serves to optimize the connection rules in the above inequality. The smaller the value of ra, the stricter the conditions for establishing the “connection”, which can effectively avoid the wrong connection between different classes. However, it also increases the risk that samples of the same type cannot be grouped and categorized. In the classification, the value of the adjustment coefficient ra should be set according to the specific clustering situation and data distribution characteristics to improve the accuracy of categorization.

3.2 Fault Detection and Classification Based on Semi-supervised Kernel …

3.2.3.2

113

Semi-supervised KPCA Classification Algorithms

1. Algorithm steps The supervised KPCA requires complete labeled samples to train the classifier, and only with a sample database of multiple pattern types can the reliability of the classification be guaranteed. However, generally, the limited number of labeled data for classification greatly limits the application of supervised KPCA. In response to the limited labeled data and a large amount of unlabeled data in actual situations, it is necessary to think about the question: “can unlabeled data play an active role in detection and classification as well as labeled data?” The core idea of semi-supervised learning is to consider the data composition and self-learning ability of the samples to be classified can solve this problem. Combining the performance characteristics of supervised KPCA and the semisupervised learning with labeled and unlabeled data co-training mechanism, the semi-supervised KPCA classification algorithm is proposed and the flow is shown in Fig. 3.7. The steps of semi-supervised KPCA classification algorithms are as follows: (1) Train the classifier with the training data which contains labeled data. (2) Pre-test the samples in the training set using the classifier generated in step 1. (3) Calculate the separability evaluation index Jb based on the adjustment of feature indexes combinations and kernel function parameters by exhaustive attack method until the maximum is obtained. The corresponding combination of feature index and kernel parameters are determined as the classifier input parameters. (4) Train the classifier using all training samples, and test the classifier using all test samples. (5) Categorize samples using the nearest neighbor function rule algorithms combined and fine-tune parameters of kernel function by the effectiveness of clustering. (6) Test the classifier using all the test samples and the representative labeled training samples of each class. (7) Categorize samples using the nearest neighbor function rule algorithms combined with labeled data. 2. Algorithm analysis In step 1, use the labeled data in the training set to train the classifier, set the initial kernel function parameter values, and select a certain combination of feature indexes in the initial set of feature indexes as the sample original input parameters. The operation is essentially the pre-training of the classifier in supervised learning. In step 2, pre-testing of training samples with labeled and unlabeled data. Calculate the separability evaluation index Jb from labeled data. In step 3, after determining different input parameters by exhaustive attack method, selecting the combination of feature index and kernel parameter values corresponding to the maximum value of the separability evaluation index. As a result, the

114

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.7 The flow chart of the semi-supervised KPCA classification method

algorithm determines the best combination of feature indexes in the pre-generated set of feature indexes, while selecting the values of the kernel parameters. In step 4, train the classifier using all training samples, and test the classifier using all test samples. And the test samples in step 4 are all unlabeled data. Therefore, the values obtained in step 3 provide only some degree of separability evaluation in step 4. The kernel function parameters determined in step 3 are not necessarily the best choice for the unlabeled test samples. This may lead to a reduction in classification accuracy in step 4 or even to incorrect classification. The nearest neighbor function rule algorithms directly classify the test samples by analyzing the distribution of sample features, which can be used as a classification performance evaluation index to optimize the values of kernel function parameters in the classifier [7].

3.2 Fault Detection and Classification Based on Semi-supervised Kernel …

115

In step 5, the kernel parameter values are fine-tuned by analyzing the classification effect of the nearest neighbor function rules. This is done by determining the kernel parameter values within a small range of variation in the initial classification. Then a series of kernel parameter values are generated according to the set value interval and perform KPCA method selects each parameter value to classify the samples in the test set. The number of groups of the nearest neighbor function rule algorithms reflects the final classification effect. In a certain sense, it means that the classification is more complex and the classification accuracy is not high enough. Therefore, the principle of fine-tuning is to make the number of groups converge to a value as small as possible but greater than 1 after the categorization of the nearest-neighbor function rules. The fine-tuned kernel function parameters are the quadratic optimized values. In step 6, use the fine-tuned parameters to construct KPCA, and add the labeled representative samples from the training set to the test set to form a new test sample. Train the classifier with all training samples and classify the new test samples. The nearest neighbor function rule algorithm is an unsupervised clustering algorithm that requires labeled data to guide pattern recognition. Therefore, introduces samples of known classes into the test set to aid in sample category labeling. It embodies semi-supervised learning with labeled and unlabeled data co-training ideas. In step 7, set the connection adjustment coefficient ra. Then classifies the new test data by the nearest neighbor function rule algorithms and analyzes each group of data after categorization in combination with the labeled data in the test set. Finally, determine the pattern of the samples.

3.2.4 Application of Semi-supervised KPCA Method in Transmission Fault Detection and Classification 3.2.4.1

Experiment and Characteristic Analysis of Typical Transmission Failure

1. The structure of the experimental system The structure of the experimental system is shown in Fig. 3.8, and the transmission experiment table and console are shown in Fig. 3.9a. (1) The components of the test bench Traction motor: maximum output power of 75 kW, the maximum speed of 4500 r/min. Loading motor: maximum output power of 45 kW, the maximum speed of 3000 r/min, the maximum output torque of 150 N m. Experimental console: control the speed of the traction motor and the torque of the loading motor. Collect and display the speed and torque of the input and output ends of the transmission. Calculate and display the transmission input and output power, etc.

116

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.8 The structure of the experimental system

Fig. 3.9 The experimental transmission system

Input and output speed and torque sensors: collect input and output speed and torque values respectively. Accompany test transmission: change the speed and torque to realize the transmission cooperation between the loading motor and tested transmission. Tested transmission: Dongfeng SG135-2 transmission with three axes and five gears.

3.2 Fault Detection and Classification Based on Semi-supervised Kernel …

117

(2) Experimental transmission route The console controls the speed of the traction motor, which is used as the power input. Passing through the tested transmission, the speed and torque sensors at the output end, and the accompany test transmission to the loading motor. The loading motor outputs the reverse torque, which is then applied to the tested transmission through the torque converter of the accompany test transmission to provide a different load for the transmission operation. The electrical energy generated by the loading motor is re-input to the traction motor through the reverser to realize the circuit closure. (3) Experimental transmission The transmission sketch and sensor arrangement of SG135-2 transmission for the experiment are shown in Fig. 3.9b. The transmission has three shaft systems including input, intermediate, and output shafts, and five gear meshing pairs. Each gear transmission as shown in Table 3.1. 2. The specific details of the test system and signal acquisition The specific details of the test system and signal acquisition are as follows: (1) Test and analysis system Sensor: piezoelectric three-way acceleration sensor, which collects acceleration and velocity signals. Charge amplifier: filtering and amplifying the collected vibration signal. The amplification can be set to 1, 3.16, 10, 31.6, 100, 316, and 1000. Multi-functional interface box: 16 channels can be collected at the same time, with filtering and amplifying functions. Amplification can be set to 1, 3.16, 10, 31.6, and 100. Acquisition card: 12-bit A/D signal conversion. Signal acquisition and analysis system: DASC signal acquisition and analysis system with signal acquisition and online and offline time domain waveform analysis, spectrum analysis, spectrum correction, refinement spectrum analysis, demodulation analysis, and other functions. As shown in Fig. 3.10a. (2) Test method In this experiment, as shown in Figs. 3.9 and 3.10b, four three-phase vibration acceleration sensors are arranged near the input shaft bearing housing, the two ends of the intermediate shaft, and the output shaft bearing housing of the tested transmission. At the same time collect the vibration signals in the horizontal (X direction), vertical (Y direction), and axial (Z direction) directions. Table 3.1 Gear ratio for each gear Gear

1st gear

2nd gear

3rd gear

4th gear

5th gear

Transmission

5.29

2.99

1.71

1

0.77

118

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.10 Signal acquisition

The vibration signal collected by the sensor is amplified by the B&K2653 charge amplifier and input to DAS multi-functional interface box. Then input to the portable computer after A/D conversion. The DASC signal acquisition and analysis system are applied to record and analysis in the Joint Time–Frequency domain. At the same time, the speed and torque sensors placed at both ends of the transmission under test acquire the speed and load information of the input and output of the transmission. (3) Data acquisition For the fault detection experiment, the 5th gear of the transmission was used as the experimental object. The transmission was designed to run under three conditions: normal, slight pitting of tooth surface, and severe pitting of tooth surface, and the faulty gears are shown in Fig. 3.11a, b. For the classification experiment, the output shaft cylindrical rolling bearing of the Dongfeng SG135-2 transmission was used as the experimental object. The measurement point is placed on the output shaft bearing housing (i.e., the position of measurement points 1 in Fig. 3.9). The experiment was designed to run the transmission under three conditions: normal, spalling of the inner ring, and spalling of the outer ring. Figure 3.11c, d show the faulty bearing components. The transmission is set to 3rd gears, with the ratio of the number of teeth of the normally meshing gears at the input end being 38/26 and the ratio of the number of teeth of the meshing gears in the 3rd gear being 35/30. Pitting detection experimental transmission operating conditions are as follows. Rotational speed: 1200 r/min (input shaft), 820 r/min (intermediate shaft), 1568 r/ min (output shaft). Output torque: 100.2 N m, output power: 16.5 kW. Rotational frequency: 20 Hz (input shaft), 13.7 Hz (intermediate shaft), 26 Hz (output shaft). Five gear meshing frequency: 574 Hz, constant meshing gear meshing frequency: 520 Hz.

3.2 Fault Detection and Classification Based on Semi-supervised Kernel …

119

Fig. 3.11 Faulty bearing components

As shown in Fig. 3.12, the difference between the time domain waveforms of normal signal and slight pitting is much smaller, and it is impossible to determine whether there is a fault. The waveform of the severe pitting signal, on the other hand, has many shock components and the amplitude increases significantly. 3. The work conditions of bearing (1) Bearing classification experimental transmission operating conditions Speed: 2400 r/min (input shaft), 1642 r/min (intermediate shaft), 1370 r/min (output shaft). Output torque: 105.5 N m. Output power: 15 kW. (2) Sampling parameter setting Acquisition signal: vibration acceleration, vibration speed. Test direction: horizontal radial, vertical radial, and axial. Sampling frequency: 40,000 Hz (acceleration), 5000 Hz (velocity). Anti-mixing filter: 20,000 Hz (acceleration), 3000 Hz (velocity). Sampling length: 1024 × 90 points. (3) Transmission characteristic frequency build-up Rotational frequency: 40 Hz (input axis), 27.4 Hz (intermediate axis), 22.8 Hz (output axis). Three-gear meshing frequency: 798 Hz.

120

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.12 The time domain waveform for each working condition

Normally meshing gear meshing frequency: 1040 Hz. Output shaft rolling bearing parameters: model NUP311EN, see Table 3.2 for parameters. Output shaft rolling bearing characteristic frequency: as shown in Table 3.2. Gear surface pitting is a minor fault, and the vibration signal extracted from the minor pitting condition usually shows a change in vibration energy in the time domain relative to the normal signal. However, the impact phenomenon is not obvious. In the frequency domain, the energy in the transconductance band increases to varying degrees without significant modulation. It is difficult to distinguish from normal signals by signal processing methods only, which makes it more difficult to detect Table 3.2 Output shaft rolling bearing parameters and characteristic frequency Pitch diameter D (mm)

Rolling element diameter d0 (mm)

Rolling element number m (num)

Contact angle α (°)

Inner ring passing frequency fi (Hz)

Outer ring passing frequency fo (Hz)

Rolling element passing frequency fg (Hz)

Cage passing frequency fb (Hz)

85

18

13

0

179.6

116.8

51.4

10.9

3.2 Fault Detection and Classification Based on Semi-supervised Kernel …

121

similar minor gear faults. A semi-supervised KPCA method is applied to detect mild pitting faults in gears in conjunction with gear surface pitting experiments. The semi-supervised KPCA detection model is evaluated by analyzing the performance in terms of both correctness and stability, and the evaluation index is the detection rate and false alarm rate of the detection results. The samples with normal and minor gear surface pitting were used as combination A. The detection rate was analyzed to evaluate the ability of the semi-supervised KPCA method to detect minor faults and test the correctness of the model. The samples with all normal were used as combination B. The false alarm rate of the detection was analyzed, which is equivalent to testing the stability of the semi-supervised KPCA detection model in the absence of abnormal data work conditions. The vibration acceleration signals were separately collected from the experimental transmission under two operating conditions: normal and minor gear surface pitting. The following time domain statistical characteristic including mean, mean square, variance, skewness, and root mean square amplitude, and one frequency domain characteristic including the amplitude corresponding to the rotational frequency of the axis where the 5th gears are located are counted to describe the gear running status. Assume a set of data is obtained as one sample after feature extraction. For the normal work condition, a total of 48 samples are obtained from the data collected in both x and y directions, of which 30 samples are used for training and the other 18 samples are used for testing. To simulate the actual inspection situation, 12 normal class samples are set as known samples in the training set. For the gear surface minor pitting condition, 4 samples are collected in x direction which 2 samples are used for training and the other 2 samples are used for testing. As a result, the training set contains 30 samples and the test set contains 18 samples in combination A. A 30 × 6-dimensional feature data matrix is generated in the training set and an 18 × 6-dimensional feature data matrix is generated in the test set. Similarly, the training set contains 32 samples and the test set contains 20 samples in combination B.

3.2.4.2

The Detection Result of Semi-supervised KPCA Algorithms

The kernel function used in KPCA is the Gaussian radial basis function kernel function. The detection method applies an improved nearest neighbor function rules algorithm to classify the samples, and coefficient of the standard deviation a = 2 in the discriminant inequality of algorithm collection. The following is a semi-supervised KPCA detection analysis of the experimental data, which shows the distribution of the samples with features projected in the 1–2 principal component direction. (1) Experimental data detection under normal and minor gear surface pitting conditions The samples with normal and minor gear surface pitting were used as combination A. Detect the sample in combination A with semi-supervised KPCA detection

122

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.13 Combined A data semi-supervised KPCA detection effect

algorithms. All the figures show the distribution of feature samples projected in the direction of 1–2 principal components. Set kernel function parameters σ 2 = 4.9. Figure 3.13a shows the effect of the semi-supervised KPCA feature distribution with 26 sample points labeled with the horizontal coordinate representing the 1st principal direction and the vertical coordinate representing the 2nd principal direction. The “*” samples represent the six labeled normal data in the set of test samples added to the algorithm, which are used to calculate the intra-class scatter Scw and optimize the detector performance. The “Δ” samples represent the fault data detected by the algorithm, and the “◯” samples represent the other data in the test set. Figure 3.13a shows a densely distributed cluster with known normal samples distributed among them. A good clustering effect is achieved because the detection algorithm optimizes the kernel function parameters by calculating the intra-class scatter evaluation index. Notice that there are two sample points marked with “Δ” far from the cluster distribution, and the results of the improved nearest neighbor function criterion algorithm for clustering points show that those two data points are abnormal fault samples with numbers 25 and 26. To check the correctness of the results, the different types of test samples are first marked by different icons. The “*” samples represent the normal data and the “Δ” samples represent the minor surface pitting data as shown in Fig. 3.13b. Figure 3.13b shows a situation that exactly matches the results detected by the algorithm in Fig. 3.13a, where the normal sample points are clustered into one class and the 2 faulty sample points are distributed out of the cluster, whose numbers are 25 and 26. This indicates that for the experimental data containing mild pitting samples on the tooth surface in combination A, the detection rate of the semi-supervised KPCA method is 100% and the false alarm rate is 0%, and the algorithm reflects a high detection performance. (2) Experimental data detection under normal conditions The samples with all normal data were used as combination B. The distribution of the feature sample is shown in Fig. 3.14. A decentralized cluster is shown in Fig. 3.14a

3.2 Fault Detection and Classification Based on Semi-supervised Kernel …

123

Fig. 3.14 Combined B data semi-supervised KPCA detection effect

with labeled normal class sample points distributed among them. The algorithm detection results show that there are no abnormal sample data. Set kernel function parameters σ 2 = 5.2 after the optimizing of the algorithm. As shown in Fig. 3.14b, all sample points belong to the normal class data, which verifies the correctness of the detection results in Fig. 3.14a. This indicates that the detection rate of the semi-supervised KPCA method is 100% and the false alarm rate is 0% for the experimental data containing normal samples in combination B. The above results demonstrate the effectiveness and high performance of the algorithm.

3.2.4.3

Application of Semi-supervised KPCA Algorithms Analysis

Application of a semi-supervised kernel function principal element analysis method is to classify sample data of transmissions in normal and bearing inner ring spalling and bearing outer ring spalling work conditions. 1. Feature index extraction The vibration acceleration signals collected from the experimental transmission in normal, bearing inner ring spalling and bearing outer ring spalling operating conditions were extracted separately. (1) Time domain statistical indicators: mean, mean square, kurtosis, variance, skewness, peak value, root mean square amplitude. (2) Dimensionless feature indicators: waveform indicator, pulse indicator, peak indicator, margin indicator. (3) Frequency domain characteristics: frequency value of the highest peak of the spectrum, the amplitude value of the inner ring of the bearing through the frequency in the refined spectrum, and the amplitude value of the outer ring of the bearing through the frequency in the refined spectrum.

124

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

The above feature index constitutes the feature set for the experimental analysis, which is used as raw information for further feature selection, feature extraction, and pattern classification. For the time domain sampling sequences collected in each direction, a total of 44 sets of sample data were obtained for each state and two directions (i.e., x direction and y direction), from which 24 sets were selected as training samples and another 20 sets were used as test samples for classification. For each sample data, 14 feature indicators are extracted, and a total of the 72 × 14-dimensional training data matrix and 60 × 14-dimensional test data matrix is formed for the 3 types of samples. To simulate the situation of insufficient information of known categories in the actual classification, three types of data are in the training sample set. The normal samples and the bearing inner ring spalling fault samples are taken as known type samples, and the bearing outer ring spalling fault samples are taken as unknown samples. The three types of samples are identified and classified in the test set. 2. Classify the experiment data with supervised KPCA algorithms The supervised KPCA uses labeled normal and bearing inner ring spalling data from the training samples to train the classifier to classify and identify samples in the test set. The RBF kernel function is used for KPCA. Set the connection adjustment coefficient ra = 0.8. The supervised KPCA method automatically selects the following features in the feature set: variance, peak value, root mean square amplitude, frequency value corresponding to the highest peak of the spectrum, bearing outer ring passing frequency amplitude, and bearing inner ring passing frequency amplitude. The nonlinear principal components in KPCA or PCA have the concept of contribution rate, which measures the strength of the ability of the projected eigenvalues in the direction of this principal component to explain the sample variance through the magnitude of the contribution of a principal component. In this experiment, the cumulative contribution of the first 2 principal components generated by KPCA is 94.98%. It indicates that the first two principal components carry enough sample variation information to be used for transmission fault classification and identification. The following is the clustering effect of supervised KPCA in the experiment. And the distribution of feature samples projected in the first 2 principal directions is shown in Fig. 3.15. Figure 3.15 shows the effect of supervised KPCA clustering. There are 80 test sample labeled points in each figure. The “⛛” and “✩” labels in Fig. 3.15a represent the 10 normal and 10 bearing inner ring spalling data extracted from the training set and added to the test sample set to guide the final categorization of the nearest neighbor function rules algorithm. The “◯” labels represent 60 test data. For the labeled normal class and bearing inner ring spalling class data was added to the test sample calculating the separability evaluation index Jb = 0.87868. The optimized RBF kernel parameters σ 2 = 10. As seen in the figure, the data points show three cluster groups, and the “⛛” sample and the “✩” sample are located in two of the cluster groups.

3.2 Fault Detection and Classification Based on Semi-supervised Kernel …

125

Fig. 3.15 Supervised KPCA clustering effect

To check the correctness of the clustering the following settings are made: the “⛛” labels represent the normal sample, the “✩” labels represent the bearing inner ring spalling sample, and the “Δ” labels represent the bearing outer ring spalling sample. As shown in Fig. 3.15b is the supervised KPCA clustering effect. The three cluster groups represent three types of samples, and the clustering matches exactly what is seen in Fig. 3.15a. The separability evaluation index Jb = 0.85378 indicates that the actual clustering achieves good separability. 3. Classify the experiment data with semi-supervised KPCA algorithms The semi-supervised KPCA uses the RBF kernel function. Set the connection adjustment coefficient ra = 0.8 and the fine-tuning range c var = 1. The semi-supervised KPCA method automatically selects the following features in the feature set: variance, peak value, root mean square amplitude, frequency value corresponding to the highest peak of the spectrum, bearing outer ring passing frequency amplitude and bearing inner ring passing frequency amplitude. Figure 3.16 shows the effect of semi-supervised KPCA clustering. There are 80 test sample labeled points in each figure. The “⛛” and “✩” samples in Fig. 3.16a represent the 10 normal and 10 bearing inner ring spalling data extracted from the training set and added to the test sample set to guide the final categorization of the nearest neighbor function rule algorithms. Use the two labeled data in the test set to calculate the separability evaluation index Jb = 0.94289. After two optimizations of the semi-supervised KPCA algorithm set Kernel parameters σ 2 = 10. As seen in the figure, the data points show three cluster groups, The three clustering centers are triangularly distributed in the figure, with a small intra-class scatter and a large inter-class scatter for each cluster. To check the correctness of the clustering the following settings are made: the “⛛” labels represent the normal sample, the “✩” labels represent the bearing inner ring spalling sample, and the “Δ” labels represent the bearing outer ring spalling sample. As shown in Fig. 3.16b is a semi-supervised KPCA clustering effect. The

126

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.16 Semi-supervised KPCA clustering effect

three cluster groups represent three types of samples, and the clustering matches exactly what is seen in Fig. 3.16a. The separability evaluation index Jb = 0.91823 indicates that the actual clustering achieves good separability. Comparing Figs. 3.15 and 3.16, the semi-supervised KPCA method has a better clustering effect compared to the supervised KPCA method for the bearing class data. 4. Comparison of KPCA classification results under two models Before classification, the test samples of each category are numbered. The normal category samples are numbered: 1–30. Bearing inner ring spalling failure sample number: 31–60. Bearing outer ring spalling fault sample numbers: 61–80. KPCA forms a series of feature projection points on the classification plane consisting of 1–2 principal directions, and the nearest neighbor function rules algorithm classifies the sample points according to the principle of similarity metric. The following are the results after categorization by the nearest neighbor function criterion algorithm in both modes. (1) Supervised KPCA final classification results Normal samples: 24, 1, 29, 23, 21, 3, 2, 30, 16, 7, 20, 18, 8, 19, 11, 27, 14, 5, 26, 22, 12, 4, 28, 17, 13, 9, 6, 25, 15, 10. Bearing inner ring spalling failure samples: 58, 38, 31, 56, 39, 37, 59, 34, 54, 46, 44, 35, 33, 55, 47, 36, 50, 49, 60, 53, 51, 57, 48, 52, 42, 32, 45, 43, 41, 40, 79, 65, 64, 72, 71, 63, 80, 76, 69, 75, 70, 62, 78, 74, 68, 66, 73, 61, 77, 67. The above results indicate that the supervised KPCA algorithm correctly classified the normal samples while misclassifying the bearing outer ring spalling fault samples into the bearing inner ring spalling fault type. The categorization misclassification rate was calculated to accurately evaluate the classification correctness of KPCA. The misclassification rate is the ratio of the sum of the number of misclassified samples and the number of samples not correctly classified to the total number of

3.2 Fault Detection and Classification Based on Semi-supervised Kernel …

127

tested samples. The calculation shows that the supervised KPCA method has a 25% misclassification rate in the bearing class classification experiment. (2) Semi-supervised KPCA final classification results Normal samples: 24, 19, 15, 11, 3, 1, 23, 2, 29, 21, 25, 10, 30, 16, 7, 27, 20, 18, 14, 8, 26, 22, 5, 4, 28, 17, 13, 12, 9, 6. Bearing inner ring spalling failure samples: 58, 38, 31, 56, 39, 37, 59, 34, 54, 46, 44, 35, 33, 55, 47, 36, 50, 49, 60, 53, 57, 51, 48, 52, 42, 32, 45, 43, 41, 40. Unknown type 1 samples: 74, 68, 64, 62, 80, 76, 69, 79, 75, 70, 78, 66, 72, 71, 65, 63, 77, 67. Unknown type 2 samples: 73, 61. It is obvious that the classification results are consistent with the actual sample categories, and the semi-supervised KPCA method can separate normal samples from bearing inner ring spalling fault samples, and the unknown type 1 samples are the bearing outer ring spalling fault samples. Since the classifier does not have a priori information about the bearing outer ring spalling fault, this type of data is classified as unknown type 1. The unknown type 2 sample is part of the misclassified data, which also belongs to the bearing outer ring spalling fault. As seen in Fig. 3.16b, the two data points located near the bearing outer ring spalling cluster are the data points corresponding to the unknown type 2 samples. The calculation shows that the misclassification rate of the semi-supervised KPCA method is 2.5%, which shows a good classification performance. 5. Comparison of KPCA classification results under two models Table 3.3 shows the comparison of the classification of supervised KPCA methods and semi-supervised KPCA methods. As seen from the table, for the bearing class fault classification experiments, the semi-supervised KPCA has more desirable classification results than the supervised KPCA in the case of insufficient a priori sample information. Table 3.3 Comparison of KPCA classification in two models Supervised KPCA

Semi-supervised KPCA

Clustering effect in the figure

Normal

Good

Separability evaluation index Jb (detection experiment)

0.87868

0.94289

Separability evaluation index Jb (control experiment)

0.85378

0.91823

Near-neighbor function rules algorithms wrong score rate (%)

25

2.5

Kernel function parameters σ 2

10

10

128

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

3.3 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering 3.3.1 Correlation of Outlier Detection and Early Fault The problem of early fault detection in the field of mechanical equipment fault diagnosis has many similarities to the problem of outlier data in data mining techniques. Early fault detection is based on fault separation based on normal signals, while outlier data mining is the identification of abnormal data in the dataset. Therefore, the method of outlier data mining is also applicable to the problem of early fault detection. At the same time, the method of artificial intelligence is applied to the field of machinery and equipment fault diagnosis to provide a new research path for machinery and equipment fault diagnosis.

3.3.1.1

Outlier Detection

The outlier detection problem was first widely studied in the field of statistics in the 1980s [8]. With the development of research, many methods for outlier detection emerged which can be broadly classified into statistical-based methods, distancebased methods, density-based methods, clustering-based methods, and deviationbased methods. (1) Statistic-based method Statistical-based methods are statistical inconsistency tests that most of these methods were developed from inconsistency testing methods for data sets with different distributions. This method always sets the situation to follow a normal distribution, a gamma distribution, and so on. The dataset is then fitted with this distribution and data that deviate from the model distribution are identified as outliers [9]. The above method assumes that the potential distribution and the distribution parameters of the data set are known. In numerous situations where we do not know whether a particular attribute follows a standard distribution. We need to perform extensive testing at a great cost to find a distribution that fits the attribute. At the same time, most distribution models can only be applied directly to the original feature space of the data, which lacks variation. Therefore, the statistic-based method cannot detect outliers in high-dimensional data. (2) Distance-based method Knorr et al. [9] firstly proposed a distance-based method and summarized the method systematically. Define outlier as follows: a data point in the data set P has a distance greater than r from at least an area β of other data points, then the point is a distancebased outlier. Outliers are defined as a single global criterion determined by parameter r and parameter β. This definition contains and extends the idea of distance-based

3.3 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering

129

approaches, overcoming the main drawback of distribution-based approaches. It is still effective in finding outliers when the data set does not satisfy any of the standard distributions. The method does not require a priori knowledge of the data distribution, and its definition covers the detection of outliers by statistical methods such as the normal distribution, the Poisson distribution, and other distributions. However, both statistical-based and distance-based methods measure the inconsistency of the data from a global perspective, which is not effective when the dataset contains multiple types of distributions or has a mixture of different density subsets. (3) Density-based method Breunig et al. [10] firstly proposed a density-based method and used the local outlier factor (LOF) of each data object to define the local density of the object’s nearest neighbors. The density of an object’s neighborhood can be described by the radius of the nearest neighbors containing a fixed number of objects or by the number of objects contained in the nearest neighbors of the specified radius. The LOF of an object is based on the single parameter of MinPts, which is the number of nearest neighbors used in defining the local neighborhood of the object. It is considered an outlier if a data point has a high LOF value. The density-based method detects outliers by comparing the object’s density and the average density of its nearest neighbors. The density-based method overcomes the weaknesses of the distance-based method in detecting data sets mixed with different density subsets and obtains higher detection accuracy. (4) Clustering-based method Many clustering methods such as DBSCAN [11] can perform the task of outlier detection. However, the main goal of these methods is to produce meaningful clusters and, incidentally, to complete the outlier detection task. These methods that are not specifically optimized for outlier detection have difficulty producing satisfactory outlier detection results. In many application scenarios, clustering-based methods can detect more meaningful outliers than statistic-based and distance-based methods due to the ability to extract local characteristics of the data. (5) Bias-based method The deviation-based approach considers objects that deviate from the feature description as outlier objects. Bias-based outlier detection methods are classified into sequential accidental methods and data standing methods.

3.3.1.2

Applicability of Early Fault Outlier Detection Methods

For statistic-based methods, normal data approximately follows Gaussian distribution, but the distribution of fault data is unknown. Semi-supervised learning methods cannot fit the distribution of fault data like supervised learning methods. The feature

130

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

space mapped by the kernel function has a high dimensionality which is not suitable for this method. For the distance-based method, the selection of the parameter r and parameter β deeply reflects the detection results. The supervised method can select the best parameter values by learning from samples with labeled labels, but the unsupervised method has difficulty obtaining the appropriate values. Meanwhile, in early fault detection, the difference in data structure between normal data and fault data can lead to inferior detection results. For density-based methods, the semi-supervised learning method cannot obtain a priori knowledge of the outlier threshold or the number of outlier data in the dataset, thus limiting the applicability of the method. Although outlier detection based on clustering methods is an incidental task. However, the advantage of unsupervised learning of clustering methods makes it possible to enhance their outlier detection capabilities by making algorithmic adjustments. Clustering-based methods are one of the most promising analysis methods for early fault detection problems [12].

3.3.2 Semi-supervised Fuzzy Kernel Clustering 3.3.2.1

Semi-supervised Learning

Currently, semi-supervised clustering methods are broadly classified into three categories [13]: (1) Constraints-based method The constraints-based method is to guide the clustering process by using labeled data to finally obtain an appropriate segmentation result. (2) Distance function-based method The distance function-based method is to guide the clustering process by a certain distance function obtained by learning from the labeled data. (3) Integrating constraints and distance function learning method Bilenko et al. integrated the above two ideas under one framework based on the C-means algorithm [14]. Sugato et al. proposed a unified probabilistic model for semi-supervised clustering [15]. In this section, we study the semi-supervised fuzzy kernel clustering method based on constraints and distances which belongs to one of the fuzzy C-means algorithms.

3.3.2.2

Semi-supervised Fuzzy Kernel Clustering Method

The semi-supervised fuzzy kernel clustering method uses a small number of known labeled samples as a guide for the clustering process and implements part of the

3.3 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering

131

supervised behavior of the unsupervised clustering method. The performance of the proposed method is significantly better than that of the simple unsupervised fuzzy kernel clustering method [16]. The flowchart of the semi-supervised fuzzy kernel clustering method is shown in Fig. 3.17. The advantage of the semi-supervised fuzzy kernel clustering algorithm is that it can use samples with partially known labels as initial clustering centers to overcome the influence to the fuzzy C-means algorithm by the selection of initial clustering centers. The fuzzy c-partition of known labeled samples does not change during the iterative process, acting as a constraint to make the clustering process proceed in the direction of known classes. However, its priority is to obtain several samples of known labels for each cluster, which would be detrimental to practical applications. The semi-supervised fuzzy kernel clustering method implements partially Fig. 3.17 The process of the semi-supervised fuzzy kernel clustering method

132

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

supervised learning of unsupervised methods and makes its performance superior to unsupervised clustering methods.

3.3.3 Semi-supervised Hypersphere-Based Fuzzy Kernel Clustering Method The above semi-supervised fuzzy kernel clustering method requires obtaining a sample of known labels for each cluster to ensure the effectiveness of the method. In early fault detection, many samples of early faults are difficult or impossible to obtain in advance. Therefore, the application of semi-supervised fuzzy kernel clustering methods is severely affected by the lack of fault samples for some classes. To find the fault samples, it is necessary to first determine which are normal samples and use them as the basic basis for determining abnormal samples. The advantage of clustering-based outlier detection methods is also the detection of abnormal data from normal data in the clustering process. A semi-supervised hypersphere-based fuzzy kernel clustering method is proposed to solve the application problem in practical engineering. The method can achieve semi-supervised fuzzy kernel clustering by obtaining only a small number of known normal samples without any known faulty samples. The semi-supervised hypersphere-based fuzzy kernel clustering method is basically the same idea as the semi-supervised fuzzy kernel clustering method described above. The method determines the initial clustering centers by known labeled samples, and the iterative process updates only the fuzzy membership of unknown labeled samples. The difference is that since there is no labeled fault sample, another way is needed to find the initial clustering centers for fault clustering. The semisupervised hypersphere-based fuzzy kernel clustering method is divided into two steps as follows: (1) Find the center of normal clusters from labeled normal samples and identify the majority of normal samples from unknown labeled samples. (2) The samples that cannot be judged as normal are considered as potential fault samples, and then go to calculate the initial clustering centers for fault clustering. This solves the problem of not being able to determine the center of fault clustering due to the lack of labeled fault samples. 3.3.3.1

Outlier Detection Based on Minimum Closed Hyperspheres

(1) Minimum closed hypersphere method The minimal closed hypersphere algorithm belongs to supervised learning which is used to identify any abnormal data that does not look like it was generated from the same training distribution [17]. It uses the training set to learn the distribution of normal data and then filters future test data based on the resulting pattern function.

3.3 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering

133

Fig. 3.18 A hypersphere containing X with a minimum radius

Assume that given a training set X = (x 1 , . . . , x n ). Let its mapping in the associated Euclidean feature space F be Φ(x), and the associated kernel K satisfies K (x, y) = Φ(x)T Φ( y). By finding the center of the smallest hypersphere v which contains X, such that the radius r of this sphere is minimized. The above optimization problem is expressed in the following equation: v ∗ = arg min v

max ||Φ(x i ) − v||

1≤i≤n

(3.33)

Solving the optimization problem: min r 2 v,r

st.

n Σ

||Φ(x i ) − v||2 = (Φ(x i ) − v)' (Φ(x i ) − v) ≤ r 2

(3.34)

(3.35)

i=1

The hypersphere (v, r) is a hypersphere containing X with minimum radius r, as shown in Fig. 3.18. The hypersphere algorithm has a problem whenever there exists training data that is not good enough, the obtained radius will be too large. Under ideal conditions, such a minimum hypersphere is required, which contains other training data, except some extreme training data. (2) Soft minimum closed hypersphere method The soft minimum hypersphere method is an improved minimum hypersphere method. The method considers two types of losses equally losses incurred by missing a small portion of data and losses due to radius reduction. To implement this strategy, the concept of slack variables ξi = ξi (v, r, x i ) is introduced, which is defined as follows: ) ( ξi = ||v − Φ(x i )||2 − r 2 +

(3.36)

134

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.19 Soft minimum closed hypersphere containing most of the data

For points falling inside the hypersphere, it has a value of zero. For points outside the hypersphere, it measures the distance to the square of the center over r 2 . Let ξ be the vector consisting of elements ξi , i = 1, . . . , n. Parameter C is a trade-off between the two purposes of minimizing the control radius and controlling the relaxation variables. The following is the mathematical description of the soft minimal hypersphere method. Parameter C is a weighing parameter between the two purposes of minimizing the control radius and controlling the relaxation variables. The following is the mathematical description of the soft minimal hypersphere method: min r 2 + C||ξ ||1

(3.37)

st. ||Φ(x i ) − v||2 = (Φ(x i ) − v)' (Φ(x i ) − v) ≤ r 2 + ξi ξi ≥ 0, i = 1, . . . , n

(3.38)

v,r

The solution to this problem can also be obtained by finding the pairwise Lagrangian function, and the results are shown in Fig. 3.19. The soft minimum hypersphere method requires setting trade-off parameters, and different values of the parameters will yield different optimization results. Therefore, the reasonable values of the parameters are difficult to determine and often contain some human prior knowledge. (3) Data pre-process To seek radius minimization and avoid human prior knowledge values of parameters C, a soft hypersphere method based on a pre-optimized training set is proposed. The method optimizes the training set to eliminate some of the training data that are not good enough to avoid the method from obtaining a larger radius than what is actual needed. In addition to a small part of extreme training data, the hypersphere also contains other training data to avoid the problem of artificial value of parameter C. Assume that the training sample set follows an independent identical distribution. In the feature space, the estimated center of mass E[Φ(x)] of the training sample set

3.3 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering

135

is calculated and the distance between each known sample and its center of mass is calculated by Euclidean distance: di = ||Φ(x i ) − E[Φ(x)]|| i = 1, . . . , n

(3.39)

Calculate the mean of the distance E(d) between each training sample and the center of mass, the deviation of each training sample from the mean σi = di − E(d), and the standard deviation σx : / σx =

(d1 − E(d))2 + (d2 − E(d))2 · · · + (dl − E(d))2 l

(3.40)

According to the law of normal distribution in the principle of probability statistics: about 68.26% of samples will fall in the interval (E(d) − σx , E(d) + σx ), about 95.44% of samples will fall in the interval (E(d) − 2σx , E(d) + 2σx ), and about 99.73% of samples will fall in the interval (E(d) − 3σx , E(d) + 3σx ). Weighing the actual analysis needs, it is sufficient to ensure that 95% of the training samples are contained within the hypersphere, so the training samples with roots of sample variance greater than two standard deviations (i.e., σi ≥ 2σx ) are excluded. The advantages of this method are that it can avoid the excessive radius of the hypersphere due to some of the training data not being good enough, it does not require the determination of the parameter C of the soft minimum hypersphere algorithm, and it can set different confidence probabilities according to the actual needs.

3.3.3.2

Semi-supervised Fuzzy Kernel Clustering Method Based on Minimum Closed Hyperspheres

After the analysis of the soft minimum hypersphere algorithm, let the number of samples determined to be normal be n l , the number of potentially faulty samples which is not included in the hypersphere is n u , and the total number of samples is n = n l + n u . Clustering possible fault samples together by extracting boundary normal samples from the set of potential faults by semi-supervised fuzzy kernel clustering. The method for semi-supervised fuzzy kernel clustering based on hyperspheres is as follows. Set c as the number of the clustering, vi (i = 1, 2, . . . , c) is the center of the i-th clustering, and u ik (i = 1, 2, . . . , c, k = 1, 2, . . . , n) is the membership function of the k-th sample to the ith clustering. The following restriction should be satisfied: n Σ 0 ≤ u ik ≤ 1, 0 ≤ u ik ≤ n k=1 ⎧ ⎪ ⎨ { u} { } U= U l = u lik | U u = u ik ⎪    ⎩   

labeled normal data unlabeled fault data

⎫ ⎪ ⎬ ⎪ ⎭

(3.41)

136

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Since n l samples have already been determined to be normal by the hypersphere algorithm, it is possible to determine the value u lik . vi =

n Σ

βik Φ(x k ), i = 1, 2, . . . , c

(3.42)

k=1

Then the objective function of the fuzzy kernel clustering algorithm in the feature space is: ( ) Jm (U, v) = Jm U, β 1 , β 2 , . . . , β c ||2 || n c Σ n || || Σ Σ || m || = u ik ||Φ(x k ) − β il Φ(x l )|| || || i=1 k=1

(3.43)

l=1

where βi = (βi1 , βi2 , . . .||, βin )T , i = 1, 2, . . . , c, m is a constant which m > 1, || ||Φ(x k ) − Σn βil Φ(x l )||2 is calculated as follows: l=1 ||2 || n n || || Σ Σ || || βil Φ(x l )|| = Φ(x k )T Φ(x k ) − 2 βil Φ(x k )T Φ(x l ) ||Φ(x k ) − || || l=1

l=1

+

n n Σ Σ

( ) βil Φ(x k )T βi j Φ x j s

(3.44)

l=1 j=1

Putting K (x, y) into Eqs. (3.43) and (3.44), we get: Jm (U, v) =

c Σ n Σ

( ) m K k − 2β iT K k + β iT Kβ i u ik

(3.45)

i=1 k=1

st.

c Σ

u ik = 1, k = 1, 2, . . . , n

(3.46)

i=1

( ) where K i j = K x i , x j , i, j = 1, 2, . . . , n, K k = (K k1 , K k2 , . . . , K kn ), k = 1, 2, . . . , n, K = (K 1 , K 2 , . . . , K n ), k = 1, 2, . . . , n. Equation (3.45) is optimized under the constraint of Eq. (3.46) to obtain: (1/(1−K ( x uk ,vi )))1/(m−1) 1/(m−1) u j=1 ((1−K ( x k ,v j ))) i = 1, 2, . . . , c, k = 1, 2, . . . , n u Σ u ( u )m −1 Σnl ( l )m −1 K K k + nk=1 u ik K K k k=1 u ik βi = , i = 1, 2, . . . , c Σnl ( l )m Σn u ( u )m + k=1 u ik k=1 u ik u u ik =

Σc

(3.47)

(3.48)

3.3 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering

137

Mostly normal samples in the unlabeled samples can be found by the soft minimum hypersphere method. To reduce the misclassification rate, the bounded normal samples are to be extracted from the samples that are not included in the hypersphere. Calculate the initial clustering center of fault clustering by the above sample information. The alternating iterative algorithm for semi-supervised fuzzy kernel clustering based on hyperspheres is as follows: (1) Use a modified soft hypersphere algorithm to find mostly normal and potentially faulty samples based on those known to be normal. (2) Calculate the number of clusters and initial cluster centers based on the obtained normal samples c and potential fault samples. (3) Initialize each coefficient vector, and calculate the kernel matrix K and its inverse matrix K −1 for the unlabeled dataset. (4) Repeat the following operation until the membership value of each sample is stable. (a) Update the membership of the sample data not included in the hypersphere u u ik by the current cluster centers according to Eq. (3.47); u (b) Update the clustering centers for faults v i by membership u ik according to Eqs. (3.42) and (3.48) (Fig. 3.20).

3.3.4 Transmission Early Fault Detection See Fig. 3.21.

3.3.4.1

Transmission Bearing Outer Ring Spalling Fault Detection

1. Data acquisition The cylindrical rolling bearing of the output shaft of the Dongfeng SG135-2 transmission was used as the experimental object. The experiment was designed to run the transmission in two states, normal and spalled outer ring, where the spalled outer ring was machined to grind a pit. Figure 3.22 shows the faulty bearing components. Set the transmission to 2nd gear. Set the number of teeth of the normal meshing gear at the input to 38/26 and the number of teeth of the 2nd gear meshing gear to 41/20. (1) Transmission operating conditions Speed: 2400 r/min (input shaft), 1642 r/min (intermediate shaft), 801 r/min (output shaft). Output torque: 313.7 N m. Output power: 25.52 kW. (2) Sampling parameter setting Acquisition signal: vibration acceleration, vibration speed. Test direction: horizontal radial (x), vertical radial (y), and axial (z).

138

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.20 The process of semi-supervised fuzzy kernel clustering based on hyperspheres

Sampling frequency: 40,000 Hz (acceleration), 5000 Hz (velocity). Anti-mixing filter: 20,000 Hz (acceleration), 3000 Hz (velocity). (3) Transmission characteristic frequency building Frequency: 40 Hz (input shaft), 27.4 Hz (intermediate shaft), 13.4 Hz (output shaft). Three-gear meshing frequency: 547.4 Hz. Normally meshing gear meshing frequency: 1040 Hz.

3.3 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering

139

Fig. 3.21 Technical route for outlier condition detection of transmissions Fig. 3.22 Bearing outer ring spalling

Output shaft rolling bearing parameters: model NUP311EN, see Table 3.4 for parameters. Output shaft rolling bearing characteristic frequency: as shown in Table 3.4. 2. Normal operating conditions and bearing outer ring spalling signal characteristics analysis The vibration acceleration signals were collected in the X-direction at two measurement points under the two conditions of normal and bearing outer ring spalling. The Table 3.4 Output shaft rolling bearing parameters and characteristic frequency Pitch diameter D (mm)

Rolling element diameter d 0 (mm)

Rolling element number m (num)

Contact angle α (°)

Inner ring passing frequency f i (Hz)

Outer ring passing frequency f o (Hz)

Rolling element passing frequency f g (Hz)

Cage passing frequency f b (Hz)

85

18

13

0

179.6

116.8

51.4

10.9

140

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.23 Time domain waveform of normal signal at two measurement points

Fig. 3.24 Time domain waveform of bearing outer ring spalling signal at two measurement points

sampling frequency is set to 3200 Hz and the low-pass filtering limit is 4000 Hz, and the generated time domain waveform is shown in Figs. 3.23 and 3.24. The demodulation analysis found that for the normal signal rolling body and bearing seat of the periodic is grinding touch. Therefore, in the demodulation results at the first-order engagement frequency and resonance spectrum, the rolling body passes through the frequency both as the main spectral line. Also due to the periodic collisions between the gears, the rotational frequencies of the input and intermediate axes can still appear in the demodulation spectrum. The demodulation analysis is still not enough to find the fault characteristics of the bearing outer ring spalling, the reason is that the bearing outer ring spalling fault is relatively weak. The bearing signal modulation energy is small compared to the transmission energy and the modulation phenomenon is not obvious. Under normal operating conditions, there is also some modulation.

3.3.4.2

Semi-supervised Fuzzy Kernel Clustering Analysis Based on Hyperspheres

The number of samples: number of samples: 210 normal samples, 20 samples of bearing outer ring spalling failure, sample data length 4 × 1024. Sample raw vector: mean square value, mean value, margin index, bearing outer ring through the frequency amplitude, modulation frequency band each amplitude of the sum. Known label samples: 26 normal samples. Unknown label samples: the remaining 184 normal samples and 20 faulty samples, a total of 204.

3.3 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering

141

Fig. 3.25 Analysis results of bearing outer ring spalling

The standard deviation of each sample is calculated in the feature space based on the Euclidean distance. Eliminate one sample that does not meet the requirement using 95% confidence probability, and the optimized 25 normal samples are used as the set of samples with known labels. Set the parameter m = 2 and the cluster center c = 2 and the semi-supervised fuzzy kernel clustering method based on hyperspheres is applied to analyze them. Figure 3.25 shows the results of the analysis when σ is varied from 0.3 to 12. Shown in Fig. 3.25, the number of samples in the detected potential faults and the fault clusters obtained by fuzzy kernel clustering analysis decreases as the number of σ increases. When σ changes from 4.2 to 6.0, the fuzzy kernel clustering method can separate some normal samples from the set of potential fault samples to achieve data purification and improve the accuracy rate of fault detection. When σ changes from 0.3 to 5.2, the fault clusters contain normal samples, and the number of normal samples decreases as σ increases. When σ changes from 4.2 to 11.2, There is a smoother variation in the number of fault samples detected. Especially in the interval from 5.4 to 11.2, there are no normal samples in the fault clusters and the accuracy rate of fault detection is 100%. When σ is greater than 11.2, the number of fault samples detected in fault clustering decreases with increasing until no fault samples are detected. The number of faulty samples misclassified as normal increases accordingly. When σ changes from 4.2 to 11.2, the samples in this interval are well separated between normal and fault samples in the high-dimensional space, and the number of detected fault samples does not decrease sharply as σ increases. Therefore, the value of this interval is selected as the optional interval of the parameter σ of the semisupervised fuzzy kernel clustering method, and its average accuracy rate reaches 95.5%. With the same parameter settings, the clustering results of the fuzzy kernel clustering method without the guidance of known labeled samples are shown in

142

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Table 3.5 The clustering results of the fuzzy kernel clustering method σ

4

5

6

7

8

9

10

11

Normal data

118

117

116

112

115

117

111

110

Fault data

111

112

113

117

114

112

118

119

18

17.9

17.7

17.1

17.5

17.9

17

16.8

Accuracy rate (%)

Table 3.6 Clustering results of the semi-supervised fuzzy kernel clustering method σ

4

5

6

7

8

9

10

11

Normal data

206

207

208

210

218

218

218

218

Fault data

29

21

Accuracy rate (%)

69

95.2

20

20

20

20

20

20

100

100

100

100

100

100

Table 3.5, and the corresponding clustering results of the semi-supervised fuzzy kernel clustering method based on hypersphere are shown in Table 3.6. Since the fault of the bearing outer ring spalling is relatively weak, the difference between the faulty and normal samples is small. It is difficult for fuzzy kernel clustering to cluster the fault samples effectively, it simply partitions the sample set into two roughly equal clusters.

3.3.4.3

Transmission Gear Pitting Fault Detection

The semi-supervised fuzzy kernel clustering algorithm based on hyperspheres was validated using the gear pitting data in Sect. 3.2. The number of samples: number of samples: 220 normal samples, 22 samples of gear pitting fault, sample data length 4 × 1024. Sample raw vector: mean value, variance, slant value, peak value, sum of each amplitude corresponding to modulation frequency band. Known label samples: 30 normal samples. Unknown label samples: the remaining 190 normal samples and 22 faulty samples, a total of 212. The standard deviation of each sample is calculated in the feature space based on the Euclidean distance. Eliminate two samples that do not meet the requirement using 95% confidence probability, and the optimized 28 normal samples are used as the set of samples with known labels. Set the parameter m = 2 and the cluster center c = 2 and the semi-supervised fuzzy kernel clustering method based on hyperspheres is applied to analyze them. Figure 3.26 shows the results of the analysis when σ is varies from 0.3 to 10. Shown in Fig. 3.26, the number of samples in potential faults which is detected and fault clusters obtained by fuzzy kernel clustering analysis decreases as σ increases. When σ changes from 0.7 to 6.0, the fault clusters contain all faulty samples. When

3.3 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering

143

Fig. 3.26 Analysis results of gear pitting and spalling

σ changes from 1.5 to 5.5, the fuzzy kernel clustering method can separate some normal samples from the set of potential fault samples to achieve data purification and improve the accuracy rate of fault detection. When σ changes from 0.3 to 3.7, the fault clustering contains normal samples and the number of them decreases with increasing σ . Especially in the interval from 3.7 to 6.0, there are no normal samples in the fault clusters and the accuracy rate of fault detection is 100%. When σ is greater than 6, the number of fault samples detected in fault clustering decreases with increasing, until the fault samples don’t exist. The number of faulty samples misclassified as normal increases accordingly. When σ changes from 2 to 6.7, the samples in this interval are well separated between normal and fault samples in the high-dimensional space, and the number of detected fault samples does not decrease sharply as σ increases. Therefore, the value of this interval is selected as the optional interval of the parameter σ of the semisupervised fuzzy kernel clustering method, and its average accuracy rate reaches 88.4%. With the same parameter settings, the clustering results of the fuzzy kernel clustering method without the guidance of known labeled samples are shown in Table 3.7, and the corresponding improved clustering results of the semi-supervised fuzzy kernel clustering method are shown in Table 3.8. Since the fault of slight pitting is relatively weak, the difference between the faulty and normal samples is small. It is difficult for fuzzy kernel clustering to cluster the Table 3.7 The clustering results of the fuzzy kernel clustering method σ

2

2.5

3

3.5

4

4.5

5

5.5

Normal data

122

122

124

124

125

125

124

128

Fault data

118

118

116

116

115

115

116

112

Accuracy rate (%)

18.6

18.6

9

19

19.1

19.1

19

20

144

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Table 3.8 Clustering results of the semi-supervised fuzzy kernel clustering method σ

2

2.5

3

3.5

4

4.5

5

5.5

Normal data

206

207

208

210

218

218

218

218

Fault data

34

33

32

30

Accuracy rate (%)

64.7

66.7

68.8

73.3

22

22

22

22

100

100

100

100

fault samples effectively, it simply partitions the sample set into two roughly equal clusters.

3.3.4.4

Transmission Gear Spalling Fault Detection

1. Data acquisition The cylindrical rolling bearing of the output shaft of the Dongfeng SG135-2 transmission was used as the experimental object. The experiment was designed to run the transmission in three states, normal, minor spalling of gears, severe spalling + tooth surface deformation, where the spalled outer ring was machined to grind a pit. Set the transmission to 5th gear. Set the number of teeth of the normal meshing gear at the input to 38/26 and the number of teeth of the 5th gear meshing gear to 22/42. Figure 3.27 shows the fault gear. (1) Transmission operating conditions Speed: 1200 r/min (input shaft), 821 r/min (intermediate shaft), 1567 r/min (output shaft). Output torque: 6.2 N m. Output power: 1.01 kW. (2) Sampling parameter setting Acquisition signal: vibration acceleration, vibration speed.

Fig. 3.27 Fault gear

3.3 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering

145

Test direction: horizontal radial (x), vertical radial (y), and axial (z). Sampling frequency: 40,000 Hz (acceleration), 5000 Hz (velocity). Anti-mixing filtering: 20,000 Hz (acceleration), 3000 Hz (velocity). (3) Variable speed characteristic frequency building Frequency: 20 Hz (input shaft), 13.7 Hz (intermediate shaft), 26 Hz (output shaft). Five gear meshing frequency: 575.4 Hz, constant meshing gear meshing frequency: 520 Hz. 2. Normal signal and slight gear spalling The vibration acceleration signals were collected at the direction of measurement point 1. The sampling frequency is set to 1000 Hz and the low-pass filtering upper limit is 1250 Hz, and the generated time domain waveforms are shown in Figs. 3.28, 3.29 and 3.30. It is obvious that the waveforms are roughly the same and the difference is not obvious. Compared with the normal signal, severe spalling + tooth surface deformation fault signal waveform appears more chaotic impact components. The signals all have obvious waveform characteristics, which are typical of amplitude-modulated signals. At the same time, the waveform amplitudes are large, which indicates a significant increase in vibration energy. In this section, we only analysis the minor spalling fault detection process. Fig. 3.28 Time domain waveform of normal signal at measurement point 1

Fig. 3.29 Time domain waveform of minor spalling of gears signal at measurement point 1

146

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.30 Time domain waveform of severe spalling and tooth surface deformation signal at measurement point 1

Let the analysis frequency be 667 Hz and the low-pass filtering upper limit be 833 Hz. Take 20 segments of data (1024 points/segment) plus Hanning window averaging and do the panoramic spectrum as shown in Figs. 3.31 and 3.32. Both conditions have the highest spectral line of the output axis rotational frequency, and the amplitude of the slightly spalled rotational frequency increases compared to the normal signal. The third- and fourth-order spectral lines of the normal signal occurrence frequency, the second-, third- and fourth-order spectral lines of the minor spalling signal occurrence frequency, and the amplitudes of the above spectral lines are greater than the amplitude of the normal signal. The fifth gear mesh frequency and 1/2 times the normal gear mesh frequency of 288 and 260 Hz were observed for both operating conditions. One of the minor spalling faults has a

Fig. 3.31 Panoramic spectrum of normal signals

Fig. 3.32 Panoramic spectrum of minor gear spalling signal

3.3 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering

147

spectral line at 575.4 Hz at the fifth gear engagement frequency and a small amount of modulated edge band, while the normal signal does not appear significantly. The spectral characteristics of minor spalling faults are not as obvious as those of serious faults. Therefore, it should be suspected that early faults have started to appear and be analyzed further. The refinement spectrum is shown in Figs. 3.33 and 3.34 with 574 Hz as the frequency center and refinement by a factor of 10. From the analysis of the refinement spectrum, it is obvious that the normal signal also has a modulated edge band of small amplitude at the fifth-gear meshing frequencies, only the amplitude is small and not reflected in the panoramic spectrum, and the 1/2 times frequency of the rotational frequency appears due to the installation problem. The amplitude at the slightly spalled engagement frequency appears larger compared to that of the normal signal, where the 1st spectral line of the normal signal is comparable in size to the 2nd and 3rd spectral lines of the slightly spalled signal, but their absolute quantities are still small. Normal operating conditions due to the presence of slight misalignment of the shaft in the installation, resulting in the appearance of 1/2 times the rotational frequency component. Other than that, both normal and minor faults demodulate the first- and second-order spectral lines of the transconductance, and even the thirdorder spectral lines. The amplitude of the demodulated spectral lines of minor spalling is essentially twice as large as under normal operating conditions, but their absolute number is still small. Spectrum analysis to a certain extent that the slight spalling fault has deviated from the normal signal with just a small degree of fault. Compared with

Fig. 3.33 Refinement spectrum of the normal signal at 574 Hz

148

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.34 Refinement spectrum of slight gear spalling signal at 574 Hz

the spectral characteristics of the signal of severe spalling + tooth surface deformation, the characteristics of its fault are not obvious. It is difficult to determine whether there is a failure from frequency analysis. 3. Semi-supervised fuzzy kernel clustering analysis based on hyperspheres The number of samples: number of samples: 210 normal samples, 20 samples of bearing outer ring spalling failure, sample data length 4 × 1024. Sample raw vector: mean value, variance, slant value, peak value, sum of each amplitude corresponding to modulation frequency band. Known label samples: 26 normal samples. Unknown label samples: the remaining 184 normal samples and 20 faulty samples, a total of 204. The standard deviation of each sample is calculated in the feature space based on the Euclidean distance. Eliminate one sample that does not meet the requirement using 95% confidence probability, and the optimized 25 normal samples are used as the set of samples with known labels. Set the parameter m = 2 and the cluster center c = 2 and the semi-supervised fuzzy kernel clustering method based on hyperspheres is applied to analyze them. Figure 3.35 shows the results of the analysis when σ is varies from 0.3 to 18. Figure 3.35 shows that the trend of the curve is basically consistent with the law of the simulation data. In the interval of σ from 1.8 to 16.5, the number of detected faults changes relatively smoothly, indicating that this interval is an optional interval for value selection, At the same time, in this stable interval, the accuracy of detecting fault clusters reaches 100%. It also confirms that when the clustering interval is large, the stable change interval is also long. Setting the same parameters, without the guidance of known label samples, the clustering results of multiple runs of the fuzzy kernel clustering method are listed in

3.3 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering

149

Fig. 3.35 Analysis results of bearing slight ring spalling

Table 3.9, and the corresponding clustering results of the hypersphere-based semisupervised fuzzy kernel clustering method are listed in Table 3.10. It can be seen from the results that the fuzzy kernel clustering method can realize the correct analysis of fault clustering at some value points, and these value points have no regularity. It is precisely because of the σ value at this time that the fault samples are better separated from the normal samples. More importantly, when the fuzzy kernel clustering algorithm randomly selects the initial cluster center, sometimes it can obtain a better initial cluster center, so fault clusters can be distinguished from normal clusters; when the initial cluster center selection is incorrect, it is still difficult for fuzzy kernel clustering to effectively cluster fault samples, and it simply divides the sample set into two clusters which are approximately equivalent, indicating that the performance of the fuzzy kernel clustering method is greatly affected by the initialization of the cluster centers. Although the spectral characteristics of slight spalling are not obvious enough, due to the lack of experimental experience, the processing degree of gear spalling faults is a bit deep, resulting in a bit of a fault degree, which makes the further purification ability of the semi-supervised fuzzy kernel clustering method for potential fault samples not reflected. Table 3.9 Analysis results of the fuzzy kernel clustering method σ

2

4

Normal data

113

Fault data

116

Accuracy rate (%)

17.2

6

8

10

12

14

16

209

114

116

119

123

140

146

20

115

113

110

106

89

83

100

17.4

17.7

18.2

18.9

22.5

24.1

150

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Table 3.10 Clustering results of the semi-supervised fuzzy kernel clustering method σ

2

4

6

8

10

12

14

16

Normal data

209

209

209

209

209

209

209

209

Fault data Accuracy rate (%)

20

20

20

20

20

20

20

20

100

100

100

100

100

100

100

100

3.4 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering The Self-Organizing Map (SOM) neural network is an unsupervised training network that can achieve the automatic classification of input patterns after training [18]. The traditional method of SOM is to visualize the training results using the U matrix. However, the traditional fault diagnosis SOM method is not effective in visualizing, because the mesh on the output layer is often distorted. Although GNSOM and DPSOM visualization methods can solve this problem well, the huge data often leads to a long training time. If the data can perform dimensionality reduction before the data is classified, the training time can be greatly reduced, the accuracy rate can be improved, and storage space can be saved. Therefore, before applying neural networks to diagnose faults, the dimensionality of the training data should be reduced as much as possible to simplify the neural network structure. Linear discriminant analysis (LDA) is an effective dimensionality reduction method that can transform the original data space into a low-dimensional feature space to produce an efficient pattern representation. LDA seeks to find a direction that gives the best separation of the projected data in the least mean square sense. The semi-supervised SOM studied in this section is implemented by modifying the algorithm based on the unsupervised SOM and combining it with feature selection to achieve an improved learning rate and learning effect.

3.4.1 Semi-supervised SOM Fault Diagnosis 3.4.1.1

Introduction to SOM

SOM neural network [18] is a more complete self-organizing feature mapping neural network scheme with better classification performance proposed by Kohonen [18]. Sometimes also called SOM neural network as Kohonen feature mapping network. The basic idea is that the neurons of the competition layer compete for the opportunity to respond to the input pattern. Finally, only one neuron becomes the competition winner and redirects the connected neurons connected with the winning neuron in a direction more favorable to its winning. The winning neuron represents the classification of the input pattern.

3.4 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering

151

Fig. 3.36 SOM network structure

Fig. 3.37 Interaction patterns of neurons

The SOM network structure consists of an input layer and a competition layer (output layer), shown in Fig. 3.36. The number of neurons in the input layer is n, and the number of neurons in the competitive layer is m. m < n. The SOM network is fully connected (i.e., each input node is connected to all output nodes). SOM networks can map arbitrary dimensional input patterns into one-dimensional or two-dimensional graphs at the output layer and keep their topology unchanged. SOM networks have the probability retention ability that the weight vector space can be made to converge with the probability distribution of the input patterns by repeated learning of the input patterns. The overall idea of the method is to take the winning neuron as the center of the circle, showing excitatory lateral feedback to the near neighboring neurons and inhibitory lateral feedback to the distant neighboring neurons, with the near neighbors stimulating each other and the distant neighbors inhibiting each other. Where near neighbors are neurons with a radius of about 50 ~ 500 μm from the neuron that emits the signal; far neighbors are neurons with a radius of about 200 μm ~ 2 mm. Neurons more distant than their distant neighbors show weak excitation, as shown in Fig. 3.37.

3.4.1.2

U Matrix Visualization Method

There are many methods available to visualize high-dimensional data by dimensionality reduction mapping, linear methods such as principal element analysis (PCA), which is computationally small and efficient and works well for linear structures. Using nonlinear mapping methods, such as Sammon’s method [19], curvilinear metaanalysis (CCA) methods, etc., which can handle nonlinear structures in the data, but

152

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

are computationally intensive. For the training results of SOM networks, the most applied visualization method is the U-matrix method (U-Matrix) and the core is the calculation of the U-matrix. The general calculation of the U-matrix is shown in Fig. 3.36. The neuron distribution of the competing layers in the SOM network structure is in the form of a matrix in a two-dimensional plane. For each of these neurons, the Euclidean distance between it and the weight vectors of all neighboring neurons is calculated, and the average or maximum of these distances or a certain function is taken as the “U-value” of the neuron. Calculate the U matrix that its element consists of the “U-values” of all competing layer neurons. When the U matrix is used to visualize the training results of the SOM network, the U matrix values are used as the third-dimensional coordinates of the neurons in conjunction with the topology of the competing layers. The structure of the competing layers of the network is plotted in three dimensions, and the network clustering training results are shown by observing the distribution of peaks, troughs, etc., or by representing the third-dimensional coordinates of neurons in grayscale. The U matrix can also be described as a way to visualize with grayscale maps.

3.4.1.3

Semi-supervised SOM

Traditional SOM neural networks can be learned by the supervised learning method, and the algorithm becomes supervised SOM if all input patterns are labeled. The semi-supervised learning method uses both labeled and unlabeled samples to learn together. Therefore, in the semi-supervised SOM that the labels of some samples in the input pattern Ak are removed, and then both labeled and unlabeled samples are input into the supervised SOM network for training. The steps of the algorithm are as follows. Let the input mode of the network be Ak = (a1k , a2k , . . . , a kN ), k = 1, 2, . . . , p and competitive layer neuron vector is B j = (b1 , b2 , . . . , b M ), where Akl denoted the labeled data, Ako denoted the unlabeled data, l + o = N , M is the number of competitive layer neuron, and the network connection weighting is {W i j }, i = 1, 2, . . . N , j = 1, 2, . . . M. (1) Set the connection weight A of the network equal to a random value over the interval [0, 1]. Determine the initial value η(0) of the learning rate η(t) (0 < η(0) < 1), the initial value N g (0) of the neighborhood function N g (t), and the total number of studies T which denotes terminate when the network has learned a specified number of times T. (2) Normalize any input pattern aik selected from Ak . Then normalize the network power vector. Ak = Ak /|| Ak ||, k = 1, 2, . . . p || || W j = W j /|| W j ||,

j = 1, 2, . . . M

(3.49) (3.50)

3.4 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering

153

(3) Input the selected pattern Ak into the network and calculate the distance between the connection weight vector W j = (w j1 , w j2 , . . . w j N ) and the input pattern Ak . [ N ]1/2 Σ( )2 k ai − wi j dj = ,

j = 1, 2, . . . M, k = 1, 2, . . . p

(3.51)

i=1

(4) Compare their distances to determine the best match unit g: ( ) dg = min d j ,

j = 1, 2, . . . M

(3.52)

(5) Adjust the connection weights between all competing layer neurons within the neighborhood of the input neuron to the activating neuron and correct the weights according to the following equation: ⎧ ( k ) k ⎪ ⎪ wi j (t) + η(t)N g j (t) ai − wi j (t) , when ai ∈ Ako ⎪ ⎪ ⎪ ⎨ w (t) + η(t)N (t)(a k − w (t)), when aik ∈ Akl and g ij gj ij i wi j (t + 1) = is correctly classified ⎪ ⎪ ( k ) when aik ∈ Akl and g ⎪ ⎪ ⎪ ⎩ wi j (t) − η(t)N g j (t) ai − wi j (t) , is incorrectly classified (3.53) where j ∈ N g (t), i = 1, 2, . . . , N , 0 < η(t) < 1, η(t) is the learning rate at moment t. (6) Enter the next learning mode into the network and return to step 3 until all learning modes have been learned. (7) Update the learning / rate η(t) and the neighborhood function N g j (t) according to η(t) = η0 (1 − t T ), where η0 denotes the initial value of the learning rate, t denotes the number of learning times, and T denotes the total number of learning times. (8) Return to step 2, let t = t + 1. End the loop until t = t + 1. Figure 3.38 shows the classification results of the Iris dataset using semisupervised SOM and semi-supervised LDA-SOM. Thirty of the 50 sample data from each class are selected as labeled samples and the remaining 20 as unlabeled samples, and then these 50 samples are simultaneously input into the network for training. The training results are shown in Fig. 3.38, where (a) is the U matrix of the semi-supervised SOM, (b) is the label of the semi-supervised SOM classification results, (c) is the U matrix of the semi-supervised LDA-SOM, and (d) is the label of the semi-supervised LDA-SOM classification results. In the figure, “Ο”, “Δ” and “⛛” represent labeled samples in Setosa, Versicolor, and Virginia, “•”, “*” and “×” represent unlabeled samples in Setosa, Versicolor, and Virginia. As can be seen from the classification results, the semi-supervised SOM still achieves the separation of the three types of data relatively well. Despite

154

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.38 Classification results of semi-supervised SOM on the Iris dataset

that 40% of the input samples do not contain labels. It can also be found that the semi-supervised SOM based on LDA feature selection is more intuitive than the semi-supervised SOM for classification.

3.4 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering

155

3.4.2 Semi-supervised GNSOM Fault Diagnosis 3.4.2.1

Introduction to GNSOM

To present the structure of data more naturally, Rubio and Gimenez et al. proposed a new SOM-based data visualization algorithm-GNSOM (Grouping Neuron SOM) [20]. This visualization method has similar algorithms compared to the original SOM. In the original SOM network, the neurons in the input layer are represented by weight vectors, which are updated during the training process, and the neurons in the output layer are fixed on a fixed network. In GNSOM, on the other hand, the value of the weight vector is fixed, and it is the position of the neurons in the output layer that is updated. It should be added that the results obtained from this method only suggest the spatial topology of the input data. Therefore, the obtained mapping map represents only one spatial distribution of the input vectors, and its coordinates do not correspond to a specific physical meaning. The specific algorithm of GNSOM can be understood in detail in Ref. [20].

3.4.2.2

Semi-supervised GNSOM

Since GNSOM comes from adding one step position adjustment step to the original SOM, the semi-supervised GNSOM algorithm can be easily obtained by referring to the semi-supervised SOM. Let the vector of neurons in the competitive layer be the position of Bj on the output layer as pi j (x, y), the distance of neighboring neurons on the output layer (competitive layer) as M, the number of neurons in the x-axis direction as Mx , and the number of neurons in the y-direction as M y . The specific implementation steps are as follows. (1) (2) (3) (4) (5)

Initialize the initial position, the rest as in the semi-supervised SOM step 1. The same semi-supervised SOM step 2. The same semi-supervised SOM step 3. The same semi-supervised SOM step 4. Correcting the connection weights of the SOM according to Eq. (3.53); semiadjusting the positions of the competing layer neurons on the competing layer according to the following Eq. (3.54). ⎧ ( ) ⎪ pi j (t) + η(t)N g j (t) aik − wi j (t) , when aik ∈ Ako ⎪ ⎪ ⎪ ⎪ ⎨ p (t) + η(t)N (t)(a k − w (t)), when aik ∈ Akl and g ij gj ij i pi j (t + 1) = is correctly classified ⎪ ⎪ ( ) ⎪ when aik ∈ Akl and g ⎪ ⎪ pi j (t) − η(t)N g j (t) aik − wi j (t) , ⎩ is incorrectly classified (3.54)

(6) The same semi-supervised SOM step 6.

156

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.39 Classification results of semi-supervised GNSOM on the Iris dataset

(7) The same semi-supervised SOM step 7. (8) The same semi-supervised SOM step 8. Figure 3.39 shows the classification results of the Iris dataset using semisupervised GNSOM and semi-supervised LDA-GNSOM. Thirty of the 50 sample data from each class are selected as labeled samples and the remaining 20 as unlabeled samples, and then these 50 samples are simultaneously input into the network for training. The training results are shown in Fig. 3.39, where Fig. 3.39a shows the labels of the semi-supervised GNSOM classification results, and Fig. 3.39b shows the semi-supervised LDA-GNSOM classification results. In the figure, “Ο”, “Δ” and “⛛” represent labeled samples in Setosa, Versicolor, and Virginia, “•”, “*” and “×” represent unlabeled samples in Setosa, Versicolor, and Virginia. As can be seen from the classification results, the semi-supervised SOM still achieves the separation of the three types of data relatively well. Despite that 40% of the input samples do not contain labels. It can also be found that the semi-supervised GNSOM based on LDA feature selection is more intuitive than the semi-supervised GNSOM for classification. However, the coordinate range of the x-axis in Fig. 3.39a is [3.1, 3.45] and that of the x-axis in Fig. 3.39b is [4.1, 4.24], which is because the GNSOM is based on the Himberg contraction model and thus the problem of over-contraction occurs.

3.4.3 Semi-supervised DPSOM Fault Diagnosis 3.4.3.1

Introduction to DPSOM

Shao and Huang proposed a new SOM-based data visualization algorithm-DPSOM (Distance Preserving Distance) [21], which can adaptively adjust the position of

3.4 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering

157

neurons according to the corresponding distance information, thus achieving an intuitive presentation of distance information between data. In addition, the algorithm can automatically avoid the problem of excessive neuron shrinkage, thus greatly improving the controllability of the algorithm and the quality of data visualization. In the original SOM algorithm, the neurons as data representatives are fixed on a low-dimensional conventional grid, and the topological ordering of neurons on this grid can be eventually achieved by using the neighborhood learning approach. However, in the DPSOM algorithm, the positions of the neurons on the lowdimensional grid are no longer fixed but can be adjusted accordingly according to the corresponding distances in the feature space and the low-dimensional space. Compared with the original SOM algorithm, the DPSOM algorithm has only one additional step of position adjustment operation, and since this step does not change the learning process of the original SOM algorithm, they have the same good robustness as the original SOM algorithms with the same good robustness, which is not comparable to those dynamic SOM algorithms [22]. In addition, this additional step operation, with small additional computation and no longer using the Himberg contraction model, avoids the problem of excessive neuron contraction in the absence of additional control parameters. In a similar way to the GNSOM method, the results obtained by this method only suggest the spatial topology of the input data, so the obtained mapping map represents only one spatial distribution of the input vectors and its coordinates do not correspond to a specific physical meaning. The specific algorithm of DPSOM can be understood in detail in Ref. [21].

3.4.3.2

Semi-supervised DPSOM

The semi-supervised DPSOM differs from the semi-supervised GNSOM only in the different position adjustment rules for the competing layer neurons. Referring to the semi-supervised GNSOM, the semi-supervised DPSOM can be easily obtained, and its position adjustment rules are as follows: ( ⎧ ⎪ pk (t) + η(t) 1 − ⎪ ⎪ ⎪ ( ⎪ ⎨ p (t) + η(t) 1− k pk (t + 1) = ⎪ ⎪ ( ⎪ ⎪ ⎪ ⎩ pk (t) − η(t) 1 −

δvk dvk

) )

( pv (t) − pk (t)), when aik ∈ Ako

when aik ∈ Akl and g is correctly classified ) when aik ∈ Akl and g δvk p (t) − pk (t)), dvk ( v is incorrectly classified (3.55) δvk dvk

( pv (t) − pk (t)),

The semi-supervised DPSOM algorithm is obtained by simply replacing Eq. (3.54) in the semi-supervised GNSOM algorithm with Eq. (3.55). Figure 3.40 shows the classification results of the Iris dataset using semisupervised DPSOM and semi-supervised LDA-DPSOM. Thirty of the 50 sample data from each class are selected as labeled samples and the remaining 20 as unlabeled

158

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.40 Classification results of semi-supervised DPSOM on the Iris dataset

samples, and then these 50 samples are simultaneously input into the network for training. The training results are shown in Fig. 3.40, where Fig. 3.40a shows the semisupervised DPSOM classification results, and Fig. 3.40b shows the semi-supervised LDA-DPSOM classification results. In the figure, “Ο”, “Δ” and “⛛” represent labeled samples in Setosa, Versicolor, and Virginia, “•”, “*” and “×” represent unlabeled samples in Setosa, Versicolor, and Virginia. As can be seen from the classification results, the semi-supervised DPSOM still achieves the separation of the three types of data relatively well. Despite that 40% of the input samples do not contain labels. It can also be found that the semi-supervised DPSOM based on LDA feature selection is more intuitive than the semi-supervised DPSOM for classification. It can also be seen from the figure that the coordinate range of the X-axis in Fig. 3.40a is [0.5, 5.5] and the coordinate range of the X-axis in Fig. 3.40b is [1.5, 5.5], which is because the DPSOM is no longer based on the Himberg shrinkage model, which well avoids the excessive shrinkage problem of the GNSOM.

3.4.4 Example Analysis 3.4.4.1

Gear Fault

The test data were obtained from the gear failure test in the Laborelec laboratory. The test data were collected from a 41-tooth helical gear in three modes of operation, namely, normal, slightly spalling, and slightly worn gear surface are shown in Fig. 3.41. In this experiment, the gear parameters and test conditions are modulus 5 mm, helix angle, center distance 200 mm, transmission ratio 37/ 41, input speed 670 r/min, and sampling frequency 10,240 Hz. The time domain and

3.4 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering

159

frequency domain waveforms are shown in Fig. 3.42, where Fig. 3.42a–c are the time domain diagrams of normal, slight spalling, and slight wear states, respectively, and Fig. 3.42d–f are the frequency domain diagrams of normal, slight spalling and slight wear states, respectively. From Fig. 3.42b, it can be seen that the amplitude of the spalling acceleration signal increases compared with that of the normal state Fig. 3.42a, but there is no obvious difference between the two, and it is difficult to distinguish the two types of faults from the time domain. It can be seen from Fig. 3.42c that when the gear shows uniform wear, the signal shows a clear shock band and also an increase in amplitude, which indicates a significant increase in vibration energy. This indicates that it is difficult to distinguish the three types of faults by using conventional signal processing methods, so it is necessary to diagnose the faults by intelligent diagnosis methods. Eleven commonly used statistical feature parameters are selected to form the original set of gear state features for describing gear failure modes, which are: mean square value, kurtosis, mean value, variance, skewness, peak value, root mean square amplitude, peak indicator, waveform indicator, pulse indicator, and margin index. In this experiment, each group of faults contains 54 samples (34 samples with labels and 20 samples without labels), for a total of 162 samples. The specific semisupervised SOM algorithm schematic block diagram is shown in Fig. 3.43. In the figure, “Ο”, “Δ” and “⛛” represent normal, spalling, and abrasion labeled data, “•”, “*”, and “×” represent normal, spalling, and abrasion unlabeled data. From this figure, it is obvious that the semi-supervised LDA-GNSOM method based on feature selection is significantly better than the semi-supervised GNSOM, however, as mentioned before, the GNSOM appears to have an excessive shrinkage problem are shown in Fig. 3.44. Figure 3.45 shows the classification results using the semi-supervised LDADPSOM proposed in this section, where Fig. 3.45a, b show the classification results

Fig. 3.41 Experimental pictures

160

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.42 Time domain and frequency domain signals of faulty bearing

Fig. 3.43 Semi-supervised SOM schematic

of semi-supervised DPSOM and semi-supervised LDA-DPSOM, respectively. From this figure, it is obvious that the semi-supervised LDA-DPSOM method based on feature selection is significantly better than the semi-supervised DPSOM. Table 3.11 shows the average correctness, average elapsed time, average quantization error, and average topology error obtained from the simulations using semi-supervised GNSOM and semi-supervised DPSOM. From Table 3.11, it can be seen that the method with the highest correct rate is the semi-supervised LDA-DPSOM with 86.3%; the least time-consuming method is the semi-supervised LDA-GNSOM with 0.6430 s; the smallest average quantization error is the semi-supervised LDA-GNSOM and semi-supervised LDA-DPSOM

3.4 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering

161

Fig. 3.44 Classification results of semi-supervised GNSOM on the Laborelec dataset

Fig. 3.45 Classification results of semi-supervised DPSOM on the Laborelec dataset Table 3.11 Experimental results of semi-supervised SOM on the Laborelec dataset

Average correctness (%)

Semi-supervised GENOM

Semi-supervised LDA-GNSOM

Semi-supervised DPS

Semi-supervised LDA-DPSOM

72.1

86.0

73.9

86.3

Average elapsed time (s)

1.3806

0.6430

1.0345

0.6439

Average quantization error

0.263

0.041

0.263

0.041

Average topology error

0.046

0.048

0.043

0.038

162

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

methods, both with 0.041; and the smallest average topology error is semi-supervised LDA-GNSOM methods, with 0.038. It can also be seen from the table that the selforganized mapping network with LDA feature selection not only speeds up the operation (73.76% for semi-supervised LDA-GNSOM and 37.76% for semi-supervised LDA-DPSOM) but also improves the correct rate (13.9% for semi-supervised LDAGNSOM and 12.4% for semi-supervised LDA-DPSOM), and reduced the average quantization error (84.41% for semi-supervised LDA-GNSOM and 84.41% for semisupervised LDA-DPSOM). In addition, the average topological error varies with the learning method but basically does not change much.

3.4.4.2

Bearing Fault

The test is conducted on a bearing plant with the model NU205M bearing as the research object, and the bearing failure simulation experiment is done on the rotating machinery failure simulation test bench, and the test design bearing is operated under three states of normal, inner ring pitting and inner ring cracking respectively. The parameters of the rolling bearing and the characteristic frequency parameters are shown in Table 3.12. Three measurement points are arranged, and the sensor is mounted on the bearing seat. The measurement point arrangement, coordinate system, fault simulation test bench, and test system are shown in Fig. 3.46. Figure 3.47 shows the time–frequency diagram of bearing failure. Figure 3.47a–c are the time domain diagrams of normal, inner ring line cutting 0.2 mm, and inner ring pitting 4 mm state respectively, and Fig. 3.47d–f are the frequency domain diagrams of normal, inner ring line cutting 0.2 mm, and inner ring pitting 4 mm state respectively. The impact components are hardly visible in Fig. 3.47a, c, and the time domain signals of both are extremely similar, while the impact components are obvious in Fig. 3.47b. In Fig. 3.47d–f, the peaks all appear near 530.5 Hz, which is the inherent frequency of the outer ring, and the modulation frequency is 7.4 Hz (530.5 Hz − 523.1 Hz = 7.4 Hz), which is extremely close to the calculated pass-by frequency of the cage (7.6 Hz), i.e., meaning that the modulation phenomenon of the carrier frequency as the inherent frequency of the outer ring and the modulation frequency as the passing frequency of the cage appears, but the modulation phenomenon is not obvious. Table 3.12 Rolling bearing parameters and characteristic frequency Pitch diameter D (mm)

Rolling element diameter d 0 (mm)

Rolling element number m (num)

Contact angle α (°)

Inner ring passing frequency f i (Hz)

Outer ring passing frequency f o (Hz)

Rolling element passing frequency f g (Hz)

Cage passing frequency f b (Hz)

38

6.5

12

18.33

128.82

98.78

52.02

7.60

3.4 Outlier Detection Based on Semi-supervised Fuzzy Kernel Clustering

163

Fig. 3.46 Bearing fault simulation test

Fig. 3.47 Time domain and frequency domain figure

Figure 3.48 shows the classification results using the semi-supervised LDAGNSOM proposed in this section, where Fig. 3.48a, b show the classification results of semi-supervised GNSOM and semi-supervised LDA-GNSOM, respectively. In the figure, “Ο”, “Δ” and “⛛” represent normal, spalling, and pitting labeled data, “•”, “*”, and “×” represent normal, spalling, and pitting unlabeled data. As

164

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.48 Results of the semi-supervised GNSOM classification of bearings

can be seen in Fig. 3.48a, although the semi-supervised GNSOM improves the visualization results slightly, the normal samples are very close to the inner circle pitting samples and are almost mixed. From Fig. 3.48b, it is obvious that the semi-supervised LDA-GNSOM method based on LDA feature selection is significantly better than the semi-supervised GNSOM, and the neurons in the output layer are divided into three clusters. Figure 3.49 shows the classification results of the bearing data using the semisupervised LDA-DPSOM proposed in this paper. It can be seen from Fig. 3.49a that although the neurons in the output layer have been adjusted according to the corresponding distances, the classification effect is not obvious, while the semisupervised LDA-DPSOM algorithm with LDA feature selection separates the three classes well, and the intra-class distances are greatly reduced, and the distinction between classes is more obvious.

Fig. 3.49 Results of the semi-supervised DPSOM classification of bearings

3.5 Relevance Vector Machine Diagnosis Method

165

Table 3.13 Experimental results of semi-supervised SOM on bearing dataset

Average correctness (%)

Semi-supervised GNSOM

Semi-supervised LDA-GNSOM

Semi-supervised DPSOM

Semi-supervised LDA-DPSOM

97.2

99.6

97.5

97.8

Average elapsed time (s)

5.2646

2.1268

4.9069

1.9136

Average quantization error

0.274

0.038

0.274

0.041

Average topology error

0.042

0.185

0.051

0.161

Table 3.13 shows the average correct rate, average time consumed, average quantization error, and average topological error obtained using semi-supervised GNSOM and semi-supervised DPSOM simulations. From Table 3.13, it is obvious that the method with the highest correct rate is the semi-supervised LDA-GNSOM with 99.6%; the method with the lowest time consumption is the semi-supervised LDA-DPSOM with 1.9136 s; the method with the lowest average quantization error is the semi-supervised LDA-GNSOM with 0.038, and the method with the lowest average topology error is the GNSOM with 0.042. The table also shows that the self-organized mapping network with LDA feature selection not only speeds up the operation (59.59% for semi-supervised LDA-GNSOM and 61.01% for semi-supervised LDA-DPSOM), but improves the correct rate (2.4% for semi-supervised LDA-GNSOM and 0.3% for semi-supervised LDA-DPSOM), but also reduces the average quantization error (86.13% for the semi-supervised LDA-GNSOM and 85.04% for the semi-supervised LDA-DPSOM). However, the average topological error has increased.

3.5 Relevance Vector Machine Diagnosis Method 3.5.1 Introduction to RVM Relevance vector machine (RVM) is based on the theory of Bayesian estimation and Support vector machine. To understand it, the basic knowledge of Bayesian estimation and the classification principle of the support vector machine is introduced, and the relevance vector machine is introduced.

166

3.5.1.1

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Bayesian Estimates

The Bayes’ theorem comes from An Essay Towards Solving a Problem in the Doctrine of Chances, 1763 by the British academic Bayesian in the proceedings of the Royal Society, in this paper, a method of probabilistic inference of binomial distribution parameters is presented. In general, Bayes assumed that the parameters were uniformly distributed over unit intervals, and his method of inferring the parameters of binomial distributions came to be known as Bayes’ theorem, and it is extended to applications other than binomial distribution and any statistical distribution.

3.5.1.2

Introduction to Relevance Vector Machines

Relevance vector machine (RVM) [23] is a sparse model with the same functional form as SVM, which is proposed by Tipping in 2000 and trained under Bayesian theory. It uses Bayesian reasoning, has good generalization ability, and the solution is sparser, does not need to adjust the super-parameter, and is easy to implement. When applied to classification problems, it can also give a probability measure for the attribution of categories. Therefore, it will be a meaningful attempt to apply RVM to fault detection and classification. The specific algorithms for RVM are detailed in Ref. [23].

3.5.2 RVM Classifier Construction Method 3.5.2.1

Feature Selection Method

Given a set of measurements, dimensionality can be reduced essentially in two different ways. The first approach is to identify variables that contribute little to the classification. In discrimination problems, you can ignore variables that have little effect on category separability. To do this, all that needs to be done is to select a feature from the dimension measurements (the number of features must also be determined), this method is called feature selection in the measurement space or simply feature selection (see Fig. 3.50a). The second method is to find a transformation from a dimension measurement space to a feature space with lower dimensions. This method is called feature selection and feature extraction in the transformation space (see Fig. 3.50b). The transformation can be a linear or nonlinear combination of the original variables, and it can be supervised or unsupervised. Under supervised conditions, the task of feature extraction is to find a transformation that maximizes the separability of a given category. This section first discusses feature selection methods. (1) Separability evaluation index

3.5 Relevance Vector Machine Diagnosis Method

167

x1

x1 f1

f1 f

f ( x) f2

xp

1

p

f2

xp

a) Feature selection

b) Feature extraction

Fig. 3.50 Dimensions compression

The feature samples correspond to their fault modes and are located in different regions of the feature space. In principle, the samples of different classes can be separated. If the dispersion between samples is large and the intra-class dispersion is small, then the separability of samples is good. It can be seen that the “Distance” between the samples reflects the separability of the samples. The denominator in Formula (3.21) is derived from the separability evaluation index Scw in Sect. 3.2.2.1. It represents the mean distance of the intra-class vector. The molecule is the index of divergence between classes, which represents the average distance between classes. To compare intra-class divergence with inter-class divergence, subtract Scw from molecule Eq. (3.21) and write: Jbw (x)∗ =

Scb − Scw Scw

(3.56)

To achieve the effect of normalization, based on Formula (3.56), the divisibility evaluation index is designed as follows: Jb (x) = /

Scb − Scw

(3.57)

2 + (S − S )2 Scw cb cw

On the one hand, this index can be used for feature selection. Firstly, the values of each eigenvalue Jb are calculated, and the divergence between classes and within classes are compared. If Jb ≤ 0 the eigenvalues are eliminated first, then the remaining eigenvalues are used for feature combination calculation and feature selection. And from the expression of Eq. (3.57), the Jb closer to 1, the better the selected feature vector. On the other hand, we can use the validity of fast metric feature extraction in small samples to guide feature selection in pattern recognition and classification, which is of great significance for feature extraction.

168

3.5.2.2

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Feature Extraction

1. Kernel direct discriminant analysis Kernel direct Discriminant Analysis (KDDA) was proposed by Lu in 2003, the idea of which is to apply the kernel method to Direct-Linear Discriminant Analysis (DDA) [24], proposing linear discriminant analysis in the characteristic space, so that it can be linearly divisible. KDDA method has been well used in face recognition [25], this paper applies this method to gear fault feature extraction. The basic idea of kernel direct discriminant analysis is to map the original space to the higher dimensional characteristic space F by nonlinear functions, then the improved Fisher Linear Discriminant Analysis and Linear Discriminant Analysis methods are applied to the calculation. At the same time, the concept of kernel function is introduced so that the inner product of any two elements in F can be replaced by the value of kernel function of the corresponding element in F, and the mapping and F are not need to be found. The “Small Sample Size problem” is effectively solved by using an improved Fisher discriminant. The detailed algorithm and derivation of KDDA can be found in Refs. [24, 25]. 2. Feature extraction based on KDDA The feature extraction of arbitrary test sample data z can be realized by calculating the projection in the direction of the normalized correlation coefficient matrix feature vector. Based on the above analysis, the steps of feature extraction using KDDA can be summarized as follows: L (1) For the given training sample data {zi }i=1 , calculate the dimension kernel function matrix. (2) The kernel matrix is standardized and calculated, calculate ΦTb Φb . (3) For the characteristic equation (Φb ΦTb )(Φb ei ) = λi (Φb ei ), the eigenvalues and eigenvectors are obtained. (4) Calculate the characteristic equation UT SW T H U and find the v-sum for its similar diagonalization. (5) Calculate ⊝ and complete the training of the KDDA feature extraction algorithm. (6) For the input sample z, the kernel matrix γ(φ(z)) is calculated. (7) The optimal discriminant eigenvector z is obtained from the formula: y = ⊝ · γ(φ(z)) d.

3. Advantages of kernel direct discriminant analysis KDDA is a feature extraction method that combines direct linear discriminant analysis (D-LDA) and kernel functions. The advantages of KDDA are summarized as follows:

3.5 Relevance Vector Machine Diagnosis Method

169

(1) The kernel function is introduced to turn the nonlinear or complex problems in the input space into linear problems when they are mapped to the highdimensional space so that they can be linear discriminant analyses in the characteristic space. It is not hard to see that, at that time, D-LDA was a special case of KDDA. (2) KDDA effectively solves the problem of small sample size in high-dimensional feature space by direct discriminant analysis and makes full use of all information in the optimal discriminant vector, including information in SW T H zero space and information outside of zero space. (3) If only the kernel discriminant analysis is considered, instead of the direct discriminant method, the pseudo-inverse matrix K ' of the kernel matrix K needs to be calculated during the process of discarding SW T H , and the matrix is always ill-conditioned because of the selection of the kernel and the kernel parameters, and there is no solution at this time. KDDA avoids this problem by introducing a direct method. 4. Comparison between kernel direct discriminant analysis and kernel principal component analysis The basic idea of KDDA is the same as that of Kernel Principal Component Analysis, which uses the principle of kernel function to map the sample points in the input space to the high-dimensional (or even infinite-dimensional) feature space through a nonlinear mapping, then the eigenvector coefficients are solved in the space. The difference is that the projection direction of the KPCA transform is to maximize the total divergence matrix of all samples and retain the largest data component of the sample variance without considering the variation of the divergence between classes, that is, information that does not take full advantage of the differences between categories. When KDDA is applied to feature extraction, the information between categories is fully considered, which can make the distance between different categories maximum and the distance between the same category minimum in the reduced dimension feature space. In other words, after feature extraction by KDDA transform, samples of the same category are clustered together, and samples of different categories are separated as much as possible. Therefore, KDDA is better than KPCA in feature extraction.

3.5.2.3

Single-Value Classification Problem

The single-valued classification is different from the traditional binary classification, which is to classify a given sample into Category I or Category II, and the goal of single-valued classification is to classify it into that category or not. There is only one category in the single-valued classification problem, so only one category of sample information is needed to construct the single-valued classifier. (1) Relevance vector machine single-value classification

170

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.51 Single-valued classification schematic

RVM single-valued classification model is built based on the binary classification algorithm. Different from binary classification, there is only one class of samples as training samples, so its class is marked as tn = 0 or tn = 1. Single-valued classifier RVM trains the single-valued classifier according to the above algorithm and calculates the predictive value of each training sample point. Taking the predicted value as the input of the logistic function, the probability distribution interval of the predicted value is calculated. Then we estimate the predicted value and the probability value of the test point according to the classifier, if the probability value falls in the training probability distribution interval, then it is the normal sample, otherwise, it is the abnormal sample. The data samples in normal operation are easy to obtain, but the fault samples are difficult to extract. The single-valued classification method only uses the data samples of the normal running state, establishes the single-valued fault classifier, and identifies the running state of the machine. Figure 3.51 is a sketch of single-value classification. (2) Support vector data description The support vector data description algorithm (SVDD) is a single-valued classification based on the support vector algorithm. It was proposed by Tax and Duin and developed from Vapnik’s Support vector machine. Unlike the optimal hyperplane of the Support vector machine, the goal of the support vector description algorithm is to find a minimum volume sphere containing the target sample data [26] and make all (or most) of the target samples contained within the sphere. The basic principles and algorithms of SVDD are not detailed here but can be found in Ref. [27].

3.5.2.4

Multi-value Classification Problem

Mechanical fault pattern recognition is a kind of diagnosis method that uses input original data and takes the corresponding action according to its category. The three basic operations of the Pattern Recognition System are pretreatment, feature extraction, and classification. The function of the classifier in the system is to assign a class mark to an object under test according to the feature vector obtained by the feature extractor. The ease of classification depends on two factors, one of which is the fluctuation of eigenvalues between individuals from the same category. The

3.5 Relevance Vector Machine Diagnosis Method

171

second is the difference between the eigenvalues of samples belonging to different categories. Pattern Recognition System involves the following steps: data acquisition, feature selection, model selection, training, and evaluation. As shown in Fig. 3.52. At present, what pattern recognition technology can do is basically classification work, there is still a considerable distance from understanding. As a pattern classification method, the issues involved are: (1) feature extraction; (2) learning training model samples to get decision rules; (3) using decision rules to classify samples. The key to the pattern recognition system is the design of the feature selection and feature discrimination module. If we can select the features with high accuracy description ability, it is undoubtedly of great significance to the establishment of the system. It can be stored less and express more physical meaning. The reasonable Fig. 3.52 Block diagram of the pattern recognition system

172

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

design of the feature discriminator can make the system have high stability and accuracy. Relevance vector machine (RVM) has good performance in two-class problems and can be extended to the multi-class problem. RVM was originally proposed for two-class classification problems, but in terms of its application in pattern recognition, only binary classifiers obviously cannot meet the application needs, especially in the field of fault diagnosis, after quantifying the fault symptom and the fault cause, the corresponding problem is definitely not only the two-class classification. Therefore, it is an important factor to restrict the successful application of RVM in the engineering field whether the binary RVM classifier can be extended to the multi-valued RVM classifier effectively. Because RVM is developed based on SVM, the research of its multi-value classification algorithm can be inspired by SVM. According to the existing theory in SVM theory, there are two main methods to construct SVM multi-value classifiers, which are called complete multi-class Support vector machine and combined multi-class Support vector machine, the combined multi-class algorithm is divided into “Oneto-many” algorithm and “One-to-one” algorithm. RVM multi-class classification algorithm can also adopt the above construction methods. (1) Complete multi-class relevance vector machine In Ref. [28], a sparse Bayesian classification algorithm based on Automatic Relevance Determination is proposed to solve multi-classification problems. The traditional “One-to-many” and “One-to-one” methods are not only complicated in the training phase but also cause overlap and large classification loss. The sparse Bayes classification method is a direct multi-classification predictor, which overcomes the above shortcomings by introducing a regular correlation function. Sparse Bayes learning includes the Occam pruning rule, which can modify the complex model to make the model smooth and simple. In this paper, a sparse learning multi-class classifier based on the Bayesian framework is proposed. The algorithm sets up a polynomial distribution for each class of multivariable, and the multiclass output of the model is the limit value (Softmax) of the kernel basis function (the regular correlation function). The parameter estimation of the model adopts Automatic Relevance Determination, which ensures the sparsity and smoothness of the model. (2) Combined multi-class relevance vector machine In fact, in the current machine learning process, the usual approach is to break the large-scale learning problem into a series of small problems, with each small problem solving a binary classification problem, the reason for doing this is not only because it can reduce the computational complexity of the whole problem and improve the generalization performance of the global classifier, but it is also because some machine learning algorithms degrade rapidly as the problem size increases, while others can not directly solve multi-class classification problems. In the design of machine learning algorithms, a binary classification algorithm is often designed first and then extended to a multi-class or regression estimation direction. Some algorithms directly extend the existing problem, in this case, the original

3.5 Relevance Vector Machine Diagnosis Method

173

problem is usually converted into a multi-class two-class classification problem, and then design the reconstruction algorithm combines all the independent results. At present, the application of pattern classification in multi-classification is based on the idea of decomposition-reconstruction, which can be divided into two kinds: “One-to-many” and “One-to-one” classification algorithms. The “One-to-one” algorithm is better than the “One-to-many” algorithm in training time but also has higher classification accuracy. However, one disadvantage of the “One-to-one” algorithm is that each classifier is trained for a set of samples of two specific classes, and each sample will be input to the classifier used in the test phase, therefore, the results may be classified into corresponding classes according to some irrelevant classifiers, which will have a certain degree of negative impact on the results, and how to punish has become a new problem.

3.5.3 Application of RVM in Fault Detection and Classification In this section, the vibration signals of transmission in a normal state, gear fault, and bearing fault are collected by experiment, and the effective information is extracted by feature selection, the method is applied to the detection and classification of typical faults of gear and bearing in RVM gearbox. The performance of RVM and SVM methods is analyzed and compared.

3.5.3.1

Transmission Typical Fault Experimental Device and Test Method

The experimental apparatus and test method are the same as in Sect. 3.2, the structure of the experiment system, the transmission test bed and control board, the transmission diagram, and the arrangement of measuring points are all consistent with Sect. 3.2.4.1.

3.5.3.2

Application of Gearbox Fault Detection Based on RVM

A good detection model often has both accuracy and stability. Therefore, the detection rate can be used to measure the effectiveness of the detection. On the one hand, it is expected that the detection model has a high detection rate, that is, it requires that the detector can accurately detect minor faults which are not much different from the normal state. Due to the similarity between normal mode and minor fault mode, it is difficult to distinguish them in ordinary linear space, so minor fault detection is especially difficult to perform. On the other hand, the prior information on fault mode is insufficient in most detection, and the known samples are almost normal data. This

174

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

will lead to the detector training is not ideal, and then affecting the accuracy of the detection. The gear surface pitting is a kind of slight fault. It is difficult to distinguish the vibration signal extracted from the slight pitting state from the normal signal only by the signal processing method. In this section, the RVM method is used to detect the slight pitting faults of gears, and compared with the SVDD method. (1) Experiment with signal acquisition To identify early gear faults, the test is divided into two parts. At first, the vibration signal is collected from the normal transmission, and then a fault gear with slight pitting is used to replace the 5th-gear meshing gear in the output shaft of the transmission, and then the vibration signal is collected. The vibration acceleration signals were collected in the horizontal direction of 1 measuring point, and the input speed was set at 1000 r/min and the torque was set at 145 N/m, the output speed was set at 1300 r/min and the torque was set at 100 N/m. The sampling frequency was set to 40,000 Hz, the upper limit of the low-pass filter was 2000 Hz, the number of sampling points was 1024 × 90, and six groups of data were recorded. Generated time-domain waveform as shown in Fig. 3.53, the amplitude difference is not obvious, the waveform is very similar, almost indistinguishable, and it is difficult to judge the fault of the gear. (2) Feature selection index There are many eigenvalues reflecting gear faults, and their sensitivity to faults is different. According to the close relationship between eigenvalues and faults, and the degree of noise disturbance in testing and analysis, the features are selected. Eleven eigenvalues in the time domain, such as the mean square value (1), kurtosis (2), mean value (3), variance (4), skewness (5), peak value (6), root mean square amplitude (7), waveform index (8), peak value index (9), pulse index (10), and margin index (11), are selected for calculation. During the selection of training samples, we can do a pre-processing of cluster analysis on the selected samples to get rid of outliers in the sample set and to ensure the following classification to achieve good results. To verify the selected features and the quality of the sample points, separability evaluation indexes are used for evaluation. The experimental data in each file is divided into 240 sections, and the above 11dimensional characteristic index values are calculated for each section to generate

Fig. 3.53 Time domain waveform

3.5 Relevance Vector Machine Diagnosis Method

175

240 samples. The separability evaluation index value of each dimension eigenvalue is calculated, as shown in Table 3.14. From the table, the values of kurtosis (2), skewness (5), waveform index (8), peak index (9), pulse index (10), and margin index (11) are all less than zero, that is, the divergence between classes is smaller than that within classes, the samples composed of these eigenvalues have poor separability, so these eigenvalues are eliminated and the remaining eigenvalues are combined with (1) mean square value, (3) mean value, (4) variance, (6) peak value and (7) root mean square amplitude. When the pattern recognition method is used for state classification, the number of feature quantities is 2 ~ 3, one is less, the error rate is high, and if the number of feature quantities is too many, will make the discriminant function complex, larger computation, and worse real-time performance. Moreover, the error rate does not decrease monotonously with the increase of the number of features. When the number of features increases to 3, the calculation is complicated and the real-time performance is poor, but the error rate is not improved obviously. Therefore, RVM only uses a separability evaluation index to select features, and only two-dimensional eigenvalues are selected to evaluate the validity of classification. (3) Analysis of RVM detection results During classification, 140 normal samples were selected as training samples, and 100 normal samples and 240 pitting samples were tested. To verify the validity of the values of evaluation indicator Jb , we chose the Gauss kernel function as the mapping function and compared the classification effects of different values. Figure 3.54 shows the classification effect at Jb = 0.968, and Fig. 3.55 shows the classification effect at Jb = 0.838. According to the classification graph, RVM and SVDD can separate the fault samples from the normal samples well when they are large, but when they are small, the fault and the normal samples overlap more and the classification effect is poor. Table 3.14 Comparison of separability evaluation indexes of each eigenvalue Eigenvalues code name

Interclass dispersion Scb Within-class dispersion Scw

Evaluation indicators Jb

1

1.081

0.323

0.920

2

0.001

0.924

− 0.707

3

0.778

0.147

0.974

4

0.975

0.355

0.868

5

0.000

1.120

− 0.707

6

1.062

0.448

0.808

7

1.176

0.257

0.963

8

0.005

0.865

− 0.705

9

0.000

0.947

− 0.707

10

0.000

0.926

− 0.707

11

0.000

0.914

− 0.707

176

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.54 Jb = 0.968 classification renderings

Fig. 3.55 Jb = 0.838 classification renderings

Table 3.15 is a comparison of the classification accuracy of the two methods for the combination of partial eigenvalues. Table 3.15 shows that when Jb is greater than 0.9, the classification accuracy of the two methods is greater than 90%, which verifies the effectiveness of feature selection. Because RVM uses Bayes’ theorem to predict the probability of points, it can quantitatively evaluate whether the detection results belong to this category. Most of its training points are contained within the classification limits, and the training error is basically zero. However, SVDD only considers the data structure space, and only accepts or rejects the attribute of the points, which makes many support vectors out of bounds, and the training error is large. In addition, the number of RVM is far less than the number of SVDD support vectors, the sparsity of the solution is better, and the structure of the model is simpler.

3.5 Relevance Vector Machine Diagnosis Method

177

Table 3.15 Comparison of classification results and separability evaluation indexes after feature combination Combination of features

1 and 4

3 and 6

3 and 7

6 and 7

1 and 6

4 and 6

Gauss kernel width

0.2

0.3

0.4

0.4

0.3

0.2

RVM classification accuracy (%)

99.41

97.94

99.41

92.64

91.47

88.24

SVDD classification accuracy (%)

99.12

90.29

97.47

90.65

88.82

84.71

Number of correlation vectors

5

5

2

3

3

4

Number of support vectors

18

37

27

32

35

50

Evaluation indicators

0.947

0.902

0.968

0.909

0.872

0.838

3.5.3.3

Application of Gearbox Fault Classification Based on RVM

1. Experimental analysis on the classification of typical bearing faults Because the vibration energy of the bearing is smaller than that of gear and shafting, and the fault characteristics are not obvious, it is difficult to identify and diagnose the bearing. The classification of different types of bearing faults is even more difficult. In this section, the relevance vector machine (RVM) multi-classification method is used to classify the sample data of transmission under normal, bearing inner ring spalling and bearing outer ring spalling conditions. (1) Experiment and signal acquisition The cylindrical rolling bearing of the output shaft of Dongfeng SG135-2 transmission is taken as the experimental object. The measuring point is located on the bearing block of the output shaft, i. e. the position of measuring point 1 in Fig. 3.9. The experimental design of the transmission in the normal, bearing inner ring spalling and outer ring spalling three states of operation. In the above three states, it is necessary to ensure that the transmission is under the same operating conditions so that the experimental data collected can be comparable. Set the transmission to 3th, the input constant gear tooth ratio is 38/26, the 3th gear tooth ratio is 35/30. The speed of the input shaft and the output shaft are 2400 r/ min and 1370 r/min respectively, the output torque is 105.5 N m and the power is 15 kW. The sampling frequency is 40,000 Hz, the anti-mixing filter is 20,000 Hz, the sampling length is 1024 × 90 points, and the vibration acceleration signals in horizontal, vertical, and axial directions are collected at the same time. The parameters of the output shaft rolling bearing are shown in Table 3.16. (2) Feature selection and extraction The vibration acceleration signals of the experimental transmission under normal, bearing inner ring spalling and bearing outer ring spalling conditions were selected. When the inner and outer rings of the rolling bearing have fatigue spalling, there will be an obvious modulation peak group near the natural frequency of the outer ring in the middle and high-frequency region of its frequency spectrum, and the natural

178

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Table 3.16 Output shaft rolling bearing parameters and characteristic frequencies Nodal diameter D (mm)

Roller diameter d0 (mm)

Number of scroll bodies m

Contact angle α

Inner ring pass frequency f i (Hz)

Outer ring pass frequency f o (Hz)

The rolling Cage pass body frequency passes the f b (Hz) frequency f g (Hz)

85

18

13

0

179.6

116.8

51.4

10.9

frequency of the outer ring will be the carrier frequency, the natural frequency modulation phenomenon [29] in which the bearing passage frequency is the modulation frequency. Therefore, when choosing the frequency domain index, the frequency value corresponding to the spectrum energy concentration is considered. At the same time, the amplitude corresponding to the passing frequency of the inner and outer rings of the bearing is also the sensitive character of the fault. Therefore, the following characteristic indicator values are calculated for each segment of data: (a) time-domain statistical characteristic indexes: mean value (1), mean square value (2), kurtosis (3), variance (4), skewness (5), peak value (6), root-meansquare amplitude (7). (b) dimensionless characteristic indexes: waveform index (8), pulse index (9), peak index (10), margin index (11). (c) frequency domain characteristic index: the frequency value (12) corresponding to the highest peak of the spectrum, the amplitude (13) corresponding to the passing frequency of the bearing inner ring in the thinning spectrum, and the amplitude (14) corresponding to the passing frequency of the bearing outer ring in the thinning spectrum. The above features constitute the feature set of the experimental analysis, which is used as “Original information” for further feature selection, extraction, and pattern classification. For the time-domain sampling sequence collected in each direction, one segment was truncated every 2048 points, and a total of 45 segments were truncated for experimental analysis. By cross-sampling, 90 samples were obtained in x direction in each state, that is, a total of 270 samples were obtained in three states. Fourteen feature indexes are extracted from each sample data, the separability evaluation indexes of each feature vector are calculated, the smaller feature vectors are eliminated, and the remaining feature vectors are extracted by kernel direct discriminant analysis. Table 3.17 is the calculation of each characteristic value separability evaluation index. As can be seen from the table, the separability evaluation indexes of (1) the mean square value, (4) the variance, (6) the peak value, (7) the root mean square amplitude, (12) the frequency value corresponding to the highest peak of the spectrum, (13) the frequency amplitude of the outer ring passing through the bearing, (14) the frequency amplitude of the inner ring passing through the bearing are all greater than zero and close to 1, there are great differences among the samples in the three states of normal, bearing inner ring spalling and bearing outer ring spalling. In the process of

3.5 Relevance Vector Machine Diagnosis Method

179

Table 3.17 Bearing class eigenvalue separability evaluation index value Eigenvalues code name

Divergence between classes Scb

Divergence within classes Scw

Separability evaluation index Jb

1

2.168

0.247

0.990

2

0.015

0.991

− 0.702

3

0.246

0.914

− 0.590

4

2.168

0.274

0.990

5

0.585

0.801

− 0.261

6

2.103

0.295

0.987

7

2.465

0.174

0.997

8

0.104

0.962

− 0.666

9

0.038

0.984

− 0.693

10

0.028

0.987

− 0.697

11

0.025

0.988

− 0.698

12

2.989

0

1

13

2.989

0

1

14

2.989

0

1

feature classification, these feature indicators should be used as priority alternative indicators. For other characteristic indexes, the inter-class divergence is smaller than the intra-class divergence, and the separability evaluation index is less than zero, which indicates that the characteristic points of three kinds of samples are mixed together, and it is difficult to distinguish different kinds of samples, these indexes are not suitable for feature extraction. A 7-dimensional eigenvalue matrix is composed of (1) mean square value, (4) variance, (6) peak value, (7) root mean square amplitude, (12) frequency corresponding to the highest peak of the spectrum, (13) frequency amplitude through the outer ring of the bearing and (14) frequency amplitude through the inner ring of the bearing, KDDA method is used for feature extraction. The effect of feature extraction is shown in Fig. 3.56. Using the Gauss kernel function, the kernel parameter σ = 5. Among them, are Fig. 3.56a for the KDDA feature extraction effect map, and Fig. 3.56b for the KPCA feature extraction effect map. It can be seen from the graph that, after the feature selection of the separability evaluation index, three types of separability can be extracted by both methods. The separability evaluation index can select features before feature extraction and evaluate the extracted feature vector after feature extraction. After feature extraction, the separability evaluation index is calculated again, and Jb of the feature vector extracted by the two methods is equal to 1. The ratio of inter-class divergence and intra-class divergence of feature vectors extracted by KDDA Scb /Scw = 301.882/1.899 = 158.97, while the ratio of interclass divergence and intra-class divergence of feature vectors extracted by KPCA

180

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.56 The effect of feature extraction when σ = 5

is Scb /Scw = 0.870/0.014 = 62.14, and the effect of KDDA feature extraction is slightly better than that of KPCA. (3) Classification of bearing experimental data with RVM multi-classification method The feature vector extracted by KDDA is used for RVM multi-classification. In each of the three types of samples, 50 samples were randomly selected as training samples, that is, a total of 150 training samples, and the remaining 120 as a test samples. The normal sample, the bearing inner ring spalling sample, and the bearing outer ring spalling sample are labeled 1, 2 and 3 respectively. The kernel functions select the Gauss Radial basis function, and the kernel parameters are selected empirically. The three classification effects are shown in Fig. 3.57. In the figure, ‘.’ represents the normal sample, ‘*’ represents the bearing inner ring spalling sample, ‘Δ’ represents the bearing outer ring spalling sample, “O” represents the correlation vector point or support vector point. From Table 3.18 and Fig. 3.57, after feature selection and extraction, both RVM and SVM can achieve good classification results, and the classification accuracy is 100%. The training time of RVM is longer than that of SVM, but the test time is shorter, and the RVM is much less than SVM, so the sparsity of the solution is better. 2. Experimental analysis of typical gear fault classification The typical faults of gears include pitting, abrasion, spalling, tooth-breaking, and so on, in this section, four kinds of faults, i.e. Gear moderate spalling, gear serious spalling and gear surface deformation, gear breaking, and gear serious pitting, are selected to carry out classification experiments, and a multi-fault classifier is established. (1) Experiment and signal acquisition The 5th-speed gear of the Dongfeng SG135-2 transmission is taken as the experimental object. The vibration acceleration signal is collected at the input of the transmission at 3 measuring points can be seen from Fig. 3.12. Experimental design of the

3.5 Relevance Vector Machine Diagnosis Method

181

Fig. 3.57 Effect of the three-class task of bearings diagram

Table 3.18 Bearing class three classification results comparison

Classification algorithm

One on one RVM

SVM

Gauss, nuclear parameters

1.5

5

Training accuracy (%)

100

100

Test accuracy (%)

100

100

Training time (s)

4.2014 1.0540

Test time (s)

0.0029 0.0126

Number of associated (or supported) vectors 5

14

transmission in normal, pitting gear, gear moderate spalling, gear serious spalling + tooth surface deformation, and gear teeth broken 5 states. During the experiment, the transmission is operated under the same working condition to ensure the correctness of classification. The transmission is set to 5th gears, with an input shaft speed of 600 r/min, an output shaft speed of 784 r/min, an output torque of 75.5 N m, and an output power of 6.5 kW. The sampling frequency is set to 40,000 Hz, the sampling length is 1024 × 90 points, and the horizontal radial (i.e., X direction) signal is collected. (2) Feature selection and extraction The vibration acceleration signals of the experimental transmission were extracted under the five operating conditions of normal, serious pitting, moderate spalling, severe spalling + deformation of the gear surface and gear tooth broken. When the tooth surface pitting, fatigue spalling and other concentrated tooth profile errors

182

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

occur in gear work, the frequency of gear meshing and its harmonics will be the carrier frequency, the gear shaft rotating frequency and its double frequency are the meshing frequency, which is called modulation phenomenon of the modulation frequency. The modulation severity is decided by the gear damage degree. In the time domain, the variation of the effective value and kurtosis index, which are the statistical indexes of the vibration energy, is shown. When the fault occurs, the abovementioned modulation phenomenon is more obvious, at the same time, it will cause the natural frequency modulation of the gear, and the rotating frequency energy of the gear shaft is also obviously increased. Based on the above analysis, the following characteristic index values were calculated for each data segment: (a) time-domain statistical characteristic indexes: mean value (1), mean square value (2), kurtosis (3), variance (4), skewness (5), peak value (6), root-meansquare amplitude (7). (b) dimensionless characteristic indexes: waveform index (8), pulse index (9), peak index (10), margin index (11). (c) frequency domain characteristic index: the sum of the main frequency and all side frequency amplitude in the modulation frequency band (12), the corrected amplitude corresponding to the meshing frequency of experimental gear (13), and the corrected amplitude corresponding to the rotating frequency of the shaft where the experimental gear is located (14). The time-domain parameters can effectively represent the time-domain characteristics and the envelope energy of vibration, and the frequency-domain parameters reflect the modulation characteristics and the distribution of vibration energy to some extent. The experiment of bearing fault classification is a three-class experiment. To further verify the classification performance of RVM, four-class and five-class experiments are carried out. Select 4 out of 5 operation states of transmission as a combination, and set 4 state data in each combination as the samples of Class 1, Class 2, Class 3, and Class 4, respectively, used for quad classification experiments. Two such combinations are established as follows: (a) normal-moderate spalling of gear-severe spalling of gear and deformation of gear surface-broken gear teeth, (b) moderate spalling-severe spalling and tooth surface deformation-tooth breakage-severe pitting. The combination of 5 kinds of sample generation, (c) normal-moderate spallingsevere spalling and surface deformation-tooth breakage-severe pitting, was used to carry out the five-classification experiment, 5 kinds of state data are set as 5 kinds of samples. In each combination, the different failure states are matched with each other, which can fully simulate the situation of multiple modes coexisting in the transmission operation. For the time-domain sampling sequence collected in each direction, one segment was truncated every 2048 points, and a total of 45 segments were truncated for experimental analysis. After cross-sampling, 90 samples were obtained in x direction in each state, that is, 450 samples were obtained in five states, and 14 feature indexes were extracted from each sample.

3.5 Relevance Vector Machine Diagnosis Method

183

(3) Using RVM multi-classification method to classify gear experimental data As the same method of feature selection and extraction of bearings, the separability evaluation indexes of 14 feature values are calculated first, and the feature extraction of KDDA is carried out by selecting the feature values with larger Jb first, the extracted feature vectors are used in RVM classification and compared with SVM classification. Combination A: Normal-moderate spalling-severe spalling and surface deformation-broken teeth. The values of 14 separability indexes are shown in Table 3.19, Table 3.20 shows the comparison of classification results can be seen from Fig. 3.58. The four-category effect maps of RVM and SVM after feature extraction by KDDA are shown in Fig. 3.59. In the picture, “.” indicates a normal sample, “+” indicates moderate spalling of gear, “*” indicates severe spalling of gear and deformation of the tooth surface, and “◇” indicates broken tooth of gear, “O” represents a relevance vector point or a support vector point. Combination B: Moderate spalling-severe spalling and surface deformation-tooth breakage-severe pitting as shown in Figs. 3.60, 3.61 and Tables 3.21, 3.22, 3.23. Combination C: Normal-moderate spalling-severe spalling and tooth surface deformation-tooth breakage-severe pitting. 3. Experimental results analysis of gear multi-classification According to the three combined classification results, both RVM and SVM can achieve better classification results based on feature selection and extraction. RVM Table 3.19 Gear class combination ➀ divisibility evaluation index value Eigenvalues code name

Divergence between classes Scb

Divergence within classes Scw

Separability evaluation index Jb

1

2.697

0.323

0.991

2

1.626

0.591

0.869

3

3.967

0.005

1

4

2.677

0.328

0.990

5

0.490

0.875

− 0.402

6

3.034

0.239

0.996

7

3.280

0.177

0.998

8

2.781

0.302

0.993

9

2.626

0.341

0.989

10

2.518

0.368

0.986

11

2.531

0.364

0.986

12

2.694

0.324

0.991

13

2.292

0.424

0.975

14

1.277

0.678

0.662

184

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Table 3.20 Comparison of four classification results of gear class combination 1

Classification algorithm

One on one RVM

SVM

Gauss, nuclear parameters

2

10

Training accuracy (%)

100

100

Test accuracy (%)

100

100

Training time (s)

8.0572 2.3703

Test time (s)

0.0050 0.0072

Number of associated (or supported) vectors 7

Fig. 3.58 KDDA feature extraction of gear class combination 1

Fig. 3.59 Four-class classification effect of gear class combination A diagram

17

3.5 Relevance Vector Machine Diagnosis Method

185

Fig. 3.60 KDDA feature extraction of gear class combination 2

Fig. 3.61 Gear class combination 2 four classification effect diagram

classification needs less support vector than SVM classification, and the test time is shorter. But the training stage of RVM is more complex and the training time is longer. As shown in Figs. 3.58, 3.60, and 3.62, after KDDA feature extraction, the normal sample in combination A, the moderate spalling sample and the severe pitting sample in combination B are close to each other and almost belong to the same category. The normal samples, moderate spalling samples and severe pitting samples in combination C are relatively close to each other, and there are some overlapping samples. After RVM classification, the near and partially overlapping samples were separated, and the classification accuracy of combinations A and B reached 100%. Combination C cannot be completely separated due to the overlap of some sample points, but still

186

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Table 3.21 Gear type combination 2 divisibility evaluation index value The eigenvalue code

Divergence between classes Scb

Divergence within classes Scw

Separability evaluation index Jb

1

2.518

0.368

0.986

2

1.585

0.601

0.853

3

3.979

0.003

1

4

2.496

0.373

0.985

5

0.530

0.865

− 0.361

6

3.126

0.216

0.996

7

3.280

0.177

0.997

8

2.736

0.313

0.992

9

2.398

0.398

0.981

10

2.391

0.399

0.980

11

2.437

0.388

0.983

12

2.844

0.286

0.994

13

2.863

0.281

0.994

14

1.524

0.616

0.827

Table 3.22 Comparison of four classification results of gear class combination 2

Classification algorithm

One on one RVM

SVM

Gauss, nuclear parameters

1.4

20

Training accuracy (%)

100

100

Test accuracy (%)

100

100

Training time (s)

9.0026 2.2073

Test time (s)

0.0048 0.0070

Number of associated (or supported) vectors 12

17

can achieve high classification accuracy. From Table 3.24, the training accuracy and test accuracy of RVM are higher than those of the Support vector machine (Fig. 3.63).

3.5.3.4

Analysis of Factors Affecting Classifier Performance

(1) The influence of kernel function on the classification performance of the classifier Three RVM-based fault detection classifiers were constructed using polynomial kernel functions, Gauss Radial basis function, and multilayer perceptron kernel functions, respectively, and the classification accuracy of each feature combination is

3.5 Relevance Vector Machine Diagnosis Method

187

Table 3.23 Gear type combination C separability evaluation index value Eigenvalues code name

Divergence between classes Scb

Divergence within classes Scw

Separability evaluation index Jb

1

3.579

0.282

0.996

2

2.099

0.578

0.935

3

4.967

0.004

1

4

3.555

0.287

0.996

5

0.654

0.867

− 0.238

6

3.955

0.207

0.998

7

4.273

0.143

0.999

8

3.526

0.293

0.996

9

3.253

0.347

0.993

10

3.172

0.363

0.992

11

3.202

0.357

0.992

12

3.588

0.280

0.996

13

3.347

0.328

0.994

14

1.874

0.623

0.895

Fig. 3.62 KDDA feature extraction of gear class combination 3

Table 3.24 Gear class combination 3 five classification results comparison

Classification algorithm

One on one RVM

SVM

Core parameters

0.7

20

Training accuracy (%)

96

94.4

Test accuracy (%)

97

95

Training time (s)

15.6247

6.5049

188

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

Fig. 3.63 Gear class combination 3 five classification effect diagram

compared. The classification accuracy of RVM and SVDD detection is shown in Table 3.25. Table 3.25 shows the comparison of the classification results of different kernel functions when the values of Jb are large. From Table 3.25, the classification performance of the polynomial kernel and multilayer perceptron kernel is similar, and the classification accuracy is basically similar. Gauss kernel functions are good at classification, which may be related to the choice of kernel parameters, which is one of the reasons why Gaussian kernels are used more in applications. In multi-classification, because of the complexity of data, the trained classifier is unstable when using the polynomial kernel function and multi-layer perceptron kernel function. Gauss kernel functions are relatively stable and can achieve better classification results. After the training set is given, it is necessary to select the kernel function and kernel parameters of RVM when searching for a decision function with RVM. In most cases, the kernel function is based on experience and reference to the existing selection experience, in the above simulation and experimental part, the kernel function is based on this method. Table 3.25 Comparison of classification results of different kernel functions The kernel function

RVM classification accuracy (%)

SVDD classification accuracy (%)

One and four

Three and six

Three and seven

One and four

Three and six

Three and seven

Gauss, kernel 99.41

97.94

99.41

99.12

90.29

97.47

Polynomial kernel

99.12

94.71

98.53

92.35

82.94

91.18

Multi-layer perceptron core

99.12

96.18

98.53

94.41

87.65

95.88

3.5 Relevance Vector Machine Diagnosis Method

189

(2) The influence of kernel parameters on classifier performance The following study examines the relationship between the classification accuracy and kernel parameters of the Gauss Radial basis function RVM fault classifier using the bearing-class data set from Sect. 3.5.3.3. Figure 3.64 shows how the classification accuracy of Gauss’s Radial basis function fault classifier varies with width σ . Among them, graphs (a) and (b) show the relationship between training accuracy, test accuracy, and Gauss kernel width, respectively. Figure 3.64 shows that the kernel parameters have a significant impact on the classification performance of the classifier, so it is necessary to select the appropriate kernel parameters when using them. (3) The effect of relevance vector on the classification performance of a classifier Figure 3.65 shows the number of correlation vectors as a function of Gauss kernel width σ . Combined with Figs. 3.64 and 3.65, it can be seen that the smaller of σ , the more of relevant vectors used for classification, the more complex the calculation, and the longer of the training time. The total number of training samples is 150. When the value of σ is close to zero, the number of relevance vectors is close to 60, accounting

Fig. 3.64 Relationship between classification accuracy and Gauss kernel width Fig. 3.65 Relationship of σ and the number of relevant vectors

190

3 Semi-supervised Learning Based Intelligent Fault Diagnosis Methods

for about one-third of the total samples. Although the classification accuracy is 100%, the prediction cost is too high, causing the classifier to be overfitting. The larger the value of σ , the lower of the number of correlation vector, the lower accuracy of classification, and the classifier is in the state of under-fitting. As can be seen from the graph, at that time of σ > 3, there is only 1 correlation vector, and the classifier is in a serious under-fitting state. Therefore, it is necessary to select appropriate kernel parameters to assure that the classification accuracy is high and the number of correlation vectors is small.

References 1. Zhu, X., Goldberg, A.B.: Introduction to Semi-supervised Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers (2009) 2. Bian, Z.Q., Zhang, X., et al.: Pattern Recognition (in Chinses). Tsinghua University Press, Beijing (2000) 3. Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982) 4. Wang, H.Z., Yu, J.S.: Study on the kernel-based methods and its model selection (in Chinese). J. Jiangnan Univ. 5(4), 500–504 (2006) 5. Smola, A.J.: Learning with Kernels. Technical University of Berlin, Berlin (1998) 6. Liao, G.L., Shi, T.L., Li, W.H.: Design and analysis of multiple joint robotic arm powered by ultrasonic motors (in Chinese). Vibr. Meas. Diag. 03, 182–185 (2005) 7. Zhong, Q.L., Cai, Z.X.: Semi-supervised learning algorithm based on SVM and by gradual approach (in Chinese). Comput. Eng. Appl. 25, 19–22 (2006) 8. Knorr, E.M., et al.: Algorithms for mining distance-based outliers in large datasets. In: Proceedings of Very Large Data Bases Conference, pp. 392–403 (1998) 9. Knorr, E.M., Ng, R.T.: Distance-based outliers: algorithms and applications. In: Proceedings of Very Large Data Bases Conference, vol. 8, pp. 237–253 (2000) 10. Breunig, M.M., Kriegel, H.P., et al.: LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, 15–18 May 2000, pp. 93–104 11. Ester, M., Kriegel, H.P., Sander, J., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceeding of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, pp. 226–231. AAAI Press (1996) 12. He, Z., Xu, X., Deng, S.: Discovering cluster-based local outliers. Pattern Recogn. Lett. 24(9– 10), 1642–1650 (2003) 13. Zhang, T., Oles, F.J.: A probability analysis on the value of unlabeled data for classification problems. In: Proceedings of the 17th International Conference on Machine Learning (ICML’00), San Francisco, 29 June–2 July 2000, pp. 1191–1198 14. Bilenko, M., Basu, S., Mooney, R.J.: Integrating constraints and metric learning in semisupervise clustering. In: Proceedings of the 21st International Conference on Machine Learning, Banff, 4–8 July 2004, pp. 81–88 15. Basu, S., Bilekno, M., Monoey, R.J.: A probabilistic framework for semi-supervised clustering. In: Proceeding of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, 22–25 Aug 2004, pp. 59–68 16. Zhang, D.Q., Tan, K.R., Chen, S.C.: Semi-supervised kernel-based fuzzy c-means. In: Proceedings of the International Conferences on Neural Information Processing, Calcutta, 22–25 Nov 2004, pp. 1229–1234 17. David, M.J.T., Piotr, J., Elzbieta, P., et al.: Outlier detection using ball descriptions with adjustable metric. In: Proceeding of Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), Hong Kong, 17–19 Aug 2006, pp. 587–595

References

191

18. Kohonen, T.: Self-organized formation of topologically correct feature maps. Biol. Cybern. 43(1), 59–69 (1982) 19. Backer, S.D., Naud, A., Scheunders, P.: Non-linear dimensionality reduction techniques for unsupervised feature extraction. Pattern Recogn. Lett. 19(1), 711–720 (1998) 20. Rubio, M., Gimnez, V.: New methods for self-organising map visual analysis. Neural Comput. Appl. 12(3.4), 142–152 (2003) 21. Shao, C., Huang, H.K.: A new data visualization algorithm based on SOM (in Chinese). J. Comput. Res. Dev. 43(3), 429–435 (2006) 22. Alhoniemi, E., Himberg, J., Parviainen, J., et al.: SOM Toolbox (Version 2.0) (1999). Available at http://www.cis.hut.fi/projects/somtoolbox/download/. Accessed 20 Oct 2008 23. Tipping, M.E.: Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res. 211–244 (2001) 24. Yu, H., Yang, J.: A direct LDA algorithm for high-dimensional data-with application to face recognition. Pattern Recogn. Lett. 34, 2067–2070 (2001) 25. Lu, J., Plataniotis, K.N., Venetsanopoulos, A.N.: Face recognition using kernel direct discriminant analysis algorithms. IEEE Trans. Neural Netw. 14(1), 117–126 (2003) 26. David, M.J.T., Robert, P.W.D.: Support vector data description. Mach. Learn. 54, 45–66 (2004) 27. Kressel, U.: Pairwise Classification and Support Vector Machines, pp. 255–268. MIT Press, Cambridge, MA (1999) 28. Kanaujia, A., Metaxas, D.: Learning Multi-category Classification in Bayesian Framework, pp. 255–264. CBIM, Rutgers University (2006) 29. Li, X.: Semi-supervised Fault Classification Method Based on Kernel Function Principal Component Analysis (in Chinese). South China University of Technology, Guangzhou (2007)

Chapter 4

Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

4.1 Manifold Learning Manifold, an extension of Euclidean space, on which every point has a neighborhood and an open set homeomorphism of Euclidean space so that it can be described by a local coordinate system. Intuitively, a manifold can be viewed as the result of sticking together a block of “Euclidean space”, which is a special case of a manifold, i.e., Euclidean space is a trivial manifold [1]. Continuously-differentiable manifolds are usually studied in differential geometry, while their properties are obtained by discrete approximation of continuity in practical problems. For a given highdimensional data set, the data variables can be represented by a small number of variables, which is geometrically represented by data points scattered on or near a low-dimensional smooth manifold. The core of manifold learning is to learn and discover low-dimensional smooth manifolds embedded in a high-dimensional space based on a limited number of discretely observed sample data for effectively revealing the intrinsic geometric structure of the data. Manifold learning has received a lot of attention in the fields of machine learning, pattern recognition, and data mining since 2000. In particular, three articles published in the same issue of Science in December 2000 investigated the problem of manifold learning from the perspectives of neuroscience and computer science, respectively, and explored the relationship between neural systems and low-dimensional cognitive concepts embedded in high-dimensional data space [1–3], making manifold learning a hot spot in machine learning and data mining research. The application of manifold learning in mechanical state identification focuses on the following three aspects: noise removal and weak impulse signal extraction, state identification, and state trend analysis. (1) Noise removal and weak impulse signal extraction. In practical engineering, the collected vibration signals are inevitably disturbed by various noises because of the complexity of the mechanical system and the variability of the working environment. Effective noise reduction techniques will help to improve the diagnosis accuracy and reduce the failure occurrence by detecting the incipient fault © National Defense Industry Press 2023 W. Li et al., Intelligent Fault Diagnosis and Health Assessment for Complex Electro-Mechanical Systems, https://doi.org/10.1007/978-981-99-3537-6_4

193

194

4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

of the machinery in time. As one of the hotspots of machine learning and pattern recognition, manifold learning has been successfully applied in noise reduction and weak impulse signal extraction. The traditional noise reduction methods and the current manifold learning methods are working on the time domain vibration signal. The advantage of the noise reduction method is that the fault generation mechanism can be thoroughly studied, however, the length of the signal often leads to low noise reduction efficiency and large storage space, especially in the mechanical fault diagnosis. To ensure the frequency domain resolution ratio, a large amount of time domain data is needed, generally tens of thousands of points. In such cases, there is a great limitation of the time domain noise reduction method in terms of computational efficiency, which is not conducive to online monitoring. (2) State identification. Various feature indicators used to describe the health state of mechanical systems are redundant, while traditional single indicators cannot completely describe the operating state of complex equipment. Therefore, fusing multidimensional feature indicators by eliminating redundant components between indicators and extracting effective features for describing equipment health status, have become the key for manifold learning-based diagnostic methods. (3) State trend analysis. In addition to describing the state trend of the equipment by using the feature indicators, building a state prediction model based on manifold learning can better diagnose the time of failure and predict the remaining life of the equipment.

4.2 Spectral Clustering Manifold Based Fault Feature Selection 4.2.1 Spectral Clustering 4.2.1.1

Spectral Graph Theory

The mathematical essence of graph theory is the combination of combinatorial theory and set theory. A graph G can consist of two sets: a non-empty set of nodes V and a finite set of edges E. W (e) is assigned on each edge e in the graph G, called the weight of the edge e. G together with the weights on its edges is called a weighted graph [4]. The main way of studying spectral graphs is to establish and represent the topology of graphs through graph matrix (adjacency matrix, Laplacian matrix, unsigned Laplacian matrix, etc.), especially the connection between the various invariants of the graph and the substitution similar invariants represented by the graph matrix. Many research methods of Laplacian eigenvalues of graphs are borrowed from the study of eigenvalues of graphs or the eigenvalues ratio of obtained from Laplacian eigenvalues graphs and the adjacency matrix of graphs. Due to adding the degree of a vertex in the definition of the Laplacian matrix, the Laplacian eigenvalues better

4.2 Spectral Clustering Manifold Based Fault Feature Selection

195

reflect the graph theory properties of the graph, so the study of Laplacian eigenvalues has received more and more extensive attention, and the spectral clustering algorithm is also based on the Laplacian eigenvalue decomposition.

4.2.1.2

Spectral Clustering Feature Extraction

The idea of a spectral clustering algorithm is derived from the spectral graph partition theory [5]. It is assumed that each data sample is regarded as a vertex V in the graph, and the edge E between the vertices is assigned a weight value W according to the similarity between the samples so an undirected weighted graph G = (V , E) based on the similarity of the samples is obtained. In the graph G, the clustering problem can be transformed into a graph partitioning problem, and the optimal division criterion is to maximize the similarity within the two subgraphs and minimize the similarity between the subgraphs [6]. According to the algorithm of spectral clustering, the feature extraction algorithm based on the spectral clustering is as follows: Algorithm 1 Feature extraction based on spectral clustering Input: The graph similarity matrix W constructed from the original data. dataset after dimensionality reduction of the graph Y = } { Output: Low-dimensional y1 , y2 , . . . , y N . Steps: (1) Calculate the Euclidean distance matrix S of the original data. Euclidean distance is defined as the similarity of any two points: || || s(i, j) = || x i , x j ||2

(4.1)

(2) Build a complete graph based on a data matrix (undirected) G = (V , E), where V is the node corresponds to the data and the edge E connecting any two nodes. The weight matrix W of the edges is used to represent the similarity between the data, which is calculated as (where σ is the control parameter): ( ) w(i, j ) = exp −s(i, j )/(2 ∗ σ 2 )

(4.2)

(3) Calculate the degree matrix, whose diagonal elements are the sums of the corresponding columns of the graph weights. D(i, i ) =



w(i, j )

(4.3)

w(i, j )

(4.4)

(4) Construct the Laplace matrix: D(i, i ) =



196

4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

(5) Diagonalize the matrix L and calculate its eigenvalues and eigenvectors: L = UΛU T

(4.5)

where U is the eigenvector whose column vectors are composed of L. Λ is the diagonal matrix with eigenvalues λ1 , λ2 , . . . , λn . (6) The eigenvectors corresponding to the first r largest non-negative eigenvalues are selected to form the transformation matrix, where the sum of the first r largest non-negative eigenvalues accounts for the sum of all non-negative eigenvalues: ∑r λi ∑ pi=1 j=1 λ j

(4.6)

The value range is 85 ~ 100%. (7) The coordinates in low-dimensional space are represented as: Y = U r Λr1/2

(4.7)

where U r is the n × r matrix composed of the first r column eigenvectors; L r is the diagonal matrix of r × r order.

4.2.2 Spectral Clustering Based Feature Selection 4.2.2.1

Incipient Fault Feature Selection

In mechanical fault diagnosis, it is crucial to select features that can effectively reflect fault information. The common method is to use the feature indicators in time and frequency domains as the original input data of pattern recognition methods. Common transmission fault features have a certain regularity, it is very important to grasp its regularity for the analysis and extraction of the corresponding fault features. When the bearings and gears in the transmission are running normally, the vibration signal is generally smooth, and the signal frequency components include the rotation frequency of each bearing and the meshing frequency of the gears, etc. When a fault occurs, the vibration signal frequency components or amplitude will change. Constructing indicators that reflect changes in vibration signals is an effective way to perform fault classification. In addition, there is usually some redundant and useless information in various existing feature indicators, especially in the time domain. It is important to select effective indicators for reducing the number of input dimensions and improving the correct classification rate. 1. Feature indicators in the time domain The commonly used 11-time domain waveform features can be divided into two parts: dimensional and dimensionless.

4.2 Spectral Clustering Manifold Based Fault Feature Selection

197

Dimensional indicators: Mean Square Value xa (1), Kurtosis xq (2), Mean x (3), Variance σ 2 (4), Skewness xs (5), Peak xp (6), Root Mean Square xr (7). Dimensionless indicators: Shape Indicator K (8), Crest Indicator C (9), Impulse Indicator I (10), Clearance Indicator L (11). When statistical feature values with dimension are used for amplitude analysis, the results obtained are not only related to the state of the electromechanical equipment, but also the operating parameters of the machine (such as speed, load, etc.). The measured dimensional feature values of different types and sizes of transmissions are not comparable, and sometimes even the measured dimensional feature values of the same type and size of transmissions cannot be compared directly. Therefore, when conducting equipment fault diagnosis, it is necessary to ensure the consistency of operating parameters and measurements. The dimensionless analysis parameters are only related to the state of the machine and are largely independent of the machine’s operating state. The dimensionless indicator is not affected by the absolute level of the vibration signal and is independent of the sensitivity of the vibration detector, amplifier, and the amplification of the entire test system, so the system does not need to be calibrated. There is no measurement error even if the sensitivity of the sensor or amplifier changes. However, their sensitivity is different from the fault signals. When the fault signal appears as a surge vibration, the peak indicator, crest indicator, and clearance indicator are more sensitive to the surge-type fault than the root mean square value. These indicators will decrease as the degree of failure increases significantly, which indicates that the three indicators are more sensitive to incipient faults. 2. Frequency energy factor The vibration of a gearbox is generally composed of the following frequency components [7]: (1) Rotational frequency (each bearing) and its higher harmonics. (2) Gear mesh frequency (GMF) and its higher harmonics. (3) A side-band frequency is generated by the modulation phenomenon of GMF with the GMF and its higher harmonics as the carrier frequency and the rotational frequency of the bearing where the gear is located and its higher harmonics as the modulation frequency. (4) A side-band frequency is generated by the resonant modulation phenomenon of the gear by using the inherent frequency of the gear as the carrier frequency and the rotational frequency of the bearing where the gear is located and its higher harmonics as the modulation frequency. (5) A side-band frequency is generated by the box resonance modulation phenomenon with the inherent frequency of the gearbox box as the carrier frequency and the rotational frequency of the bearing where the gear is located and its higher harmonics as the modulation frequency. (6) A Modulation sideband is generated with the intrinsic frequency as the carrier frequency and the rolling bearing pass frequency as the modulation frequency. (7) Implicit ingredients.

198

4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

(8) Cross-modulation ingredients. As shown in Fig. 4.1, the modulation form exhibited by the gear fault mainly depends on the excitation energy, and different fault degrees show different modulation forms. When there is a slight fault, such as a slight shaft bending or small area, a small number of tooth surface pitting, meshing frequency modulation phenomenon, mesh frequency modulation phenomenon usually happens by the meshing frequency being frequency modulated. When the fault is more serious and the excitation energy is larger, the inherent frequency of the gear itself is stimulated to generate the resonance modulation phenomenon with the inherent frequency of the gear as the carrier frequency; when the excitation energy is very large and the fault is very serious, the inherent frequency of the gearbox box is stimulated to generate the box inherent frequency modulation phenomenon. For the gear incipient fault, an indicator energy factor is constructed, which can reflect the difference of modulation energy in the incipient fault. The specific approach is as follows: (1) A fast Fourier transform of the original signal is performed to obtain the FFT spectrogram. (2) Calculate the gear GMF and its multiplier n f z , n = 1, 2, 3, . . .. (3) Calculate the spectral line number km of frequency f m (m = 1, 2, 3, . . .), which is nearest to the mesh frequency and its multiples. ∑ K m+1 −1

(4) Define

i=K m−1 +1



Aj

Ai

, m = 2, 3, 4, …; when m = 1: Δ1 =

∑ K m+1 −1 Ai i=1 ∑ Aj

(4.8)

3. Case study In this section, the gear failure experimental data from Laborelec laboratory were used to conduct the experiments, including gear tooth face fatigue pitting fault, tooth

Fig. 4.1 Fault modulation of gear

4.2 Spectral Clustering Manifold Based Fault Feature Selection

199

face slight spalling fault and tooth face severe spalling fault. 37 sets of each fault type were selected, and the gear parameters and experimental conditions are described in Sect. 3.4.4. (1) The gear state original feature set S is composed of the above 11 time-domain statistical feature parameters, which are used to describe the gear fault type. (2) The calculated gear meshing frequency is 304 Hz, Δ1 , Δ2 , Δ3 and Δ4 are calculated according to (4.8). The original feature set S of the gear state is composed of the four feature parameters, which are used to describe the gear fault type. The 91 labeled samples and 20 unlabeled samples are randomly selected from 111 samples in the two cases of (1), and (2), respectively. Support vector machine (SVM) is used to train the classifier and predict the type of unlabeled samples, where C = 100 and σ = 0.85; the results are shown in Fig. 4.2.

Fig. 4.2 Classification results of different feature indicators

200

4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

From Fig. 4.2a, samples 7 and 88 are mistakenly classified and the accuracy is 90%. In Fig. 4.2b, the accuracy rises to 100%, which proves the effectiveness of the proposed energy factor indicator.

4.2.2.2

PCA-Based Feature Selection

In pattern recognition, some of the input feature indicators cannot reflect the state of machine operation, which not only increases the redundant information for detection and classification but also lengthens the time, so it is necessary to conduct feature selection. The normal and slightly spalled signals from the Laborelec lab’s gear fault experimental data were selected to validate the experimental results. PCA was adopted to select features with a higher impact on the prediction results. The programming environment was Matlab 7.6.0. The CPU was Pentium dual-core processor (CPU clock speed is 2.50 GHz), and the memory was 2 GB. Table 4.1 provides the contribution rates and cumulative contribution rates of the 11 features. It can be seen that the accumulated contribution rate of the first three principal elements rises to 90.52%. In this section, only the first three principal elements were analyzed because the 85% accumulated contribution rate can represent most of the information contained in the original variables. Figure 4.3 shows the contribution of each feature indicator to the first 3 principal components. Figure 4.4 provides the extraction rates of the first three principal components acting on each feature indicator. Nine feature indicators (Mean Square Value xa (1), Kurtosis xq (2), Mean x (3), Variance σ 2 (4), Peak xp (5), Root Mean Square xr (6), Shape Indicator K (7), Crest Indicator C (8), Impulse Indicator I (9), Table 4.1 Feature values and contribution rates of principal component Principal component Feature values λ Contribution rates C (%) Accumulated contribution AC (%) 1

6.0382

54.89

54.89

2

2.9032

26.39

81.28

3

1.0161

9.24

90.52

4

0.5679

5.16

95.68

5

0.4251

3.87

99.55

6

0.0418

0.38

99.93

7

0.0056

0.05

99.98

8

0.0009

0.02

100

9

0.0004

0

100

10

0.0003

0

100

11

4.2285e−005

0

100

4.2 Spectral Clustering Manifold Based Fault Feature Selection

201

Clearance Indicator L (10)) with extraction rates greater than 80% were selected for classification learning, and the sample set became 74 × 9-D. 44 labeled samples and 30 unlabeled samples were randomly selected from 74 × 9-D samples. Similarly, SVM was used to train the classifier and predict the type of unlabeled samples, where C = 100 and σ = 0.5. The classification results are shown in Fig. 4.5. From Fig. 4.5a, No. 27, 39, 66, and 73 are misclassified and the accuracy is 86.66%. After running the PCA algorithm, the accuracy rises to 90%, and only No. 40, 50, and 57 are misclassified among the input features with only 9-dimensional input. In terms of training and prediction time, it takes 0.029924 s when the input features are 11-dimensional, while it takes only 0.025856 s for 9-dimensional input.

Fig. 4.3 Contribution of the first three principal components on features

Fig. 4.4 Extraction rate by first three pivot elements on feature indicators

202

4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

Fig. 4.5 Prediction results of different dimensional features

4.2.2.3

Feature Extraction Based on Density-Adjustable Spectral Clustering

1. Similarity metric From Fig. 4.6, it is common for distance-based methods to classify a and b samples in one category while a and c samples are classified in different categories because the distance between a and b is smaller than that of a and c. It is necessary to design some kind of similarity metric such that the distance between a and c is smaller than that of a and b to obtain the correct classification result. Therefore, the density-based clustering assumption [8] is defined as enlarging the length of those paths that cross

4.2 Spectral Clustering Manifold Based Fault Feature Selection

203

Fig. 4.6 Space distribution based on density clustering assumptions

the low-density region while shortening the length that does not cross the low-density region. By calculating the dissimilarity between each pair of nodes based on densitysensitive distance, the original data is transformed into the dissimilarity space of pairwise data, and the corresponding dissimilarity matrix can be obtained. Furthermore, dimensionality reduction is achieved by computing the eigenvalues of the original data in the lower dimensional space. Such a density-adjustable distance is defined in Ref. [9]. Definition 1 l(x, y) = ρ dist (x,y) − 1

(4.9)

where dist (x, y) denotes the Euclidean distance between x and y. ρ > 1 is the densityadjustable factor. Such a distance can satisfy the clustering hypothesis and can be used to describe the global consistency of the clusters, as demonstrated in Ref. [9]. The regulatory factor ρ can be adjusted to enlarge or shorten the distance between two points. As shown in Fig. 4.7, the linear distance between points a and c is L. The paths of a to c along the data distribution are l1 , l 2 , …, l m . The paths between the nodes through which the ith path li passes are l i1 , l i2 , …, l in . Obviously, li1 + l i2 + … + lin ≥ L. However, after introducing the adjustment scaling factor ρ, there exists ρ that makes ρ li1 + ρ li2 + … + ρ lin − n < ρ L − 1. In Fig. 4.7, the distance between a and c is 8 and ab + bc = 9 > ac, after introducing ρ and setting ρ = 2, the corresponding distance becomes the value in the box of the figure, i.e., ab + bc = 70 < ac = 255. Therefore, the side ac will be assigned a weight of 70 when the graph is built. The similarity metric: s0 (i, j ) =

1 dsp(l(i, j )) + 1

(4.10)

204

4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

Fig. 4.7 Shortest path when ρ=2

where, dsp(l(i , j)) is the shortest path distance of i and j after density factor adjustment. The density-based clustering hypothesis can be achieved by finding the shortest density-based adjustable path distance between any two points in the graph and assigning weights to them. The shortest path can be calculated by Dijkstra’s algorithm, Floyd–Warshall’s algorithm, etc. All the results, in this case, are obtained by Johnson’s algorithm in Matlab 7.6.0, which is a combination of Bellman–Ford’s algorithm, Reweighting (reassignment of weights) and Dijkstra’s algorithm, it is available to calculate the shortest path. 2. Feature extraction based on density-adjustable spectral clustering After introducing the density adjustment factor, a feature extraction algorithm based on the density adjustable spectral clustering is proposed, which can shorten the distance between the same categories and increase the distance between different categories after executing feature extraction. Algorithm 2 Feature extraction algorithm based on density-adjustable spectral clustering Input: Graph similarity matrix W of the original data, the density-adjustable factor ρ. { } Output: Low-dimensional dataset Y = y1 , y2 , . . . , y N after dimensionality reduction. Steps: (1) Repeat (1) in Algorithm 1. (2) Calculate the distance matrix S0 of ρ. (3) Construct a complete graph (undirected) G = (V , E) based on the original data matrix, where V is the nodes and E are the edges connecting any two nodes. The weights of the edges are used to represent the similarity between the data, which is calculated as (σ is the control parameter): ( ) w(i, j ) = exp −s0 (i, j )/(2σ 2 ) (4) Repeat steps (3) ~ (7) as in Algorithm 1.

(4.11)

4.2 Spectral Clustering Manifold Based Fault Feature Selection

205

3. Case study The two circles dataset in Ref. [10], three spirals and toy data in Ref. [6], and the recognized Fisher iris data in UCI are selected as the artificial dataset. The principal component feature extraction, spectral clustering feature extraction, and densityadjustable spectral clustering feature extraction methods are used to extract features for them respectively. From Fig. 4.8, it can be seen that the feature extraction based on spectral clustering is better than the principal component method, which can effectively distinguish the categories for two circles, three spirals, and toy data graphs with manifold structures, while the principal component method is almost ineffective for such structures. After adding the density-adjustable factor, the feature extraction method based on densityadjustable spectral clustering increases the distance between different categories and shortens the distance between the same categories. For the Fisher iris dataset, it also shows that the feature extraction effect based on spectral clustering and densityadjustable spectral clustering is superior to the principal component method from the first three dimensions. However, the feature distribution of the three feature extraction methods on the iris dataset are similar, because the first three dimensions of the dataset after feature extraction do not completely reflect the feature discrepancy, therefore, the dimension of the graph after dimensionality reduction is still greater than 3. The spectral clustering feature extraction method introduces a Gaussian kernel, so it has a good effect on the manifold structure. Besides the parameter ρ, parameter σ also be introduced to the density-adjustable spectral clustering algorithm. Taking three spirals dataset as an example, Fig. 4.9a provides the spectral clustering feature when σ = 0.1, σ = 0.2, and σ = 0.3, respectively. It can be seen that σ is more sensitive to the feature extraction results, and the feature discrepancy of σ = 0.1 is significantly better than σ = 0.2 and σ = 0.3. It is verified that the method can extract better features when σ is the interval of [0.04, 0.12]. Figure 4.9b shows that the feature distribution is based on the density-adjustable spectral clustering method with ρ = 20. When σ = 0.2 and σ = 0.3, the feature distribution discrepancies are inferior to σ = 0.1, but they can also correctly distinguish the three categories. It is verified that the method can extract better features when σ is the interval of [0.09, 0.3]. Compared to the spectral clustering feature extraction method, the parameter range is large in this method. The feature discrepancy also changes obviously when ρ is changed, as shown in Fig. 4.9c, and the results for ρ = 5 and ρ = 15 are better than when ρ = 10.

206

4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

Fig. 4.8 Feature distribution by different feature extraction methods

4.2 Spectral Clustering Manifold Based Fault Feature Selection

(c)

(d) Fig. 4.8 (continued)

207

208

4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis σ=0.1

σ=0.2

σ=0.3

0.1

0.1

0.1

0

0

0

-0.1 0.1

-0.04 0 -0.05 -0.1 -0.06

-0.1 0.1

-0.02 0 -0.04 -0.1 -0.06

-0.1 0.1

-0.02 0 -0.04 -0.1 -0.06

(a) Spectral clustering feature distribution of different σ σ=0.1

σ=0.2

σ=0.3

0.1

0.1

0.1

0

0

0

-0.1 5 -3 x 10

-0.1 0.1

0

-5 -0.1

0

0.1

-0.05 0 -0.052 -0.1 -0.054

-0.1 0.1

0 0.05 -0.1 0.04

0.06

(b) Density-adjustable spectral clustering feature distribution of different σ ( ρ = 20 ) ρ=10

ρ=5

ρ=15

0.1

0.1

0.1

0

0

0

-0.1 0.1

-0.1 0.1

-0.1 0.1

0 0 -0.1 -0.1

0.1

-0.05 0 -0.052 -0.1 -0.054

0 0 -0.1 -0.1

0.1

(c) Density-adjustable spectral clustering feature distribution of different ρ ( σ = 0.5 ) Fig. 4.9 Feature distribution on different parameters

4.2.3 DSTSVM Based Feature Extraction 4.2.3.1

Density-Adjustable Spectral Clustering-Based DSTSVM

1. DSTSVM The spectral clustering algorithm based on spectral graph theory can obtain the global optimal solution. By introducing density adjustable factors and similarity measures based on minimum path, the density-adjustable spectral clustering method

4.2 Spectral Clustering Manifold Based Fault Feature Selection

209

shortens the distance between the same categories and expands the distance between different categories, which can adequately reflect the data structure. Semi-supervised SVM [11] incorporates unknown sample information and has few-shot and nonlinear characteristics. Based on the above theories, density-adjustable spectral clustering and semi-supervised SVM (DSTSVM) was proposed. In this method, the densityadjustable spectral clustering-based method is used to extract features that served as input to DSTSVM. The kernel function is a Gaussian kernel and the classification results are obtained after training by the gradient descent method. Algorithm 3 Density-adjustable spectral clustering and semi-supervised SVM Input: (1) Data: m × n dimensional raw data, which includes both labeled and unlabeled data. (2) Parameters: density-adjustable factor, penalty parameter C, and Gaussian kernel width σ . Steps: (1) Extract feature using Algorithm 2, and derive the kernel function of semisupervised SVM. (2) Train the semi-supervised SVM based on y1 , y2 , …, ym using gradient descent proposed by Olivier Chapelle [12]. 2. Case study To verify the effectiveness of the DSTSVM, a simulation experiment was conducted on Fisher’s iris. Randomly select 50 labeled samples from the 150 samples and repeat the sampling 10 times. Two experimental sets were as follows. (1) The 50 labeled samples selected each time are fed into the SVM for training the model, and predicting the labels of the remaining 100 unlabeled samples. Finally, the output is the average of the accuracy of 10 predictions. (2) The 50 labeled and unlabeled samples selected each time are fed together into semi-supervised transductive SVM (TSVM), cluster kernel semi-supervised support vector machine based on spectral clustering (CKSVM) and DSTSVM for co-training, and predicting the labels of the 100 unlabeled samples. Finally, the output is the average of the accuracy of 10 predictions. In the above experiments, C = 100, the kernel width of the Gaussian kernel function and ρ in DSTSVM were taken as the best parameters. The specific results are listed in Table 4.2. From Table 4.2, it can be seen that the TSVM has the lowest accuracy, the CKSVM method performs superior to the TSVM method, and the DSTSVM gets better classification results, similar to the supervised SVM method. 3. Parameter optimization

210

4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

Table 4.2 Accuracy of different methods on iris dataset

Method M

Parameter

Accuracy CA (%)

SVM

σ = 1.05

94.4

TSVM

σ = 1.05

81.5

CKSVM

σ = 1.2

93.3

DSTSVM

ρ = 2, σ = 1.3

94.6

(1) Impact of parameters The role of C is to adjust the range of the confidence interval. The kernel parameter σ implicitly changes the mapping function, which in turn changes the subspace distribution complexity of the sample data, i.e., the maximum VC dimension of the linear classification interface. The ρ parameter changes the distribution of the data after feature extraction. To some extent, these parameters affect the classification results. For the iris dataset, σ = 0.7, ρ = 2, the error rate of the corresponding 10 predictions on different C is shown in Fig. 4.10. From Fig. 4.10, we can learn that the error is high when C is very small, and the error rate decreases sharply as C increases. The error rate realizes converges after increasing to a certain value, such as C = 10 in Fig. 4.10. In this case, the performance of DSTSVM is not affected by C, however, when C is greater than about 3000, the error rate increases again. Setting C = 100 and ρ = 2, the error rate of the corresponding 10 predictions on different σ is shown in Fig. 4.11. It can be seen that the error decreases first and then increase as the value of σ increases, and a better value is achieved in the interval [0.4, 1.9]. Setting C = 100 and σ = 0.7, the error rate of the corresponding 10 predictions on different ρ is shown in Fig. 4.12. It can be seen that superior results can be achieved by taking ρ within 35, and the error rate is the lowest when ρ = 5.6. As the value of ρ increases, the error rate also increases. (2) Parameter optimization Fig. 4.10 Error rate of classification on different C

4.2 Spectral Clustering Manifold Based Fault Feature Selection

211

Fig. 4.11 Error rate of classification on different σ Fig. 4.12 Error rate of classification on different ρ

Based on the above analysis, it can be seen that the performance of the classifier remains stable as long as C exceeds a certain threshold. The effects of σ and ρ are relatively sensitive to the classification results, especially σ, while the role of ρ can be seen as a fine-tuning for the classification results, which can improve the accuracy at certain values. Therefore, the optimal performance (lowest error rate) of DSTSVM is achieved by selecting the best combination of the parameters. The classification results of different combinations are shown in Fig. 4.13, where σ ∈ [0.4, 1.9], ρ ∈ [2, 35], the step length of σ is 0.05, and the step length of ρ is 1. The maximum value of the average classification accuracy for 10 times is 94.6% when (ρ, σ ) taken as (2, 1.3). From Fig. 4.13, the average accuracy does not fluctuate very much in the middle region, i.e., σ ∈ [0.4, 1.9] and ρ ∈ [2, 33]. It can be concluded that the performance of the DSTSVM method remains stable as long as the parameters are in a certain reasonable range, which is a guideline for the selection of parameters in the later work.

4.2.3.2

Fault Diagnosis Model and Case Study

The essence of fault diagnosis is pattern recognition. A basic pattern recognition system consists of four main parts: data acquisition, pre-processing, feature extraction and selection, and classification decision. The fault features of mechanical

212

4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

Fig. 4.13 The effect of parameter combination (ρ, σ) on classification accuracy

systems, especially the incipient fault of gear, are often submerged in the noise signals [7]. However, it is difficult to extract useful features using traditional signal processing methods, and it is challenging to collect extensive labeled samples in practical application scenarios. Therefore, it is necessary to construct a DSTSVM-based fault diagnosis model. As shown in Fig. 4.14, the fault diagnosis model first simulates various typical abnormal or fault types of the equipment and obtains its vibration and speed signals through various sensors. After calculating the feature indexes and normalizing the data, a fault knowledge library is created. When the actual fault occurs on mechanical equipment, the extracted fault features and established fault types are used to co-train the model, and the output is the fault type of the mechanical equipment. As a result, the final fault type of the device can be obtained by a comprehensive decision and the related interventions can be implemented. In addition, new data or fault types obtained from the diagnosis can be added to the fault knowledge library, so that the fault types for training the DSTSVM model will increase and the model will be more consistent with the actual situation. The prediction or classification results obtained by this model become more and more accurate, and such a circular process of training, identification, and retraining can also be used for online condition monitoring of the mechanical equipment. To verify the effectiveness of the DSTSVM-based fault diagnosis model, a gear fault experiment was conducted in the Laborelec laboratory, as detailed in Sect. 4.2.2, feature were selected by the PCA method. The 9 feature indicators (mean square value, kurtosis, mean, variance, peak, root mean square, crest indicator, impulse indicator, and clearance indicator) were used for classification learning. In 74 × 9-D samples, 40 labeled samples and 34 unlabeled samples were randomly selected, and the sampling was repeated 10 times. After extracting features from all samples using the density-adjustable spectral clustering

4.2 Spectral Clustering Manifold Based Fault Feature Selection

213

Fig. 4.14 DSTSVM-based fault diagnosis model

method, the labeled and unlabeled samples were together input into a transductive SVM for co-training and predicting the labels of the 34 unlabeled samples. The penalty factor was set to 100. To evaluate and validate the performance of the proposed method, a fivefold crossvalidation (5-CV) method was used. By dividing the original data into 5 groups, each subset of data was considered as a validation set, and the remaining 4 groups were used as training sets, 5 groups with corresponding learning models can be obtained. By 10 times of 5-CV, the average of the classification accuracy of 5 groups of learning models was used as the classification result. Table 4.3 reports the comparison results between the proposed method and other methods. It can be seen that similar results are achieved in CKSVM and supervised SVM method, TSVM method has the lowest accuracy among the three semisupervised methods, which is comparable to inputting features directly to the model without feature extraction and demonstrating the importance of feature extraction in incipient fault detection. Among these methods, DSTSVM has the highest accuracy, which verifies the effectiveness of DSTSVM for incipient fault detection. Table 4.3 Accuracy of different methods

Method M

Parameter

Accuracy DA (%)

SVM

σ = 0.50

88.82

TSVM

σ = 0.5

88.23

CKSVM

σ = 1.15

90.47

DSTSVM

ρ = 2, σ = 1.6

92.94

214

4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

4.2.4 Machinery Incipient Fault Diagnosis 4.2.4.1

Incipient Fault Detection and Classification for Transmission Gear

1. Data acquisition The experimental system structure and transmission testing platform are detailed in Sect. 3.2. The gear normal and various fault types used in this section were derived from the fifth gear with a transmission ratio of 0.77 and a sampling frequency of 40,000 Hz. The vibration acceleration signal in the X-direction was taken at position 3 of the transmission input, and the transmission was operated under three operating conditions, i.e., normal, gear incipient pitting, and gear incipient spalling. Table 4.4 provides the 27 modes for different faults, different torque, and different speeds. Table 4.5 lists the characteristic frequency of gearbox at different speeds. 2. Incipient fault detection based on density-adjustable spectral clustering and semisupervised SVM (1) Incipient fault detection under the same working condition The three fault types cannot be completely detected in the time and frequency domains under the operating conditions of 800 r/min at the drive end and 75 N m at the output end. To verify the effectiveness of the DSTSVM method, gear incipient pitting and incipient spalling faults under this operating condition were detected separately. The vibration acceleration signals at the normal state and the two fault states (i.e., mode 9, mode 18, and mode 27 in Table 4.4) were collected and the number of samples is 1024 × 90. The gear original feature set S consisted of 11 statistical features, i.e., mean square value, kurtosis, mean, variance, skewness, peak, root mean square, shape indicator, crest indicator, impulse indicator, and clearance indicator. In total, there were 270 11-D samples in three fault types, and each type of signal was taken as 90 samples, where each sample consisted of 1024 points. A sample dataset was constructed for gear fault diagnosis. Those data were grouped into two subsets, one subset containing samples at normal state and incipient pitting fault state, and the other subset was collected at normal state and incipient spalling fault state. Each subset contained 180 11-D samples. For the gear incipient pitting fault detection, the PCA algorithm was used to select important features from the 11-D original features. Figure 4.15a provides the contribution rate and accumulated contribution of the first three principal components, it can be seen that the accumulated contribution of the first three principal components reaches more than 90%, which can reflect the 11-D original features. The extraction rates of the 11-D original features extracted by the first three principal components are shown in Fig. 4.15b, it can be seen that the 9 feature indicators (extraction rate of mean square value, kurtosis, variance, skewness, peak, root mean square, crest indicator, impulse indicator, and clearance indicator) are extracted by the first three principal components with more than 80% of the extraction rates, so they are selected

4.2 Spectral Clustering Manifold Based Fault Feature Selection

215

Table 4.4 Different modes of gears Modes

Fault type

1

Normal

Rotational speed at the drive end (r/min) 600

Torque at the output end (N m) 50

2

75

3

100

4

800

50

5

75

6

100

7

1000

50

8

75

9

100

10

Gear incipient pitting

600

50

11

75

12

100

13

800

50

14

75

15

100

16

1000

50

17

75

18

100

19

Gear incipient spalling

600

50

20

75

21 22

100 800

50

23

75

24 25

100 1000

50

26

75

27

100

Table 4.5 The characteristic frequency of gear Rotational speed at the drive end (rpm)

600

Rotation frequency of input shaft (Hz) 10 Rotation frequency of middle shaft (Hz)

6.842105

Rotation frequency of output shaft (Hz)

13.0622

Mesh frequency of fifth-shifting gear (Hz)

287.3684

800 13.33333 9.122807 17.41627 383.1579

1000 16.66666667 11.40350877 21.77033493 478.9473684

216

4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

Fig. 4.15 PCA results for gear incipient pitting

as an important feature and the original dimension becomes 180 × 9-D. In 180 × 9-D samples, 30 labeled samples and 150 unlabeled samples were randomly selected, and the sampling was repeated 10 times. All samples were fed into the DSTSVM classifier for co-training, and predicting the labels of the 150 unlabeled samples after 5-CV. The same experimental procedure was repeated for the gear incipient spalling detection. Figure 4.16a provides the corresponding contribution rate and accumulated contribution. It can be seen from Fig. 4.16b that the extraction rate of the remaining features by the first three principal components is greater than 80% except for the shape indicator and the original dimension becomes 180 × 10-D. In 180 × 10-D samples, 30 labeled samples and 150 unlabeled samples were randomly selected, and the sampling was repeated 10 times. All samples were fed into the DSTSVM classifier for co-training, and predicting the labels of the 150 unlabeled samples after 5-CV. Table 4.6 provides the detection results of the different methods, where C = 100 for all classifiers. From Table 4.6, the detection accuracy of all methods at the incipient spalling fault is higher than the incipient pitting fault, which is also consistent with the above analysis. Thanks to SVM, two groups of experiments obtained good detection results with a small number of labeled samples. By introducing unlabeled samples for co-training, the accuracy of the semi-supervised methods (TSVM, CKSVM, and DSTSVM) is higher than supervised SVM methods, which demonstrates that the effective use of unlabeled samples information can improve the detection accuracy. Furthermore, the DSTSVM method shows higher accuracy in two experiments, which indicates that the sample distribution structure is more discriminative after calculating the sample similarity based on density-adjustable, and the detection accuracy can be further improved.

4.2 Spectral Clustering Manifold Based Fault Feature Selection

217

Fig. 4.16 PCA results for gear incipient spalling

Table 4.6 Detection results for 11 statistical features in time domain Method M

Gear incipient pitting detection

Gear incipient spalling detection

Parameter

Accuracy (%)

Optimal parameter

Accuracy (%)

SVM

σ = 0.50

84.20

σ = 0.50

97.13

TSVM

σ = 0.50

86.66

σ = 0.50

100

CKSVM

σ = 1.55

87.77

σ = 1.5

100

DSTSVM

ρ = 2, σ = 1.10

88.27

ρ = 2, σ = 1.85

100

To improve the detection accuracy of incipient pitting fault, the frequency domain energy factor proposed in Sect. 4.2 was used as the model input, and the energy factor was calculated by taking the five gear mesh frequency and corresponding frequency multiplier ( f 1 = 383 Hz, f 2 = 383 * 2 Hz, f 3 = 383 * 3 Hz, f 4 = 383 * 4 Hz). Table 4.7 shows the detection results based on energy factor features, and the parameters of all methods are the same as in Table 4.6. As can be seen from Table 4.7, the detection accuracy of the incipient pitting fault is substantially improved when the input feature is the energy factor in the frequency domain, and the accuracy of all methods is 100%. For incipient spalling faults, the accuracy is similar to using time domain features. The main reason is Table 4.7 Detection results based on energy factor features Method

Accuracy of incipient pitting fault (%)

Accuracy of incipient spalling fault (%)

SVM

100

98.96

TSVM

100

99.72

CKSVM

100

99.60

DSTSVM

100

100

218

4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

that the modulation in the incipient pitting FFT spectrum is mainly based on the mesh frequency and frequency multiplier, while the normally engaged gear mesh frequency appears in the incipient spalling FFT spectrum, which demonstrates the effectiveness of the proposed feature indicator in the incipient pitting fault diagnosis. (2) Incipient fault detection under different working conditions For fault detection under different working conditions, vibration acceleration signals (corresponding to all modes in Table 4.4) from normal, incipient pitting fault, and incipient spalling fault types were collected, and the number of sampling points in each working condition is 3072 × 30. The gear original feature set S consists of 11 statistical features, there are 270 11-D samples in three fault types, and each type of signal was taken as 30 samples, where each sample consisted of 3072 points. Those data were grouped into two subsets, one subset containing samples at normal state and incipient pitting fault state, and the other subset was collected at normal state and incipient spalling fault state. Each subset contained 540 11-D samples. For the gear incipient pitting fault detection, the PCA algorithm was used to select important features from the 11-D original features. Features 1, 2, 4, 5, 6, 7, 9, 10, and 11 were selected and the original sample becomes 540 × 9-D. The same procedure was used for gear incipient spalling fault detection, the selected features were 1, 2, 3, 5, 6, 7, 9, 10, and 11. In 540 × 9-D samples, 50 labeled samples and 450 unlabeled samples were randomly selected, and the sampling was repeated 10 times. All samples were fed into the DSTSVM classifier for co-training, and predicting the labels of the 450 unlabeled samples after 5-CV. Table 4.8 provides the detection results of the different methods, where C = 100 for all classifiers. From Table 4.8, the gear incipient spalling fault detection accuracy of all methods exceeds 90% even with very few labeled samples. However, the detection accuracy of all methods is lower for gear incipient pitting faults due to the complex working conditions. To improve the detection accuracy of incipient pitting faults, multi-sensor data fusion is adopted for model training. In this section, four sensors were arranged and the x-direction of each sensor was taken as the original signal. The 11 statistical features of each sensor were extracted separately and composed into a 4-dimensional feature vector. 270 × 4-D samples from normal state and 240 × 4-D samples from incipient pitting fault were selected since the data of the 4th sensor with 1000 rpm and 75 N m torque were missing for Table 4.8 Gear pitting and spalling detection under different working conditions Method M

Gear incipient pitting detection

Gear incipient spalling detection

Parameter

Accuracy (%)

Optimal parameter

Accuracy (%)

SVM

σ = 0.55

72.18

σ = 0.50

91.06

TSVM

σ = 0.55

74.02

σ = 0.50

92.46

CKSVM

σ = 1.55

76.03

σ = 1.55

93.06

DSTSVM

ρ = 2, σ = 1.22

75.24

ρ = 2, σ = 1.40

94.12

4.2 Spectral Clustering Manifold Based Fault Feature Selection

219

incipient pitting fault. 50 labeled samples and 460 unlabeled samples were randomly selected from 510 × 11-D the samples, and the sampling was repeated 10 times. All samples were fed into the DSTSVM classifier for co-training, and the detection accuracy is listed in Table 4.9. Figure 4.17 visually shows the detection accuracy of different methods with each feature indicator. It can be seen that the skewness feature has the lowest accuracy, and better results are obtained on the mean square value feature, mean feature, variance feature and root mean square feature. Compared to the detection results using a single sensor, the DSTSVM and CKSVM methods improve by 20% and the SVM method improves by 15%, which proves the effectiveness and reliability of the multi-sensor data fusion method. In addition, the DSTSVM and CKSVM methods are superior to the other two methods when using mean square, mean, variance and root mean square features for fault detection, which demonstrates the effectiveness of spectral clustering methods with fewer feature dimensions and provides a reference for gear incipient feature selection. To further explore the influence of feature indicators on the detection results, the mean square value, mean, variance, and root mean square features were used as input indicators for incipient pitting fault detection under different working conditions. The samples from sensor 3 were input to the DSTSVM model, and the detection accuracy of 490 unlabeled samples was 91.8% based on the availability of 50 labeled samples. This method was 14% higher than those methods with 9 feature indicators selected using PCA, which indicates that the feature selection by PCA only eliminates redundancy and does not completely select effective and discriminative feature indicators, so the feature selection needs further study. 3. Gear incipient fault detection based on density-adjustable spectral clustering and semi-supervised SVM Three types of fault signals were collected for gear incipient fault detection under the working condition of 800 r/min speed at the drive end and 75 N m torque at the output end. 30 labeled samples were selected to detect whether 150 samples belonged to incipient pitting fault or incipient spalling fault. Similarly, 30 labeled samples were selected to detect whether 240 samples belonged to the normal state, incipient pitting fault, or incipient spalling fault. Each set of experiments was randomly sampled and averaged 10 times as the detection results. The penalty factor C = 100 for all methods and the detection results of different feature values are listed in Table 4.10. The 8-D feature indicators in the time domain were features 1, 2, 4, 6, 7, 9, 10, and 11 selected by PCA, and the frequency domain energy factor indicators are 4-D calculated from the mesh frequency and its 2–4 frequency multipliers. From Table 4.10, when using feature indicators in the time domain, the accuracy of detecting only the incipient pitting fault and the incipient spalling fault is higher than that detecting normal state, incipient pitting fault, and incipient spalling fault, which is mainly because the distinction between the normal state and incipient pitting fault is insufficient. In both sets of experiments, the semi-supervised method outperformed the supervised method, and the best method was DSTSVM.

1.0

0.50

1.50

1.45

0.80

1.50

1.45

0.55

0.50

0.50

0.50

Mean

Variance

Skewness

Peak

Root mean square

Shape indicator

Crest indicator

Impulse indicator

Clearance indicator

65.06

65.47

64.34

64.02

90.84

74.08

56.13

85.13

80.78

62.36

84.86

0.50

0.50

0.50

0.55

1.45

1.50

0.80

1.45

1.50

0.50

1.0

Parameter σ

Kurtosis

TSVM

Parameter σ

Accuracy (%)

SVM

Mean square value

Feature

66.76

67.19

67.10

63.58

65.76

72.41

53.10

67.58

81.06

61.78

64.78

Accuracy (%)

0.50

0.50

0.50

0.50

0.55

0.50

0.50

0.55

0.50

0.50

0.50

Parameter σ

CKSVM

Table 4.9 Multi-sensor data fusion for gear pitting detection under different working conditions

62.90

61.10

63.0

61.60

89.86

72.60

56.04

89.28

89.58

61.43

88.84

Accuracy (%)

2, 0.50

2, 0.50

2, 0.50

2, 0.75

3, 0.55

3, 0.75

2, 0.75

2, 0.75

2, 0.75

2, 0.70

3, 0.70

Parameter (ρ, σ)

DSTSVM

65.21

66.20

64.28

64.63

92.80

74.36

56.73

91.30

92.08

63.04

91.80

Accuracy (%)

220 4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

4.2 Spectral Clustering Manifold Based Fault Feature Selection

221

Fig. 4.17 Accuracy of different features

Table 4.10 Detection accuracy on different feature indicators Feature indicator

Time domain (8-D)

Method M Incipient fitting fault versus incipient spalling fault

Normal versus incipient fitting fault versus incipient spalling fault

Parameter

Accuracy (%) Parameter

SVM

σ = 0.50

99.4

σ = 0.55

86.66

TSVM

σ = 0.50

100

σ = 0.55

87.29

CKSVM

σ = 0.95

100

σ = 0.75

88.16

DSTSVM

ρ = 2, σ = 0.60 100

ρ = 4, σ = 0.75 88.87

σ = 0.50

100

σ = 0.55

99.48

σ = 0.50

100

σ = 0.55

98.54

CKSVM

σ = 0.80

100

σ = 0.75

98.87

DSTSVM

ρ = 2, σ = 0.60 100

Energy factor SVM (4-D) TSVM

Accuracy (%)

ρ = 5, σ = 0.75 99.5

When the frequency domain energy factor is used to detect incipient faults, the results of both sets of experiments are substantially improved, which also proves the effectiveness of the frequency domain energy factor feature indicator. To analyze the effect of sample size on the detection accuracy, 8-D features-based different methods were conducted for incipient fault detection, and the results are shown in Fig. 4.18. It can be seen that the accuracy of the SVM method increases as the sample size increases. When the number of labeled samples is 30–60, the SVM has the lowest accuracy, while the other methods make better results due to the participation of unlabeled samples. When the sample size is larger than 60, the SVM method has higher accuracy than TSVM and CKSVM, and the DSTSVM method always achieves the best detection results than the other methods due to the introduction of the density-adjustable factor. In addition, the DSTSVM method

222

4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

Fig. 4.18 Accuracy of different sample sizes

makes superior accuracy in most cases, and the method performance tends to stabilize after the number of labeled samples is greater than 80.

4.2.4.2

Bearing Incipient Fault Detection

(1) Data description The data set of CWRU was collected from the test platform at Case Western Reserve University [13], as detailed in Sect. 2.5.4. Inner race single point faults, outer race single point faults, and rolling element point faults were introduced to the SKF 62052RS deep groove ball bearings using electro-discharge machining. The diameter of faults was 0.1778 mm and the depth was 0.2794 mm. The specific parameters of the bearing are listed in Table 4.11. The normal state, inner race fault, outer fault, and rolling element fault signals were collected at 1797 r/min speed and zero motor load, and the sampling frequency was 12,000 Hz. The fault eigenfrequency of rolling bearings is calculated and shown in Table 4.12. (2) Bearing incipient fault detection based on density-adjustable spectral clustering and semi-supervised SVM Table 4.11 Geometry parameters of 6205-2RS bearing Inside diameter (mm)

Outside diameter (mm)

Thickness (mm)

Number of rollers

Ball diameter (mm)

Pitch diameter (mm)

25

52

15

9

7.94

39.11

4.2 Spectral Clustering Manifold Based Fault Feature Selection

223

Table 4.12 Fault eigenfrequency of rolling bearings Rotation frequency Inner race (Hz) Outer race (Hz) Rolling element (Hz) Cage train (Hz) (Hz) 29.95

162.1852

107.3648

141.1693

11.9285

The vibration acceleration signal of the normal state and rolling element point fault were collected under the working condition of 1797 r/min speed zero motor load. The number of sampling points was 1024 × 100, in which one sample was taken every 1024 points, and there were 100 samples in total. The 11 statistical features and the amplitude feature where the rolling body failure frequency was located constitute 200 12-D samples, the shape indicator, and the amplitude feature were removed by running the PCA algorithm, and the sample dimensions became 10-D. During sampling, 15, 25, 35, 45, and 55 labeled samples were randomly selected from 200 × 10 samples, and the treatment was repeated 10 times. All samples were fed into the DSTSVM classifier for co-training, and predicting the labels of unlabeled samples after 5-CV. Table 4.13 provides the incipient fault detection results and the penalty factor C = 100. It can be seen from Table 4.13 that the accuracy of the SVM method increases as the sample size increases while the accuracies of the other three methods are 100% even with only a small number of samples. (3) Bearing incipient fault detection based on density-adjustable spectral clustering and semi-supervised SVM The vibration acceleration signal of the normal state, inner race point fault, outer race point fault, and rolling element point fault signals were collected. The number of sampling points was 1024 × 100, in which one sample was taken every 1024 points, and there were 100 samples for each type of signal. The 11 statistical features, 3 features in the frequency domain, the amplitude feature where the outer race frequency was located, the amplitude feature where the inner race frequency was located, and the amplitude feature where the rolling element frequency was located constitute 400 samples. After the mean feature and the peak feature were removed by running PCA algorithm, the sample dimensions became 400 × 12-D. Table 4.13 Rolling element incipient fault detection results of rolling bearing Labeled SVM TSVM CKSVM DSTSVM samples Parameter Accuracy Parameter Accuracy Parameter Accuracy Parameter Accuracy number σ (%) σ (%) σ (%) (ρ, σ) (%) n 15

0.50

88.10

0.5

99.78

1.4

100

(2, 1.25)

100

25

0.50

95.42

0.5

100

1.5

100

(2, 1.5)

100

35

0.50

96.12

0.5

100

1.5

100

(2, 1.35)

100

45

0.50

97.22

0.5

100

1.5

100

(2, 1.35)

100

55

0.50

97.65

0.5

100

1.5

100

(2, 1.35)

100

224

4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

Table 4.14 Rolling element incipient fault detection results of different methods Labeled SVM TSVM CKSVM DSTSVM samples Parameter Accuracy Parameter Accuracy Parameter Accuracy Parameter Accuracy number σ (%) σ (%) σ (%) (ρ, σ) (%) n 20

0.50

94.05

0.50

94.63

1.45

100

(2, 1.75)

100

40

0.55

98.11

0.50

96.80

1.3

100

(2, 1.75)

100

60

0.50

98.64

0.50

97.90

1.4

100

(2, 1.70)

100

80

0.50

98.70

0.50

98.67

1.35

100

(2, 1.45)

100

100

0.50

98.96

0.50

98.73

1.4

100

(2, 0.5)

100

During sampling, 5, 10, 15, 20, and 25 labeled samples were randomly selected from each type of sample, and the treatment was repeated 10 times. All samples were fed into the DSTSVM classifier for co-training, and predicting the labels of unlabeled samples after 5-CV. Table 4.14 provides the incipient fault detection results and the penalty factor C = 100. It can be seen from Table 4.14 that the accuracy of the SVM method increases as the sample size increases and is higher than TSVM. The accuracy of CKSVM and DSTSVM methods achieves 100% when the number of labeled samples is 20, and the accuracy keeps stable, which demonstrates the effectiveness of spectral clusteringbased classifiers.

4.3 LLE Based Fault Recognition 4.3.1 Local Linear Embedding Local linear embedding (LLE), a nonlinear dimensionality reduction algorithm that represents the original topology with local linear using Euclidean distance, was proposed by Roweis and Saul and published in Science. The core purpose of the LLE method is to reconstruct a weight vector between a sample and its neighborhood samples in the low-dimensional space, and keep the weights in each neighborhood consistent with the original space, i.e., to minimize the reconstruction error when the embedding mapping is locally linear [14]. The weights reconstructed by the LLE algorithm can capture the intrinsic geometric properties of the local space, such as the invariance regardless of translation, rotation, or scaling. The nearest neighborhood of each sample is first determined in LLE, and then the reconstructed weights are obtained by solving the constrained least squares problem. In this process, the LLE transforms the constrained least squares problem to a possibly singular linear system of equations and guarantees the non-singularity of the coefficient matrix of the linear system of equations by introducing a small regularization factor γ . Furthermore, a sparse matrix can be constructed using these reconstructed

4.3 LLE Based Fault Recognition

225

weights. As a result, LEE can obtain the global low-dimensional embedding by calculating the smallest eigenvectors of the sparse matrix. The main advantages of the LLE algorithm are: (1) Only the number of nearest neighbors and the embedding dimension need to be determined, which is simple for parameter selection; (2) The process of finding the optimal low-dimensional feature mapping can be transformed into the problem of solving the eigenvalues for a sparse matrix, which overcomes the local optimum problem; (3) The low-dimensional feature space retains the local geometric properties in the high-dimensional space; (4) The low-dimensional feature space has a full-domain orthogonality coordinate system; (5) The LLE algorithm can learn low-dimensional manifolds of arbitrary dimensions and has an analytic global-optimal solution without iterations; (6) The LLE algorithm is regarded as the feature value computation of sparse matrices, and the computational complexity is relatively small and easy to execute. The LLE algorithm preserves the intrinsic connection between samples using the weight coefficients of samples in the neighborhood, which can be effectively applied in nonlinear dimensionality reduction [15].

4.3.2 Classification Based on LLE In pattern recognition, the accuracy of status classification depends largely on feature selection. The running of a mechanical system is a complex stochastic process, which is difficult to describe by a deterministic time function. The purpose of feature analysis is to extract useful features from the original signal, which can reflect the running status of mechanical equipment. Although there are many features, the features that can reflect the regularity, sensitivity, clustering, and separability of states are different, and the correlation between the state information contained in each feature is not consistent. Therefore, it is necessary to select effective features with good regularity and sensitivity as the initial vector and construct a lower dimensionality feature for status classification based on eliminating redundant information [16]. The core idea of the LLE algorithm [17] is to determine the interrelationships between samples in a local coordinate system for learning the intrinsic manifold structure of the samples. As an unsupervised learning algorithm, LLE is an effective method for compressing high-dimensional data. In the binary classification, the positive and negative samples are on their specific manifolds, therefore, their distance difference can be used to determine the class of the test samples.

226

4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

4.3.3 Dimension Reduction Performance Comparison Between LLE and Other Manifold Methods To verify the superiority of the LLE algorithm, the comparative experiments with the Isomap method and Laplacian Eigenmap (LE) method were conducted on the classical Twin Peaks dataset, where N = 800, D = 3. Figure 4.19 provides the dimensionality reduction results of three manifold learning methods. It can be seen that the global Isomap method maintains the data topology better, but it takes a longer time, almost 50–90 times longer than the local algorithms. LE method takes the shortest time, but the topology of the data is severely damaged. The LLE method can effectively cluster the data after dimensionality reduction and it takes less time. Although most of the internal topology is destroyed, it is more effective in dimensionality reduction than the other two methods. The running time of the three methods on different k are listed in Table 4.15. When k is within a certain range, it has little influence on the algorithm running time. However, the local methods have a great advantage when considering factors such as the representation ability of the selected features, the computation time, and the simplicity of implementation.

Fig. 4.19 Dimensionality reduction results of three manifold learning methods

Table 4.15 Running time of the three methods on different k Method LLE LE Isomap

k=6 2.2561 0.51615 17.6196

k=8 0.29539 0.19607 17.3278

k = 10 0.3938 0.21568 17.3004

k = 12 0.51363 0.23435 17.2798

k = 14 0.76083 0.26801 17.3269

4.3 LLE Based Fault Recognition

227

4.3.4 LLE Based Fault Diagnosis When applying local linear embedding methods in fault diagnosis, two problems need to be addressed: extracting representative fault features, and diagnosing fault type of new data based on known fault knowledge library [18]. LLE is a classical local manifold learning method. Although many experiments have proven that LLE is an effective visualization method, the algorithm still has some shortcomings when applied to the field of pattern recognition [19]: (1) How to improve the generalization of the model on unknown samples; (2) The lack of label information. Therefore, combining supervised linear discriminant analysis and LLE algorithms is a good solution for fault diagnosis.

4.3.4.1

Diagnosis Algorithm

Local linear discriminant embedding (LLDE), a supervised LLE method, was proposed by Li et al. of the Chinese Academy of Sciences and was successfully applied to face recognition [20]. In this section, an improved local linear discriminant embedding classification (LLDEC) method is proposed based on the principle of LLDE and the introduction of evaluation criterion of classification, and the steps of the LLDEC method are as follows. (1) LLDEC algorithm Step 1: Determine the neighborhood of X i by using KNN or ε − ball methods. Step 2: Minimize linear error between X i and its k-nearest neighbors by calcu||2 || ∑ || || lating the reconstructed weights of X i in εi (W ) = arg min||X i − kj=1 W i j X j || . ] [ Step 3: Obtain a weighted matrix W = W i j N ×N of X i by repeating step 2. Step 4: Construct matrix M, M = (1 − W )T (1 − W ). Step 5: Construct a matrix X M X T . Step 6: Calculate the intra-class discrete matrix Sb , inter-class discrete matrix Sw , and the weighted distance difference Sb − μSw . Step 7: Get d-dimensional space embedding Y = V T X by calculating the generalized eigenvalue d of (X M X T − (Sb − μSw ), X X T ) and its corresponding eigenvector matrix V. Step 8: Identify the class of the test sample by comparing the difference εY (W )+ between the sample and positive manifold and εY (W )− between the sample and negative manifold, and select a suitable classifier for embedding results. (2) The purpose of LLDEC In terms of visualization, the goal of dimensionality reduction is to map the highdimensional space of samples to a two-dimensional or three-dimensional space while preserving the intrinsic structure of samples as much as possible. However, the goal

228

4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

of classification is to map the samples to a feature space where each class sample can be clearly separated. LLE is an effective visualization method for mapping highdimensional data to two-dimensional space, but it has inferior classification ability. The goal of LLDEC is to improve the classification ability of the original LLE by making full use of the class information. It is known that the reconstruction weights are kept constant regardless of translation, rotation, or scale change, which is defined as: ||2 || ||2 || || || || || ∑ || ∑ ∑ || ∑ || || || || || ϕ(Y ) = W i j Y j || = W i j (Y j − T i )|| || ||Y i − ||(Y i − T i ) − || || i || j i || j (4.12) where, Ti is the transformation vector of class i. To improve the effectiveness of LLDEC algorithm, the K-nearest neighbor algorithm is combined to perform classification. (3) The algorithm based on the LLDEC and KNN classifier K-nearest neighbor classification algorithm, named KNN, is a classification method with k nearest neighbor, which identifies the class of new sample by the class of the k nearest neighbor(s). The selection of the nearest neighbor number k depends on the sample number and degree of dispersion in each class, and different k values are selected for different applications. If there are few samples around the test sample si , then the range contained by the k will be large, and vice versa. Therefore, the nearest neighbor algorithm is vulnerable to noisy data, especially with isolated points in the sample space, and the intrinsic reason is that the k nearest neighbor samples have the same effect on the test sample in the basic KNN method. Generally speaking, different neighbor samples have different effects on the test sample, and the closer the sample is, the greater the effect on it. LLDEC is a supervised algorithm with label information. The effectiveness of the method can be verified by combining the dimensionality reduction and the KNN algorithms.

4.3.4.2

Case Study

To verify the effectiveness of the proposed LLDEC method, we ran a set of comparative experiments on the gear dataset collected by Laborelec lab. A sample was composed of skewness, crest indicator, peak, root mean square, variance, and shape indicator. Three types of signals (normal, pitting fault, and spalling fault) were collected respectively, so the dataset contained 162 6-D samples, and the number of samples in each type was 54. Figure 4.20 provides the classification results of two methods, it shows that the classification ability of the proposed LLDEC is greatly improved compared to the original method. In each set of experiments, 34

4.3 LLE Based Fault Recognition

229

samples were randomly selected to train a KNN model and to predict the fault classes of 20 test samples. Figure 4.21 provides the classification results of both LLE and LLDEC with KNN. Table 4.16 shows the classification results of the integrated LLE and LDA model on the lab data dataset. When k = 8, the classification accuracy is only up to 40%, while the accuracy of the proposed LLDEC method rises to 96.7%. Different from the Iris dataset, the dimensionality of the lab dataset increases from 4-D to 6-D, and the accuracy of LLDEC increases consequently, which illustrates the effectiveness of the proposed LLDEC method in dimensionality reduction for high-dimensional nonlinear data.

Fig. 4.20 Classification result of LLE method and LLDEC method on lab dataset

Fig. 4.21 Classification result of LLE + KNN method and LLDEC + KNN method on lab dataset

230

4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

Table 4.16 Accuracy of four algorithms on lab dataset Algorithm

Training samples

Test samples

Parameter

Accuracy (%)

LLE

102

60

8

40

ULLELDA

102

60

(8, 2)

82

LLDA

102

60

2

81.76

LLDEC

102

60

12

96.7

4.3.4.3

Fault Diagnosis for Automotive Transmission

1. Transmission gearing fault diagnosis The gear fault data was collected from the Dongfeng SG135-2 automobile transmission produced by a gear factory. The forward fifth gear was used, and the sensor was mounted on the output shaft bearing. The transmission was working in 3 modes: normal, incipient pitting fault, and incipient spalling fault. Please refer to Sect. 3.3.4 for the transmission and the fault gears. To ensure the fairness of the experiment, it is necessary to ensure that the transmissions operate under the same working conditions. Table 4.17 lists the transmission operating conditions and signal sampling parameters. (1) LLE-based gear fault diagnosis A 15-D dataset of 270 samples consisted of 6 features in time domain, 4 features in the frequency domain, and 5 features in the time–frequency domain, and the original LLE algorithm was adopted for dimensionality reduction and fault diagnosis. Compared to the above case study, it can be seen from Fig. 4.22 that the effectiveness of the selected features and the fault modes can be obviously distinguished even using the basic LLE method. By combining KNN, the final diagnosis results are shown in Fig. 4.23 in which 60 training samples and 30 test samples were randomly selected in each fault mode. It is not visually noticeable that there is significant differentiation among the fault modes. Therefore, Table 4.18 provides accuracy in terms of quantitative metrics. Table 4.17 Working condition parameters of transmission Parameter

Value

Parameter

Value

Input speed

1000 rpm

Output speed

1300 rpm

Input torque

69.736 N m

Output torque

50.703 N m

Rotation frequency of input shaft

16.67 Hz

Acceleration sampling frequency

40,000 Hz

Rotation frequency of middle shaft

11.40 Hz

Mesh frequency of fourth-shifting gear

433 Hz

Rotation frequency of output shaft

21.67 Hz

Mesh frequency of fifth-shifting gear

478 Hz

4.3 LLE Based Fault Recognition

231

Fig. 4.22 Two-dimensional visualization of raw LLE gear data

Fig. 4.23 Visualization of classification result by using LLE + KNN

Even though the highest accuracy rate was achieved in diagnosing spalling faults and pitting faults, it was still only 73%. (2) LLDEC-based gear fault diagnosis Similarly, a 15-D dataset of 270 samples consisted of 6 features in time domain, 4 features in the frequency domain, and 5 features in the time–frequency domain, Table 4.18 Classification of gear data using LLE + KNN Algorithm

LLE

Fault modes

Normal versus pitting fault

Normal versus spalling faults

Pitting versus spalling faults

Three faults

Training and test samples

(120, 60)

(120, 60)

(120, 60)

(180, 90)

Time (s)

0.6145

0.614

0.6147

0.6237

Accuracy (%)

56

60

73

62.2

232

4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

and the LLDEC algorithm was adopted for dimensionality reduction and fault diagnosis. Figure 4.24 shows the visualization of gear data dimension reduction by using LLDEC. By combining KNN, the final diagnosis results are shown in Fig. 4.25. In which, 60 training samples and 30 test samples were randomly selected in each fault mode.

Fig. 4.24 Visualization of gear data dimension reduction by using LLDEC

Fig. 4.25 Visualization of classification result by using LLDEC + KNN

4.3 LLE Based Fault Recognition

233

From Fig. 4.25, three fault modes can be completely distinguished by using the LLDEC method, which proves the effectiveness of the method. Table 4.19 provides accuracy in terms of quantitative metrics. Compared with the original LLE algorithm, the accuracy of the LLDEC method has been greatly improved. 2. Transmission bearing fault diagnosis The dataset was collected from the bearing fault platform of CWRU. The experiment simulated three fault modes: inner race single point faults, outer race single point faults, and rolling element point faults, as detailed in Sect. 2.5.4. (1) LLE-based bearing fault diagnosis A 15-D dataset of 270 samples consisted of 6 features in time domain, 4 features in the frequency domain, and 5 features in the time–frequency domain, and the original LLE algorithm was adopted for dimensionality reduction and fault diagnosis. Figures 4.26 and 4.27 provide the fault diagnosis results. From Figs. 4.26 and 4.27, it is difficult to observe a significant difference among the fault modes; Table 4.19 provides accuracy in terms of quantitative metrics, and the normal and rolling element faults are also very difficult to be separated. Table 4.19 Classification of gear data using LLE + KNN Algorithm

LLE

Fault modes

Normal versus inner race fault

Normal versus outer race faults

Normal versus rolling element faults

Four faults

Training and test samples

(120, 60)

(120, 60)

(120, 60)

(240, 120)

Time (s)

2.8783

2.8792

2.8791

3.7517

Accuracy (%)

80

100

41.67

85

Fig. 4.26 Two-dimensional visualization of result using LLE

234

4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

Fig. 4.27 Visualization of classification result by using LLE + KNN

(2) LLDEC-based bearing fault diagnosis A 15-D dataset of 270 samples consists of 6 features in time domain, 4 features in the frequency domain, and 5 features in the time–frequency domain, and the original LLE algorithm is adopted for dimensionality reduction and fault diagnosis. Figures 4.28 and 4.29 provide the fault diagnosis results. Compared with the original LLE algorithm, the LLDEC method can obviously identify the different fault modes. Table 4.20 provides accuracy in terms of quantitative metrics, the accuracy of the LLDEC method has been greatly improved. From Fig. 4.29 and Table 4.20, the accuracy of the LLDEC method is not good enough because it is difficult to distinguish between the rolling element fault and the normal mode of the bearing. Although the improved LLDEC method is better than the basic LEE method, its effectiveness still needs to be further improved.

Fig. 4.28 Visualization of dimension reduction result using LLDEC

4.3 LLE Based Fault Recognition

235

Fig. 4.29 Visualization of classification result using LLDEC + KNN Table 4.20 Classification of bearing data using LLE + KNN Algorithm

LLE

Fault modes

Normal versus inner race fault

Normal versus outer race faults

Normal versus rolling element faults

Four faults

Training and test samples

(120, 60)

(120, 60)

(120, 60)

(240, 120)

Time (s)

0.3930

0.393

0.3930

0.4185

Accuracy (%)

99

97

72.67

91.3

4.3.5 VKLLE Based Bearing Health State Recognition The LLE algorithm uses Euclidean distance to measure the similarity of samples, which does not represent the intrinsic structure of samples. It is very sensitive to the number of nearest neighbors, even if the number of nearest neighbors is similar, it will cause an obvious discrepancy in the dimensionality reduction results. Therefore, it is important to choose the number of nearest neighbors, too large a number affects the local properties, while too small does not guarantee the global properties.

4.3.5.1

Variable K-Nearest Neighbor Locally Linear Embedding

Although the LLE algorithm has achieved research results in the nearest neighbor parameter setting problem, it still needs further improvement because the number of nearest neighbors is fixed for all samples, and it has low computational efficiency and high complexity. Due to the uneven distribution and the different locations of samples, each sample has its most suitable number of nearest neighbors. Following this principle, a variable K-nearest neighbor locally linear embedding (VKLLE)

236

4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

method is proposed to determine the optimal number of nearest neighbors based on the property of the residuals that can evaluate the degree to which the data retains distance information. The smaller the residuals, the more sufficient information about the original data structure is retained in the sample after dimensionality reduction. Conversely, the larger the residuals, the more information of data is lost, and the worse the effect of dimensionality reduction. The experimental results show that the use of the VKLLE method in the analysis of bearing state signals can effectively improve the algorithm stability while ensuring the dimensionality reduction effect. 1. VKLLE algorithm (1) Calculate the nearest neighbor of samples: Given a maximum value k max of the number of nearest neighbors (since the essence of the LLE algorithm is to maintain the local linearity of samples, the value of k max cannot be too large, otherwise it will destroy the local linearity and worsen the dimensionality reduction effect), calculate the Euclidean distance between each point and the rest of samples, and construct a nearest neighbor graph by finding its nearest k (k < k max ) samples. (2) Obtain weights: For each k, the reconstruction error can be described as || ||2 || ∑ || ∑ || || || ε(W ) = wi j x j || || x i − || || i || j∈Ji

(4.13)

by ∑kminimizing ε(W ), the weight wi = {wi1 , wi2 , . . . , wik } can be obtained, and j=1 wi j = 1. (3) Obtain low-dimensional embedding: the reconstruction error can be described as || ||2 || || k ∑ || || || y ε( yi ) = min || − w y (4.14) i j j || i || yi || || j by minimizing ε( yi ), ε(Y k ) can be expressed as: [ ( )] ε(Y k ) = min tr Y k wiT wi Y Tk

(4.15)

where, Y k = {yi − yi1 , yi − yi2 , …, yi − yik }, yij ( j = 1, 2, 3, …, k) denoted the jth nearest neighbor, and tr(Y k Y k T ) = c, tr(·) is the trace of the matrix. According to the Lagrangian function: ( ) L = Y k wiT wi Y Tk − λ Y k Y Tk − c the partial derivatives can be calculated as:

(4.16)

4.3 LLE Based Fault Recognition

237

∂L = wiT wi Y Tk − λY Tk = 0 ∂Y k

(4.17)

Y k is the eigenvector corresponding to the non-zero minimum eigenvalue of wiT wi . |Y k | is the distance of the first k nearest points of yi . Therefore, Y k can be determined by the weight wi of the high-dimensional space xi . Calculate the residual 1−ρ X2 k Y k of each k, where, X k is the distance matrix between the xi in the high-dimensional space and its first k nearest neighbors. ρ denotes the linear correlation coefficient. The smaller the residuals, the more sufficient information about the original data structure is retained in the sample after dimensionality reduction. Conversely, large residual values cannot adequately preserve the information of the original samples. (4) Determine the optimal number k j opt of nearest neighbors for each sample and the optimal weight matrix. ) ( k i opt = min 1 − ρ X2 k Y k

(4.18)

2. Assessment metrics The classification performance decrease rate index R to the same [21] and the separability evaluation index J b in Sect. 3.2.2 are used as the assessment metrics to assess the classification results of the data after dimensionality reduction. Smaller R means better classification in the lower dimensional space and more effective label information is retained, and vice versa. Larger means the shorter intra-class distance and longer inter-class distance of the same kind of samples, which represents the high separability of sample clustering and good dimensionality reduction. The classification performance decreases the rate index R=

Nx − N y Nx

(4.19)

where, N x , N y are the number of samples classified correctly in the high-dimensional and low-dimensional spaces. The K-nearest neighbor classifier is selected to classify samples, i.e., the class of a given sample is consistent with the class that has the most class labels in its nearest neighbor samples. 3. Case study The mixing national institute of standards and technology handwritten digits (MNIST) dataset was used to verify the effectiveness of the VKLLE method. Figure 4.30 provides an example of the MNIST dataset, 500 samples (the digits are 0–4) with 784 dimensions were selected from the MNIST training set as simulation data. Table 4.21 provides the classification performance decrease rate R for the MNIST dataset at different numbers of nearest neighbors (K = 1, 2, 5, 9) and low-dimensional

238

4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

Fig. 4.30 MNIST dataset

space (2-D and 5-D). Table 4.22 provides the separability of samples after dimensionality reduction, where d is the dimensions in low-dimensional spaces. Each column in Tables 4.21 and 4.22 represents R and J b of the two algorithms with different numbers of nearest neighbors k = 10, 15, 20, 25, 30 or k max = 10, 15, 20, 25, 30. The smaller R indicates that the more information in the low-dimensional space is retained, the larger J b indicates the higher separability of the sample after dimensionality reduction. When the dimension of the low-dimensional space is larger, the R-value is smaller, which indicates that more label information is retained under this dimension space, and is much more suitable for classification. For the same dimension, the R of the LLE method is quite different for different k, and the maximum difference can reach 20.17% (the difference between k = 10 and k = 30 when d = 2 and K = 9). In contrast, similar R can be obtained for different k max of the VKLLE method, and the maximum difference is only 3.54% (the difference between k max = 10 and k max = 30 when d = 2 and K = 5), which verified the stability of the proposed VKLLE method. Compare with the R-value of the LLE method, the VKLLE method can achieve a smaller R-value. Only when k/k max = 15, d = 5, and K = 2 or K = 5, the R-value of the VKLLE method is higher than that of the LLE algorithm, but the difference between the two values is small. The difference is 0.41% at K = 2 and 0.21% at K

4.3 LLE Based Fault Recognition

239

Table 4.21 R of LLE and VKLLE methods on MNIST dataset d 2

5

K

R (%) of LLE method

R (%) of VKLLE method

10

15

20

25

30

10

15

20

25

30

1

0

0

0

0

0

0

0

0

0

0

2

3.73

5.18

8.28

16.36

19.25

2.28

1.66

2.48

2.48

2.28

5

6.46

8.33

12.50

21.25

24.79

3.13

3.75

3.75

5.21

4.58

9

5.94

8.07

9.77

22.51

26.11

1.70

2.12

2.76

4.67

4.03

1

0

0

0

0

0

0

0

0

0

0

2

1.66

1.04

1.86

2.69

6.00

1.04

1.45

1.04

2.07

1.86

5

1.04

2.08

4.58

5.00

6.04

0

2.29

3.33

2.71

3.54

9

0

0.64

2.55

4.67

5.94

− 1.27

− 0.21

1.49

1.49

1.27

Table 4.22 J b of LLE and VKLLE methods on MNIST dataset d

J b (%) of LLE method

J b (%) of VKLLE method

10

15

20

25

30

10

15

20

25

30

2

0.550

0.595

0.551

0.463

0.512

0.645

0.547

0.567

0.573

0.573

5

0.580

0.572

0.531

0.473

0.494

0.618

0.587

0.581

0.580

0.569

= 5. For all other data, the VKLLE method obtained a lower R, and the maximum difference is − 22.08% (k/k max = 30, d = 2, K = 9). It can be seen from Table 4.22 that the J b of the VKLLE method is all above 0.54, and is larger than the LLE method (the maximum difference can achieve 0.11) except for the case when d = 2 and k/k max = 15, which demonstrates that the samples after dimensionality reduction using the VKLLE method have better clustering performance. The LLE algorithm belongs to local methods, and a too large number of nearest neighbors fails to preserve the local structure of the samples. However, it can be obtained from Tables 4.21 and 4.22 that the VKLLE algorithm still has a small R even if a large k max is used, which illustrates the VKLLE method can achieve better dimensionality reduction than the LLE method. 4. Complexity analysis Assuming that the number of samples is N, the dimension of the original space is D, the nearest neighbors of each sample in the LLE algorithm is k, the average number of nearest neighbors of each sample in the VKLLE algorithm is k m , the maximum number of nearest neighbors is k max , and the dimension in the low-dimensional space after dimensionality reduction is d. The complexity of the VKLLE algorithm and the LLE algorithm is calculated as follows, respectively. (1) The LLE algorithm complexity

240

4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

To obtain the optimal number of nearest neighbors and get better classification results, a popular procedure is repeating the LLE algorithm to drive the optimal k, the complexity is calculated as follows. Step 1: Calculate the nearest neighbors O(k max × D × N 2 ). Step 2: Calculate the weights O(k max × D × N × k 3 ). Step 3: Calculate low-dimensional embedding O(k max × d × N 2 ). (2) The VKLLE algorithm complexity Step 1: Calculate the nearest neighbors O(k max × D × N 2 ). Step 2: Calculate the weights and the latest nearest neighbor O(N t × D × N × k m 3 ) and N t = k max + k m . Step 3: Calculate low-dimensional embedding O(d × N 2 ). To sum up, the algorithm complexity depends on the dimension D of the original space and the sample number N. When D > N, the complexity depends on Step 1 and Step 2, and the computational time of the VKLLE algorithm is longer than the LLE algorithm; When D < N, the complexity depends on Step 1 and Step 3, and the computational time of VKLLE algorithm is much lower than LLE algorithm. The MNIST dataset (the total number of samples in the training set is 59,370, and the dimension is 784) is used to calculate the complexity of two algorithms, where d = 2, D = 784, and k max = 30. It is obvious from Table 4.23 that the computational time of the VKLLE algorithm is significantly reduced compared to the LLE algorithm starting from near the 1000th sample, and the difference in complexity between the two algorithms becomes more obvious as the number of samples increases. 5. VKLLE-based bearing health state recognition The rotating machinery fault simulation experimental platform introduced in Sect. 3.4.4 is used to generate 3 fault states: normal, inner race fault with 1 mm slice, outer fault with 1 mm slice. Figure 4.31 shows the bearing inner race fault and outer race fault with 1mm slice. By installing a PCB acceleration sensor on the bearing housing, the different vibration signals can be obtained using the BBM data Table 4.23 The complexity of LLE and VKLLE algorithms

Sample (N)

LLE (s)

VKLLE (s)

500

27.909703

75.689061

1000

159.886005

154.756701

1500

422.532133

238.290806

2000

829.198174

292.623492

2500

1454.128425

390.467419

3000

2363.035180

481.147113

Note Processor: Intel(R) Core(TM) i5 CPU M450@ 2.4 GHz, Memory: 2G

4.3 LLE Based Fault Recognition

241

acquisition front-end device. The specific bearing parameters and feature frequencies are listed in Table 3.12. All vibration signals are collected at a speed of 1100 r/ min, a sampling frequency of 12,000 Hz, and a time of 1.5 s. 40 sets of vibration signals are collected under each fault status, and 20-D features based on Table 4.24 are extracted to constitute an original fault feature set. In total, 120 samples with 20-D can be obtained, and the dimension is 3-D after dimensionality reduction. Figure 4.32 provides the waveform in the time domain and demodulation spectrum of the bearing under each status, it can be seen that the waveform discrepancy in the time domain is obvious under a different status. In addition to the different vibration

Fig. 4.31 Simulated fault of rolling bearing

Table 4.24 Original feature indicator Feature indicator in the time domain

Feature indicator in the frequency domain

Mean p1

Amplitude at the rotational frequency p11

Square root amplitude p2

Amplitude at the inner circle passing frequency p12

Standard deviation p3

Amplitude at the outer circle passing frequency p13

Peak p4

Amplitude at the rolling element passing frequency p14

Skewness p5

Amplitude at the cage train passing frequency p15

Kurtosis p6

Amplitude at 2 times the rotational frequency p16

Shape indicator p7

Amplitude at 2 times the inner circle passing frequency p17

Crest indicator p8

Amplitude at 2 times the rolling element passing frequency p18

Impulse indicator p9

Amplitude at 2 times the outer circle passing frequency p19

Clearance indicator p10

Amplitude at 2 times the cage train passing frequency p20

242

4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

amplitudes, the bearing under different statuses also has different shock signals, such as the shock signal of the inner race fault with a 1 mm slice and the shock signal of the outer race fault with a 1 mm slice. From the corresponding demodulation spectrum, the frequencies demodulated from the inner race fault with a 1 mm slice and the outer race fault with a 1 mm slice contain not only the same rotation frequency component (18.25 Hz), but also the respective passing frequency (inner race: 127.5 and 382.9 Hz, outer race: 98.51 and 198.8 Hz), while the modulation of bearing in the normal state is not obvious. Although the fault categories can be distinguished from the time domain and frequency domain, feature extraction is still necessary to accurately and intelligently diagnose fault types. Figure 4.33 shows the distribution of the sample under different feature indicator spaces. “+” denotes the normal state, “*” denotes the outer race fault with a 1 mm slice, and “◯” denotes the inner race fault with a 1 mm slice. It should be noted that all indicators are normalized (mean value of 0 and variance of 1). From Fig. 4.33, different feature indicators are varying in fault diagnosis, for example, the Standard deviation indicator is superior to the Kurtosis, which in turn is superior to the mean. The classification performance decrease rate of LLE and VKLLE methods (d = 3, K = 1, 2, 5, 9, k/k max = 6, 8, 10, 12, 14, 16) are listed in Table 4.25. It can be seen

Fig. 4.32 Waveform and demodulation spectrum of bearing under each status

4.3 LLE Based Fault Recognition

243

that the classification performance decrease rate of VKLLE method is equal to the LLE method for K = 9, k/k max = 6, where the classification performance decrease rate of the VKLLE method (R = 0) is lower than the LLE method (R = 1.667). Figure 4.34 provides the cluster results of the LLE and VKLLE methods on different k values, and the corresponding J b is listed in Table 4.25. It can be seen that at each k/k max , the separability evaluation index J b obtained by dimensionality reduction using the VKLLE method is larger than the LLE method, and the maximum difference is 0.184 (k/k max = 8). Furthermore, J b (VKLLE) is all above 0.8, and the maximum value is 0.952 (k/k max = 6), which indicates that the inter-class distance between different classes is larger and the intra-class distance of similar samples is smaller after dimensionality reduction using VKLLE method. Compared with the LLE method, it can be seen from Fig. 4.34 that the VKLLE method can obtain obvious clustering results, i.e., samples of the same class are clustered together, while

Fig. 4.33 Sample distribution under different feature indicator spaces (“+” denotes the normal state, “*” denotes the outer race fault with 1 mm slice, and “◯” denotes the inner race fault with 1 mm slice)

244

4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

Fig. 4.33 (continued)

samples of different classes are separated and have good stability. The LLE method is sensitive to different k values, and the classification results vary greatly for different k values. For example, although the difference in k values is small (k = 6 and k = 8), the clustering effect for k = 6 is better than k = 8. The above experiments show that the proposed VKLLE method can achieve better clustering results than the LLE method.

4.3 LLE Based Fault Recognition Table 4.25 J b values of LLE and VKLLE methods

245

k/k max

LLE (J b )

VKLLE (J b )

6

0.883

0.952

8

0.716

0.900

10

0.801

0.912

12

0.729

0.868

14

0.765

0.856

16

0.730

0.825

Although the cluster for different fault statuses of bearings is improved by the VKLLE method, J b is lower than 0.9 when k max > 12 because the vibration signal in the test dataset is inevitably disturbed by noise, speed fluctuations, etc. Figure 4.35 provides the cluster results of VKLLE methods after noise reduction in feature space, and the corresponding J b is listed in Table 4.26. From Fig. 4.35 and Table 4.26, the cluster results of the inner race fault with 1 mm slice (“◯”) are significantly improved after noise reduction in feature space. The J b values under different k max after noise reduction in feature space are improved correspondingly, and the maximum difference is close to 0.1 (k max = 16), which demonstrates the effectiveness of the noise reduction. Moreover, the fluctuation of J b was reduced from 0.127 to 0.033 within the range of k max values, which illustrates the corresponding stability can also be improved. It is also noted that J b values are all greater than 0.9 and higher than before noise reduction, which proves the combination of noise reduction in feature space and the VKLLE method can further improve the clustering results.

4.3.5.2

NPE-Based Tendency Analysis of Bearing

The LLE method belongs to a nonlinear algorithm, which can effectively calculate nonlinear data matrix with the advantages of fast computation and few parameters. However, LLE is also a non-incremental learning algorithm, which cannot effectively adapt to the new sample. The low-dimensional space of new samples is calculated by retraining the LLE algorithm, resulting in a large storage space and not suitable for real-time monitoring. To address this problem, an intelligent neighborhood preserving embedding (NPE) algorithm with generalization is proposed by Cai et al. [22] from the theory of the LLE method. The idea of NPE is assuming that it exists a mapping transformation matrix from high-dimensional space to low-dimensional space, and this matrix can be obtained from the training samples. As a result, test data in the low-dimensional space can be obtained by multiplying with this matrix, which is a linear approximation of the LLE method. Self-organizing map (SOM) neural network [23], an unsupervised clustering algorithm, is proposed by Kononen from Finland and has been widely used in condition

246

4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

monitoring of mechanical components. The principle is that, when a certain condition type is an input, a neuron in its output layer gets the maximum stimulation to become the “winning” neuron, and the neighboring neurons also get a larger stimulation due to the lateral effect. In this case, the SOM network is trained and the connection weight vectors of the “winning” neuron and its neighboring neurons are modified accordingly to the direction of the condition type. When the condition type

1 VKLLE3

LLE3

1

0.5

0 1

0 1

1

0.5 LLE2

0.5

0 0

VKLLE1

(b) kmax=6, Jb=0.952

1 VKLLE3

1

LLE3

0.5 0 0

VKLLE2

LLE1

(a) k=6, Jb=0.883

0.5

0 1

LLE2

0.5

0 1

1

0.5 0 0

1

0.5

0.5

VKLLE2

LLE1

0.5 0 0

VKLLE1

(d) kmax=8, Jb=0.900

(c) k=8, Jb=0.716

1

VKLLE3

1

LLE3

1

0.5

0.5

0.5

0 1

1

0.5 LLE2

0.5 0 0

(e) k=10, Jb=0.801

LLE1

0.5

0 1

1

0.5 VKLLE2

0.5 0 0

VKLLE1

(f) kmax=10, Jb=0.912

Fig. 4.34 Cluster results of LLE method and VKLLE method on different k values (“+” denotes the normal state, “*” denotes the outer race fault with 1 mm slice, and “◯” denotes the inner race fault with 1 mm slice)

4.3 LLE Based Fault Recognition

247

1 VKLLE3

LLE3

1

0.5

0 1

LLE2

0 1

1

0.5 0 0

0.5

0.5

0.5

LLE1

VKLLE2

(g) k=12, Jb=0.729

0 0

1

VKLLE3

0.5

0 1

LLE2

0.5

0 1

1

0.5 0 0

1

0.5

0.5

0.5 0 0

VKLLE2

LLE1

VKLLE1

(j) kmax=14, Jb=0.856

(i) k=14, Jb=0.765

1 VKLLE3

1

LLE3

VKLLE1

(h) kmax=12, Jb=0.868

1

LLE3

1 0.5

0.5

0 1

1

0.5 LLE2

0.5 0 0

(k) k=16, Jb=0.730

LLE1

0.5

0 1

1

0.5 VKLLE2

0.5 0 0

VKLLE1

(l) kmax=16, Jb=0.825

Fig. 4.34 (continued)

is changed, the original winning neuron in the two-dimensional space will be transferred to other neurons, so that the connection weights of the SOM network can be adjusted using a large number of training samples via the self-organization method. As a result, the output layer’s feature map of the SOM network can reflect the distribution of the samples. Performance assessment of bearing is proposed based on NPE and SOM methods, in which the NPE method is used to map the samples from the high-dimensional space to the low-dimensional space and train a SOM model by using samples under the normal condition. For the test data, they are fed into the trained SOM model and the condition result can be determined by calculating the

248

4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

1 VKLLE3

VKLLE3

1

0.5

0 1

VKLLE2

0 1

1

0.5

0.5

0 0

VKLLE1

(b) kmax=8, Jb=0.953

1

1 VKLLE3

VKLLE3

0.5 0 0

VKLLE2

VKLLE1

(a) kmax=6, Jb=0.956

0.5

0 1

VKLLE2

0.5

0 1

1

0.5 0 0

1

0.5

0.5

VKLLE2

VKLLE1

(c) kmax=10, Jb =0.955

0.5 0 0

VKLLE1

(d) kmax=12, Jb=0.949

1 VKLLE3

1 VKLLE3

1

0.5

0.5

0.5

0 1

1

0.5 VKLLE2

0.5 0 0

VKLLE1

(e) kmax=14, Jb =0.952

0.5

0 1

1

0.5 VKLLE2

0.5 0 0

VKLLE1

(f) kmax=16, Jb =0.923

Fig. 4.35 Cluster results of VKLLE methods after noise reduction in feature space on different k values (“+” denotes the normal state, “*” denotes the outer race fault with 1 mm slice, and “◯” denotes the inner race fault with 1 mm slice)

deviation value between the test sample and the normal samples using the minimum quantization error (MQE). Finally, the MQE is normalized to obtain the confidence value. The CWRU dataset was used to validate the proposed method based on NPE and SOM. The specific vibration signals are collected by a vibration acceleration sensor mounted in the upper housing of the induction motor output shaft, and the sampling frequency is 48 kHz, including:

4.3 LLE Based Fault Recognition Table 4.26 J b values of VKLLE methods before and after noise reduction in feature space

249

k max

Before noise reduction (J b ) After noise reduction (J b )

6

0.952

0.956

8

0.900

0.953

10

0.912

0.955

12

0.868

0.949

14

0.856

0.952

16

0.825

0.923

(1) Normal state; (2) Rolling element fault (fault width are: 0.007 in., 0.014 in., 0.021 in., depth is 0.011 in., represented by B014, B021, B007, respectively). The vibration signals of four fault types are collected at three different loads and speeds: 1 hp-1772 r/min, 2 hp-1750 r/min, and 3 hp-1730 r/min. 20 sets of time domain signals are collected at each load and speed according to 4–25, and the number of samples for each fault type is 60. Therefore, 20 × 240 data are constructed in high-dimensional space. The intrinsic dimension of the samples in the high-dimensional space is determined by the residual error curve proposed in the literature [24], and the dimension at the “inflection point” of the residual curve is considered to be the intrinsic dimension of samples. Figure 4.36 provides the residual error curve of the sample dimension. It can be seen that the “inflection point” of the residual curve appears in the lowdimensional space with 3-D, so the intrinsic dimension of the sample is considered to be 3-D. 50% of samples are selected to calculate the mapping transformation matrix by using the NPE method, and the rest of the samples are mapped to the 3-D space by a constructed transformation matrix. Then, the SOM method is used to calculate the corresponding MQE value, which can be used to determine the fault types. Figure 4.37a provides the sample curves in low-dimensional space obtained in the original space without noise reduction using the NPE method, where NPE1, NPE2, and NPE3 indicate the first-dimensional feature, second-dimensional feature, and the third-dimensional feature of samples in low-dimensional space. From Fig. 4.37a, the feature curves of the samples fluctuate obviously and cannot effectively identify fault types. In the NPE1 feature space, the samples are similar in the normal state and the B007 fault type, so it is difficult to distinguish them, which shows that the sample curve is valid in this case. The sample curves after noise reduction are shown in Fig. 4.37b. It can be seen that the sample curves from the same fault type in the low-dimensional space become significantly smaller after the noise reduction, which illustrates that the samples are similar to those of the same fault type and have greater differences from different fault types. The MQE curves calculated by using different methods are shown in Fig. 4.38. Method 1: Calculate the MQE value by inputting samples from the original space into the SOM model, as shown in Fig. 4.38a; Method 2: 50% of samples from the original space are selected to calculate the mapping transformation matrix by using

250

4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

Fig. 4.36 Residual error curve of sample at different dimensions

Fig. 4.37 Sample curves in low-dimensional space before and after noise reduction in the original feature space

NPE method, and the rest of samples are mapped to the low-dimensional subspace, and the corresponding MQE value can be calculated by SOM method, as shown in Fig. 4.38b; Method 3: The samples are noise reduced from original feature space by using the NPE method and then input to the SOM model to calculate the MQE values, as shown in Fig. 4.38c. From Fig. 4.38, different calculation methods have different results. Method 1 shown in Fig. 4.38a cannot effectively diagnose B014 and B021 faults, and the MQE value of B021 under 1 hp is among the values of B007 and B014, which is not consistent with the changing trend of the fault, and the MQE values of the same fault

4.3 LLE Based Fault Recognition

Fig. 4.38 MQE curves of three calculation methods

251

252

4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

have obvious fluctuations. In Method 2 shown in Fig. 4.38b, the discrepancy in MQE values between the normal state and B007 is obvious. Under 2 hp and 3 hp loads, most samples of B021 have higher MQE values than other fault types, indicating that the fault is deteriorating further. However, there are still some samples that overlap with the B014 fault, and the MQE value of B021 under 1 hp is also among the values of B007 and B014, which cannot be distinguished effectively. In Method 3 shown in Fig. 4.38c, the samples have very small fluctuations in the same fault types, the discrepancy in MQE values between the normal state and B007 further increases, and the distinction between different states is more obvious. Under 2 hp and 3 hp loads, most samples of B021 have higher MQE values than other fault types, indicating the fault degree is the deepest. Moreover, the MQE value of B021 under 1 hp load is no longer among the values of B007 and B014, but similar to B014 and even partially higher than B014, which indicates that the fault has started to deteriorate and is consistent with the actual degradation trend of bearings.

4.4 Fault Classification Based on Distance Preserving Projection The idea of distance preserving projection is that first calculate the Euclidean distance between any two data from the original database to obtain a distance matrix, from which get Minimum Spanning Tree. Then project Minimum Spanning Tree from small to large, from left to right. At the same time, precisely retain the distance from each data to its nearest neighbors and some of its nearest neighbors in lowdimensional space to achieve the purpose of dimensionality reduction.

4.4.1 Locality Preserving Projections (1) Calculate near neighbors: Calculate the Euclidean distance between each x i and the rest of the data to find k nearest points to it then build a near neighbor graph. The equation of distance is || ) || ( d x i , x j = || x i − x j ||

(4.20)

(2) Select weight value: ⎧ Wi j =

e− 0,

||xi −xj ||2 t

(3) Calculate eigenvector:

, i and j are near neighbors, t as heat kernel signature others (4.21)

4.4 Fault Classification Based on Distance Preserving Projection

X L X T a = λX DX T a

253

(4.22)

∑ where D is the diagonal matrix, while Dii = j W ji and L = D − W . (4) Calculate the Eigenvalue and Eigenvector of Eq. (4.22). Vectors a0 , a1 , …, and ad are the Eigenvectors corresponding to Eigenvalues λ0 < λ1 < · · · < λd . Therefore, the low-dimensional space sample is yi = AT xi , where A = (a0 , a1 , …, ad ) as transformation matrix. By calculating training samples to obtain the transformation matrix from highdimensional space to low-dimensional space, test samples can obtain corresponding low-dimensional test space, which effectively improves the calculating speed, solves the problem of LE that it fails to deal with test samples effectively, and improves algorithmic generalization. Using manifold learning to recognize mechanical status is a useful means of diagnosis. At present, in the field of mechanical fault diagnosis, sample space processed by manifold learning is usually feature space after feature extraction of time-domain signals. The number of samples in the feature space is far less than the amount of data in the time domain signal. If noise reduction performance is ensured, direct noise reduction to time-domain signals can be replaced by noise reduction to the feature space, which can effectively reduce computational complexity, accelerate calculation speed and reduce storage space. Transformation of time-domain signal noise in feature sample space. (1) Transfer condition of noise when time-domain features are extracted from timedomain vibration signals. Assume X(i), (i = 1, 2, …, N) is the measured time-domain vibration signal and its expression is: X(i ) = Y (i) + ΔY (i)

(4.23)

where Y(i) is the ideal time-domain vibration signal when there is no noise; ΔY (i ) is its corresponding noise. Time domain transformation is performed on X(i) to obtain corresponding timedomain feature indicators such as peak-to-peak value, average, variance, mean square amplitude, kurtosis, impulse indicator, etc. Those feature indicators can all be expressed in the following equation, on which the study of the noise transformation process is based. }m { ∑ a [X(i )]n 1 1 M= { ∑ }m b [X(i)]n 2 2

(4.24)

Substitute Eq. (4.23) for Eq. (4.24), }m { ∑ a [Y (i ) + ΔY (i)]n 1 1 M= { ∑ }m b [Y (i ) + ΔY (i)]n 2 2

(4.25)

254

4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

where a, b, n1 , n2 , m1 , m2 are coefficients. The different coefficient combination represents different time-domain features, such as mean square amplitude: a = 1/N, b = 1/N, n1 = 2, m1 = 1/2, m2 = 0. Equation (4.25) can be written as }m }m { ∑ { ∑ a [Y (i )]n 1 1 + a [ΔY (i )]n 1 1 + c1 M= { ∑ }m }m { ∑ b [Y (i )]n 2 2 + b [ΔY (i)]n 2 2 + c2 { ∑ }m a [Y (i )]n 1 1 = { ∑ }m + c b [Y (i )]n 2 2

(4.26)

where c1 , c2 , and c are the remainders of the equation containing different levels of noise. Equation (4.26) tells that time-domain features extracted from time-domain vibration signals are the combination of the ideal feature part and noise part. For example, the transformation of noise from the time domain to kurtosis is: xq = where xa = Eq. (4.27), xq =

1 N

∑N i=1

N 1 ∑ [X(i )]4 N i=1 (xa )2

(4.27)

[X(i )]2 is the mean square value. Substitute Eq. (4.23) for

N 1 ∑ [Y (i) + ΔY (i )]4 ( ∑ )2 N N 1 i=1 [Y (i ) + ΔY (i )]2 N

i=1

{ }2 } { 4 2 2 2 N 1 ∑ [Y (i )] + [ΔY (i )] + 2Y (i )ΔY (i ) + 2[Y (i )] [ΔY (i)] + 2Y (i )ΔY (i ) = ( ∑ ) N N 1 1 ∑ N [ΔY (i )]2 + 2 ∑ N [Y (i )ΔY (i)] 2 2 i=1 i=1 [Y (i )] + N i=1 i=1 N N =

N 1 ∑ [Y (i )]4 { ∑ }2 + Δ N N i=1 1 [Y (i )]2 N

= xq + Δ

i=1

(4.28)

where xq is kurtosis of an ideal no noise condition; Δ is a noise. As can be seen from the above reasoning, when measured time-domain vibration signal that contains additive noise is time-domain transformed to obtain time-domain feature indicators, the influence of additive noise will be transformed into the feature indicators. (2) Transformation condition of noise when time-domain vibration signal is under frequency domain analysis. The extraction of frequency-domain features is obtained by performing Fourier transform on measured time-domain vibration signals, transforming time-domain signals to frequency-domain signals, and then doing relevant analysis and processing on frequency-domain signals to extract corresponding features. Therefore, studying the

4.4 Fault Classification Based on Distance Preserving Projection

255

transformation condition of noise in the process of signals transformed from time domain space to frequency domain space tells us how noise is superimposed on frequency-domain feature indicators. The transformation equation between the time domain and frequency domain is: x(k) =

N ∑

X(i )e− j

2π N

ki

, k = 1, 2, . . . , N

(4.29)

i=1

Substitute Eq. (4.23) for Eq. (4.29), x(k) =

N ∑

[Y (i) + ΔY (i)]e− j

2π N

ki

k = 1, 2, . . . , N

i=1

=

N ∑ i=1

Y (i )e− j

2π N

ki

+

N ∑

ΔY (i)e− j

2π N

ki

(4.30)

i=1

∑N 2π Let y(k) = i=1 Y (i )e− j N ki be Fourier transform of time-domain signals when ∑N 2π ΔY (i)e− j N ki be Fourier the condition is ideal and no noise and Δ y(k) = i=1 transform of the noise part. Therefore, Eq. (4.30) can be simplified as: x(k) = y(k) + Δ y(k), k = 1, 2, . . . , N

(4.31)

As can be seen from the above derivation process, in the process of Fourier transform, additive noise is transformed from time domain space to frequency domain space. Therefore, it is inevitable that extracting feature indicators from frequency domain space is affected by noise. As can be seen from the transformation process of additive noise, in the process of feature extraction in the time domain and frequency domain, additive noise exists still in the form of additive noise. Therefore, the effect of noise reduction on the extracted features is equivalent to the effect of direct denoising the original time domain signals. Using manifold learning such as LPP to identify mechanical status is an effective means of diagnosis, but the presence of noise seriously affects the accuracy of identification. At present, in the field of mechanical fault diagnosis, sample space processed by manifold learning is usually feature space after feature extraction of time-domain signals. The number of samples in the feature space is far less than the amount of data in the time domain signal. If noise reduction performance is ensured, direct noise reduction to time-domain signals can be replaced by noise reduction to the feature space, which can effectively reduce computational complexity, accelerate calculation speed and reduce storage space.

256

4 Manifold Learning Based Intelligent Fault Diagnosis and Prognosis

4.4.2 NFDPP Figure 4.39 is a simplified graph of LPP and NFDPP when choosing sample structure information to build a weight map. It can be seen from the figure that NFDPP is an improved method of LPP, which concerns both the structure information of the nearest neighbors and the farthest neighbors of the sample, avoiding the shortcomings of LPP that only concerns near neighbor structure information and ignores the non-neighbor structure information. The essence of NFDPP is to maintain the farthest structure properties of the sample while retaining local structure characteristics so that the sample space can preserve more original spatial data structure information after dimension reduction. The specific calculation process is as follows: Given m-dimensional sample matrix X = [x1 , x2 , x3 … xN ] ⊆ Rm , where N is the number of samples. Find transformation matrix A, map N samples to low-dimension subspace Y = [y1 , y2 , y3 … yN ] ⊆ Rd (d S-transform > Short-time Fourier transform. Therefore, the wavelet transforms with Morlet wavelet basis are selected to combine with the CNN, and the effect of fault diagnosis is better and relatively stable.

5.3.3 Transmission Fault Diagnosis Under Variable Speed The speed that the engine transmits to the gearbox is not constant in most cases during the operation of the car. The speed of the input shaft of the gearbox changes with time, and the vibration is more complicated than that of the input shaft keeping a steady speed. Due to its unique structure, the CNN is invariant to a certain degree of translation, scaling and torsion. Therefore, the time–frequency analysis combined with the convolutional neural network method is used in this section, and the fault diagnosis of the gearbox under the variable speed is carried out.

5.3.3.1

Experimental Setup

The experiment is still carried out on the three-axis five-speed transmission, and the experimental equipment is consistent with 5.2.3. The fault of the gear is set on the driven wheel of the fifth gear, and the gear is cut in different degrees to simulate the three fault states of gear mild broken tooth, moderate broken tooth and single broken tooth. Bearing Fault is located in the output shaft of the rolling bearing inner ring,

350

5 Deep Learning Based Machinery Fault Diagnosis

Table 5.27 Fault conditions Group

Fault condition

Group

Fault condition

1

Fifth gear normal condition

5

Inner ring fault with 0.2 mm width and Fifth gear normal condition

2

Fifth gear mild broken tooth

6

Inner ring fault with 0.2 mm width and Fifth gear mild broken tooth

3

Fifth gear moderate broken tooth

7

Inner ring fault with 0.2 mm width and Fifth gear moderate broken tooth

4

Fifth gear single broken tooth

8

Inner ring fault with 0.2 mm width and Fifth gear single broken tooth

for 0.2 mm width of the fault. Combination gear and bearing fault state and normal state, a total of eight states of the signal to be identified, as shown in Table 5.27.

5.3.3.2

Fault Diagnosis Under the Speed-Up Condition

Variable speed refers to the input shaft speed changes with time, including speed up, speed down, speed up or down in three cases. The rising speed is similar to the falling speed, so only the rising speed can be selected for analysis. Speed-up refers to the input shaft of the gearbox speed increasing with time. (1) Speed-up signal analysis 1. Simple Analysis in Time Domain and Frequency Domain The sampling frequency is set to 12 kHz, and the vibration signals of each fault state and normal state are collected. In order to show the overall change trend of the signal with time, the time-domain signal of 60 s is selected for each type of fault state, as shown in Fig. 5.43. Under the condition of increasing speed, the vibration amplitude of the signal increases with the increasing speed of the input shaft. The degree of fault can also affect the amplitude of vibration, the greater the degree of fault, the more intense vibration, and the greater the amplitude of vibration. Because the speed of the input shaft set up in the experiment is an overall increasing trend, it cannot guarantee that each fault state will have the same speed at the same time, so at a fixed time in the time domain diagram of eight states, the amplitude of vibration signal with large fault degree may be lower than that with small fault degree. Time-domain signal can only reflect the amplitude of the vibration signal and whether there is an impact phenomenon, and cannot be analyzed by the time-domain signal to determine the fault state. The frequency components of the vibration signals and the corresponding vibration amplitude values of each frequency component can be analyzed through the frequency spectrum. The vibration signals of the broken teeth of the fifth gear were selected and Fourier transform, the spectrum is shown in Fig. 5.44. The frequency

5.3 CNN Based Fault Classification

Fig. 5.43 The time domain vibration signal of each health condition

351

352

5 Deep Learning Based Machinery Fault Diagnosis

Fig. 5.44 The frequency spectrum of the fifth gear single broken tooth condition

components in Fig. 5.44 are so numerous that it is not possible to analyze the frequency components by observing the spectrum to determine the fault status. When the rotating speed is changed, the rotating frequency of each shaft is changed, and the meshing frequency of the gears is also changed. The corresponding frequency component is an interval. The meshing frequency of the constant meshing gear and the fifth output gear f m1 and f m2 are calculated as follows f m1 = z 1 × f n1

(5.32)

f m2 = z 4 × f n3

(5.33)

where, z 1 is the constant meshing gear drive teeth, z 1 = 26, f n1 is the input shaft frequency, z 4 is five gear driven teeth, z 4 = 22, f n3 is the output shaft frequency. When the speed of the input shaft slowly changes from 0 to 1500 rpm, the frequency of the input shaft f n1 gradually increases from 0 to 25 Hz. The maximum sum of the meshing frequencies of the constant-meshing gear and the fifth output gear f m1 max and f m2 max are calculated as follows: f m1 max = z 1 × f n1 max = 26 × 25 = 650 (Hz) f m2 max = z 4 × f n3 max = 22 × f n1 max ×

26 42 × = 718.42 (Hz) 38 22

(5.34) (5.35)

The range of the meshing frequency of the constant meshing gear is 0 ∼ 650 Hz, and the range of the meshing frequency of the fifth gear is 0 ∼ 718.42 Hz. If two pairs of gears still have second-order vibration response, the range of meshing frequency of constant meshing gear and fifth gear will be 0 ∼ 1300 Hz and 0 ∼ 1436.84 Hz respectively. Because the frequency of meshing changes with the speed of rotation, its frequency range is in a range, easy to cover up the fault frequency components. Therefore, it is difficult to make fault diagnosis by frequency spectrum under the condition of variable speed.

5.3 CNN Based Fault Classification

353

Fig. 5.45 The time–frequency graph of the signal from the 20 to 40 s in the condition of gear single broken tooth

2. Time–frequency Analysis When combined with the convolutional neural network algorithm, the performance of the time–frequency transform method is in the order of Continuous wavelet transform > S-transform > Short-time Fourier transform. Therefore, the time–frequency transform method uses continuous wavelet transform, wavelet base still uses Morlet wavelet. In order to observe the trend of time–frequency variation of the whole signal under the condition of variable speed, the time-domain signal in the condition of broken gear teeth, which is from 20 to 40 s, is intercepted, and the continuous wavelet transform is carried out, the resulting time–frequency diagram is shown in Fig. 5.45. As can be seen in Fig. 5.45, there is a large amplitude band near 400 Hz, which is the gear mesh frequency component. With the increase of rotation speed, the value of the meshing frequency increases slowly, which is reflected in the time–frequency graph as a frequency band of meshing frequency similar to a line with a positive slope but a small slope value. The energy-concentrated frequency band around 1500 and 2500 Hz is the natural frequency component of the gearbox system excited by the impact component. The frequency value is not affected by the rotational speed, but the amplitude increases with the increase of the rotational speed. In addition, the number of impacts will increase with the increase of rotation speed and the number of meshing times per unit time, so the impact components will be more and more intensive in the time–frequency diagram. In order to explain the difference between the same kind of samples, two signals of the same length at different time are selected to carry out time–frequency transform respectively. Two segments of 0.5 s in length at the 10th and 60th s of five gears with broken teeth are taken for continuous wavelet transform, and the time–frequency diagram is shown in Fig. 5.46.

354

a) The signal at the 10s is performed continuous wavelet transform

5 Deep Learning Based Machinery Fault Diagnosis

b) The signal at the 60s is performed continuous wavelet transform

Fig. 5.46 The time–frequency graphs of the 0.5 s length signal of the fifth gear single broken tooth condition

As can be seen in Fig. 5.46, the amplitude of the time–frequency diagram at 60 s is significantly larger than that at 10 s due to the increasing speed of the input shaft. Moreover, the frequency of impact will increase with the increase in rotation speed, because the number of teeth entering the mesh will increase. When the time–frequency images of different fault signals are recognized, the samples of each kind of signal are the amplitude matrix corresponding to the time– frequency images of a fixed length for a short period of time. Under the condition of constant speed, the samples of the same kind of signals should be basically the same without considering the influence of noise and random factors. Under the condition of variable speed, the value of gear meshing frequency, bearing fault characteristic frequency, impact frequency and amplitude of frequency components all change with time, but they are similar in general. In view of the CNN invariance of image translation, scaling and torsion, the convolutional neural network is used for time– frequency image recognition at variable speed. (2) Time–frequency image analysis From time zero to time t1 , the rotational speed of the input shaft increases from 0 to n 1 . Taking t0 at a certain time within this period as the boundary, the time–frequency graph corresponding to signals in time 0 − t0 is taken as the training sample set, with a total of 500 training samples. The time–frequency graph corresponding to signals in the t0 − t1 period is used as the test sample set, with a total of 500 test samples. The size of the time–frequency graph obtained by the time–frequency transform method is adjusted to 32 × 32 so that it can be used as the input of the convolutional neural network. During image recognition, there are eight types of time–frequency graphs under different states, each of which has 1000 samples, with training samples and test samples accounting for 50% each.

5.3 CNN Based Fault Classification

355

Three deep learning algorithms represented by convolutional neural network, deep belief network and stacked autoencoder were used to identify time–frequency images under speed-up conditions. Based on the analysis of the convolutional neural network parameters, the convolutional neural network parameters were selected as follows: 6 convolution cores with a size of 5 × 5 in the first convolution layer and 12 convolution cores with a size of 3 × 3 in the second convolution layer; The pool area of down-sampling layer is 2 × 2, average pool is used, batch size = 5, iteration 10 times. The parameters of deep belief network are: the number of input layer nodes is 1024, the number of output layer nodes is 8, and the number of hidden layer nodes is 1000, 800, 500. In the pre-training phase, each Restricted Boltzmann Machine is traineded with unlabeled data and iterated 100 times, while in the fine-tuning phase, the pre-trained network and the softmax classifier are combined to form a classification model, fine-tune the network by Backpropagation it with labeled data, iterating it 50 times. The parameters of stacked autoencoder are as follows: the number of nodes in the input layer is 1024, the number of nodes in the output layer is 8, and the number of nodes in the three hidden layers is 400, 200, 50 respectively. The pre-training phase iterates 100 times. Because of the relationship between the computation time and the structure of the algorithm, the number of iterations is chosen to calculate the classification accuracy every 20 times until 200 iterations. Each experiment was repeated 5 times, and the average of 5 classification results was taken as the final classification accuracy rate, then the relationship between the classification accuracy of the three kinds of deep learning algorithms and the number of iterations is shown in Fig. 5.47 respectively. Figure 5.47a shows the training and test classification accuracy of the convolutional neural network for recognition of time–frequency images at speed-up conditions. When the number of iterations increased from 1 to 2, the classification accuracy of the training samples and test samples increased significantly. After four iterations, the training accuracy was over 99%, and after six iterations, the training accuracy was 99.9%, which indicated that the fitting effect of the convolutional neural network to the training samples was very good, there is little point in adding more iterations. For the test samples, after four iterations, the test accuracy is maintained at more than 90%, but the accuracy does not increase with the number of iterations but slightly fluctuates, the test accuracy reached a maximum of 98% after 7 iterations. The deep belief network underwent both pre-training and fine-tuning, and Fig. 5.47b shows the curve of training and test accuracy with the number of finetuning iterations. As can be seen from the graph, the training accuracy increased from 15.01% to 97.7% and the test accuracy increased from 13.98% to 86.78% when the number of fine-tuning iterations increased from 1 to 2. The classification accuracy of training and testing continues to increase slightly as the number of fine-tuning iterations increases. When the number of iterations is 12, the training accuracy is nearly 100%, and the test accuracy is 93.9%. After 27 iterations, the test accuracy remained above 96% and reached the maximum of 96.71% after 40 iterations.

356

5 Deep Learning Based Machinery Fault Diagnosis

a) Convolutional Neural Network

b) Deep Belief Networks

c) Stacked autoencoder

Fig. 5.47 The accuracy of different algorithms under the speed-up working condition

Stacked Autoencoder is also pre-trained and fine-tuned. Figure 5.47c shows the change of training and test accuracy with the number of fine-tuning iterations. Because all the training samples are one-time whole input and the network fitting is for all training samples, the training accuracy is 100% after pre-processing. A value of 0 on the horizontal axis represents the classification accuracy without fine-tuning, which is 84.72% after 100 iterations of pre-training. After 60 tweaks, the test accuracy is maintained above 98%, and reaches the maximum at 120 iterations, reaching 99.43%. As a whole, when the number of iterations or the number of fine-tuning iterations reaches a certain degree, the training accuracy of the three deep learning algorithms can basically reach 100%, indicating that the three algorithms can basically completely fit the training samples. The highest classification accuracy of the three algorithms is above 96%, which shows that the three deep learning algorithms can be used to classify and identify the gearbox fault time–frequency diagram under speed-up conditions. Comparing the three algorithms, the stacked autoencoder has the highest test accuracy, which is 99.43%, followed by the convolutional neural network of 98%, and the deep belief network has the highest test accuracy of 96.71%. As far as the

5.3 CNN Based Fault Classification

357

Table 5.28 The highest training and test accuracy in the four algorithms Algorithm

The highest training accuracy (%)

The highest test accuracy (%)

CNN

100

98

DBN

100

96.71

SAE

100

99.43

SVM

99.975

46.025

stability of the algorithm is concerned, both the deep belief network and the stacked autoencoder are stable, while the convolutional neural network fluctuates slightly. Considering the computation time of the three algorithms to achieve the maximum test accuracy, the single operation time of the convolutional neural network is 88.9 s, which is about one-fifth of the deep belief network’s and one-eleventh of the stacked auto encoder’s. From the above analysis, it can be concluded that the convolutional neural network can effectively identify time–frequency images of gearbox faults under speed-up conditions, although its test accuracy may fluctuate, it is not as stable as deep belief networks and stacked autoencoder, but its time cost is much lower than the other two kinds of algorithms. The support vector machine of the shallow machine learning algorithm is also used for time–frequency image recognition. The parameters of the Support vector machine are set as follows: first, the format of the input data is adjusted using the algorithm in LIBSVM [23], using the radial basis function as the kernel function, the penalty coefficient C and the kernel parameter g were selected by cross-validation and grid search for the Support vector machine, and the whole training sample set was trained with the best parameters, to get the Support Vector Machine model. Table 5.28 shows the highest test and classification accuracy for the Support Vector Machine and three types of deep learning algorithms. As can be seen from Table 5.28, the training process is successful with a maximum classification accuracy of 99.975% for the time–frequency image recognition using the optimized Support vector machine. However, the highest test accuracy was only 46.025%, which was far lower than the test accuracy of the three deep learning algorithms. It not only shows that the deep learning algorithm is better than the shallow algorithm in the whole Support Vector Machine, but also shows that the Support Vector Machine is not suitable for time–frequency image recognition under speed-up conditions.

5.3.3.3

Fault Diagnosis Under the Condition of Speed-Up and Speed-Down

The vibration under the rising and falling speed condition is similar to that under the rising speed condition. The higher the rotating speed, the larger the vibration amplitude, the higher the frequency of impact, and the higher the gear meshing

358

5 Deep Learning Based Machinery Fault Diagnosis

Fig. 5.48 The rotational speed curve of the input shaft

frequency. For time–frequency samples, since each sample is only a time–frequency map obtained by time–frequency transformation of a short period of time in which the rotation speed can be regarded as constant, moreover, the fault setting is consistent with the speed-up condition, so the time-domain signal and time–frequency diagram under the speed-up and speed-down condition are no longer analyzed. The specific speed regulation process of the input shaft lifting speed is to first set a high-speed value n 1 at time t1 , then gradually reduce the speed, reduce to n 2 at time t2 and then increase the speed again, increase to n 3 at time t3 , then gradually lower, and reduce to n 4 at time t4 . Figure 5.48 shows how the speed of the input shaft varies with time. The t0 moment in the figure is the dividing line between the selected training sample and the test sample. The time–frequency graph corresponding to the signals in the time period t1 − t0 is taken as the training sample set, with a total of 500 training samples. The time–frequency graph corresponding to signals in the time period t0 − t4 is used as the test sample set. There are 500 test samples in total. After collecting the vibration signals of various fault states and normal states, a large number of time–frequency maps are obtained by using the time–frequency transform method. The size of the time–frequency maps is adjusted to 32 × 32, so that they can be used as input samples of convolutional neural network. Time– frequency transform method is still continuous wavelet transform, wavelet base is Morlet wavelet. In the process of image recognition, there are eight kinds of time– frequency graphs in different states, each class has 1000 samples, of which the number of training samples and test samples each accounts for 50%. The convolutional neural network, deep belief network and stacked autoencoder are still used to identify the time–frequency images under speed-up and speed-down conditions, and the parameters of each network are consistent with those under speed-up conditions. The experiments under each algorithm were repeated 5 times, and the average of the 5 classification results was taken as the final classification accuracy, the relationship between the training accuracy and test classification accuracy of different algorithms and the number of iterations is shown in Fig. 5.49.

5.3 CNN Based Fault Classification

a) Convolutional Neural Network

359

b) Deep Belief Network

c) Stacked autoencoder

Fig. 5.49 The accuracy of different algorithms under the up and down speed working condition

Figure 5.49a shows the training and test classification accuracy of the CNN for recognition of time–frequency images under speed-up and speed-down conditions. When the number of iterations is less than or equal to 5 times, the accuracy of training samples and test samples increases with the number of iterations. After 8 iterations, the correct rate of the training samples has been maintained above 99.9%. For the test samples, after 5 iterations, the test accuracy did not increase with the number of iterations, but there was a significant fluctuation, but it remained near 90%, and it reaches a maximum of 96% on eight iterations. Figure 5.49b shows the change of the training and test accuracy of the DBN with the number of fine-tuning iterations. As can be seen from the graph, the training accuracy increased from 12.84 to 97.61% and the test accuracy increased from 10.79 to 62.8% when the number of iterations was increased from 1 to 2. As the number of fine-tuning iterations increased, the classification accuracy of training and testing continued to increase. When the number of iterations is 17, the training accuracy is nearly 100%, and the test accuracy is 89.93%. After 27 iterations, the test accuracy remained above 96%, and reached the maximum of 93.05% after 49 iterations. Figure 5.49c shows the training and test accuracy of stacked autoencoder as a function of the number of fine-tuning iterations. A value of 0 on the horizontal axis represents the classification accuracy without fine-tuning, which is 93.48% after 100

360

5 Deep Learning Based Machinery Fault Diagnosis

Table 5.29 The highest training and test accuracy in the four algorithms Algorithm

The highest training accuracy (%)

The highest test accuracy (%)

CNN

100

96

DBN

100

93.05

SAE

100

97.55

SVM

97.75

71.65

iterations of pre-training. After 120 tweaks, the test accuracy was maintained above 96% and reached the maximum at 160 iterations, reaching 97.55%. The highest classification accuracy of the three kinds of deep learning algorithms is above 93%, which shows that the three kinds of deep learning algorithms can recognize the gearbox fault time–frequency graph under the condition of speed-up and speed-down. Considering the maximum test accuracy of the three algorithms respectively, the stacked autoencoder has the highest test accuracy of 97.55%, the next is the convolutional neural network of 96%, and the maximum test accuracy of the deep belief network is 93.05%. As far as the stability of the algorithm is concerned, both the DBN and the stacked autoencoder are stable, while the CNN fluctuates slightly. The time–frequency image is identified by the Support vector machine. Table 5.29 shows the highest test and classification accuracy for the support vector machine and three types of deep learning algorithms. As can be seen from Table 5.29, when the time–frequency images are identified using the optimized Support vector machine, the highest test accuracy is 71.65%, which is lower than the test accuracy of the three kinds of deep learning algorithms, it is shown that the deep learning algorithm is better than the Support vector machine with the shallow algorithm in identifying the time–frequency under the conditions of speed-up and speed-down.

5.4 Deep Learning Based Equipment Degradation State Assessment The fatigue damages produced by mechanical and thermal stress often happen in large machines and make them fail to work normally. Condition monitoring-based power equipment PHM has received wide concern in recent years, which consists of multiple parts: Condition monitoring & data acquisition, Feature extraction & selection, Fault diagnosis & health assessment, System Maintenance Policy, etc. Based on multi-sensor monitoring, various features with fault correlation are extracted from multi-type monitoring signals (vibration, temperature, oil pressure, acoustic emission, electrical signal, etc.) for fault detection and residual life prediction, which can reduce the cost of system maintenance and avoid the accident of shutting down effectively.

5.4 Deep Learning Based Equipment Degradation State Assessment

361

Fig. 5.50 The architecture of an AE

When encountering the problem of equipment degradation uncertainty, the shallow network hard to extract deep degradation features in a highly abstract way, and only obtain general shallow features. The assessment of equipment performance degradation involved the adjustment and adaptation of parameters for the evaluation model in different equipment, meanwhile, requiring the evaluation model to perform deep feature extraction and selection from the collected signal. Fortunately, the DL-based method can complete this assessment. To address two challenging problems in degradation state assessment: extracting multi-dimensional features and modeling the correlation of the degraded time series signal, a novel DAE-LSTM based equipment degradation state assessment method is introduced in this section. Furthermore, a case study of the milling tool degradation state assessment is shown at the end.

5.4.1 Stacked Autoencoder 5.4.1.1

AutoEncoder

An AutoEncoder (AE) [24] is a particular type of neural network, whose architecture is shown in Fig. 5.50. The network learns the mapping relationships of g(σ (X )) ≈ X for minimizing the error of reconstruction. Here, σ (X ) denotes encoder and g(Y ) denotes decoder. Specifically, the purpose of the encoder is complete the nonlinear transformation from X to Y , and that of the decoder is reconstructing Z .

5.4.1.2

Network Architecture

Similar to DBN, the SAE [14] is a stack of multiple AE as shown in Fig. 5.51. Specifically, the output of each AE trained in an unsupervised way was used as

362

5 Deep Learning Based Machinery Fault Diagnosis

Fig. 5.51 Stacked AutoEncoder

inputs for the next AE and the features are extracted from low-level to high-level by multilayer learning. The procedure of multilayer learning can be concluded as the following steps: First, after the first AE (AE1) was trained, feature 1 output by hidden layer was used as inputs of the second AE (AE2). Second, feature 2 was obtained by AE2 trained in an unsupervised way. Third, all the AE were trained by repeating the above steps. Finally, the SAE network with multiple hidden layers in space was constructed. Furthermore, following the final feature layer of the SAE network, the classification layer was employed to perform the task of classification. After that, a deep layer network with the ability of feature extraction and data classification was constructed by training the neural network in a supervised way.

5.4.2 Recurrent Neural Network The architecture of Recurrent Neural Networks is each independently proposed by Jordan [6] and Elman [7] and is characterized by allowing output from some nodes to affect subsequent input to the same nodes, which can be denoted by: Yt = Fun(X t , Yt−1 )

(5.36)

5.4 Deep Learning Based Equipment Degradation State Assessment

363

Fig. 5.52 A simplified architecture of three layered Recurrent Neural Networks

A simplified architecture of three layered Recurrent Neural Networks as shown in Fig. 5.52. It is very similar between the Forward Propagation of the Recurrent Neural Networks and traditional Neural Networks, except introduces historical data. However, in the Back Propagation (BP) field, the traditional BP algorithm can’t be used for the model training because the Recurrent Neural Networks (RNNs) introduce the calculation through time. Therefore, the Back Propagation Through Time (BPTT) [25] is developed based on the traditional BP algorithm. BPTT is used to train the model of RNNs, the procedure can be concluded as the following steps: First, the output of each neuron is calculated by the Forward Propagation. Second, the error term of each neuron is calculated by the BPTT. Finally, the weights are trained with gradient escent method. The weight matrix of the RNN has a shared structure through time, therefore, it will appear continued multiplication through the chain rule when dealing with the derivatives. The form of continued multiplication may lead to Back Propagation errors from vanishing or exploding when the time step is larger, and the Recurrent Neural Networks will not be trained. Therefore, Long Short Term Memory (LSTM) [26] is proposed to solve the above problem, in which the nodes of traditional Recurrent Neural Networks are being reconstructed as shown in Fig. 5.53. The increased number of loop layers in the RNN may lead to the activation functions into the region of saturation gradient. To reduce this risk, the input gate, output gate, and forget gate are introduced in LSTM to control the magnitude of calculation quantity in the network. Furthermore, these gates are benefits parameter optimization by containing more layers in the network. The function and procedure of these gates are summarized as followed. First, the input gate is used to decide what Fig. 5.53 The architecture of LSTM

364

5 Deep Learning Based Machinery Fault Diagnosis

information can be added to the network and the procedure of it can be concluded as followed: (1) the candidate vector gs will be obtained by passing the current input xt and hidden state h t−1 through the tanh function. (2) the vector i s will be obtained by passing the same information of the current input and hidden state into the sigmoid function which can transform the values between 0 and 1. Here, i s is used to decide what information of gs can be fed into the next calculation. Second, the forget gate is used to decide what information needs to focus on and which needs to be ignored, in which the vector f s will be obtained by passing the current input xt and hidden state h t−1 into the sigmoid function which can transform the values between 0 to 1. Here, f s is used to decide what information of St−1 should be preserved. Finally, the output gate is used to decide what information of St will be passed into the next layer, in which the vector Os will be obtained by passing the current input xt and hidden state h t−1 into the sigmoid function.

5.4.3 DAE-LSTM Based Tool Degradation State Assessment When encountering the problem of equipment degradation state uncertainty, the shallow network hard to extract deep degradation features in a highly abstract way, and only obtain general shallow features. As a class of typical deep learning problems, the assessment of equipment performance degradation involved the adjustment and adaptation of parameters for the evaluation model in different equipment, meanwhile, requiring the evaluation model to perform deep feature extraction and selection from the collected signal. However, the selection and dimension reduction of features become more difficult because of the significant information duplication in the detection data of equipment performance. Deep Auto-Encoder [27] (DAE) is beneficial to solve this problem, which contains multiple hidden layers that can learn from the training data in an unsupervised way and obtain a better reconstruction effect. Meanwhile, for obtaining the quantitative judgment of the equipment degradation state, the cross-correlation of time series data should be combined with the degradation assessment model when comprehensively judging the equipment performance. To solve the two problems of the extraction and reduction of multi-dimensional features and the time-series correlation modeling of degradation signals, a DAELSTM based equipment degradation assessment method was proposed, whose procedure can be concluded as followed: First, the feature extractor is obtained by unsupervised feature self-learning dimension reduction and supervised reverse fine-tuning. Second, the optimized feature sequence is used as the input of the LSTM. Finally, the cross-correlation of the degradation process information is obtained by the LSTM, and the complete information of the equipment degradation process data is used to quantitatively evaluate the equipment degradation state. The procedure of the DAE-LSTM based degradation assessment method is shown in Fig. 5.54. First, the statistical features of the signals are extracted from the multisensor monitoring signals, and the degradation feature dataset in the training set

5.4 Deep Learning Based Equipment Degradation State Assessment

365

is used as the input of the DAE. Second, the DAE is used to extract the lowdimensional degradation signal highly related to the fault from the high-dimensional feature signal in an unsupervised self-learning way. To ensure the maximum correlation between the dimensional reduction coding and the fault features, the weight parameters of DAE are adjusted by the method of fine-tuning learning. Finally, the DAE encoding finishing parameter fine-tuning is arranged in chronological order and used as the input of the LSTM. Furthermore, to address the two problems: (1) the number of nodes cannot be selected when the number of layers is uncertain. (2) the network parameters are overfull when the number of layers is too many in the stage of constructing DAE and LSTM networks. The method of stacking the middle hidden layer is used to simplify the number of nodes in the middle layers into two parameters, which is, the number of middle hidden layers and the number of middle hidden layer nodes. The parameters of the network structure can be determined by Particle Swarm Optimization (PSO) [28]. To verify the effectiveness of the proposed DAE-LSTM based equipment degradation assessment method in industrial data, the milling tool wears data were analyzed. The experimental data comes from NASA Ames Research Center, including 16 sets of monitoring data on tool wear degradation [29]. Each set of data contains signal with different modes, such as vibration signals, acoustic emission signals and current signals during tool wear. Meanwhile, the signal is collected at a sampling frequency of 250 Hz. The working conditions and the final wear of the milling tool of each group are shown in Table 5.30. The raw time-domain signals obtained by each monitoring sensor during the first sampling of CASE1 are shown in Fig. 5.55, which are the current signal of the AC spindle motor, the current signal of the DC spindle motor, the vibration signal of the working table, the vibration signal of the machine spindle, the acoustic emission signal of the working table, the acoustic emission signal of the machine spindle, respectively. From the time domain data, it can be seen that the milling cutter has an entry phase, a stable cutting phase and an exit phase when performing the cutting process. The signals of the stable cutting phase are selected for analysis. The four time-domain features including effective value, absolute mean value, variance and peak value are extracted from six sensor monitoring data. And then, the training set formed with the four time-domain features is used as the training samples of the DAE which is used to perform feature extraction and dimension reduction. To determine the network structure parameters, the reconstruction error after dimension reduction coding is used as the parameter update fitness of the particle swarm algorithm. The regression model output of the dimension reduction features is shown in Fig. 5.56, and the label of the regression model is the wear of the milling cutter. When performing the feature extraction and dimension reduction of CASE1 data, the data of the other 15 CASEs were used as the training set. The extracted features of CASE1 data are passed into the regression model, and the output is shown in Fig. 5.56a. Furthermore, when performing the feature extraction and dimension reduction of CASE2 data, the data of the other 15 CASEs were used as the training set. The extracted features of CASE2 data are passed into the regression model, and

366

5 Deep Learning Based Machinery Fault Diagnosis

Fig. 5.54 A multi-dimensional features and DAE-LSTM based method for equipment performance degradation assessment

5.4 Deep Learning Based Equipment Degradation State Assessment

367

Table 5.30 The working conditions and wear status of milling tool data CASE

1

2

3

4

5

6

7

8

Number of samples

17

14

16

7

6

1

8

6

Wear status

0.44

0.55

0.55

0.49

0.74

0

0.46

0.62

Cutting depth

1.5

0.75

0.75

1.5

1.5

1.5

0.75

0.75

Cutting speed

0.5

0.5

0.25

0.25

0.5

0.25

0.25

0.5

Materials

Cast iron

Cast iron

Cast iron

Cast iron

Steel

Steel

Steel

Steel

CASE

9

10

11

12

13

14

15

16

Number of samples

9

10

23

15

15

10

7

6

Wear status

0.81

0.7

0.76

0.65

1.53

1.14

0.7

0.62

Cutting depth

1.5

1.5

0.75

0.75

0.75

0.75

1.5

1.5

Cutting speed

0.5

0.25

0.25

0.5

0.25

0.5

0.25

0.5

Materials

Cast iron

Cast iron

Cast iron

Cast iron

Steel

Steel

Steel

Steel

the output is shown in Fig. 5.56b. From the regression output, it can be seen that the information retained by dimension reduction coding is highly correlated with the degree of wear. After training and fine-tuning with labeled data, the dimension reduction coding is consistent with the degree of wear in the changing trend, but there is a deviation in the amplitude, which indicates that the DAE is effective for feature extraction and dimension reduction of multi-dimensional sensor feature sets. Therefore, the low-level network of the regression model, which is the DAE, can be retained as the feature extractor of the newly monitored degradation data. When the feature extractor of the DAE has been constructed, the dimension reduction code is set as the input of the LSTM after setting the time step, and the degree of wear is used as the label of the degradation degree. The DAE is trained by the data except for CASE1, which is used to perform the feature extraction and dimension reduction of CASE1 data. Furthermore, the data except for CASE1 are used in LSTM for labeled training. The model prediction output of CASE1 data is shown in Fig. 5.57a. In the same way, the model prediction output of CASE2 data is shown in Fig. 5.57b. The experimental results show that the deep network highly fitted to the training data can be obtained by using the deep network as the feature extractor to perform the training. When removing its regression layer, the low-level network still has a good feature extraction effect in similar test data with little difference, which indicates that the deep network can learn the general features in the shallow layer, and further extract and learn the shallow features. In the end, the deep learning model suitable for

368

5 Deep Learning Based Machinery Fault Diagnosis

Fig. 5.55 The time domain signals of the monitoring sensors for the milling tool conditions

Fig. 5.56 The output of the regression model

References

369

Fig. 5.57 The diagnostic results of DAE-LSTM

equipment degradation assessment can be obtained by passing the low-level feature coding into other deep network models for post-processing.

References 1. Hof, R.D.: 10 Breakthrough Technologies 2013, MIT Technology Review, 23 Apr 2013 2. Hinton, G.E., Salakhutdinov, R.: Reducing the dimensionality of data with neuralnetworks. Science 313, 504–507 (2006) 3. Bengio, Y.: Learning deep architectures for AI. Foundations Trends Mach. Learn. 2(1), 1–127 (2009) 4. Sermanet, P., Chintala, S., LeCun, Y.: Convolutional neural networks applied to house numbers digit classification. In: International Conference on Pattern Recognition (ICPR 2012) (2012) 5. Le, Q.V., Ranzato, M., Monga, R., Ng, A.Y., et al.: Building high-level features using large scale unsupervised learning. In: Proceedings of the 29th International Conference on Machine Learning (2012) 6. Jordan, M.I.: Serial order: a parallel distributed processing approach. Adv. Psychol. 121, 471– 495 (1997) 7. Elman, J.L.: Finding structure in time. Cogn. Sci. 14(2), 179–211 (1990) 8. Yu, K., Jia, L., Chen, Y., et al.: Deep Learning: Yesterday, today, and tomorrow. J. Comput. Res. Dev. 50(9), 1799–1804 (2013) 9. Jones, N.: Computer science: the learning machines. Nature 505(7482), 146–148 (2014) 10. Tamilselvan, P., Wang, P.: Failure diagnosis using deep belief learning based health state classification. Reliab. Eng. Syst. Saf. 115, 124–135 (2013) 11. Tran, V.T., Thobiani, F.A., Ball, A.: An approach to fault diagnosis of reciprocating compressor valves using Teager-Kaiser energy operator and deep belief networks. Expert Syst. Appl. (41), 4113–4122 (2014) 12. Hinton, G.E., Sejnowski, T.J.: Learning and Relearning in Boltzmann Machines, vol. 1, pp. 282– 317. MIT Press, Cambridge (1986) 13. Smolensky, P.: Information Processing in Dynamical Systems: Foundations of Harmony Theory, vol. 1, pp. 194–281 (1986) 14. Bengio, Y., Lamblin, P., Popovici, D., et al.: Greedy layer-wise training of deep networks. Proc. Adv. Neural Inf. Process. Syst. 19, 153–160 (2007) 15. Ma, D.: Research on Image Retrieval Based on Deep Learning. Inner Mongolia: Master’s thesis, Inner Mongolia University (2014) 16. Xiao, H., Cai, C.: Comparison study of normalization of feature vector. Comput. Eng. Appl. 45(22), 117–119 (2009)

370

5 Deep Learning Based Machinery Fault Diagnosis

17. Liu, S.: Study on data normalization in BP neural network. Mech. Eng. Autonomy 3, 122–123 (2010) 18. Liu, H., Wang, H., Li, X.: A study on data normalization for target recognition based on RPROP algorithm. Modern Radar 5, 55–60 (2009) 19. Yang, J., Zhang, D., Yang, J.Y.: Two-dimensional PCA: a new approach to appearance-based face representation and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 26(1), 131–137 (2004) 20. Lin, H.: Anti-noise Performance of Discrete Spectrum Correction Theories and Their Application in Engineering. Guangzhou: PhD thesis, South China University of Technology (2010) 21. Li, B., Liu, P., Hu, R., et al.: Fuzzy lattice classifier and its application to bearing fault diagnosis. Appl. Soft Comput. 12(6), 1708–1719 (2012) 22. LeCun, Y., Bottou, L., Bengio, Y., et al.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 23. Suk, H.I., Lee, S.W., Shen, D.: Latent feature representation with stacked auto-encoder for AD/ MCI diagnosis. Brain Struct. Funct. 220(2), 841–859 (2015) 24. Bourlard, H., Kamp, Y.: Auto-association by multilayer perceptrons and singular value decomposition. Biol. Cybern. 59(4–5), 291–294 (1988) 25. Werbos, P.J.: Backpropagation through time: what it does and how to do it. Proc. IEEE 78(10), 1550–1560 (1990) 26. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 27. Keras’s Blog. https://blog.keras.io/building-autoencoders-in-keras.html 28. Kennedy, J.: Particle swarm optimization. In: Encyclopedia of Machine Learning. Springer US, pp. 760–766 (2011) 29. Agogino, A., Goebel, K.: BEST lab, UC Berkeley. “Milling Data Set “, NASA Ames Prognostics Data Repository. http://ti.arc.nasa.gov/project/prognostic-data-repository. NASA Ames Research Center, Moffett Field, CA (2007)

Chapter 6

Phase Space Reconstruction Based on Machinery System Degradation Tracking and Fault Prognostics

6.1 Phase Space Reconstruction In recent years, chaotic time series analysis has been widely used in many fields, such as mathematics, physics, meteorology, information science, economy and biology; the study of chaotic time series analysis has become one of the frontier topics of nonlinear science. Chaos theory links determinism and randomness, two traditionally completely independent and contradictory concepts, to each other. It holds that there are deterministic laws behind phenomena that are regarded as random irregularity. Theoretically, chaotic dynamical systems theory can establish a deterministic mathematical model to describe stochastic and irregular systems, thus providing a deterministic theoretical framework to explain complex phenomena in the real world [1]. In 1980, Packard et al. [2] first proposed the reconstruction of the phase space of a nonlinear time series to study its nonlinear dynamics characteristics, pioneering the use of one-dimensional time series to study the chaotic phenomenon of complex dynamic systems. Next, Takens [3] proposed to reconstruct the phase space of the nonlinear time series using the delayed coordinate method and proved mathematically that the reconstructed phase space can preserve the dynamic characteristics of the original system, which is called Takens embedding theorem [3]. The proposition of Takens theorem makes it possible for researchers to relate the theoretical abstract object, such as a chaotic dynamical system, to the measured time series in practical engineering from a strictly mathematical point of view. In this way, researchers don’t need to model complex systems directly; just by studying the measured time series, the properties of chaotic systems can be equally researched on the basis of the preservation of their intrinsic dynamical properties and their mathematical significance. At present, the commonly used nonlinear characteristic parameters to reflect the characteristics of chaotic time series, such as correlation dimension, Lyapunov exponent and Kolmogorov entropy, are extracted from the reconstructed phase space of the measured time series. Therefore, phase space reconstruction is the key to processing nonlinear time series. In this chapter, Takens embedding theorem and phase space reconstruction method based on delayed coordinates are introduced briefly, then the © National Defense Industry Press 2023 W. Li et al., Intelligent Fault Diagnosis and Health Assessment for Complex Electro-Mechanical Systems, https://doi.org/10.1007/978-981-99-3537-6_6

371

372

6 Phase Space Reconstruction Based on Machinery System Degradation …

selection methods of two parameters, delay time and embedding dimension in phase space reconstruction, are introduced in detail, which lays a foundation for nonlinear feature extraction, degradation tracking and fault prediction of mechanical rotating parts.

6.1.1 Takens Embedding Theorem According to chaos theory, the evolution of any component of a system is determined by the interaction of other components, so the development of any component contains information about other related components. In other words, if the univariate time series consisting of a variable of the system is considered, which is the result of the interaction of many other relevant physical factors and contains all the information changes of the other variables involved in the movement, therefore, the univariate sequence must be extended to the high-dimensional space in some way to fully display the information of the phase space, which is a nonlinear time series phase space reconstruction. At present, it is a common method to reconstruct the phase space by using a chaotic time series output from a nonlinear system, so as to investigate the characteristics of the whole chaotic system. The current widely used phase space reconstruction method is mainly the delayed coordinate method proposed by Packard, Takens and so on, whose mathematical theory is based on the Takens embedding theorem. Before introducing Takens embedding theorem, some related mathematical concepts are introduced qualitatively: (1) Manifold: An abstract space that has the property locally Euclidean space is a generalization of the concepts of curves and surfaces in Euclidean space; (2) differential homeomorphism: for two manifolds M 1 and M 2 , f is called differential homeomorphism between M 1 and M 2 if the maps, f:M 1 → M 2 and their inverse maps, f:M2 → M1, are differentiable; (3) isometric isomorphism and embedding: for two metric spaces (M 1 , P1 ) and (M 2 , P2 ), if there exists a full mapping, f:M 1 → M 2 , then two metric spaces (M 1 , P1 ) and (M 2 , P2 ) are isometric isomorphism; if (M 1 , P1 ) and (M 3 , P3 ) is a subspace isometric isomorphism, then (M 1 , P1 ) is called that can be embedded in (M 3 , P3 ). Based on the above concepts, the Takens embedding theorem can be further introduced as follows [3]: Let M be a d-dimensional manifold, ϕ : M → M is a smooth differential homeomorphism, and let y is a(smooth function on is(a mapping ( M, which ) ) of φ(ϕ,y) : M → R 2d+1 , then φ(ϕ,v) (x) = y(x), y(ϕ(x)), y ϕ 2 (x) , . . . y ϕ 2d (x) is an embedding from M to R 2d+1 . Where y(x) is the observed value of the system state x; the space, including φ(ϕ,y) (M), is called the embedding space and its dimension, 2d + 1, is called the embedding dimension. Takens theorem gives the mathematical guarantee of data embedding. The original space and the embedded space that satisfy the theorem are isometric and isomorphic,

6.1 Phase Space Reconstruction

373

so the embedded space can retain the basic dynamic information of the original space. The method of phase space reconstruction by Takens theorem is the delayed coordinate method, which can reconstruct one-dimensional time series into a multidimensional phase space vector through time delay [4]. For the observation time series {x(1), x(2), …, x(N)}, whose length is N, the corresponding reconstructed phase space can be obtained by selecting the appropriate delay time τ and embedding dimension m according to formula 6.1: ⎧ ⎪ ⎪ ⎪ ⎨

X(1) = {x(1), x(1 + τ ), . . . , x(1 + (m − 1)τ )} , X(i) = {x(i ), x(i + τ ), . . . , x(i + (m − 1)τ )} (6.1) ⎪ ⎪ x(N − (m − 1)τ ) = {x(N − (m − 1)τ ), x(N − (m − 2)τ ), . . . , x(N )} ⎪ ⎩ i = 1, 2, . . . , N − (m − 1)t X(1), X(2),…, X(N − (m − 1)τ ) is a vector in the reconstructed phase space. It can be seen from the above formula that the selection of the delay time parameter and embedding dimension parameter has a great influence on the structure of reconstructed phase space, but Takens theorem does not give a specific selection method for these two parameters. The following two sections briefly discuss and introduce the selection of latency and embedding dimensions.

6.1.2 Determination of Delay Time According to Takens theory, the selection of delay time is arbitrary when reconstructing phase space for an infinite time series that is not disturbed by noise. However, in practice, the observation series cannot be infinitely long, and any time series is inevitably disturbed by noise. If the delay time τ is too small, any two components of the phase space vector X(i) = {x(i), x(i + τ ),…, x(i + (m-1)τ )}, x(i) and x(i + τ ) are too close numerically, which leads that the difference of phase space vectors is too small; the information redundancy is too large and the reconstructed phase space contains less information about the original measurement points. It is shown that the phase space track compresses to the principal diagonal line of the phase space in the phase space shape. While if τ too large, the correlation between the elements of the phase space vector will be lost easily, which is shown that the phase space trajectory may appear folding phenomenon in the phase space morphology. Therefore, it’s essential to choose the appropriate delay time τ on reserving the dynamic characteristics of the original system to the maximum extent in the reconstructed phase space. There are several commonly used methods to calculate delay time: autocorrelation method, average displacement method and mutual information quantity method. The autocorrelation method is a relatively mature method for determining the optimal delay time by examining the linear independence between sequences, but the delay time parameter obtained by the autocorrelation method cannot be generalized to the reconstruction of high-dimensional phase space.

374

6 Phase Space Reconstruction Based on Machinery System Degradation …

Another method to calculate the delay time is the average displacement method. The method needs to be based on the selection of an embedding dimension, and the delay time obtained is the optimal value under the condition of the selection of the embedding dimension. However, the optimal embedding dimension cannot be determined in advance in reality, which brings an error in determining the optimal delay time. In addition, the average displacement method needs to calculate the distance between all vectors in the phase space, and the amount of calculation is relatively large. In addition to the above two methods, the mutual information method is commonly used to calculate the delay time. Compared with the autocorrelation method, though it needs more computation, the mutual information method contains the nonlinear characteristics of the time series. So, its calculation results are better than the autocorrelation method. What’s better, there is no need to determine the embedding dimension in advance. Therefore, the mutual information method is adapted to obtain the delay time parameters of reconstructed phase space in this chapter.

6.1.3 Determination of Embedding Dimensions For the reconstruction of the embedded dimension of phase space, Takens theorem only gives a sufficient condition, m ≥ 2d + 1, for the selection of embedding dimension, but the selection method is only for the time series with infinite length and no noise in the ideal case. In practice, the embedding dimension should be greater than the minimum value of m. Theoretically, as long as m is large enough, the dynamical characteristics of the original chaotic system can be described and its internal motion law can be revealed. However, in practice, too large embedding dimension will greatly increase the computation of geometric invariant parameters (such as correlation dimension, Lyapunov exponent, etc.) of chaotic systems, and in the case of large system noise, the impact of noise and rounding error on reconstruction is also greatly increased. Here are some common ways to find the embedding dimension: trial algorithm, singular value decomposition method, and pseudo-neighborhood method. Based on the given delay time, the trial algorithm can continuously calculate some geometric invariants of the system (such as correlation dimension, Lyapunov exponent, etc.) by increasing the embedding dimension until the embedding dimension reaches a certain value, where the parameters of these geometric invariants stop changing. Then it’s the optimal phase space reconstruction embedding dimension. However, the trial algorithm needs to reconstruct the phase space many times and calculate the geometric invariants of the reconstructed space to observe, which is a large computation and the computing time increases consequently. The singular value decomposition method, also known as principal component analysis, was first introduced by Broomhead in 1986 to determine the embedding dimension of chaotic time series [5].

6.2 Recurrence Quantification Analysis Based on Machinery Fault …

375

The singular value decomposition method (principal component analysis) is essentially linear. The application of the linear method to the parameter selection of phase space reconstruction of nonlinear systems is controversial in theory. The basic idea of the false adjacent point method is to investigate which of the adjacent points in the phase space are the real ones and which are the false ones when the embedding dimension changes. When there are no false neighbors, it is considered that the geometric structure of phase space is completely opened [6]. Compared with the trial algorithm, the false adjacent point method has less computation, less data requirement, and better anti-noise ability. In this chapter, the false adjacent point method is used to obtain the parameters of the embedding dimension of the reconstructed phase space.

6.2 Recurrence Quantification Analysis Based on Machinery Fault Recognition Among the various methods for nonlinear time series analysis methods, the recurrence quantification analysis method [7] has the characteristics of a small amount of data required and strong anti-noise ability, and is a new research hotspot in nonlinear time series analysis. At present, recurrence quantification analysis (RQA), as a nonlinear feature extraction method, has been widely used in various fields. For example, Zbilut applies the RQA method to weak signal detection in noisy environments [8]. This section describes how to apply multiple feature parameters extracted by the RQA method to the identification of bearing fault severity.

6.2.1 Phase Space Reconstruction Based RQA Like many nonlinear feature extraction methods, the recurrence quantification analysis method is based on phase space reconstruction. By using the mutual information method and the false neighborhood method introduced in Chap. 2, the delay time τ and the embedding Dimension m are selected, and the phase space reconstruction of the observation time series {x(1), x(2), …, x(N)} with length N is carried out by using formula 6.1, we get a series of space vectors {X(1), X(2), …, X(N − (m − 1)τ )}, each of these vectors is a point in the reconstructed phase space. These vectors are then used to construct recurrence matrices: ( 1 : ε > ||X(i ) − X( j )|| Ri j = Θ(ε − ||X(i) − X( j )||) = i, j ∈ [1, Nm ] (6.2) 0 : ε < ||X(i ) − X( j )|| Of the form: Θ(•)—unit step function;

376

6 Phase Space Reconstruction Based on Machinery System Degradation …

ε—recurrence threshold; N m = N − (m − 1)τ —number of vectors; Suppose a certain recurrence threshold ε is determined, and any two vectors X(i) and X( j) in space are brought into formula 6.2 for calculation. If Rij Equals 1, a point is drawn at (i, j) coordinates; if Rij equals 0, no point is drawn at (i, j) coordinates. When all the vectors in the reconstructed phase space are processed by Eq. 6.2, a twodimensional graph, called a recurrence graph, can be obtained. The linear structure and the point density in the recurrence graph can reflect the dynamic characteristics of the pre-reconstruction time series. For example, the points in the recurrence graph of the Gaussian white noise are evenly distributed, and the recurrence graph of the periodic signal is composed of some lines parallel to the diagonal [9]. However, a recurrence graph is only a graphical and qualitative description of the dynamic characteristics of time series, and its rich information needs to be described by some quantitative features. Based on the recurrence graph, Marwan proposed the recurrence quantitative analysis method, which can extract effective characteristic parameters, such as the recurrence rate (RR), determinism (DET), laminarity (LAM) and recurrence entropy (ENTR), to quantitatively describe the dynamic characteristics of the original time series [10]. The mathematical definition and basic meaning of each of these four parameters are described below. For all N m vectors in the reconstructed phase space, after constructing the recurrence graph according to formula 6.2, the recurrence rate RR is defined as: ∑ Nm ∑ Nm i=1 j=1 Ri j RR = (6.3) 2 Nm Let P (L) and P (V) represent the length distributions of the line in 45-degree and vertical direction in a recurrence graph, respectively: p(l) = N l / p(v) = Nv /

∑lmax α=lmin

∑vmax α=vmin

α Nα

(6.4)

α Nα

(6.5)

Of the form: N l —the number of straight lines of length l in the 45-degree direction; N v —the number of straight lines of length v in the vertical direction; N α —the number of straight lines of length α in the direction of 45° or in the vertical direction; l min , l max —the minimum length (usually 2) and a maximum length of the 45-degree straight line; vmin , vmax —the minimum length (usually 2) and the maximum length of the vertical straight line;

6.2 Recurrence Quantification Analysis Based on Machinery Fault …

377

Thus, deterministic (DET), laminar (LAM), and recurrence entropy (ENTR) can be defined as: ∑lmax l=lmin lp(l) DET = ∑ Nm ∑ (6.6) Nm i=1 j=1 Ri j ∑vmax vp(v) v=v (6.7) LAM = ∑vmaxmin v=1 vp(v) ENTR = −

Nm ∑

p(l) ln p(l)

(6.8)

l=lmin

In a physical sense, the recurrence rate RR describes the density of recurrence points in a recurrence graph and reflects the probability of the occurrence of a particular state Deterministic (DET) describes the ratio of the number of recurrence points in the diagonal structure to the number of all recurrence points, reflecting the predictability of the system; Laminar (LAM) describes the number of recursion points in the vertical line structure of the recurrence graph, reflecting the intermittent and hierarchical nature of the system; recurrence entropy (ENTR) is based on the Shannon entropy of the diagonal length frequency distribution, which describes the complexity of the deterministic structure in the dynamical system, reflecting the dynamic information of the system or the degree of randomness [8]. In a word, these RQA parameters are the reflection of System dynamics features and can be used as feature parameters for fault diagnosis of mechanical rotating parts. Next, a simulation experiment shows the effectiveness of the recurrence quantitative analysis method. A white Gaussian noise signal with a standard deviation of 1, a sine signal with a frequency of 2π and a Lorenz System x component with noise were selected as simulation signals. Each kind of signal collects 1500 data points, selects the delay time τ = 14, embeds the dimension m = 5 to reconstruct the phase space and makes the recurrence graph, as shown in Fig. 6.1a–c, the horizontal axis and the vertical axis represent the number of vectors in the reconstructed phase space. As can be seen from the graph, the white Gaussian noise signal is a random signal, in which the recurrence points are evenly distributed, and there is no obvious 45-degree line and vertical line structure, while the sine signal is periodical, and its recurrence graph has an obvious 45-degree linear structure, and the whole graph has a banded distribution; the Lorenz system is a more complex chaotic system, its representation in a recurrence graph is that the structure of the graph is more complex than a recurrence graph of white Gaussian noise and sine signals. A short 45-degree line, a short vertical line, and a partially banded blank distribution appear in the figure. Then, the recurrence rate, deterministic, laminar and recurrence entropy parameters of the three signals are calculated by Formula 6.3, 6.6 and 6.8 respectively, as shown in Table 6.1. As can be seen from the table, the values of the four parameters are also very small for simple white Gaussian noise signals with no obvious regularity in the recurrence graph; with the complexity of the system structure, the four parameters of

378

6 Phase Space Reconstruction Based on Machinery System Degradation …

Fig. 6.1 Recurrence figure of three different signals

Table 6.1 Recurrence quantitative analysis parameters Recurrence rate (RR)

Deterministic (DET)

Laminar (LAM)

Recurrence entropy (ENTR)

White Gaussian noise signal

0.0018

0.0032

0.0038

0.0083

Sine signal

0.1180

0.9691

0.0100

0.5185

Lorenz system signal

0.0817

0.9803

0.9860

1.1111

the sinusoidal signal and the Lorenz system signal are larger than those of the white noise signal, and as the most complex Lorenz system signal, the other parameters are larger than the sinusoidal signal except the recurrence rate. This indicates that the Lorenz system has a higher complexity and a larger amount of information.

6.2.2 RQA Based on Multi-parameters Fault Recognition The four characteristic parameters extracted by the RQA algorithm, namely recurrence rate (RR), deterministic (DET), laminar (LAM) and recurrence entropy (ENTR), can be used as quantitative features to identify and evaluate the fault severity of rolling bearings. Based on this, a multi-parameter bearing fault severity identification algorithm can be formed, and its flow chart is shown in Fig. 6.2. The steps of the algorithm are as follows: first, the vibration signals of rolling bearings with different fault degrees are obtained by sensors, and the acquired time series are standardized; Then the phase space is reconstructed by using the mutual information method and the false neighborhood method to select the delay time and the embedding dimension. Then calculate the recurrence matrix and draw the recurrence plot according to Eq. 6.2. Finally, the four characteristic parameters are calculated according to Eqs. 6.3 and 6.8, and the change rules of these characteristic

6.2 Recurrence Quantification Analysis Based on Machinery Fault …

379

Fig. 6.2 Flow chart of RQA-based multi-parameter bearing fault severity identification algorithm

parameters extracted from bearing vibration signals under different fault degrees are compared. The validity of these characteristic parameters is verified by measured vibration signals of rolling bearings with different failure degrees. Fault data for this experiment were obtained from the Case Western Reserve University Electrical Engineering Laboratory in Ohio, USA [11]. The experimental setup, as described in Chap. 2, consists of a drive motor, a torque sensor, a power meter and a control device. The tested bearing is the motor’s output shaft support bearing, the model is 6205-2RS JEM SKF. Single-point faults of different sizes are introduced to the tested bearings using EDM technology. The fault diameters include 0.18, 0.36 and 0.53 mm. The vibration signal is sampled by the accelerometer at a frequency of 12 kHz. Firstly, under the condition of 1750 rpm and 2 HP load, the data of bearing with inner ring fault is processed and analyzed. Figure 6.3 shows a section of the vibration signal and its corresponding spectrum of the healthy bearing and the fault bearing with the fault diameter of 0.18 mm, 0.36 mm and 0.53 mm, respectively. Although the difference can be seen from the time domain signal and its corresponding spectrum, it is difficult to distinguish the difference in the fault severity directly from the graph. The signals are then analyzed using a recurrence quantitative analysis method. Firstly, the delay time and embedding dimension are adaptively selected by mutual the information method and false neighborhood method, and the phase space of

380

6 Phase Space Reconstruction Based on Machinery System Degradation …

Fig. 6.3 Bearing vibration signal and its spectrum (RPM: 1750 rpm, load: 2HP)

bearing data with different failure degrees is reconstructed respectively, and the recurrence threshold ε = 0.4 is selected to construct the recurrence graph, as shown in Fig. 6.4, the horizontal and vertical axes of the recurrence graph represent the number of vectors in the reconstructed phase space. As can be seen from the figure, the density of recurrence points and the structure of horizontal and vertical lines in the graph change with the increase of the degree of bearing failure. Then, in order to be the recurrence graph quantitatively, the recurrence rate (RR), deterministic (DET), laminar (LAM) and recurrence entropy (ENTR), which are the characteristic parameters of recurrence quantitative analysis, are calculated respectively, the results are shown in Table 6.2 and Fig. 6.5. As can be seen from the figure, for RR, Lam and ENTR, the index value of the fault bearing is larger than that of the healthy bearing, and the index value increases with the increase of the fault degree. However, for the DET index, the healthy bearing DET index is greater than the mild failure of the bearing DET index, which shows that when the bearing failure of the inner ring, the predictability of the system does not necessarily increase with the degree of failure. Therefore, RR, Lam and ENTR can be used to identify the signals of different fault severity in this test, while DET is not an effective indicator to identify the fault severity.

6.2 Recurrence Quantification Analysis Based on Machinery Fault …

1400

1400

1200

1200

1000

1000

800

800

600

600

400

400

200

200

200

400

600

800

1000

1200

1400

200

(a) Health Bearings

1400

1200

1200

1000

1000

800

800

600

600

400

400

200

200

400

600

800

1000

600

800

1000

1200

1400

(b) Minor faults(0.18mm)

1400

200

400

381

1200

(c) Medium failure (0.36mm)

1400

200

400

600

800

1000

1200

1400

(d) Serious failure (0.53mm)

Fig. 6.4 Recurrence diagram of bearing vibration signal with different failure degree (1750 rpm, load: 2HP)

In order to verify the above results, the data processing and analysis of ball bearing failure under the condition of 1750 rpm and 2 HP load are shown in Table 6.3 and Fig. 6.6. At the same time, under the condition of 1797 rpm and 0 HP load, the data processing and analysis of the bearing with outer ring fault are presented in Table 6.4 and Fig. 6.7. As can be seen in Fig. 6.6 and Table 6.3, RR, DET, Lam and ENTR all increase with the severity of ball bearing failure, therefore, these four parameters are effective in diagnosing the severity of ball bearing fault. As can be seen from Fig. 6.7 and Table 6.4, in the bearing outer ring failure test, the RR value of the bearing signal at moderate failure was greater than the RR value of the bearing

382

6 Phase Space Reconstruction Based on Machinery System Degradation …

Table 6.2 Bearing inner race fault RQA parameters (RPM: 1750 rpm, load: 2 HP) Fault size

RR

DET

ENTR

LAM

A: health bearing

0.0009

0.0779

0.3325

0.0217

B: 0.18 mm

0.0023

0.0557

1.5263

0.0571

C: 0.36 mm

0.0078

0.0604

1.553

0.1056

D: 0.53 mm

0.0406

0.2071

2.3661

0.3819

Fig. 6.5 Bearing inner race fault RQA parameters (RPM: 1750 rpm, load: 2 HP)

signal at severe failure. In addition, DET, LAM and ENTR can accurately diagnose the severity of bearing failure. From the three experimental results, it can be seen that RR and DET cannot accurately evaluate the severity of bearing failure under certain failure conditions, only Lam and ENTR can increase the severity of bearing failure. This can be explained by the fact that as the bearing fault crack or spalling size increases, the rolling body Table 6.3 Bearing ball fault RQA parameters (rpm: 1750 rpm, load: 2 HP) Fault size

RR

DET

ENTR

LAM

A: health bearing

0.0008

0.0784

0.3387

0.0212

B: 0.18 mm

0.005

0.7374

1.3885

0.4482

C: 0.36 mm

0.0198

0.758

1.5168

0.6116

D: 0.53 mm

0.0409

0.9447

2.4918

0.8506

6.2 Recurrence Quantification Analysis Based on Machinery Fault …

383

Fig. 6.6 Bearing ball fault RQA parameters (rpm: 1750 rpm, load: 2 HP)

Table 6.4 RQA parameters for bearing outer ring failure-RPM-rpm: 1797 rpm, load: 0HP) Fault size

RR

DET

ENTR

LAM

A: health bearing

0.0004

0.0937

0.3206

0.0454

B: 0.18 mm

0.0196

0.2032

2.0945

0.2222

C: 0.36 mm

0.0399

0.2169

2.5204

0.3719

D: 0.53 mm

0.0338

0.3298

2.5384

0.4522

through the fault point generated by the vibration shock signal will be enhanced, the existence of these fault-related vibration signals makes the vibration signals of fault bearing increase more frequency components, which increases the complexity of the system and leads to the increase of entropy and laminarity. Therefore, in this experiment, Lam and ENTR are two effective quantitative characteristic indexes, which can identify the vibration signals of rolling bearings under different kinds of faults and different severity of faults. Though the experimental verification of a multi-parameter bearing fault severity identification algorithm based on a recurrence quantitative analysis method, laminarity (LAM) and recurrence entropy (ENTR) has been proven to be effective in identifying the vibration signals of bearings with different fault severity, which is the basis for the study of bearing life degradation. In addition, the ENTR itself is defined in the form of Shannon entropy, which is directly related to the complexity of the

384

6 Phase Space Reconstruction Based on Machinery System Degradation …

Fig. 6.7 Bearing outer ring failure RQA parameters (rpm: 1779 rpm, load: 0HP)

system, literature [9] further shows that the increase in the complexity of the dynamical system leads to the change of the distribution of the 45-degree line length of the recurrence graph, thus increasing the value of the recurrence entropy. Therefore, considering the more definite physical meaning of the recurrence entropy parameter, the recurrence entropy feature can be further selected as a non-linear feature parameter to perform degradation tracking research on the whole life cycle of mechanical rotating parts, and the specific method will be detailed in the next section.

6.3 Kalman Filter Based Machinery Degradation Tracking 6.3.1 Standard Deviation Based RQA Threshold Selection In the previous section, recursive entropy is used to identify the bearing at different failure stages, and the influence of recursive threshold ε is not considered when calculating the recursive entropy value. However, from formula 6.2, it can be seen that recursive threshold has an important effect on the recursive matrix and the calculation of the recursive graph and recursive entropy. If ε is too large for ||X(i)-X( j)||, almost all Rij are made equal to 1, then the recursive graph will be full of recursive points; if ε is too small for ||X(i)-X( j)||, almost all Rij is made equal to 0, then the recursion graph will be blank, with almost no recursion points. Both of these situations can

6.3 Kalman Filter Based Machinery Degradation Tracking

385

have adverse effects on the outcome of a recursive quantitative analysis, so choosing the appropriate recursive threshold is important for the recursive quantitative analysis method itself. Furthermore, the previous fault severity identification was only used to analyze the bearing signals at several discrete time points, and the selection of recursive threshold has little influence on the results, the study of mechanical component degradation tracking requires the real-time tracking of vibration signals and the accurate identification of faults in the whole life cycle of components. Therefore, the stability and fault sensitivity of the extracted feature parameters are required to be higher, and these need to be achieved by a reasonable selection of the recursive threshold. Therefore, it is of great significance to study the selection of the recursive threshold. At present, some existing researches also discuss the selection of the recursive threshold. For example, the maximum phase space scale [12] and the standard deviation of the noise contained in the signal [13] have been studied to select recursive thresholds. Although these methods are very effective in some RQA experiments, the computation of maximum phase space scale and recursive point density is too large to meet the real-time requirement of degradation tracking research. In addition, it is difficult to determine the magnitude of the noise in the signal in practical applications, so it is difficult to determine the threshold accurately by using the noise standard deviation. Therefore, a new recursive threshold selection method based on the standard deviation of the observation sequence is proposed to improve the traditional RQA algorithm. Theoretically, the standard deviation of time series reflects the fluctuation degree of time series, and the greater the fluctuation degree of time series is, the greater the fluctuation degree of time series is. because every vector in the reconstructed phase space is constructed from every element in the observation sequence, ||X(i)X( j)|| in Formula 6.2 is also related to the fluctuation of the original observation sequence, in other words, it is also related to the standard deviation of the original observation sequence. The greater the standard deviation of the time series is, the greater the ||X(i)-X( j)|| is, and the larger the recursive threshold ε to choose should be. Therefore, it can be assumed that there is a linear relationship between the standard deviation of time series and the recursive threshold ε. In a real-life rotator degradation tracking experiment, for vibration signal observation sequences collected at specific time intervals, the specific recursive threshold selection method is as follows: for the time series of the vibration signal acquired for the first time, the standard deviation σ 1 is calculated, and the recursive threshold ε1 is determined by using 10% of the maximum phase space scale [12], the scaling coefficient k = ε1 /σ 1 is thus determined; next, the recursive threshold εi can be obtained from formula 6.9 for the following I sampling sequence: εi = kσi where: εi : the recursive threshold of the first observation sequence; σ 1 : the standard deviation of the first observed sequence.

(6.9)

386

6 Phase Space Reconstruction Based on Machinery System Degradation …

After each observation sequence obtains the recursive threshold, its corresponding recursive entropy feature can be obtained through formula 6.8. From the steps of the above-mentioned recursive threshold selection method, it can be seen that the method establishes a connection between the standard deviation of the observation sequence and the recursive threshold, and adaptively selects the recursive threshold according to the conditions of the different observation sequences, which improves the stability of RQA method to obtain the recurrence entropy feature. In addition, the RQA method only needs to calculate the maximum phase space scale parameter once to obtain the recursive threshold, which is less computation, simple and convenient to use, it can meet the real-time requirement of degradation tracking research.

6.3.2 Selection of Degradation Tracking Threshold Another important issue to be studied is the setting of the health threshold for the degradation tracking of rotating parts by using the recursive entropy feature of the observation sequence. In the course of operation, the performance of rotating parts will degrade with time. And in a long time, it will gradually from the health state to the fault degradation state. The so-called health threshold is the threshold parameter that can distinguish the health state and the failure degradation state of the rotating parts, i.e. the threshold parameter which can identify the initial failure time of the parts. The Chebyshev’s inequality of probability theory is used to select the degradation tracking health threshold. Chebyshev’s inequality is defined as: P{|X − μh | ≥ εh } ≤

σh2 εh2

P{|X − μh | < εh } > 1 −

σh2 εh2

or

(6.10)

where: X: the recursive entropy sequence of mechanical components in the same state; H: the mean of sequence X; σ h : standard deviation of sequence X; εh : a selected real number. Chebyshev’s inequality shows that for an arbitrary probability distribution of an observation sequence in the same state, all the values in the sequence are close to the sequence mean, that is, all the values in the sequence are in the interval [μh − εh , μh + εh ] is greater than 1 − σh2 /εh2 . The theory can be illustrated by the following hypothesis tests: H0 : |X − μh | < εh H1 : |X − μh | ≥ εh

(6.11)

6.4 Improved RQA Based Degradation Tracking

387

For any health state of the mechanical components of the recursive entropy value X 0 , if |X − μh | ≥ εh , then reject the H 0 hypothesis, and mistakenly judge that X 0 belongs to the fault state of the mechanical components of the recursive entropy value. Statistically, this misdiagnosis is called making a type I error, and according to formula 6.10, the probability of making a type 1 error is α = σh2 /εh2 at this time. On the other hand, if X 1 , the recursive entropy value of a mechanical component in any fault state, |X − μh | < εh , then H 0 hypothesis is accepted and X 1 is incorrectly judged to be the recursive entropy value of a healthy mechanical component. This condition of misdiagnosis is called making a type II error, and the probability of making a type II error is β at this time. If α is too large, it is easy to mistake the recursive entropy of a healthy part for a faulty part, and if β is too large, it is easy to mistake the recursive entropy of a faulty part for a healthy part. According to hypothesis testing theory, after the sampling sequence is determined, the probability of making these two kinds of errors is restricted to each other. The smaller the probability, the greater the probability. Conversely, the larger the probability, the smaller the probability. In practical application, the probability of making the type I error is usually controlled first, and then the probability of making the type II error is minimized. Here is Chebyshev’s inequality used to determine the health threshold that distinguishes the health state of a rotating component from the degradation state of a fault, it is also the first step to control the probability of the type I error that the health state recursive entropy value is misjudged as the fault state. Therefore, the selection, i.e., εh = 5σh , α = 4%, degradation tracking health threshold was set to μh + 5σ h . This means that a healthy mechanical component has a 96% confidence probability that its recursive entropy falls within the [μh − 5σ h , μh + 5σ h ] range, while the probability of a healthy mechanical component having a greater than μh + 5σ h is less than 4%. In other words, once the recursive entropy of a mechanical component exceeds the health threshold μh + 5σ h , it is considered to be in the initial failure state, with a 96% confidence probability. In the actual research, the mechanical rotating parts are generally in the healthy state at the beginning of the life cycle experiment, so the healthy recursive entropy value series can be formed by the recursive entropy value of the health data at this time, the mean and standard deviation were calculated, and the degradation tracking health threshold was constructed.

6.4 Improved RQA Based Degradation Tracking Based on the above-mentioned improved RQA algorithm based on standard deviation threshold selection and Chebyshev’s inequality health threshold selection method, the whole life cycle of mechanical rotating parts can be tracked. The flow chart of the degradation tracking algorithm is shown in Fig. 6.8. The steps of the algorithm are summarized as follows: firstly, for the observation sequence at t = 1 time, the recursive threshold and entropy RP1 are calculated by using the maximum phase space scale parameter, and the standard deviation and ratio parameter k of recursive thresholds are calculated. For each subsequent observation

388

6 Phase Space Reconstruction Based on Machinery System Degradation …

Fig. 6.8 Flow chart of degradation tracking algorithm

6.5 Kalman Filter Based Incipient Fault Prognostics

389

sequence, the recursive threshold and entropy RPt were calculated. According to Chebyshev’s formula, the degenerated health threshold is obtained by using the recursive entropy sequence composed of the first t N recursive entropy values. And then every recursive entropy RPt is calculated and compared with the health threshold. If it is less than the health threshold, it shows that the component is still in a healthy state, and the above steps are repeated to continue tracking; once the recursive entropy value is greater than the health threshold, indicating that the initial fault occurred.

6.5 Kalman Filter Based Incipient Fault Prognostics The above degradation tracking process gets the initial fault occurrence time when the extracted recursive entropy feature is larger than the health threshold and does not predict the initial fault occurrence time in advance. The Kalman filter [14] is an optimal recursive prediction algorithm that can predict the future state of a system based on dynamic system noise measurements. This section uses the Kalman filter algorithm to predict the Kalman filter time of the initial failure of a rotating mechanical component during degradation tracking. A dynamical system can be described by the following dynamic equation: Xk+1 = AXk + wk yk = CXk + vk

(6.12)

where: X k —the state of the system at K time; yk —the observed value of the system at K time; A, C—A is the transition matrix, C is the measurement matrix; wk —process noise, wk ~ N (0, Q), Q is the covariance matrix; vk —measurement noise, vk ~ N (0, R), R is the covariance matrix. Assuming the current time is K, the steps for using the Kalman filter algorithm to predict the state of the system at the K + 1 time are shown in formulas 6.13–6.17. Initial predictions: Xk+1|k = AXk|k

(6.13)

Pk+1|k = APk|k A' + Q

(6.14)

Covariance prediction:

Emmerich Kalman gain matrix calculations: Kk+1 = Pk+1|k C' (CPk+1|k C' + R)−1

(6.15)

390

6 Phase Space Reconstruction Based on Machinery System Degradation …

Optimal state prediction: Xk+1|k+1 = Xk+1|k + Kk+1 (yk − CXk+1|k )

(6.16)

Covariance update: Pk+1|k+1 = Pk+1|k − Kk+1 CPk+1|k

(6.17)

where: Xk|k—the optimal state of the system at k time; Xk + 1|k—the initial estimated state of the system at k + 1 time; Xk + 1 |k + 1—the optimal estimated state of the system at k + 1 time; Pk|k—the covariance matrix of System k-time; Pk + 1|k + 1—the covariance matrix of System k + 1. After one step of prediction, take the k + 1 moment as the current moment, repeat the steps of formula 6.13–6.17, and continue to predict the next optimal estimation state of the system. As can be seen from the above prediction steps, the Kalman filter is a high-speed recursive algorithm, it can predict the state of the system at the next moment only using the current state, measurements, and covariance matrix with each iteration. In addition, the Kalman filter algorithm makes full use of the system information including measurement values, measurement errors, system noise and so on to predict the future optimal estimation state of the system. However, to make predictions using the Kalman filter, a system dynamic equation in the form of forms 6.12 and determine its parameters should be gotten, including A, C, wk and vk . This dynamic equation needs to be able to describe the state of the system well and be constructed simply to meet the real-time requirement of online prediction. The autoregressive model (AR model) has the characteristics of simple structure and convenient construction, and it is proved theoretically that the higherorder AR model can achieve a similar precision to the autoregressive moving average (ARMA) model. Therefore, the autoregressive model can satisfy the requirement of online prediction. The Kalman filter algorithm uses the autoregressive model to construct the dynamic equation of the system here: Xt =

p ∑

a j Xt− j + εt

(6.18)

j=1

where: aj model parameters; p order of the model; εt model error; The autoregressive model shows that the current state of the system X t can be obtained by accumulating the previous p-time state X t−1 , …, X t-p and the model error.

6.5 Kalman Filter Based Incipient Fault Prognostics

391

Among them, the model order p can be obtained by using AIC (Akaike information criterion) [15], and the model parameter aj can be obtained by using the Burg algorithm [16]. In practical application, AR models of different order are built for given time series, and corresponding AIC values are calculated for each model. The AR model with the minimum AIC value is the most suitable model for the time series. Based on AR Model 6.18, the parameters of dynamic Eq. 6.12 can be determined as follows: ⎡

xk

⎤ p∗1

⎢ xk−1 ⎥ ⎢ ⎥ Xk = ⎢ . ⎥ ⎣ .. ⎦ xk− p+1

⎤ p∗ p ⎡ ⎤ p∗1 a1 a2 . . . a p εk ⎥ ⎢1 ⎢ ⎥ [ ]1∗ p ⎥ ⎢ ⎢0⎥ A=⎢ . wk = ⎢ . ⎥ ⎥ C = 1 0 ··· 0 .. ⎦ ⎣ ⎣ .. ⎦ 1 0 0 (6.19) ⎡

The process noise wk can be determined by the model error εt of the AR model. For the measurement noise vk , many kinds of research on Kalman filter are directly determined by the measurement accuracy of sensors, but this approach is not appropriate here. Since the state of the system is given by the recursive entropy feature of the RQA algorithm, the object of AR modeling is not the original signal measured by the sensor, but the recursive entropy feature of the RQA algorithm. Therefore, the measurement noise vk should be determined by the recursive entropy calculation process. The average error method is used to determine the measurement noise. The steps are as follows: for a t-time observation sequence X 1(T), it is divided into n pieces of equal short sequences {x1 (t 1 ),…, x1 (t n )}, and calculate the recursive entropy value of each short sequence RP1 … RPn , respectively; the average of these recursive entropy values is taken as the recursive entropy feature of the time series x1 (t), and the standard deviation of these recursive entropy values s1 is taken as the measurement error of the recursive entropy feature; next, if N recursive entropy feature points are used for AR modeling, the measurement noise vk can be determined by the standard deviation s1 to sN of these recursive entropy features: N 1 ∑ vk = si N i=1

(6.20)

Based on the above proposed Kalman filter algorithm and combined with the previous method of mechanical rotating parts degradation tracking, the whole life cycle of mechanical rotating parts can be tracked simultaneously and the time of the initial failure of the component is predicted. The flow chart of the specific prediction algorithm is shown in Fig. 6.9. The specific algorithm steps are as follows: (1) firstly, the recursive entropy value and the degenerated health threshold are calculated according to the degeneracy tracking algorithm proposed in the previous section;

392

6 Phase Space Reconstruction Based on Machinery System Degradation …

Fig. 6.9 Flow chart of an initial Kalman filter fault prediction algorithm

(2) at t Moment, the AR model is constructed by selecting n recursive entropy values in the time interval {t−n + 1, t−n + 2, …, t}, and using the Kalman filter method to predict the m recursive entropy values in the backward time interval {t + 1, t + 2, …, t + m}; (3) if the recursive entropy value of t + l (l ≤ m) is larger than the threshold value of degenerated health, then the initial failure of the component will occur at T + l; otherwise, the recursive entropy value of the next time is calculated, and N recursive entropy values on the time interval {t−n + 2, t−n + 3, …, t, t + 1}, Repeat step (2) until a predicted value exceeds the health threshold. To verify the effectiveness of the degradation tracking algorithm based on the improved recursive quantitative analysis method and the Kalman filter algorithm for the initial fault prediction of mechanical rotating parts, these algorithms are applied to the degradation experimental data of actual bearings. The experimental data were provided by NSF I/UCR Center [17]. The bearing test system is shown in Fig. 6.10a and the detailed system structure is shown in Fig. 6.10b. The system rotates at 2000 rpm, and the total radial load is 6000 lbs and loaded on the bearing through the spring system. A total of 4 double-row deep groove ball bearings, whose model is ZA2115 and each row of 16 rotors, the pitch diameter is 2.815 inches, roller diameter is 0.331 inches, and the contact angle is 15.17 degrees, is installed on the test shaft of the test system. All bearings are lubricated by an oil circulation system

6.5 Kalman Filter Based Incipient Fault Prognostics

393

that regulates oil temperature and flows, and wear particles are collected from the oil by an electromagnet mounted in the feedback oil pipeline as a sign of bearing failure. When the particle amount attached to the electromagnet exceeds a certain threshold, the test bearing is considered to be damaged. At this time, the electronic switch is switched on and the experiment stops. The PCB 353B33 high-sensitivity accelerometer is mounted on each of the test housings, and four thermocouples are mounted on the outer rings of the bearings to measure the operating temperature of the bearings. The vibration data of the bearing are collected by DAQCarde-6062E of NI company, and the vibration signal is processed by LABVIEW software. The sampling frequency is 20 kHz. In the experiment, the bearing vibration signal is collected every 10 min, and 20,480 data points are collected each time. This experiment started at 14:39 on October 29, 2003 and ended at 23:39 on November 25, 2003, with 2000 data files collected. At the end of the experiment, bearing 3 occurred an inner ring failure, as shown in Fig. 6.11. Therefore, this experiment uses 2000 data files of bearing 3 for analysis and processing.

Fig. 6.10 Bearing test-to-failure experiment system [18] Fig. 6.11 Real defect bearing diagram [48]

394

6 Phase Space Reconstruction Based on Machinery System Degradation …

First, the first data file is used to determine the scale factor k of 8 according to Formula 6.9, and then all data files are processed by the improved RQA method. Take the second data file as an example. Firstly, the optimal delay time τ = 2 and the optimal embedding dimension m = 5 are obtained by using the mutual information method and the pseudo-neighborhood method, and then the phase space of the second data file is reconstructed according to the two parameters. After that, the recursive threshold of the second file data is 0.88 (ε = kσ = 8 × 0.11) by using the threshold selection method based on standard deviation. The waveform map and recursive map of the first 1024 data points of the second data file are shown in Fig. 6.12. The recursive entropy character is further calculated by the recursive graph. It is noted that each data file is divided into 20 equal data segments on average, and each segment calculates the recursive entropy value separately. The average of these recursive entropy values serves as the recursive entropy characteristic value of the data file, and the standard deviation of these recursive entropy values is labeled s1 … sN and prepared for the construction of the equation of state. Since there are 2000 data files in this experiment, a total of 2000 recursive entropy eigenvalues were calculated and used to draw the degradation tracking curve of the bearing, as shown in Fig. 6.13. Generally speaking, the first quarter of the time zone during the degradation test is in a healthy state. Therefore, the recursive entropy eigenvalues of the first 500 data files were used to calculate the degradation tracking health threshold of bearings. Based on Chebyshev’s inequation, the health threshold is set to 1.6046, and the health threshold line is also drawn in Fig. 6.13. As can be seen from the diagram, the recursive entropy feature of the bearing exceeds the health threshold for the first time at point 1833, so point 1833 is considered the initial failure time of the bearing. The above is a degradation tracking curve, which is an improvement of the traditional recursive quantitative analysis method by using the recursive threshold selection method based on standard deviation. For comparison, this experiment uses the traditional recursive quantitative analysis method to determine the recursive threshold using the maximum phase space scale, and calculates the recursive entropy for the degradation tracking experiment of the bearing, as shown in Fig. 6.15. As can be seen from the graph, the degradation curve fluctuates greatly and the initial failure occurs at point 1984, 111 points later than the initial failure time obtained using the improved recursive quantitative analysis method. In addition, considering the sensitivity of the kurtosis parameter to the initial failure of mechanical components and the wide application of the RMS parameter in degradation tracking [17], these two characteristic parameters were also calculated here for each data file; The corresponding bearing degradation tracking curves were made, and the health thresholds were also determined as 4.0649 and 0.5079 using the same method, as shown in Figs. 6.14 and 6.16. As can be seen in Fig. 6.14, the kurtosis feature exceeds the set health threshold at point 800, which means that point 800 is mistaken for the time of the initial failure of the bearing in the online monitoring; in addition, the degradation curve in the real degradation stage of the bearing (around 1800 points) has a very large fluctuation, which is very adverse to the failure prediction. Similarly, as can be seen from Fig. 6.16, although the regression curve derived from the root-mean-square feature is much more stable than the kurtosis feature, the root-mean-square feature exceeds the health threshold

6.5 Kalman Filter Based Incipient Fault Prognostics

395

early at points 1271 and 1764, which also leads to a miscalculation of the time of the initial failure. The results of these comparative experiments show that the regression curve based on recursive entropy can describe the process of bearing degradation more clearly get a more accurate fault threshold and reduce the possibility of false timing of initial fault than those based on kurtosis and root mean square. In addition, the improved recursive 1uantitative analysis method using standard deviation-based threshold selection can extract more accurate and stable recursive entropy features, which improves the effectiveness of the traditional recursive quantitative analysis method for bearing degradation tracking. Then, based on the recursive entropy features extracted above and the degradation tracking results, a Kalman filter prediction method can be used to predict the initial failure time of the bearing in advance. The autoregressive model (AR Model) was constructed with 60 recursive entropy values (n = 60). Using the AR model as the equation of state, the prediction is conducted 6 time units in advance (m = 6, 60 min or 1 h) until the recursive entropy value exceeded the health threshold. In particular, the measurement noise parameter vk in the dynamic equation is determined by the average error method given in formula 6.20. The standard deviation of the measurement noise in this experiment is 0.1. The final prediction is shown in Fig. 6.17. When 60 recurrent entropy feature points (1769–1828) were used to build the AR model (the model order was selected as 15th order by AIC criterion), the 5th prediction point exceeded the health threshold for the first time, that is, point 1833 is predicted as the time of the initial failure of the bearing. This prediction is the same as the actual degradation tracking result. As a comparison, the AR model and the ARMA model, Fig. 6.12 Time-domain waveform and the recurrence plot for bearing 3

5 0 -5 0

200

400

600

800

1000

200

400

600

800

1000

1000 900 800 700 600 500 400 300 200 100 0

396

6 Phase Space Reconstruction Based on Machinery System Degradation …

Fig. 6.13 PR entropy degradation curve based on improved recursive quantitative analysis method of bearing 3

Fig. 6.14 Kurtosis degradation curve for bearing 3

two commonly used time series prediction models, were also used to predict the time of the initial failure of the bearing, and the results are shown in Figs. 6.18 and 6.19. As can be seen from Fig. 6.19, the AR model prediction results cannot track the real recursive entropy change trend well, and when the AR model is constructed by using the 60 recursive entropy features from 1770 to 1829 points, the 6th prediction point exceeds the health threshold for the first time, so point 1835 is predicted as the time of initial failure. There is a 2-point (20 min) prediction delay error in the prediction result. As can be seen from Fig. 6.18, when the ARMA model is constructed using

6.5 Kalman Filter Based Incipient Fault Prognostics

397

Fig. 6.15 PR entropy degradation tracking curve based on traditional recursive quantitative analysis method of bearing 3

Fig. 6.16 RMS degradation curve for bearing 3

398

6 Phase Space Reconstruction Based on Machinery System Degradation …

60 recursive entropy features from 1769 to 1828 points, the 6th prediction point exceeds the health threshold for the first time, thus point 1834 is predicted as the time of initial failure. Although the ARMA model improves the prediction accuracy, the prediction results still have a point (10 min) of prediction delay error. The error can be explained as the AR model and ARMA model don’t contain any feedback process, and their prediction results are completely dependent on the development trend of adjacent data. The Kalman filter algorithm can make full use of all kinds of information in the system, and every step of prediction includes error feedback, so it can improve the accuracy of prediction. Besides the time series prediction model, the neural network is also a common prediction method, so the backpropagation (BP) neural network is used to predict the initial fault time of the bearing. The same data segment (points 1769 to 1828) is used as the neural network training data, and the prediction results are shown in Fig. 6.20. As can be seen from the figure, the prediction curve can follow the trend of the actual recursive entropy value, but point 1832 is predicted as the time of the initial failure, and there is a point (10 min) of prediction error. The error can be explained as that the neural network needs a lot of training data to ensure the accuracy of the prediction, but in the experiment of online prediction, the amount of training data is relatively insufficient.

Fig. 6.17 Prediction results of the bearing 3 failure using Kalman filter

6.6 Particle Filter Based Machinery Fault Prognostics

399

Fig. 6.18 Prediction results of the bearing 3 failure using ARMA model

Fig. 6.19 Prediction results of the bearing 3 failure using AR model

6.6 Particle Filter Based Machinery Fault Prognostics The fault prediction of mechanical components can be divided into initial fault time point prediction and remaining useful life prediction. The prediction of the initial failure time point is to track the working state of the mechanical component continuously when it is healthy, and to predict the time point of the slight initial failure according to its degradation, when the component is not seriously damaged, it can continue to work for a long time, and the prediction of the remaining service life is

400

6 Phase Space Reconstruction Based on Machinery System Degradation …

Fig. 6.20 Prediction results of the bearing 3 failure using BP neural network

to follow the development trend of the fault after the initial failure of the component, and to predict the remaining useful life of the component, that is, the time it takes for a part to degenerate from its current state to become completely damaged and unable to continue working. In the last section, the initial fault time is predicted in advance, but the method of predicting the remaining useful life is not given. The characteristic parameters are stable when the machine parts are in a healthy state, and they change obviously only when they are very close to the initial fault time point. However, after the initial fault of the mechanical components, it is necessary to go through a long degradation state before serious failure occurs, resulting in the components cannot continue to work. Therefore, the prediction of the remaining useful life is a long-term prediction process, which puts forward a higher demand for the stability and accuracy of the prediction model. Data-driven prediction models based on Bayesian estimation provide a rigorous mathematical framework for long-term dynamical systems [19]. Based on the Bayesian estimation theory, the state of the current system can be estimated from the state of the previous time, and the estimated value of the current state can be further updated by the measurement data of the current system, thus, the optimal estimation state is obtained. Through such a recurrence estimation process, the multi-step time prediction can be completed. The Kalman filter algorithm used in the previous section is a linear approximation of Bayesian estimation, which can solve the problem of optimal a posteriori state estimation for systems in linear Gauss spaces. For the problem of nonlinear estimation, the extended Kalman filter algorithm is widely used [20], but the extended Kalman filter algorithm only makes use of the first-order partial derivative of the nonlinear function Taylor series and ignores the higher-order term,

6.7 Particle Filter

401

as a result, extended Kalman filter algorithms are often effective only for systems with weak nonlinear characteristics, and cannot deal with signals with severe nonlinear characteristics. The particle filter algorithm solves the problem of state prediction of nonlinear systems very well. The particle filter algorithm is based on Monte Carlo integration and recurrence Bayesian estimation. Its basic idea is as follows: firstly, a set of random samples, called particles, is generated in the state space according to the empirical distribution of the system. Particles and their weights are continuously updated from the measured data, and the updated particle approximates the posterior probability density distribution of the state of the system [21]. At present, the particle filter algorithm in the residual life prediction research mainly includes: the gearbox remaining useful life prediction [22], the bearing useful life prediction [23], etc. However, there are still some problems in particle filter, such as particle degradation and particle diversity loss, which need to be further studied. In order to solve these problems, an adaptive importance density function selection algorithm and a particle smoothing algorithm based on neural network are proposed in this section. Based on this, an enhanced particle filter algorithm is proposed to predict the residual service life of mechanical rotating parts.

6.7 Particle Filter For any dynamical system, the dynamic equation can be expressed as: xk = f (xk−1 ) + ωk−1 z k = h(xk ) + vk

(6.21)

Of the form: xk zk ωk −1 vk f (•) h(•)

the state of the system at k time; the observed value of the system at k time; the process noise of the system at k − 1 time; the measurement noise of the system at k time; procedure function; measurement function;

Given dynamic Eq. 6.21, Bayesian estimators can be used to predict the optimal a posteriori state distribution x p(x0:k |z 1:k ). The reasoning process of Bayesian estimation can be divided into two steps, prediction and update, as shown in Eqs. 6.22 and 6.23 respectively:  p(x0:k |z 1:k−1 ) =

p(xk |xk−1 ) p(x0:k−1 |z 1:k−1 )dx0:k−1

(6.22)

402

6 Phase Space Reconstruction Based on Machinery System Degradation …

p(x0:k |z 1:k−1 ) p(z k |xk ) p(z k |z 1:k−1 ) p(z k |xk ) p(xk |xk−1 ) p(x0:k−1 |z 1:k−1 ) = p(z k |z 1:k−1 )

p(x0:k |z 1:k ) =

(6.23)

∝ p(z k |xk ) p(xk |xk−1 ) p(x0:k−1 |z 1:k−1 ) Of the form: p(x 0:k | z1:k −1 ) the distribution of the predicted probability density of the system at k time; p(x 0:k | z1:k ) the posterior probability density distribution of system k time; p(x k | x k −1 ) state transition probability model; p(zk | x k ) likelihood function;  p(zk |zk-1 ) normalization factor, p(z k |z 1:k−1 ) = p(x0:k |z 1:k−1 ) p(z k |xk )d xk ; ∝ positive correlation symbols; The prediction step is to use the system model to calculate the prediction probability density function at time k for all measurements at 1: k-1, then update the process using the latest measurements to revise the probability density function to a posterior probability density function at time k. Equations 6.22 and 6.23 are the basis of Bayesian estimation. Generally speaking, for nonlinear and non-Gaussian Systems, in reality, the optimal solutions of Eqs. 6.22 and 6.23 are difficult to be given by complete analytical expressions. The particle filter method can solve the integral operation of Bayesian estimation by Monte Carlo method, and thus give the approximate optimal solution First of all, for a series of random samples (partiof Bayesian estimation [21]. { i } x , i = 1, 2, ..., N of the system at time k, the corresponding weights are cles) k { i } wk , i − 1, 2, ..., N . And the posterior probability density function of the system state at time k can be approximated by these particles [24]: p(xk |z 1:k ) ≈

N ∑

wki δ(xk − xki )

(6.24)

i=1

Of the form: N number of particles; δ(•) the Dirac delta function, or the impulse function; These weights can be determined by the importance sampling theory [21]. In this theory, the importance density function q(x0:k |z 1:k ) is introduced to determine the importance weight as follows: wki

) ( i |z 1:k p x0:k ) ∝ ( i q x0:k |z 1:k

Let q(x0:k |z 1:k ) be further decomposed into:

(6.25)

6.7 Particle Filter

403

q(x0:k |z 1:k ) = q(xk |x0:k−1 , z 1:k )q(x0:k−1 |z 1:k−1 )

(6.26)

Then, with Eqs. 6.23 and 6.26 and with Eq. 6.25 available: wki ∝

i i i ) p(x0:k−1 |z 1:k−1 ) p(z k |xki ) p(xki |xk−1 |z 1:k ) p(x0:k = i i i q(x0:k |z 1:k ) q(xk |x0:k−1 , z 1:k )q(x0:k−1 |z 1:k−1 )

i = wk−1

i = wk−1

i p(z k |xki ) p(xki |xk−1 )

(6.27)

i q(xki |x0:k−1 , z 1:k ) i p(z k |xki ) p(xki |xk−1 ) i q(xki |xk−1 , zk )

From formula 6.27, it can be seen that the determination of the importance density function is the key link in calculating the importance weight. Gordon et al. proposed a priori state transition probability density function as the importance density function ) ( ) ( i i j j to calculate the weights [21], that is, let q xk |xk−1 , z k = p xk |xk−1 . In this case, Eq. 6.27 is simplified to: i wki ∝ wk−1 p(z k |xki )

(6.28)

That is, the weight value of the current moment can be obtained from the weight value of the previous moment and the likelihood function. Next, the weight is standardized by Eq. 6.29 and the likelihood function is obtained by using the measurement equation in Eq. 6.21. Then the weight function expression can be further given by Eq. 6.30: wki ≈ wki /

N ∑

wki

(6.29)

i=1 i i wki ≈ wk−1 p(z k |xki ) ≈ wk−1 pvk (z k − h(xki ))

(6.30)

Of the form: pvk (•)—the probability density function of measurement noise vk ; Finally, the optimal state xk of the system at the current time is estimated as follows: xˆk ≈

N ∑

wki xki

(6.31)

i=1

The biggest problem of traditional particle filter algorithms is particle degradation. After several iterations, the weight of particles is concentrated on one or a few particles, while the weight of other particles is almost zero, as a result, much of the computational effort is wasted updating the probability density function particles,

404

6 Phase Space Reconstruction Based on Machinery System Degradation …

so that the resulting set of particles does not reflect the state of the real probability density function. It should be noted that the resampling algorithm is introduced into the traditional particle filter algorithm, which can remove the particles with small weights, and the weight distribution can be optimized by selecting the appropriate importance density function. Both methods are effective means to solve the problem of particle degradation. In the following, the shortcomings of the existing importance density function selection and resampling algorithms are introduced in detail, and an improved algorithm is proposed. The importance density function in the particle filter algorithm is closely related to the calculation of particle weight and the renewal process of particles. In order to simplify the calculation process, the prior state transfer probability density function is used as the importance density function in the traditional particle filter algorithm, the importance density function selection method is also adopted in many current researches using particle filter [24]. However, this method takes the prior probability density of the system state as an approximation of the posterior probability density, and does not take into account the current measurement values, leading to a large deviation between the probability density function resampled from the importance probability density function (the prior probability density function) and the probability density function resampled from the real posterior probability density function. this bias is particularly pronounced when the likelihood function is located at the tail of the prior state transition probability density function. In the present research, many scholars have discussed and studied the selection method of the importance density function. For example, Yoon uses Gauss’s mixed model in conjunction with an unscented sequential Monte Carlo probability assumption density filter to improve the traditional importance density function selection method [25], li proposed a new method to calculate the weight of importance by combining wavelet and Grey Model [26]. All of the above studies have introduced other algorithms to assist the selection of importance density function, and achieved good experimental results, but also increased the complexity of the algorithm and increased the amount of computation. Therefore, different from the above methods, this paper presents an adaptive importance density function selection method based on the update of its own particle distribution. In essence, the particle filter algorithm is to fit the posterior probability density distribution of the state of the system by continuously updating the particle swarm, while the importance density function has a close influence on the updating process of the particles. Therefore, the updated particle swarm distribution can be considered as the importance density function of the current moment to guide the next time to calculate and update the weight of particles. Using this method, the importance density function is updated by the particle distribution in the previous prediction step in each prediction step, in this way, the importance density function can not only keep the prior information of the system state at the last moment, but also approach the posterior probability density distribution of the system state by real-time updating. Combined with the traditional particle filter algorithm, the process of the adaptive importance density function selection method is further introduced in detail (shown in Fig. 6.21):

6.7 Particle Filter

Fig. 6.21 Flow chart of adaptive importance density function selection algorithm

405

406

6 Phase Space Reconstruction Based on Machinery System Degradation …

{ } (1) particle initialization: at k = 0, the particle swarm x0i , i = 1, 2, ..., N is generated from the prior distribution of the system, and the variance σ0 of the particle swarm is calculated at this moment, which is ready for the iterative operation in the next moment; (2) particle update: at k > 0, the process noise in Eq. 6.21 is determined by ωk−1 = ε + Δk−1 , where Δk−1 ∼ N (0, σk−1 ), ε is the error of the equation of state and N is the probability density function of normal distribution. And the particle swarm {xki , i = 1, 2, ..., N } is calculated by updating the previous probability density function in Eq. 6.21, and the mean and variance of the updated particle swarm were calculated. Since many distributions can be approximated by normal distribution, the normal distribution N (x k , σk ) is used to approximate the probability density function of the particle swarm. (3) particle weight calculation: using the probability density function of the current i , zk ) = particle swarm to determine the importance density function q(xki |xk−1 N (x k , σk ), the weight of each particle can be calculated using Eq. 6.32: i wˆ ki = wk−1

i p(z k |xki ) p(xki |xk−1 ) i q(xki |xk−1 , zk )

=

i ) 1 pvk (z k − h(xki )) p(xki |xk−1 N N (x k , σk )

(6.32)

i where, wk−1 is defined as 1/N , because the particles at the last moment have the same weight after the resampling algorithm; (4) particle resampling: { } after the weight { is} standardized by Eq. 6.29, the current particle swarm xki and its weight wˆ ki are resampled by using the resampling algorithm in the traditional particle filter to obtain the new particle swarm, and the new particle has the same weight 1/N . The state of the system can be estimated by the new particle swarm by Eq. 6.31; then k = k + 1, return step 2) and carry on the next time particle update and weight calculation until the predicted steps reach the threshold kend .

From the above steps of the method of selecting the adaptive importance density function, we can see that the importance density function can be adjusted adaptively by the renewal of particles in each iteration cycle, and at the same time, the adjusted importance density function can affect the subsequent resampling operation and the next particle update operation. Therefore, the adaptive importance density function selection method not only considers the prior information of the system state at the last time when the particle update and resampling operation, but also does not limit itself to the prior information, instead, the importance density function is adjusted to make the distribution of particles close to the posterior probability distribution of the state of the system, which reduces the possibility of particle degeneration. In addition, because this method only needs to count the mean and variance of particle swarm, and does not introduce other algorithms, it has less computation and meets the real-time requirement of on-line prediction. Resampling algorithm is another effective method to improve the degradation of particles. The basic idea of resampling algorithm is to copy those particles with large

6.7 Particle Filter

407

weight and discard those with small weight, so that the number of small weight particles in the new particle swarm will be reduced after resampling, which will restrain the degradation of particles. However, after several resamples, those particles with high weights may be duplicated many times, resulting in many particles with the same weights in the new population, resulting in a gradual loss of particle diversity in the particle swarm, which makes it difficult for the iterative particle swarm to represent the posterior probability density distribution of the system state [27]. This phenomenon is called particle depletion. In order to eliminate particle dilution, some algorithms are proposed to improve the traditional resampling algorithm, such as residual resampling [28] and distributed resampling [29]. These algorithms generally focus on improving the resampling algorithm itself, but seldom involve adjusting some singularities in particles before resampling steps, i.e., some particles with abnormally large weights, these singularities are the root cause of particle depletion. In order to adjust the singularity, it is necessary to establish the relation model between the particle and its corresponding weight. Since the relationship is generally nonlinear and non-gauss, the model is generally difficult to give in analytic form. The neural network has good nonlinear tracking ability and information learning ability, among which the back propagation (BP) neural network has the characteristics of simple structure and less prior information requirement for the modeled samples, therefore, BP neural network is used as a particle smoothing algorithm to improve the traditional resampling algorithm. This method does not change the steps of the original resampling algorithm, only uses BP neural network to smooth the weight of particles before resampling in order to eliminate the singularity. BP neural network includes input layer, hidden layer and output layer. In the training process of BP neural network, the gradient descent algorithm is used to adjust the weights of the network, and the total error is reduced until the error between nonlinear input and output reaches the minimum. It is theoretically proved that any continuous function can be fitted by a BP neural network model with a hidden layer [30]. Therefore, a three-layer BP neural network is used to construct the relationship model between particles and their weights, furthermore, the weight of particles is smoothed. Figure 6.22 shows a schematic diagram of a particle smoothing algorithm based on the BP neural network. The steps of the algorithm are as follows: i (1) firstly, the particle swarm { new } {xk , i = 1, 2, ..., N } and its corresponding i weights wˆ k , i = 1, 2, ..., N are obtained after updating and calculating the weights at time k; (2) and, taking the particle value and its corresponding weight value as the training input and the training output of the neural network respectively, a BP neural network is trained by gradient descent algorithm, which is recorded as MBP ; (3) then, the particle value xki is substituted into the trained neural network MBP as the test input, and the network ( ) output is calculated, that is, the smoothed particle i = MBP xki ; weight value wˆ s,k {( ) } i , i = 1, 2, ..., N are (4) finally, the particles and the smoothed weights xki , wˆ s,k resampled using the traditional resampling algorithm.

408

6 Phase Space Reconstruction Based on Machinery System Degradation …

Fig. 6.22 Flow chart of particle smoothing algorithm based on neural network

In each iteration of the particle filter prediction, the resampling particle smoothing algorithm is used once to map the probability density distribution of the particle weights from the discrete space to the continuous space, the new importance weight points are sampled in the generated continuous space. After this process, the large weights of the particles are smoothed, and the large differences between the weights of the different particles are reduced, so that the diversity of the particles is preserved after the resampling operation.

6.8 Enhanced Particle Filter Combining the above-mentioned adaptive importance density function selection method and the particle smoothing algorithm based on a neural network with the traditional particle filter method, an enhanced particle filter algorithm is proposed, the flow chart of the algorithm is shown in Fig. 6.23. The specific algorithm steps are as follows: (1) firstly, the particles are initialized at k = 0 to prepare for subsequent iterations; (2) at the next time, the system dynamic equation is used to update the particles, and the process noise is corrected by the variance value determined at the last time; (3) then the mean and variance of the particle swarm are calculated, and the distribution function of the particle at the current time is determined as the importance density function by the adaptive importance density function selection method, and the weight of each particle is calculated and standardized; (4) next, the weight of particles is smoothed by the particle smoothing algorithm based on BP neural network to keep the diversity of particles, and then the particles are resampled by the resampling algorithm according to the smoothed weight to obtain the new particle swarm with equal weight;

6.8 Enhanced Particle Filter Fig. 6.23 Flow chart of enhanced particle filter algorithm

409

410

6 Phase Space Reconstruction Based on Machinery System Degradation …

(5) finally, the new particle swarm is used to estimate the current state of the system through Eq. 6.31. After one prediction, repeat the above steps, and execute the algorithm to predict the next state of the system until the stopping condition k = kend is reached.

6.9 Enhanced Particle Filter Based Machinery Components Residual Useful Life Prediction In order to use the enhanced particle filter algorithm to predict the life of mechanical rotating parts, the system dynamic equation is the first problem to be solved. The dynamic equation needs to be able to describe the state of the system well and be constructed simply to meet the real-time requirement of online prediction. In addition, considering that the running state of mechanical parts is not only related to the state of the previous time, but also related to the state at p consecutive times before. Therefore, unlike the first-order state equation given in Eq. 6.21, a multi-order state equation is needed to describe the state change of the system. In the previous chapter, the validity of the autoregressive (AR) model in predicting the initial failure time point is verified, in addition, the AR model can establish the relationship between the state of the system at the current moment and the state of the system at the previous successive moments. Therefore, the AR model is used to construct the multi-order state equation, as shown in Eq. 6.33: xk = f (xk−1 , xk−2 , ..., xk− p ) + ωk−1 =

p ∑

a j xk− j + εk−1

(6.33)

j=1

where, the variables in the formula have the same meanings as those in Eq. 6.30. AIC criterion and Burg algorithm are also used to determine the order of the model. In addition, the equation of measurement is in the form of: z k = xk + vk

(6.34)

where, the selection of vk is related to the predicted characteristics. With the AR state equation, the enhanced particle filter algorithm can be applied to the life prediction of mechanical rotating parts. Generally speaking, the whole life cycle of mechanical rotating parts can be divided into three stages: health state, fault degradation state and serious fault state. The health threshold of degradation tracking introduced in Sect. 6.3 distinguishes the health state of a component from the failure degradation state, and the initial failure occurrence time is the beginning of the failure degradation state. Different from the time prediction of initial failure, the prediction of remaining useful life is carried out under the condition of failure degradation. Specifically, a prediction model (in this case, the enhanced particle

6.9 Enhanced Particle Filter Based Machinery Components Residual Useful …

411

filter) is built using current–time data to predict the state of the system backwards until a predicted state exceeds a predetermined fault threshold X th . The threshold is a sign that the component is in a state of serious failure. And the remaining useful life of the component at the current moment RU L t can be obtained as follows: RU L t = tr − t

(6.35)

where, t is the current moment, tr is the time when the predicted state exceeds the fault threshold. Therefore, the AR model and the enhanced particle filter algorithm are combined to establish the residual service life of mechanical rotating parts of the prediction framework, the flow chart as shown in Fig. 6.24. The algorithm flow is described as follows: (1) to extract the nonlinear features describing the degradation of components and obtain the health threshold is obtained by the method of setting the health threshold in Sect. 6.3, then obtain the time of initial failure, which marks that the component enters the stage of fault degradation; (2) the remaining useful life is predicted from the beginning of the fault degradation state; (3) at time t, n features in time period {t − n + 1, t − n + 2, ..., t} were selected to construct the AR model. And the particle swarm is generated by a prior distribution using the features of the current moment. The enhanced particle filter algorithm is used to continuously predict backward for multiple steps {t + 1, t + 2, ...} until the predicted feature is greater than the fault threshold X th . If the time corresponding to the feature is t + m, then m is the predicted remaining useful life of the component at time t; (4) at time t + 1, the new observation features are obtained. The AR model was updated by selecting n features in the time period {t − n + 2, t − n + 3, ..., t, t + 1}, and the predicted remaining useful life of the component at time t + 1 was obtained by using the enhanced particle filter algorithm with the same method as in step 3); (5) repeat step (3) and step (4) until the new observed feature is larger than the fault threshold X th ; (6) at last, the remaining useful life curve is made by the predicted remaining useful life of each time. In order to verify the validity of the remaining useful life prediction algorithm of mechanical rotating parts based on enhanced particle filter, the same experimental data of bearing degradation as in Sect. 6.3 are used in this study, that is, bearing 3 degradation test data containing 2000 data files. The recurrence entropy feature derived from the improved recurrence quantitative analysis method is still used as the state characteristic parameter, and then the remaining useful life is predicted by the enhanced particle filter method.

412

6 Phase Space Reconstruction Based on Machinery System Degradation …

Fig. 6.24 Flow chart of residual service life prediction for mechanical rotating parts based on enhanced particle filter

6.9 Enhanced Particle Filter Based Machinery Components Residual Useful …

413

The regression tracking curve of bearing 3 based on recurrence entropy is shown in Fig. 6.25. Each point in the figure corresponds to a data file, with a 10-min interval between the two points. According to the experimental results in Sect. 6.3, the health threshold is set to 1.6046, and the initial bearing failure occurs at point 1833. In addition, because the degradation curve has a significant upward trend after point 2120, so point 2120 as the time point of serious bearing failure, the corresponding recurrence entropy eigenvalue 2.1242 is used as the fault threshold of life prediction. The prediction of the remaining useful life of the bearing starts from 1833 points. Firstly, the AR model is established as a dynamic equation by using 60 recurrence entropy features from 1833 to 1892, generating 100(N = 100) particles as a priori distributed particle swarm, utilizing the enhanced particle filter algorithm to predict the remaining useful life according to the steps described in Fig. 6.24, the final result was that the 232nd prediction point exceeded the fault threshold for the first time, and therefore the predicted remaining useful life of point 1892 was 2320 min (232 ∗ 10 min =2320 min). Using the same procedure to predict the remaining useful life of all the points in the fault degradation state, 228 remaining life prediction points were obtained. Because the 2120th point is the prediction stop point, the predicted results are shown in Fig. 6.27. The horizontal axis of the figure represents the starting point of each life prediction, and the vertical axis represents the remaining useful life corresponding to the predicted starting point. As can be seen from the figure, the predicted remaining useful life curve can basically reflect the trend of the true remaining useful life curve, and the closer to the predicted end point, the smaller the fluctuation of the predicted remaining useful life curve, this means that the closer the bearing state is to the point of failure time,

Fig. 6.25 Degeneracy tracking curve of bearing 3

414

6 Phase Space Reconstruction Based on Machinery System Degradation …

Fig. 6.26 Traditional particle filter residual service life prediction results of bearing 3

the higher the accuracy of remaining useful life prediction is. In order to make a comparison, this experiment also uses the traditional particle filter method and the common prediction model support vector regression method, which uses the same data segment to train the AR model and the regression network, the remaining useful life of bearing 3 is predicted by the same steps, and the results are shown in Figs. 6.26 and 6.27. In order to quantitatively evaluate the prediction error, Eqs. 6.36 and 6.37 are used to calculate the mean error e3 and root-mean-square error e4 : e3 =

M 1 ∑ |RU L p (i ) − RU L r (i )| M i=1

[ | M |1 ∑ ( )2 | RU L p (i ) − RU L r (i ) e4 = M i=1

(6.36)

(6.37)

where, RU L p is predicted remaining useful life; RU L r is true remaining useful life; M is the number of remaining useful life points used for prediction, which is 228 in this experiment. In addition, similar to the above simulation experiment, the effective particle number and the standard deviation of particles are also calculated for each remaining useful life prediction point, and the average values of these two parameters for

6.9 Enhanced Particle Filter Based Machinery Components Residual Useful …

415

Fig. 6.27 Enhanced particle filter residual life prediction results of bearing 3

all remaining useful life prediction points are calculated, the results are shown in Table 6.5. As can be seen from Fig. 6.26 and Table 6.5, although the remaining useful life prediction curve obtained by the traditional particle filter method can also reflect the trend of the real remaining useful life curve, the fluctuation of the prediction curve and the prediction error are obviously higher than the enhanced particle filter algorithm. In addition, the average effective particle number and the average standard deviation of the particle swarm after enhanced particle filter algorithm are larger than those of the traditional particle filter algorithm, the comparison results show that the adaptive importance density function selection method and the particle smoothing algorithm based on neural network can effectively reduce the degradation of particles Table 6.5 Quantitative evaluation parameters of bearing 3 in different prediction methods Enhanced particle filter

Traditional particle filter

Support vector regression

Average error e3 (in hours)

1.67

3.89

5.09

Root-mean-square error e4 (in hours)

2.43

4.89

6–16

The average number of effective particles

75

59

/

0.0291

/

Mean standard deviation 0.0608 of particles

416

6 Phase Space Reconstruction Based on Machinery System Degradation …

Fig. 6.28 Support vector regression residual life prediction results of bearing 3

and keep the diversity of particles, therefore, the prediction accuracy of particle filter is improved. In addition, as can be seen in Fig. 6.28 and Table 6.5, the fluctuation and prediction error of the remaining useful life prediction curve obtained by the support vector regression algorithm i larger than those obtained by the particle filter algorithm, this can be explained as the application of support vector regression algorithm to the prediction of remaining useful life under the condition of insufficient training samples and long prediction steps. In order to meet the requirement of long-term prediction, the selection of kernel function and regression parameters should be further improved and optimized.

References 1. Meng, Q.: Nonlinear Dynamical Times Series Analysis Methods and Its Application. Shandong University (in Chinese), Communication and Information Systems (2008) 2. Packard, N.H., Crutchfield, J.P., Fanners, J.D., et al.: Geometry from a time series. Phys. Rev. Lett. 45(9), 712–716 (1980) 3. Takens, F.: Detecting strange attractors in turbulence. In: Dynamical Systems and Turbulence, Warwick 1980, pp. 366–381. Springer, Heidelberg (1981) 4. Kantz, H., Schreiber, T.: Nonlinear Time Series Analysis. Cambridge University Press, UK (1997) 5. Broomhead, B.S., King, G.P.: Extracting qualitative dynamies from expenmental data. PhysieaD 20, 217–236 (1986) 6. Kennel, M.B., Brown, R., Abarbanel, H.D.I.: Determining embedding dimension for phase space reconstruction using a geometrical construction. Phys. Rev. A 45(6), 3403–3411 (1992) 7. Webber, C.L., Zbilut, J.P.: Dynamical assessment of physiological systems and states using recurrence plot strategies. J. Appl. Physiol. 76(2), 965–973 (1994)

References

417

8. Zbilut, J.P.: Detecting deterministic signals in exceptionally noisy environments using crossrecurrence quantification. Phys. Lett. A 246(1), 122–128 (1998) 9. Nichols, J.M., Trickey, S.T., Seaver, M.: Damage detection using multivariate recurrence quantification analysis. Mech. Syst. Signal Process. 20(2), 421–437 (2006) 10. Marwan, N.: Encounters with Neighbors: Current Developments of Concepts Based on Recurrence Plots and Their Applications. University of Potsdam, Potsdam (2003) 11. Bearing Data Center Seeded Fault Test Data: The Case Western Reserve University Bearing Data Center Website. http://csegroups.case.edu/bearingdatacenter/pages/welcome-case-wes tern-reserve-university-bearing-data-center-website 12. Marwan, N., Romano, M.C., Thiel, M., et al.: Recurrence plots for the analysis of complex systems. Phys. Rep. 438(5–6), 237–329 (2007) 13. Thiel, M., Romano, M.C., Kurths, J., et al.: Influence of observational noise on the recurrence quantification analysis. Physica D 171(3), 138–152 (2002) 14. Kalman, R.E.: A new approach to linear filtering and prediction problems. J. Basic Eng. 82, 35–45 (1960) 15. Gototo, S., Nakamura, M., Uosaki, K.: Online spectral estimation of nonstationary time-series based on AR model parameter-estimation and order selection with a forgetting factor. IEEE Trans. Signal Process. 43(6), 1519–1522 (1995) 16. Zhang, Y., Zhou, G., Shi, X., et al.: Application of Burg algorithm in time-frequency analysis of doppler blood flow signal based on AR modeling. J. Biomed. Eng. 22(3), 481–485 (2005) 17. Qiu, H., Lee, J., Lin, J., et al.: Robust performance degradation assessment methods for enhanced rolling element bearings prognostics. Adv. Eng. Inform. 17(3–4), 127–140 (2003) 18. Qiu, H., Lee, J., Lin, J.: Wavelet filter-based weak signature detection method and its application on roller bearing prognostics. J. Sound Vib. 289(4–5), 1066–1090 (2006) 19. Zhu, Z.: Particle Filtering Algorithm and Its Application (in Chinese). Science Press, Beijing (2010) 20. Samantaray, S.R., Dash, P.K.: High impedance fault detection in distribution feeders using extended kalman filter and support vector machine. Eur. Trans. Electrical Power 20(3), 382–393 (2010) 21. Gordon, N.J., Salmond, D.J., Smith, A.F.M.: Novel approach to nonlinear/ non-Gaussian Bayesian state estimation. IEE Proceedings-F Radar Signal Process. 140(2), 107–113 (1993) 22. Sun, L., Jia, Y., Cai, L., et al.: Residual useful life prediction of gearbox based on particle filtering parameter estimation method (in Chinese). Vib. Shock 32(6), 6–12 (2013) 23. Chen, C., Vachtsevanos, G., Orchard, M.: Machine remaining useful life prediction: an integrated adaptive neuro-fuzzy and high-order particle filtering approach. Mech. Syst. Signal Process. 28, 597–607 (2012) 24. Arulampalam, M.S., Maskell, S., Gordon, N., et al.: A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Trans. Signal Process. 50(2), 174–188 (2002) 25. Yoon, J.H., Kim, D.Y., Yoon, K.J.: Gaussian mixture importance sampling function for unscented SMC-PHD filter. Signal Process. 93(9), 2664–2670 (2013) 26. Li, T., Zhao, D., Huang, Z., et al.: A wavelet-based grey particle filter for self-estimating the trajectory of manoeuvring autonomous underwater vehicle. Meas. Control 36(3), 321–325 (2014) 27. Cao, B. Research on Improved Algorithms and Applications based on Particle Filter. (in Chinese) Xi’an Institute of Optics and Precision Mechanics of CAS (2013) 28. Rigatos, G.G.: Particle filtering for state estimation in nonlinear industrial systems. IEEE Trans. Instrum. Meas. 58(11), 3885–3900 (2009) 29. Bolic, M., Djuric, P.M., Hong, S.J.: Resampling algorithms and architectures for distributed particle filters. IEEE Trans. Signal Process. 53(7), 2442–2450 (2005) 30. Li, Q., Yu, J., Mu, B., et al.: BP neural network prediction of the mechanical properties of porous NiTi shape memory alloy prepared by thermal explosion reaction. Mater. Sci. Eng. A 419(1–2), 214–217 (2006)

Chapter 7

Complex Electro-Mechanical System Operational Reliability Assessment and Health Maintenance

7.1 Complex Electro-Mechanical System Operational Reliability Assessment Mechanical equipment manufacturing industry is an important embodiment of the national integrated strength and defense strength. As the key tool and resource of mechanical equipment manufacturing industry, complex electro-mechanical system is the foundation of national economy and industry. However, under the conditions of long-term operation, variable loads, and multi-physical field coupling, the performance of complex electro-mechanical systems gradually degrade, which impacts the reliability and remaining useful life. Typically, aircraft engine suffers from safety hazards like rotor cracks, bearing damage, static and dynamic rubbing due to its inherent complex structure and severe working conditions, thus affecting the reliability of the aircraft engine. According to statistics from Air Force Material Laboratory (AFML): more than 40% of flight accidents caused by mechanical excitation are related to aircraft engines. In aero-engine accidents, the failure damage of engine rotors (including rotating parts such as shafts, bearings, discs and blades) accounts for more than 74%. At present, the overhaul period of Chinese aero engines is half of that of American aero engines, and the total service life is only 1/4 of that of American aero engines. For example, the refurbishment life of J-10 power is 300 flight hours and the total life is 900 flight hours. The overhaul period of the 3rd generation turbofan engines F100 and F110 in the US is about 800–1000 flight hours, and the total service life is about 2000–4000 flight hours. Another typical example is construction machinery: it works in the field all year round, even in extreme environments such as high altitude, hypoxia and drought, with variable working conditions, those damages or destructions seriously endanger the reliability and service life of electro-mechanical systems. According to statistics, engine failures account for about 30% of engineering machinery failures, compared to 20% by transmission system failures, 25–35% by hydraulic system failures, and 15–25% by braking system failures and structural weld cracking. The average time between failure (MTBF) of 1000 h reliability test and “three guarantees” period abroad is 500–800 h, with a © National Defense Industry Press 2023 W. Li et al., Intelligent Fault Diagnosis and Health Assessment for Complex Electro-Mechanical Systems, https://doi.org/10.1007/978-981-99-3537-6_7

419

420

7 Complex Electro-Mechanical System Operational Reliability …

maximum of more than 2000 h. The MTBF in China is 150–300 h, among which the mean time between failures of wheel loaders is 297.1 h, 400 h at most and 100 h at least. To sum up, high failure rate, low reliability and short service life are the bottlenecks which restricts the international competitiveness and influence of Chinese mechanical equipment, and they are the key to transform China from a manufacturing giant to a manufacturing power. The difficulties are as follows: (1) Reliability research and controllable life design in China are at the initial stage. The mechanical equipment R&D in China has always been imitating foreign products through surveying and mapping for design and development. Thus, there is a lack of basic data such as load spectrum, reliability and life test of key parts, and the relationship between load spectrum, service condition parameters, reliability and life has not been established, making it impossible to control the service life and reliability of products in the design stage. (2) Traditional reliability theory and life test research mainly rely on the classical probability statistics method, which must meet three prerequisites: a large number of samples determined by the law of large numbers; The sample has probability repeatability; It is not affected by human factors. However, the above three prerequisites are difficult to meet in the reliability and life test analysis of mechanical equipment, it is unrealistic to obtain the reliability life data of mechanical equipment through a large number of samples in view of economic costs and time costs. Moreover, due to the differences in operation and maintenance among mechanical equipment, it is difficult to ensure the probability repeatability and no disturbance from human factors. (3) Traditional reliability assessment is mainly based on the binary hypothesis or finite state hypothesis, and generally, it is considered that there are only two states (namely, normal and failure) or finite states. However, the health status of electro-mechanical system has the characteristics of continuous progressive degradation and random decentralized failure, and the binary hypothesis and finite state hypothesis are insufficient to reveal the health attributes of mechanical equipment. (4) Changes in service conditions and operating parameters (such as temperature, vibration, load, pressure, electrical load, etc.) often affect the operating reliability of electro-mechanical systems, any single parameter above exceeds the limit or failure will affect the reliability. However, the machine service conditions and parameters rarely follow the traditional mathematical distribution form, so their changes are often more difficult to deal with. In order to reveal the failure rules of parts, structures and equipment, both domestic and foreign scholars have conducted in-depth research, explored the physical mechanism of mechanical structure performance degradation and failure evolution, and put forward some reliability prediction models that can reflect the failure correlation among components. These models mainly estimate and predict the failure characteristics of the population (such as mean time to failure, reliable operation probability, etc.) based on a large number of historical failure data. Literature [1–3] summarizes the reliability prediction methods and theories. Common reliability prediction

7.1 Complex Electro-Mechanical System Operational Reliability Assessment

421

methods based on fault events include linear model, polynomial model, exponential model, time series model, regression model, etc. Due to the rich information resources contained in the operation state data, some scholars have fused reliability methods with fault prediction techniques in recent years, making the fault prediction results more scientific and complete.

7.1.1 Definitions of Reliability Reliability: the ability or possibility of components, products and systems to perform specified functions without failure within a certain period of time and under certain conditions. The reliability of products is usually evaluated by the degree of reliability, failure rate, MTBF and other indicators. Mathematically, Degree of reliability: refers to the ability of the product to complete the predetermined function within the specified time and under the specified conditions, Assuming that the specified time is t and the product life is T, the reliability is usually expressed as the probability of T > t: R(t) = P(T > t)

(7.1)

Failure probability: refers to the probability that a product loses its intended function under specified conditions and within a specified time, also known as failure rate, unreliability, etc., which is recorded as, and usually expressed as: F(t) = P(T ≤ t)

(7.2)

Obviously, the relationship between failure probability and reliability is: F(t) = 1 − R(t)

(7.3)

Operation reliability: the normalized health measurement index for completing the scheduled function determined by its operation status information under the specified conditions and within the service time.

7.1.2 Operational Reliability Assessment 7.1.2.1

Condition Monitoring and Signal Acquisition for Machinery

As we all know, machinery has experienced a series of degradation states from normal states to failure, and the operation process can be monitored by some measurable variables. Therefore, it is very important to establish an internal relationship between

422

7 Complex Electro-Mechanical System Operational Reliability …

mechanical operation reliability and condition monitoring information, such as vibration signal, temperature, pressure, etc. Vibration signal, the most commonly used data, is acquired by sensors. The operation reliability evaluation starts with the monitoring data obtained by sensors and the data acquisition system which aims at collecting the mechanical state information in certain operating states.

7.1.2.2

Lifting Wavelet Packet Transform

In order to extract the mechanical dynamic signal characteristics from the collected excitation response signal, the wavelet transform with multi-resolution capability can observe the signal at different scales (resolutions). By decomposing the signal into different frequency bands, we can see both the full picture of the signal and the details of the signal. At least, it has the following two outstanding advantages: (1) The multi-resolution ability of wavelet transform can observe the signal at different scales (resolutions), decompose the signal into different frequency bands, and see the full picture of the signal as well as the details of the signal. (2) The orthogonal property of wavelet transform can decompose any signal into its own independent frequency bands so that the decomposed signals in these independent frequency bands carry different mechanical state information. The lifting wavelet packet inherits the good multi-resolution and time–frequency localization characteristics of the 1st generation wavelet transform, and has the advantages of more efficient and faster wavelet transform execution capability, simple structure, low computational complexity [4]. Daubechies proved that any wavelet transform can be executed by a lifting scheme [5]. The lifting wavelet packet transform is a lifting scheme based on wavelet packet transform, which includes forward transformation (decomposition) and inverse transformation (reconstruction). The inverse transformation can be realized by running the forward transformation in the reverse direction. The specific process is explained as follows: (1) Decomposition The forward transformation of the lifting wavelet packet for signal decomposition includes three steps: split, prediction and update. Split: suppose there is an original signal S = {x(k), k ∈ Z }, The original signal can be divided into two sub-sequences: one is even sequence se = {se (k), k ∈ Z } and the other is odd sequence so = {so (k), k ∈ Z }, where x(k) is the k th sample in sequence S, Z is a set of positive integers. se (k) = x(2k), k ∈ Z

(7.4)

se (k) = x(2k), k ∈ Z

(7.5)

where k is the sample order for sub-sequence se and so .

7.1 Complex Electro-Mechanical System Operational Reliability Assessment

423

The reason for splitting an original signal into two series is that adjacent samples are much more correlated than those far from each other. Therefore, the odd and even series are highly correlated. Prediction and update: some even series samples can be used to predict specific samples of odd series, and the difference of prediction is called detail signal. The even series can be updated by the obtained detail signal, and the improved even signal is called the approximate signal: sl1 = s(l−1)1o − P(s(l−1)1e )

(7.6)

sl2 = s(l−1)1e + U (sl1 )

(7.7)

sl(2l −1) = s(l−1)2l−1 o − P(s(l−1)2l−1 e )

(7.8)

sl2l = s(l−1)2l−1 e + U (sl(2l −1) )

(7.9)

After the l th decomposition, sl1 , sl2 , . . . , sl2l is the decomposed signals on each frequency band; s(l−1)1o , . . . , s(l−1)2l−1 o are the odd sequences after (l − 1)th decomposition; s(l−1)1e , . . . , s(l−1)2l−1 e are even sequences after (l − 1)th decomposition; P are N predictors, where p1 , p2 , . . . , p N are predictor’s coefficients, N is the number of predictors. U are N˜ predictors, where u 1 , u 2 , . . . , u N˜ are predictor’s coefficients, N˜ is the number of predictors. The forward transformation of lifting wavelet packet is illustrated in Fig. 7.1. (2) Reconstruction The inverse transform for signal reconstruction can be derived from the forward transform by running the lifting scheme as illustrated in Fig. 7.1 backwards. The signal in one frequency band after decomposition is set to be reconstructed, and

Fig. 7.1 Lifting wavelet packet transform

424

7 Complex Electro-Mechanical System Operational Reliability …

the others are set as zero. The signal reconstruction of second generation wavelet package transform for an appointed frequency band is carried out as follows: s(l−1)2l−1 e = sl2l − U (sl(2l −1) )

(7.10)

s(l−1)2l−1 o = sl(2l −1) + P(s(l−1)2l−1 e )

(7.11)

s(l−1)2l−1 (2k) = s(l−1)2l−1 e (k), k ∈ Z

(7.12)

s(l−1)2l−1 (2k + 1) = s(l−1)2l−1 o (k), k ∈ Z

(7.13)

s(l−1)1e = sl2 − U (sl1 )

(7.14)

s(l−1)1o = sl1 + P(s(l−1)1e )

(7.15)

s(l−1)1 (2k) = s(l−1)1e (k), k ∈ Z

(7.16)

s(l−1)1 (2k + 1) = s(l−1)1o (k), k ∈ Z

(7.17)

Therefore, this chapter will use the lifting wavelet packet method, which divides the signal into multi-level frequency bands in the whole frequency band, to analyze the mechanical vibration signal more precisely.

7.1.2.3

Energy Distribution of Lifting Wavelet Packet Transformation

Since the orthogonal basis of the lifting wavelet packet follows the law of conservation of energy [6], each obtained 2 l band has the same bandwidth, and each band is connected end-to-end after lth decomposition and reconstruction. Set sl,i (k) as the lth band after lth decomposition, whose energy El,i and relative energy E˜ l,i are defined as follows: )2 1 ∑( sl,i (k) , i = 1, 2, · · · , 2l , k = 1, 2, · · · , n, n ∈ Z n − 1 k=1 n

El,i =

E˜ l,i

⎛ l ⎞−1 2 ∑ = El,i ⎝ El,i ⎠ i=1

(7.18)

(7.19)

7.2 Reliability Assessment and Health Maintenance of Turbo Generator Set …

7.1.2.4

425

Entropy and Measurement of Operational Reliability

Entropy is a commonly-used measure of uncertainty and an important concept of modern dynamic systems and ergodic theory. Einstein once called the law of entropy “the first law of the whole science”. Information entropy was proposed by American scholar Shannon CE in 1948 to measure the uncertainty in the system by introducing thermodynamic entropy into information theory [7–9]. In information theory, information entropy represents the average amount of information provided by each symbol and the average uncertainty of the information source. Given an uncertain system X = {xn }, its information entropy can be expressed as [10]: Sv (X ) = −

n ∑

pi log( pi )

(7.20)

i=1

∑n where { pi } stands for the probability distribution of {xn }, and i=1 pi = 1. Information entropy is used to describe the uncertainty of the system and evaluate the complexity of random signals. According to this theory, the more uncertain the probability distribution (equal probability distribution), the larger the entropy value, and vice versa. Therefore, the magnitude of information entropy also reflects the uniformity of probability distribution. Therefore, using information entropy as a dimensionless index can measure the irregularity and complexity of mechanical signals in real-time and evaluate the reliability of the mechanical state. The operation reliability and health maintenance block diagram of complex electro-mechanical systems is shown in Fig. 7.2.

7.2 Reliability Assessment and Health Maintenance of Turbo Generator Set in Power Plant Turbo generators can produce a large amount of electrical energy, which is an important part of the electric power system and is widely used in the electric power industry all over the world. With a detailed, long-term maintenance plan in place, utilities can ensure that their facilities will safely deliver as much reliable power to the grid as possible. The criteria for a turbo generator are high reliability, high performance, with many starts and flexible operation throughout the service life. In addition, modern turbo generators are built to last between 30 and 40 years. With aging generator units and mechanical components, reliability and safety evaluation are imperative indicators for a plant to prevent failures. Relevant research is studied worldwide by many researchers and engineers. Matteson proposed a dynamic multi-criteria optimization framework for the sustainability and reliability evaluation of power systems [11]. Lo Prete proposed a model to assess and quantify the sustainability and reliability of different power production scenarios [12]. Moharil et al. analyzed the generator system reliability with wind

426

7 Complex Electro-Mechanical System Operational Reliability …

Fig. 7.2 Illustration of operational reliability and health maintenance of complex electromechanical system

energy penetration in the conventional grid [13]. Since turbo generator faults have a significant impact on safety, Whyatt et al. identified failure modes experienced by turbo generators and described their reliability [14]. Tsvetkov et al. presented a mathematical model for the analysis of generator reliability, including the development of defects [15]. Generally speaking, traditional approaches entail collecting sufficient failure samples to estimate the general probability of the system or component failures and the distribution of the time-to-failure. It is usually difficult to use probability and statistics for turbo generator safety analysis due to the lack of failure samples and time-to-failure data. The failure rate of a generator includes all the failures which cause the generator to shut down and also depends on the maintenance and operating policy of utilities. In fact, turbo generators are usually set on different operating parameters and conditions such as temperatures, vibration, load and stress. The variations of the operating parameters can affect operational safety whenever a single parameter or condition is out of limit and failures can also be caused by the interaction of operating parameters. It has been realized from the real-time operation that a component will experience more failures during heavy loading conditions than during light loading conditions, which means that the failure rate of a component in real-time operation is not constant and varies with operating parameters [16].

7.2 Reliability Assessment and Health Maintenance of Turbo Generator Set …

427

Depending on the operating parameters and conditions, the constitutive components of a turbo generator will go through a series of degradation states evolving from functioning to failure. Therefore, there is a great demand for ways of assessing the operational safety of turbo generators with time-varying operational parameters and conditions during their whole life span, which is beneficial for implementing optimal condition-based maintenance schedules with low failure risk. When condition monitoring is performed during plant operational transients, the intrinsically dynamic behavior of the monitored time-varying signals should be taken into account. Monitoring the condition of a component is typically based on several sensors that estimate the values of some measurable parameters (signals) and trigger a fault alarm when the measured signal is out of the limit. To this purpose, Baraldi et al. proposed approaches based on the development of several reconstruction models and the signals were preprocessed using Haar wavelet transforms for a gas turbine during start-up transients [17]. Lu et al. proposed a simplified on-board model with sensor fault diagnostic logic for turbo-shaft engines [18]. Li et al. established a hybrid model for hydraulic turbine-generator units based on nonlinear vibration [19]. The above operational safety diagnosis and evaluation methods have mainly utilized dynamic monitored information, so how to process the monitored information and associate it with operational safety is very essential. Information entropy is an effective indicator to measure a system’s degree of uncertainty. Based on the information entropy theory, the most uncertain probability distribution (such as the equal probability distribution) has the largest entropy, and the most certain probability distribution has the smallest entropy. On this basis, the use of information entropy is widespread in engineering applications, such as topological entropy of a given interval map, spatial entropy of pixels, weighted multiscale permutation entropy of nonlinear time series, Shannon differential entropy for distributions, min-and max-entropies, collision entropy, permutation entropy [20], time entropy [21], multiscale entropy, wavelet entropy [22] and so on, different types of information entropy have been defined in accordance with their own usage. Entropy is well used in machinery fault diagnosis. Sawalhi et al. used minimum entropy and spectral kurtosis for fault detection in rolling element bearings [23]. Tafreshi et al. proposed a machinery fault diagnosis method utilizing entropy measure and energy map [24]. He et al. approximated entropy as a nonlinear feature parameter for fault diagnosis of rotating machinery [25]. Wu et al. proposed a bearing fault diagnosis method based on multiscale permutation entropy and support vector machine [26]. In the branch of information entropy, Rényi entropy was introduced by Alfréd Rényi in 1960 [27], which is known as a parameterized family of uncertainty measures. It is noteworthy that the classical Shannon entropy is a special case of Rényi entropy when the order α of Rényi entropy is equal to one. Similarly, other entropy measures that have appeared in various kinds of literature are also special cases of Rényi’s entropy [28]. Besides being of theoretical interest as a unification of several distinct entropy measures, Rényi entropy has found various applications in statistics and probability [29], pattern recognition [30], quantum chemistry [31], biomedicine [32], etc.

428

7 Complex Electro-Mechanical System Operational Reliability …

Therefore, a new method of operational reliability evaluation health maintenance is proposed. This method extracts the relative energy of each frequency band of reconstructed signals through the lifting wavelet packet decomposition and reconstruction and maps it to the [0, 1] interval by Rényi entropy. Firstly, the sensor-dependent vibration signals reflecting the time-varying characteristic of an individual turbo generator are acquired by professional sensors and then analyzed by the lifting wavelet packet since the wavelet transform excels in analyzing unsteady signals in both the time domain and frequency domain. The relative energy of the decomposed and reconstructed signals in each frequency band describes the energy distribution characteristics of the signals in different frequency bands, and the signal characteristics are mapped to the reliability [0, 1] interval by defining the wavelet Rényi entropy, which is applied in the 50 MW turbo generator.

7.2.1 Condition Monitoring and Vibration Signal Acquisition After a 50 MW turbo generator unit (shown in Fig. 7.3) is repaired, in order to ensure its normal start-up and operation, an MDS-2 portable vibration monitoring system and professional sensors are used for vibration monitoring for the #1 and #2 bearing bushings in the high-pressure cylinder, #3 and #4 bearing bushings in the low-pressure cylinder and #5 and #6 bearing bushings in the electric generator. The structure of this turbo generator unit is shown in Fig. 7.4 and it mainly consists of a high-pressure cylinder, a low-pressure cylinder, an electric generator and #1 ~ #6 bearing bushings. With the increased speed and load in the start-up process, all the bearing bushings are in normal states since the peak-to-peak vibration in the vertical direction is less than 50 μm, except for the vibration of the #4 bearing bushing in the lowpressure cylinder which is out of limit. Therefore, the condition monitoring emphasis is focused on the vertical vibration of the #4 bearing bushing. In the start-up process with an empty load, the peak-to-peak vibration in the vertical direction of #4 bearing bushing is 24.7 μm at the speed of 740 r/min. Moreover, the peak-to-peak vibration Fig. 7.3 The illustration of the 50 MW turbo generator unit

7.2 Reliability Assessment and Health Maintenance of Turbo Generator Set …

429

Fig. 7.5 The waveform of #4 bearing bushing vibration signal in time domain

A/µm

Fig. 7.4 The structure diagram of the 50 MW turbo generator unit

in the vertical direction is increased to 63.2 μm at the speed of 3000 r/min and even to 86.0 μm at the speed of 3360 r/min. Afterward, vibration monitoring is conducted at a stable speed of 3000 r/min with several given loads. The peak-to-peak vibration is about 74 μm with a load of 6 MW, 104 μm with a load of 16 MW, and even increases to 132 μm with a load of 20 MW. The vibration is too severe to increase the load more, so the load is decreased to 6 MW and the peak-to-peak vibration is about 75–82 μm. The acquired vibration waveform is shown in Fig. 7.5, which shows disorder and dissymmetry at the top and bottom of the vibration signal. The sampling frequency is 2 kHz. The FFT spectrum of the vibration signal is shown in Fig. 7.6. It can be seen that the amplitude of the running frequency of 50 Hz is the largest in the whole frequency range. In the 100–500 Hz frequency band, there are a large number of harmonic frequency components from 2 times running frequency to 10 times running frequency, and their amplitudes are also large. Generally speaking, the 50 Hz running frequency component can represent rotor imbalance, and the 100 Hz double running frequency mainly reflects the misalignment of shafting. For the rest of the high-frequency harmonic components, it is difficult to analyze the current health status of the unit based on the FFT spectrum alone.

7.2.2 Vibration Signal Analysis In order to further analyze the sensor-dependent vibration signal, a lifting wavelet packet is adopted to decompose the original signal to the extent of level 2, level 3 and level 4, respectively. Figure 7.7 shows the four signals obtained by the lifting

Fig. 7.6 The FFT spectrum of #4 bearing bushing in the turbo generator unit

7 Complex Electro-Mechanical System Operational Reliability …

A/µm

430

A/µm

wavelet packet in level 2, which correspond to the frequency bands of 0–250 Hz, 250–500 Hz, 500–750 Hz and 750–1000 Hz respectively. Figure 7.8 shows the relative energy distribution of lifting wavelet packet reconstructed signal in level 2. The first band has the largest relative energy, while the second band has much more relative energy than the remaining two bands. On the basis of level 2 wavelet packet analysis, the original signal is further decomposed and reconstructed in level 3 and eight signals are obtained, as shown in Fig. 7.9. They correspond to frequency bands 0–125 Hz, 125–250 Hz, 250–375 Hz, 375–500 Hz, 500–625 Hz, 625–750 Hz, 750–875 Hz and 875–1000 Hz, respectively. The relative energy distribution of the eight signals is shown in Fig. 7.10. The energy of the first four frequency bands is much larger than that of the last four frequency bands. On the basis of the level 3 signal analysis, the original signal is further decomposed to the extent of level 4 as shown in Fig. 7.11. The obtained sixteen signals correspond to the frequency bands of 0–62.5 Hz, 62.5–125 Hz, 125–187.5 Hz, 187.5–250 Hz,

Fig. 7.7 Lifting wavelet packet reconstructed signal in level 2

7.2 Reliability Assessment and Health Maintenance of Turbo Generator Set …

431

A/µm

Fig. 7.8 The relative energy distribution of lifting wavelet packet reconstructed signal in level 2

Fig. 7.9 Lifting wavelet packet reconstructed signal in level 3

250–312.5 Hz, 312.5–375 Hz, 375–437.5 Hz, 437.5–500 Hz, 500–562.5 Hz, 562.5– 625 Hz, 625–687.5 Hz, 687.5–750 Hz, 750–812.5 Hz, 812.5–875 Hz, 875–937.5 Hz and 937.5–1000 Hz. The signal’s relative energy of each frequency band is shown in Fig. 7.12. It can be seen from the figure that the low-frequency band accounts for a large amount of signal energy. Among them, the first reconstructed signal energy accounts for the largest proportion, followed by the fourth to eighth frequency band, and the second to third frequency band accounts for a small amount of signal energy.

432

7 Complex Electro-Mechanical System Operational Reliability …

A/µm

Fig. 7.10 The relative energy distribution of lifting wavelet packet reconstructed signal in level 3

Fig. 7.11 Lifting wavelet packet reconstructed signal in level 4

7.2 Reliability Assessment and Health Maintenance of Turbo Generator Set …

433

Fig. 7.12 The relative energy distribution of lifting wavelet packet reconstructed signal in level 4

7.2.3 Operational Reliability Assessment and Health Maintenance After the lifting wavelet packet is used to decompose and reconstruct the turbo generator unit, the original signal is decomposed into several independent frequency band signals, each frequency band has the corresponding relative energy distribution characteristics. As a dimensionless index, information entropy can be used to describe the average amount of relative energy information provided by each reconstructed signal and the average uncertainty of the information source, measure the irregularity and complexity of mechanical signals in real-time, and evaluate the reliability of mechanical conditions.

7.2.3.1

Probability Space and Random Variable

As usual, a finite probability space ∑ is given by a non-empty finite set Ω and a probability function P: Ω → [0, 1] with ω∈Ω P(ω) = 1, taking it as understood that the σ-algebra is given by the power set of Ω. For a random variable X : Ω → χ , where its range is assumed to be finite. The distribution of X is denoted as Px : χ → [0, 1], i.e., PX (x) = P(X = x), where X = x is a shorthand for the event ω ∈ Ω|X (ω) = x. The standard notation for intervals in, e.g., [0,1] and [1, ∞] are denoted the respective intervals [0, 1] = {r ∈ R|0 ≤ r ≪ 1 } and [1, ∞] = {r ∈ R|1 < r }.

7.2.3.2

Rényi Entropy

Rényi entropy unifies all the distinct entropy measures. For a parameter α ∈ [0, 1) ∪ (1, ∞) and a random variable X, the Rényi entropy of X is defined as:

434

7 Complex Electro-Mechanical System Operational Reliability …

Hα (X ) =

∑ 1 PX (x)α log 1−α x

(7.21)

where the sum is over all x ∈ supp(PX ). It is well-known and not hard to verify that this definition of H α is consistent with the respective definitions of H 0 and H 2 and limα→1 Hα (X ) = H (X ) and limα→∞ Hα (X ) = H∞ (X ). Furthermore, it is known that the Rényi entropy is decreasing in α, i.e., Hβ (X ) ≤ Hα (X ) for 0 ≤ α ≤ β ≤ ∞. For α ∈ [0, 1) ∪ (1, ∞), It will be convenient to re-write Hα (X ) as Hα (X ) = − log Renα (X ) with: Renα (X ) = (



α

PX (x)α ) α−1 = ||PX ||αα−1 1

(7.22)

x

where ||PX ||α is the α-norm of PX : X → [0,1] ⊂ R, we call Renα (X) the Rényi probability (of order α) of X. For completeness, we also define: Ren0 (X ) = |supp(PX )|−1 and Ren1 (X ) = 2−H (X ) , which is consistent with taking the limits.

7.2.3.3

Operational Reliability Degree

The relative energy distribution E˜ l,i of lifting wavelet packet analysis signal is calculated according to Eq. (7.19). Within the framework of Rényi entropy calculation, the energy characteristics of mechanical operation state signals are mapped into a dimensionless indicator in the interval of [0,1], and the operational reliability is defined as follows: 2 ∑ 1 log2l ( E˜ l,i )α 1−α i=1 l

R =1−

(7.23)

The operational reliability degree is calculated according to the equation from level l = 2 to level l = 4, respectively. From Table 7.1, it is seen that the current operational reliability degree of the turbo generator is not suitable, since all of the calculated reliability degrees from level l = 2 to level l = 4 are under 0.4, it is inferred that the current health condition of the turbo generator unit is not good, and there are potential faults and dangerous parameters, which make the mechanical operational condition become unstable, leading to the uncertainty of the probability distribution of vibration signals obtained from monitoring, and therefore the low degree value of mechanical operational reliability. Table 7.1 Operational reliability degree

Decomposition level

l=2

l=3

l=4

Reliability

0.3363

0.2467

0.2812

7.2 Reliability Assessment and Health Maintenance of Turbo Generator Set …

435

As the above analysis, the amplitude of the running frequency is the largest among the whole frequency range and some harmonic frequency components from two times the running frequency to ten times the running frequency are also large in Fig. 7.6. Signals analyzed by the lifting wavelet packet and its energy distributions in level 2, level 3 and level 4 exhibit non-stationary, nonlinear and colored noise characteristics. Considering the start-up process with no load and loading operation conditions, the vertical vibrations of the #3 and #5 bearings which are adjacent to the #4 bearing are not high (under 20 μm). Different from the #3 and #5 bearings, the vibration of the #4 bearing increases with increased speed and load. It is concluded that the vibration is not caused by imbalance and misalignment for the reason that vibrations would be out of limits in multiple bearing positions if an imbalance or misalignment fault occurs. Therefore, the problem is focused on the #4 bearing itself. It is inferred that the monitored non-stationary and nonlinear components in the vibration signal of the #4 bearing may be caused by mechanical looseness and local friction, so the bearing force and support status of the sizing block and bearing lodgement must be checked. With the above analysis, the turbo generator unit is stopped and overhauled. The preload of the #4 bearing bushing is about 0.11 mm, which is far from the requirement of 0.25 mm. The gaps of the left and right sizing blocks are checked by a filler gauge. The 0.05 mm filler gauge can be filled into 30 mm of the left sizing block and 25 mm of the right sizing block. The gap in the bottom of the #4 bearing bushing is also far away from the obligate gap of 0.05 mm. Therefore, the gaps of the 4# bearing bush are re-corrected and the preload is added to the requirement of 0.25 mm, and then it is operated again. After maintenance, the vibration of the #4 bearing bushing decreased obviously in the process of increasing speed with no load. At 3000 r/min, the load is gradually increased to 45 MW, and the peak-to-peak vertical vibration of #4 bearing bushing is basically stable in the range of 46–57 μm, which is much better than the previous situation. In order to assess the operational reliability of the turbo generator unit after maintenance, vibration monitoring via sensors is conducted at a stable speed of 3000 r/min with a load of 6 MW, which is the same as the case before maintenance. The waveform of the acquired vibration signal shown in Fig. 7.13 shows some differences compared with the vibration signal before maintenance in Fig. 7.5, such as the symmetry between the top and bottom of the vibration signal is much better than before and the peak-to-peak vibration is about 45 μm, which falls in the permissible range. The FFT spectrum of the vibration signal is shown in Fig. 7.14, which is different from the spectrum before maintenance shown in Fig. 7.6. The amplitudes of the harmonic frequency components from two times the running frequency to ten times the running frequency are decreased. The lifting wavelet packet is adopted to analyze the acquired vibration signal on level 2, level 3 and level 4, respectively. Afterward, the relative energy of the corresponding frequency band analyzed by the lifting wavelet packet is computed. The four obtained signals analyzed by the lifting wavelet packet in level 2 are illustrated in Fig. 7.15. The signals’ relative energy after maintenance is concentrated in the first frequency band shown in Fig. 7.16 and the relative energy of the last three frequency bands is very little, which is quite different from Fig. 7.8. It is shown that

7 Complex Electro-Mechanical System Operational Reliability …

A/µm

436

A/µm

Fig. 7.13 The waveform of vibration signal in time domain after maintenance

Fig. 7.14 The FFT spectrum after maintenance

the relative energy in the second frequency band of Fig. 7.8 before maintenance is generated by the machinery fault information of the gaps in the #4 bearing bushing and the whole energy is decentralized in frequency bands. It can be seen that faults can affect the frequency band energy distribution of the signal. Figure 7.17 shows the signal of the lifting wavelet packet decomposition and reconstruction in level 3. Figure 7.18 shows the signal relative energy distribution of the lifting wavelet packet decomposition and reconstruction in level 3. Different from Fig. 7.10 which is before maintenance, the first frequency band has the largest energy, while the relative energy of the next seven frequency bands is small. By comparing the operational conditions of the turbo generator unit before maintenance as shown in Fig. 7.10, it indicates that the larger energy distribution from the second frequency band to the fourth frequency band in Fig. 7.10 is caused by the fault information of #4 bearing bushing. The lifting wavelet packet decomposition and reconstruction in level 4 is further proceeded, as shown in Fig. 7.19. Figure 7.20 shows the relative energy distribution of lifting wavelet packet reconstructed signal after maintenance. Different from the energy distribution of lifting wavelet packet decomposition and reconstructed vibration signal before maintenance is shown in Fig. 7.12, the relative energy of other frequency bands in Fig. 7.20 is very small except for the first frequency band.

437

A/µm

7.2 Reliability Assessment and Health Maintenance of Turbo Generator Set …

Fig. 7.15 Lifting wavelet packet reconstructed signal in level 2 after maintenance Fig. 7.16 The relative energy distribution of lifting wavelet packet reconstructed signal in level 2 after maintenance

It is inferred that the relative energy from the fourth frequency band to the ninth frequency band in Fig. 7.12 before maintenance is caused by the machinery fault information of the gaps in the #4 bearing bushing. It is concluded that machinery fault information can spoil the energy convergence of lifting wavelet packet transform and thus induce the dispersion of the wavelet energy distribution. Therefore, it is verified that second wavelet package transform can process the vibration signals in different frequency bands to effectively reveal the machinery operation conditions. To summarize, it is diagnosed that the extensive vibration is caused by the looseness of the #4 bearing, poor support and tension force shortage. The vibrations added with increased speed and load, which have the characteristics of non-stationarity, nonlinear properties and contain colored noise because of the friction caused by looseness faults. According to Eq. (7.23), after the maintenance of the turbo generator unit, the relative energy of vibration monitoring signals and lifting wavelet packet reconstructed

7 Complex Electro-Mechanical System Operational Reliability …

A/µm

438

Fig. 7.17 Lifting wavelet packet reconstructed signal in level 3 after maintenance

Fig. 7.18 The relative energy distribution of lifting wavelet packet reconstructed signal in level 3 after maintenance

signals are used to evaluate its current operational reliability. The calculation results are shown in Table 7.2. It can be seen that the operational reliability has been improved after maintenance, which is basically above 0.8.

439

A/µm

7.2 Reliability Assessment and Health Maintenance of Turbo Generator Set …

Fig. 7.19 Lifting wavelet packet reconstructed signal in level 4 after maintenance

Fig. 7.20 The relative energy distribution of lifting wavelet packet reconstructed signal in level 4 after maintenance

440

7 Complex Electro-Mechanical System Operational Reliability …

Table 7.2 The operational reliability degree after maintenance

Decomposition level

l=2

l=3

l=4

Operational reliability degree

0.8627

0.8278

0.8060

7.2.4 Analysis and Discussion 7.2.4.1

Comparison and Analysis of Operational Reliability Degree Before and After the Maintenance of Turbo Generator Unit

When the turbo generator is degrading and in dangerous conditions, the stability of the system is reduced, which causes the safe operational conditions to become more and more uncertain. Entropy is a measure of “uncertainty”, and the most uncertain probability distribution (such as equal probability distribution) has the largest entropy value. Therefore, the magnitude of information entropy also reflects the uniformity of probability distribution, which can measure the irregularity of mechanical monitoring signals in real-time and reflect the reliability of mechanical operational conditions. Before maintenance, due to the loosening fault of the #4 bearing bushing, the peak-topeak value of vibration monitoring exceeds the limit. The relative energy distribution based on the lifting wavelet packet decomposition and reconstruction is loose, the Rényi entropy value is large, and the calculation result of operational reliability is small. From Table 7.1, it is shown that all of the calculated operational safety degrees from level l = 2 to level l = 4 are under 0.4, the lowest of which is 0.2467 in level l = 3. This indicates that the reliability of the current turbo generator unit is very poor and needs maintenance. When the machine is stopped for maintenance, it is found that the preloading force of the #4 bearing bushing is lower than the standard value, causing mechanical loosening and local friction, which induced a disorder phenomenon in the vibration signals acquired by the sensors. After maintenance, all of the calculated operational safety degrees from level l = 2 to level l = 4 are over 0.8. This shows that through vibration monitoring, evaluation and diagnosis, shutdown and maintenance, the unit health conditions and operational reliability indicators are improved, and it can be seen that timely maintenance can improve operational safety and avoid accidents. In general, the operational reliability evaluation of complex electro-mechanical systems based on vibration condition monitoring can be realized by monitoring the vibration condition of turbo generator unit and obtaining the relative energy distribution of each frequency band through lifting wavelet packet analysis of signals, and then calculating Rényi entropy and mapping it to probability [0, 1] space. Through the example analysis of a steam turbo generator unit in a thermal power plant, it is found that the operational reliability evaluation method proposed by wavelet Rényi entropy can provide guidance for the health monitoring and maintenance of the steam turbo generator unit and lay the foundation for diagnosis.

7.3 Reliability Assessment and Health Maintenance of Compressor …

7.2.4.2

441

Influence Analysis of the Number of Lifting Wavelet Packet Decomposition Layers on Operational Reliability Evaluation

It can be seen from Tables 7.1 and 7.2 that the degree of operational reliability is related to the number of decomposition levels in the lifting wavelet packet analysis. With each additional layer of decomposition, the number of frequency bands after decomposition is doubled. As the number of frequency bands increases, the relatively concentrated energy of each frequency band is allocated to the additional frequency bands of the next layer, so that the relative energy distribution of the signal becomes uncertain, the entropy increases and the reliability degree decreases. Therefore, the operational reliability degree of the steam turbo generator unit after maintenance is shown in Table 7.2. As the number of decomposition layers increases from l = 2 to l = 4, the operational reliability decreases monotonically. But before the maintenance as shown in Table 7.1, due to mechanical instability and fault information, the vibration monitoring signal is irregular, uncertainty degree is higher, so after the lifting wavelet packet decomposition and reconstruction, the relative energy distribution of the signal is scattered, the operational reliability degree and the number of lifting wavelet packet decomposition layers do not follow the simple monotonically decreasing principle. Therefore, an appropriate number of layers l can be selected to measure the operational reliability of complex electro-mechanical systems in specific analysis. Since the number of layers l = 3 is intermediate between the number of layers l = 2 and l = 4, it is more appropriate to choose the number of layers l = 3 in practical engineering applications.

7.3 Reliability Assessment and Health Maintenance of Compressor Gearbox in Steel Mill Nowadays, reliability is a multi-disciplinary scientific discipline that aims at system safety. The fundamental issues of reliability engineering lie in the representation and load modeling, quantification analysis, evolution and uncertainty assessment of system models. To assess the operational reliability of the compressor gearbox in a steel mill, the machinery vibration signals with time-varying operational characteristics are first decomposed and reconstructed by means of a lifting wavelet packet transform. The relative energy of every reconstructed signal is computed as an energy percentage of the reconstructed signal in the whole signal energy. Moreover, a normalized lifting wavelet entropy is defined by the relative energy to reveal the machinery’s operational uncertainty. Finally, the operational reliability degree is defined by the quantitative value obtained by the normalized lifting wavelet entropy belonging to the range of [0, 1].

442

7 Complex Electro-Mechanical System Operational Reliability …

7.3.1 Condition Monitoring and Vibration Signal Acquisition There is an oxy-generator compressor as shown in Fig. 7.21. The rotating speed of the output shaft in the overdrive gearbox is 14,885 r/min. The vibration of the gearbox increases and high-frequency noise appears in service. As shown in Fig. 7.21, the gearbox contains four sliding bearings, which can be tested by an acceleration transducer with a 20 kHz sampling frequency. It is found that the vibration of the 3# bearing is the highest among the four sliding bearings. Meanwhile, the temperature of the 3# bearing bush is more than 50 °C and is the highest. The vibration signal acquired at 3# bearing is shown in Fig. 7.22, containing considerable noise. The frequency spectrum of the vibration signal is shown in Fig. 7.23. It can be seen that the frequency components are spread throughout the spectrum, containing rich vibration information in the higher frequency band.

Fig. 7.21 Schematic drawing of compressor gearbox

1000

A/um

Fig. 7.22 Time domain waveform of vibration signal

0

-1000

0

0.01

0.02

0.03

0.04

0.05

6000

8000

10000

t/s

Fig. 7.23 Frequency spectrum of vibration signal A/um

200

100

0

0

2000

4000

f/Hz

7.3 Reliability Assessment and Health Maintenance of Compressor …

443

7.3.2 Vibration Signal Analysis In order to further analyze the vibration signal to obtain more condition information, the lifting wavelet packet transform is adopted to decompose the original signal to the extent of level 2, level 3 and level 4, respectively. The obtained four signals analyzed in level 2 are illustrated in Fig. 7.24, which correspond to the frequency bands 0– 2500 Hz, 2500–5000 Hz, 5000–7500 Hz and 7500–10,000 Hz, respectively. Then the relative energy of each signal is calculated separately according to Eq. (7.19), and the distribution of the whole signal accounted for by each signal is shown in Fig. 7.25, from which it is evident that the first band has the largest energy, and those in the second and third are comparable. Furthermore, the original signal is further processed to the extent of level 3 as shown in Fig. 7.26, corresponding to the frequency bands 0–1250 Hz, 1250–2500 Hz, 2500–3750 Hz, 3750–5000 Hz, 5000–6250 Hz, 6250–7500 Hz, 7500–8750 Hz and 8750–10,000 Hz, respectively. The relative energy of each frequency band is shown in Fig. 7.27. The energy of the first four bands is much larger than that of the last four bands. It is worth noting that after three levels of decomposition and reconstruction, the signal is decomposed into eight bands, and the relative energy distribution occupied by each reconstructed signal is uneven. The second band signal is the most energetic and the first band signal is the second. The signal is mainly concentrated in

a21

1000 0

-1000

0

0.005

0.01

0.015

0.02 0.025

0.03 0.035

0.04 0.045

0.05

0

0.005

0.01

0.015

0.02 0.025

0.03 0.035

0.04 0.045

0.05

0

0.005

0.01

0.015

0.02 0.025

0.03 0.035

0.04 0.045

0.05

0

0.005

0.01

0.015

0.02 0.025

0.03 0.035

0.04 0.045

0.05

0

-1000 1000

a23

A/um

a22

1000

0

-1000

a24

1000 0

-1000

t/s Fig. 7.24 Reconstructed signals in level 2

444

7 Complex Electro-Mechanical System Operational Reliability …

Fig. 7.25 Relative energy distribution of reconstructed signals in level 2

the low frequency, while the energy of the sixth band is generated by the machinery degradation information and fault (Fig. 7.28). The original signal is further processed to the extent of level 4 to obtain 16 signals, as shown in Fig. 7.27. The relative energy of each corresponding frequency band is shown in Fig. 7.29. Among them, the largest energy lies in the third band, followed by the second, fourth, sixth, eleventh, twelfth and first bands, while the remaining bands have a smaller energy share. Through the above analysis, it can be seen that the 1000

a35

a31

1000 0

-1000

0

0.01

0.02

0.03

0.04

0.05

a32

a36 0

0.01

0.02

0.03

0.04

0.05

0

0.01

0.02

0.03

0.04

0.05

0.02

0.03

0.04

0.05

0

0.01

0.02

0.03

0.04

0.05

0

0.01

0.02

0.03

0.04

0.05

0

0.01

0.02

0.03

0.04

0.05

0

-1000 1000

a38

1000

a34

0.01

0

-1000

a37

0

0

-1000

0

1000

1000

a33

A/um

0

-1000

-1000 1000

1000

-1000

0

0

0.01

0.02

0.03

0.04

t/s Fig. 7.26 Reconstructed signals in level 3

0.05

0

-1000

t/s

7.3 Reliability Assessment and Health Maintenance of Compressor …

445

Fig. 7.27 Relative energy distribution of reconstructed signals in level 3

relative energy distribution of the current signals in each band is relatively disordered, and mechanical anomalies are suspected.

7.3.3 Operational Reliability Assessment and Health Maintenance After decomposing and reconstructing the vibration signal of the compressor gearbox bearing using lifting wavelet packet transform, the original signal is decomposed into several band-independent signals, each band having corresponding relative energy distribution characteristics. Shannon entropy is a measure of the uncertainty associated with a random variable. Specifically, Shannon entropy quantifies the expected value of the information contained. The Shannon entropy of a random variable X can be defined as in Eq. (7.24), where Pi is defined in Eq. (7.25) with x i indicating the i-th possible value of X out of n symbols, and Pi denoting the possibility of X = x i : H (X ) = H (P1 , . . . , Pn ) = −

n ∑

Pi log2 Pi

(7.24)

i=1

Pi = Pr(X = xi ) Shannon entropy attains, but is not limited to, the following properties: (1) Bounded: 0 ≤ H (X ) ≤ log2 n (2) Symmetry: H (P1 , P2 , · · · , Pn ) = H (P2 , P1 , · · · , Pn ) = · · ·

(7.25)

446

7 Complex Electro-Mechanical System Operational Reliability … 1000

a45

a41

1000 0 -1000

0

0.01

0.02

0.03

0.04

0.05

0 0

0.01

0.02

0.03

0.04

0.05

a33

a47 0

0.01

0.02

0.03

0.04

0.05

a48

a34 A/um

0

0.01

0.02

0.03

0.04

0.05

a413

a49

0

0.01

0.02

0.03

0.04

0.05

0

0.03

0.04

0.05

0

0.01

0.02

0.03

0.04

0.05

0

0.01

0.02

0.03

0.04

0.05

0

0.01

0.02

0.03

0.04

0.05

0

0.01

0.02

0.03

0.04

0.05

0

0.01

0.02

0.03

0.04

0.05

0

0.01

0.02

0.03

0.04

0.05

0

0.01

0.02

0.03

0.04

0.05

0

-1000 1000

a415

1000 0 0

0.01

0.02

0.03

0.04

0.05

0

-1000 1000

a416

1000

a412

0.02

1000

a414

a410

0.01

0

-1000

1000

a411

0

1000

0

0 -1000

0.05

0

-1000

1000

-1000

0.04

1000

0

-1000

0.03

0

-1000

1000

-1000

0.02

1000

0

-1000

0.01

0

-1000

1000

-1000

0

1000

a46

a42

1000

-1000

0

-1000

0

0.01

0.02

0.03

0.04

0.05

0

-1000

t/s Fig. 7.28 Reconstructed signals in level 4

Fig. 7.29 Relative energy distribution of reconstructed signals in level 4

t/s

7.3 Reliability Assessment and Health Maintenance of Compressor …

447

(3) Grouping: H (P1 , P2 , · · · , Pn ) = H (P1 + P2 , · · · , Pn )+(P1 + P2 )H (P1 /(P1 + P2 ), P2 /(P1 + P2 )) Within the framework of Shannon’s entropy definition, the relative energy distribution of the signal calculated through lifting wavelet packet transform (Eq. 7.19) maps the degree of reliability value into a dimensionless reliability index in the interval [0, 1]: 2 ∑ l

R = 1 − (−

E˜ l,i log2l E˜ l,i )2

(7.26)

i=1

Since mechanical performance degradation and faults can make machinery conditions uncertain, the probability distribution of the monitored condition information will become uncertain too. If the distribution of the energy of transformed 2 l frequency bands follows uniform distribution then E˜ l,i = 1/2l . If the Shannon entropy in the brackets is calculated as 1, then the normalized lifting wavelet entropy is equal to 0. On the contrary, if there is only one band that concentrates the whole energy of 2 l bands, its relative energy is equal to 1 (like the most certain probability distribution), and the normalized lifting wavelet entropy is equal to 1. On the basis of information entropy theory, the most uncertain probability distribution (such as equal distribution) has the largest entropy and the opposite can also be true, so information entropy is a measure of uncertainty and provides a practical criterion for analyzing similarity or dissimilarity between probability distributions when mechanical equipment is settled on different operating states involving functioning to failure. Since wavelets meet the demands of transient signal analysis and entropy is associated with the measurements of information uncertainty, it is useful to evaluate the operational reliability of mechanical equipment with wavelet entropy from condition monitoring information. The degree of reliability is calculated according to Eq. (7.26) from level l = 2 to level l = 4, respectively. From Table 7.3, it is seen that machinery performance degrades during longtime operations and needs repair since all degrees from level l = 2 to level l = 4 are near 0.5. It can be inferred that the current operating conditions of the machinery are less certain, resulting in uncertainty in the probability distribution of the vibration signals obtained from the monitoring, and consequently in low values of the reliability measures. The oxy-generator compressor in the steel mill is stopped and overhauled, and many cracks and fragments are found on the #3 bearing bush of the gearbox. After replacing the #3 bearing and maintenance, starting the oxy-generator again. The vibration and the high-frequency noise are reduced. The waveform and frequency spectrum of the acquired vibration signal are shown in Figs. 7.30 and 7.31. Table 7.3 Degree of reliability

Lifting wavelet packet level

l=2

l=3

l=4

Degree of reliability

0.6120

0.4924

0.4854

448

7 Complex Electro-Mechanical System Operational Reliability …

A/um

500 0 -500 0

0.01

0.02

t/s

0.03

0.04

0.05

Fig. 7.30 Time domain waveform after maintenance

A/um

300 200 100 0

0

2000

6000

4000

8000

10000

f/Hz Fig. 7.31 Frequency spectrum after maintenance

Figures 7.32 and 7.33 show the reconstructed signals and their relative energy in level 2 using the lifting wavelet packet transform, respectively. Different from the relative energy acquired before maintenance in Fig. 7.25, those after maintenance are concentrated in the 1st frequency band, the 2nd band still occupies a certain amount of energy, and the relative energy of the 3rd band is very small, which is different from the relative energy distribution shown in Fig. 7.25. The 3rd band also occupies a large amount of energy before maintenance. Therefore, it is speculated that the large relative energy of the third band shown in Fig. 7.25 before maintenance may be the fault information generated by many cracks and fragments of the #3 bearing bush, which makes the energy distribution of the four bands relatively dispersed. Figures 7.34 and 7.35 show the reconstructed signals and their relative energy in level 3 using the lifting wavelet packet transform, respectively. The 2nd frequency band has the largest energy and proportion. The 1st, 3rd, 4th bands have small energy and the relative energy of the following four bands is very small. Compared with Fig. 7.27 before maintenance, it can be seen that the 6th band also accounts for a certain amount of energy, and the relative energy proportion of the 2nd and 4th bands is also large. Moreover, the energy proportion of the 2nd band is not as large as those after maintenance. The comparison shows that the 1st, 4th and 6th band in Fig. 7.27 before maintenance occupy large relative energy. Therefore, it is speculated that the fault is caused by many cracks and fragments in the #3 bearing bush, which makes the abnormal vibration cover a wide range of frequency band. After maintenance, the

7.3 Reliability Assessment and Health Maintenance of Compressor …

449

a21

1000 0

-1000

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

0

-1000 1000

a23

A/um

a22

1000

0

-1000

a24

1000 0

-1000

t/s Fig. 7.32 Reconstructed signals in level 2 (after maintenance)

Fig. 7.33 Relative energy distribution of reconstructed signals in level 2 (after maintenance)

machine runs stably, making the relative energy distribution of the monitored signal frequency band becomes uniform. Figures 7.36 and 7.37 show the reconstructed signals and their relative energy in level 4 using the lifting wavelet packet transform, respectively. The main energy is concentrated in the 3rd frequency band, and the 2nd, 4th, 5th and 6th also have a small amount. The energy distribution is quite different from that before maintenance shown in Fig. 7.29, where the 2nd band also accounts for a large proportion, which is close to the 3rd band. At the same time, the 1st, 4th, 5th, 6th, 9th, 11th and 12th all occupy some relative energy, and the distribution is decentralized.

450

7 Complex Electro-Mechanical System Operational Reliability … 1000

a35

a31

1000 0

-1000

0

0.01

0.02

0.03

0.04

0.05

a32

a36 0

0.01

0.02

0.03

0.04

0.05

0

0.01

0.02

0.03

0.04

0.05

0.02

0.03

0.04

0.05

0

0.01

0.02

0.03

0.04

0.05

0

0.01

0.02

0.03

0.04

0.05

0

0.01

0.02

0.03

0.04

0.05

0

-1000 1000

a38

1000

a34

0.01

0

-1000

a37

0

0

-1000

0

1000

1000

a33

A/um

0

-1000

-1000 1000

1000

-1000

0

0

0.01

0.02

0.03

0.04

0.05

0

-1000

t/s

t/s

Fig. 7.34 Reconstructed signals in level 3 (after maintenance)

Fig. 7.35 Relative energy distribution of reconstructed signals in level 3 (after maintenance)

According to the difference between the relative energy distribution before and after maintenance, it can be seen that the mechanical fault information can be reflected in the energy distribution of the lifting wavelet packet reconstructed signal and make the distribution disperse. This paper calculates the relative energy distribution after maintenance based on the obtained signals to measure the operational reliability, as shown in Table 7.4. It can be seen that the reliability after maintenance has been improved. Therefore, the lifting wavelet packet transform can effectively reveal the health status of complex electro-mechanical systems by decomposing signals and reconstructing them in different frequency bands.

7.3 Reliability Assessment and Health Maintenance of Compressor … 1000

a45

a41

1000 0 -1000

0

0.01

0.02

0.03

0.04

0.05

0

0.01

0.02

0.03

0.04

0.05

0

0.01

0.02

0.03

0.04

0.05

A/um

0

0.01

0.02

0.03

0.04

0.05

0

0.01

0.02

0.03

0.04

0.05

0.02

0.03

0.04

0.05

0

0.01

0.02

0.03

0.04

0.05

0

0.01

0.02

0.03

0.04

0.05

0

0.01

0.02

0.03

0.04

0.05

0

0.01

0.02

0.03

0.04

0.05

0

0.01

0.02

0.03

0.04

0.05

0

0.01

0.02

0.03

0.04

0.05

0

-1000

a414

a410

0.01

1000

0 0

0.01

0.02

0.03

0.04

0.05

0

-1000 1000

a415

1000

a411

0

0

-1000

a413

a49

0

1000

0 0

0.01

0.02

0.03

0.04

0.05

0

-1000 1000

a416

1000

a412

0.05

1000

1000

0

-1000

0.04

1000

0

-1000

0.03

0

-1000

a48

a34

1000

-1000

0.02

1000

0

-1000

0.01

0

-1000

a47

a33

1000

-1000

0

1000

0

-1000

0

-1000

a46

a42

1000

-1000

451

0

0.01

0.02

0.03

0.04

0.05

0

-1000

t/s

t/s

Fig. 7.36 Reconstructed signals in level 4 (after maintenance)

7.3.4 Analysis and Discussion It can be seen from the above analysis that when the mechanical equipment experiences performance degradation or some faults occur, its working stability will be reduced, and the operating state will become more and more uncertain. The monitored signals contain fault information, which makes the relative energy distribution of the reconstructed signal, which is generated via lifting wavelet packet decomposition, scatter in the frequency band, thus reducing the degree of operational reliability.

452

7 Complex Electro-Mechanical System Operational Reliability …

Fig. 7.37 Relative energy distribution of reconstructed signals in level 3 (after maintenance)

Table 7.4 Operation reliability after maintenance

Lifting wavelet packet level

l=2

l=3

l=4

Degree of reliability

0.9444

0.8698

0.7968

By comparing the operational reliability before and after maintenance, it is found that the latter is better than the former across different decomposition levels l. Specifically, it can be seen from Table 7.3 that the degree of operational reliability from the 2nd level to the 4th level is about 0.5, which indicates that the current health state is uncertain and needs maintenance. After maintenance, the operational reliability of all layers in Table 7.4 exceeds 0.7, and the maximum reliability is 0.9444 in the 2nd level. It can be seen that timely repair and maintenance can improve the operational reliability of electro-mechanical equipment and prevent accidents. The operational reliability assessment based on condition monitoring can provide a basis for condition-based maintenance and ensure the operational safety of the electro-mechanical system. From Tables 7.3 and 7.4, it is seen that the degree of reliability value uniformly and monotonously decreases from level l = 2 to level l = 4. When the decomposition level l increases, the number of frequency bands is increased and the initially concentrated energy is scattered with the increased frequency bands. With the increased bands, each band occupies a certain energy and the distribution becomes more uncertain. Since the more uncertain probability distribution has more entropy, the normalized lifting wavelet entropy is added and the operational reliability is decreased with increased frequency bands. Therefore, the defined normalized lifting wavelet entropy should be computed at an appropriate level l in order to achieve reasonable operational reliability. In this case, level l = 3 is thought of as the most suitable level.

7.4 Aero-Engine Rotor Assembly Reliability Assessment and Health …

453

7.4 Aero-Engine Rotor Assembly Reliability Assessment and Health Maintenance Aero-engine, which is the heart of aircraft, is thermal machinery that works in longterm service under the harsh environment of high temperature, high pressure and high speed. It plays an extremely important role in the performance of aircraft. Liu Daxiang, a member of the Chinese Academy of Engineering, pointed out at the National Aircraft Manufacturing Technology Forum in April 2008 that aero-engines have the characteristics of 3 high (high reliability, high performance, and high safety), 4 low (low fuel consumption, low pollution, low noise, and low cost), and 1 long (long life). Aero-engine technology has always been the key for the world’s military powers to give priority to development, highly monopolize and tightly block, and is one of the important symbols of a country’s military equipment level, industrial strength and comprehensive national strength. From 1963 to 1975, there were 3824 flight accidents in USAF fighter jets, of which 1664 were caused by engines, accounting for 43.5% of the total number of flight accidents; From 1989 to 1993, there were 279 major flight accidents in the world’s air transport, of which more than 20% were caused by engine failures. Therefore, aero-engine is the focus of aviation flight safety and maintenance support. At present, many countries and airlines in the world attach great importance to the development of safety technologies related to aero-engines. Table 7.5 [33] summarizes typical faults and their common diagnostic methods and monitoring parameters of aero-engines. Boeing B747, B767, Airbus A310 and other aircraft are equipped with complete condition monitoring and fault diagnosis systems, with more than 15 engine monitoring parameters. The diagnosis system of the F100 engine records and monitors 38 engine parameters and flight parameters, including monitoring abnormal conditions such as over rotation, over temperature, abnormal oil return pressure, engine stall, surge, main fuel pump failure, afterburner failure and flameout. The effectiveness of the F100-PW-220 engine condition monitoring system is 99.3%, and the error rate of condition monitoring is less than 1% for every 1 million flight hours. The JT90 engine used in Boeing B747 aircraft uses a status monitoring system to establish a trend analysis model to determine the source of engine performance deterioration. The PW4000 engines equipped with Boeing B767, B777 and McDonnell Douglas MD-11 aircraft and the V2500 equipped with A320 and MD90 aircraft use integrated control and monitoring systems, which have the capabilities of self-inspection, fault isolation and precise thrust adjustment, and improve flight reliability and maintainability [33]. At present, intelligent testing technology is the inevitable trend of aero-engine development. The USAF’s engine-integrated management system applies the Expert Maintenance System (XMAN), the Jet Engine Troubleshooting Expert System (JET-X), and the Turbo Engine Expert Maintenance System (TEXMAX). Artificial intelligence technology can help reduce the burden of personnel in complex systems in the field of built-in test, automatic test equipment and engine state monitoring. The diagnosis system based on the expert system can reduce the maintenance man-hour by 30%, the replacement rate of parts failure by

454

7 Complex Electro-Mechanical System Operational Reliability …

50%, and the maintenance test by 50%. The artificial intelligence fault monitoring system of the U.S. Army AH-64 helicopter uses artificial intelligence methods to detect, identify and diagnose faults through an intelligent fault diagnosis and positioning device, find out the fault location, and carry out maintenance. The entire system is controlled by an airborne computer. Although aircraft and aero-engines are equipped with complete condition monitoring and fault diagnosis systems, aero-engine failures and breakdowns have been emerging for many years and frequently occurred in many types of aircraft. For example, in 1989, a B1-B bomber crashed due to the fracture of the rear shaft sealing labyrinth of the high-pressure turbine of the F101 engine. In less than two months Table 7.5 Typical fault diagnosis of aero-engine Fault

Diagnosis

Monitored parameters

Blade damage

Pneumatic parameter measurement Vibration and noise analysis Borescope

Rotation rate Exhaust temperature Vibration Noise spectrum

Fatigue crack

Vibration Borescope Noise Ultrasonic

Amplitude Frequency Noise spectrum

Damping table damage

Vibration Noise

Blade spacing Vibration

Mechanical erosion

Pneumatic parameter measurement

Thrust Fuel flow Temperature before turbine Rotation rate of high/low pressure

Surge

Pneumatic parameter measurement Vibration

Fuel flow Air flow The temperature of turbine inlet and outlet Rotation speed

Control failure of vent valve or deflector

Pneumatic parameter measurement Noise

Temperature before turbine Air flow Rotation rate

Compressor sealing worn

Pneumatic parameter measurement

Air flow Rotation rate Compressor outlet temperature Turbine outlet temperature Thrust

Blade Icing

Pneumatic parameter measurement Rotor flexibility

Rotation rate Air flow Compressor outlet temperature Pressure

7.4 Aero-Engine Rotor Assembly Reliability Assessment and Health …

455

since July 1994, the main US fighter F-16 has dropped four consecutively (two from the Egyptian Air Force and two from the Israeli Air Force). It is very rare in world aviation history that 4 consecutive aviation accidents were caused by the same fault in a short time. The joint accident investigation and analysis conducted by the US Air Force and GE, an aero-engine manufacturer, showed that the reason for those 4 F-16s to fall was that the sealing labyrinth of the rear shaft of the high-pressure turbine of the F-110 engine of the aircraft was broken, and the broken fragments damaged the low-pressure turbine, which eventually led to engine damage. In the past crashes of aircraft equipped with F101 and F110 engines, 8 aircraft B-1B, F-14 and F-16 of different models were all caused by the fracture of the sealing labyrinth [33]. In view of the above faults, the US Air Force and GE also took various measures, such as adjusting the clearance, changing the original snap ring damping ring into a damping bushing, etc. In order to solve the crack accident of the sealing labyrinth faults, since the end of 1994, 150 F-16s have been grounded in the U.S. Air Force, 200 F-16s in other countries’ air forces, two of the five B-2 bombers, and some F-14Ds. A certain type of aero-engine in China also has the safety problem of sealing labyrinth faults. Because the sealing labyrinth mainly bears the torque transmitted in the process of rotation, it resists the torque by the friction torque generated between the joint surfaces after bolt connection pre-tightening. However, the tightening torque controlled by torque measuring wrench or constant torque wrench is greatly affected by the fluctuation of friction coefficient, so the bolt preload distributed along the circumference of the wheel disc may be different. If the pre-tightening force is different, the sealing labyrinth disc will be preloaded in a certain direction, which will generate initial stress and additional bending moment on the rotor shaft system. At the same time, it will also make the coupling shaft system not concentric or not straight, which will make the rotor in an unbalanced state, causing repeated bending and internal stress of the rotor, thus generating excessive vibration, accelerating the wear of parts, and finally causing cracks. In addition, after the removal and installation of bolts, or after a period of service, the fit clearance between the assembly hole and the bolt will increase, which will lead to a reduction in the fit accuracy, which will also make the coupling shafting out of center or not straight, and cause the rotor components to become loose. To sum up, the rupture accident of sealing ring gear at home and abroad is a difficult problem that the academic and engineering circles pay close attention to solving. Therefore, it is necessary to develop an effective evaluation method for aero-engine rotor assembly reliability to ensure the reliability and safety of aero-engine rotors. Assembly is the last link of product manufacturing, and how to ensure the reliability and safety of mechanical equipment at the beginning of assembly is one of the main issues for industry and academia. According to statistics, in the automobile assembly industry, the failures caused by the assembly in the manufacturing of a new product account for 40–100% of new products [33]. At present, due to the lack of automatic methods and advanced technologies to effectively detect the assembly reliability of aero-engine rotors, the assembly performance can only be indirectly evaluated by the whole machine test run, and sometimes there may even be repeated

456

7 Complex Electro-Mechanical System Operational Reliability …

procedures which contain test run failures, disassembly and assembly. The development of aero-engine rotor assembly reliability evaluation technology can alleviate such a problem, shorten assembly and maintenance time, reduce costs, and finally ensure flight safety. Therefore, the research of aero-engine rotor assembly reliability assessment is an important research topic in the field of manufacturing reliability, which is of great significance for the safety, economy and maintainability of aircraft. Therefore, according to the structural characteristics and assembly process of an aero-engine rotor and the causes of the loose bolt assembly, this section carries out research on the assessment of assembly reliability. Through the excitation test of the rotor, the lifting wavelet packet is used to analyze the excitation response signal, extract the relative energy distribution of the reconstructed signal, and map it to the reliability interval [0, 1].

7.4.1 The Structure Characteristics of Aero-Engine Rotor An aero-engine is a turbofan engine, which is mainly composed of a low-pressure and a high-pressure compressor, a high-pressure and a low-pressure turbine. The air flow enters the engine from the inlet port and is pressurized through the low-pressure compressor, and then the flow is divided into two streams, one of which is discharged through the outer culvert; The other part is further pressurized by the high-pressure compressor, and then heated and combusted in the combustion chamber. The heat generated by the gas combustion makes the high-pressure turbine expand to do work and drive the front-end compressor to rotate. The gas further expands to do work through the tail nozzle to generate flight thrust. A set of stud bolts are used to tighten the drum and discs at all levels among the 2nd stage blade disc and each one in the 3rd stage and the high-pressure rotor shaft, and the torque is transmitted by the friction of the end surface, as shown in Fig. 7.38. In order to simulate the influence of small fluctuation of tightening torque on the assembly reliability of stay bolts during manual assembly, 3 kinds of assembly

Fig. 7.38 Schematic diagram of aero-engine rotor

7.4 Aero-Engine Rotor Assembly Reliability Assessment and Health …

457

Fig. 7.39 Assembly reliability assessment of aero-engine rotor

experiments are set for the assembly process of pull rod bolts: assembly state 1 (torque M1 ), state 2 (torque M2 ), and state 3 (torque M3 ), where M1 < M2 < M3 and only the tightening torque of state 3 meets the requirements. When the bolts are loosely assembled, the stiffness of the rotor structure becomes smaller, and the system is easy to vibrate. Due to the influence of damping during vibration transmission, the dynamic response signal of the engine rotor will attenuate quickly. As the bolt assembly state changes from loose to tight, the rotor structure stiffness gradually increases, the internal damping of the structure decreases, and the high-frequency component of the dynamic response signal increases. Therefore, in view of rotor assembly looseness fault, according to the characteristics that lead to different rotor dynamic characteristics due to different bolt preloads, combined with the structural characteristics and assembly process of the aero-engine rotor, the research on assembly reliability assessment and health maintenance mainly includes three steps, as shown in Fig. 7.39: (1) Through the advanced data acquisition system, the dynamic excitation test is carried out on the rotor under different assembly states; (2) The lifting wavelet packet transform is used to analyze the excitation response signals of the high-pressure compressor rotor under different assembly states, and the wavelet packet decomposition of the excitation response signals is used to reconstruct the relative energy characteristics of the sub-signals; (3) The relative energy distribution of the signal is mapped to the assembly reliability interval within the definition framework of information entropy.

458

7 Complex Electro-Mechanical System Operational Reliability …

7.4.2 Aero-Engine Rotor Assembly Reliability Assessment Test System In order to study the assembly performance of an aero-engine rotor, an assembly performance testing system is built. It mainly uses the advanced data acquisition system to test the dynamic excitation of an aero-engine rotor in different assembly states. The test system is mainly composed of five parts: an aero-engine highpressure compressor rotor, vibration exciter, signal generator, sensor and data acquisition system. JZK-5 vibration exciter produced by Jiangsu Lianneng Electronics Co., Ltd. of China is used to test the dynamic response of an aero-engine highpressure compressor rotor using dynamic excitation under different assembly states. The specific performance parameters of the vibration exciter are shown in Table 7.6. The exciter is installed at the lower end of the high-pressure shaft of the high-pressure compressor rotor, as shown in Fig. 7.40. In this case, the DF1631 power function signal generator is selected as the signal generator. Through many experiments, it is found that the exciter with a square wave signal can generate a better excitation response signal in the experiment. Therefore, the square wave signal generated by the DF1631 power function signal generator is Table 7.6 Performance parameters of vibration exciter JZK-5 Maximum excitation force (N)

Maximum amplitude (mm)

Maximum acceleration (g)

Maximum input current (Arms)

Frequency range (Hz)

Force constants (N/A)

≥ 50

± 7.5

20

≤7

DC-5k

7.2

Overall dimension (mm)

Mass (kg)

Mass of movable parts (kg)

Output

First order resonance frequency (Hz)

DC resistance of moving coil (Ω)

0.25

Rod

50

0.7

Φ138 × 160 8.1

Fig. 7.40 Assembly reliability assessment test of an aero-engine rotor

7.4 Aero-Engine Rotor Assembly Reliability Assessment and Health …

459

Table 7.7 Performance parameters of accelerometer 333B32-ICP Range (pk)

Sensitivity (mv/ g)

Resolution (rms)

Frequency range (Hz)

Operating temperature range (°C)

Mass (g)

± 50 g

100

0.00015 g

0.5–3000

−18 ~ +66

4

used as the excitation signal of the exciter. The working frequency is 1 Hz, and the amplitude of the output signal is a full-scale value. The 333B32 ICP acceleration sensor made by the American PCB company is used to measure the excitation response signal of the aero-engine rotor under the excitation of the vibration exciter. The main performance parameters of the sensor are shown in Table 7.7. The accuracy level can ensure that the test result contains 4 significant digits. Since the purpose of the excitation test is to study the dynamic response law of bolt tightness and rotor assembly reliability, the installation plane of the sensor is A-A plane close to the bolt installation as shown in Fig. 7.40, and the 4 sensors are arranged evenly along the circumference. Sony EX system is used to collect and store the excitation response information of an aero-engine high-pressure compressor rotor under the excitation of vibration exciter.

7.4.3 Experiment and Analysis The vibration exciter is used to test the vibration response signal of an aero-engine rotor in three assembly states. Taking sensor I as shown in Fig. 7.40 as an example, the measured time domain waveforms and spectra of dynamic signals under three different states are shown in Fig. 7.41, and the sampling frequency is 6400 Hz. The time domain waveforms of the excitation response signals in three states are all oscillation attenuation signals. The amplitude of the excitation response signal in assembly state 1 is the smallest. With the increase of preload, the amplitude in assembly state 2 and 3 is slightly larger than that in state 1. In the signal spectrum diagram, there is a maximum spectrum peak near 2000 Hz. With the increase of bolt pre-tightening force, the stiffness of the rotor structure increases, and the highfrequency component in the spectrum increases. Therefore, the second frequency spectrum peak appears near 2600 Hz in the spectrum of state 3 (i.e. qualified state). Next, the lifting wavelet packet transform is used on the excitation response signals measured under three assembly states, and the excitation response sub-signals a31, a32, …, a38 of eight frequency bands are obtained, which are shown in Fig. 7.42. The excitation response signals under each state are decomposed into different frequency bands by the lifting wavelet packet. Since the sampling frequency of the excitation response signal is 6400 Hz, the frequency band of sub-signal a31, a32, a33, a34, a35, a36, a37, a38 are corresponding to 0–400 Hz, 400–800 Hz, 800–1200 Hz, 1200–1600 Hz, 1600–2000 Hz, 2000– 2400 Hz, 2400–2800 Hz, 2800–3200 Hz.

460

7 Complex Electro-Mechanical System Operational Reliability …

Fig. 7.41 Time domain waveform and frequency spectrum of aero-engine rotor under 3 assembly states (from Sensor I)

7.4 Aero-Engine Rotor Assembly Reliability Assessment and Health …

461

Fig. 7.42 Reconstructed signals from lifting wavelet packet transform of an aero-engine rotor under 3 assembly states (from Sensor I)

462

7 Complex Electro-Mechanical System Operational Reliability …

Because the lifting wavelet packet decomposes the excitation response signal into independent and orthogonal frequency bands, the excitation response sub-signals corresponding to each frequency band contain different assembly information, so the relative energy of the sub-signals decomposed into each frequency band can reflect the dynamic response information as the bolt preload changes. According to Eq. (7.19) described in Sect. 7.1.2, the relative energies sub-signals a31–a38 reconstructed from lifting wavelet packet transform are calculated under the three assembly states, respectively. It can be seen from the distribution shown in Fig. 7.43 that the excitation response sub-signal a35 (corresponding to the frequency band of 1600–2000 Hz) obtained through the lifting wavelet packet decomposition and reconstruction corresponding to the three assembly states occupies the largest energy among the eight sub-signals a31–a38. It contains the main signal components of excitation response information from stay bolt. From assembly state 1 to 3, the energy amplitude of the excitation response sub-signal a35 is also positively correlated to the bolt preload. In addition, the sub-signal a36 (the corresponding frequency band is 2000–2400 Hz) also has a large amount of energy in the eight reconstructed sub-signals, and the energy gradually declines with the increase of bolt preload. Note that as the bolt preload gradually increases, the structural stiffness evolves from assembly state 1 to 3, and the energy of the excitation response subsignal a35 containing the main signal components of the rotor excitation response (such as the natural frequency of the rotor) is more and more concentrated, with the frequency band energy containing other non-main signal components reduced (such as the sub-signal a36), Therefore, in the assembly state 3 (i.e. qualified assembly state), the energy of the signal a35 of the aero-engine high-pressure compressor rotor structure is significantly greater than that of the other two assembly states, and the energy is very concentrated. For the three assembly states, the rotor assembly reliability can be obtained by Eq. (7.26) using the above relative energy distribution, and the results are shown in Table 7.8. It can be seen from the table that in the three states of bolts from loose to tight, the assembly reliability of the aero-engine rotor increases monotonously, which consists of the physical law that when the bolt preload gets tighter, the stiffness of the aero-engine rotor will increase gradually. In assembly state 1, the rotor stiffness

Fig. 7.43 Relative energy distribution of reconstructed signals via lifting wavelet packet transform under 3 assembly states (from Sensor I)

7.4 Aero-Engine Rotor Assembly Reliability Assessment and Health … Table 7.8 Assembly reliability of aero-engine rotor

463

Assembly state

1

2

3

Degree of assembly reliability

0.5197

0.8693

0.9486

is the smallest. In addition to the main signal components such as the rotor’s natural frequency, its excitation response signal contains more other dynamic response information, such as those caused by assembly looseness. The closer the various frequency components are to the equal probability distribution, the greater the information entropy and the smaller the assembly reliability. In contrast, in assembly state 3 (i.e. qualified assembly state), the rotor stiffness is the largest, and the excitation response signal is mainly composed of signal components such as rotor natural frequency, and the energy is concentrated. The energy of other dynamic response information is small, so the probability distribution of the signal is relatively determined, thus the assembly reliability of the aero-engine rotor in assembly state 3 is the largest. For the intermediate assembly process, i.e. assembly state 2, whose bolt tightness is between state 1 and 3, its assembly reliability lies between state 1 and 3.

7.4.4 In-Service Aero-Engine Rotor Assembly Reliability Assessment and Health Maintenance In China, the service life of in-service aero-engines is controlled according to working hours and calendar life most of the time. When one of them reaches the predefined value, the engine will be returned to the factory for repair. If the vibration of an aero-engine exceeds the standard level, it is necessary to trace the source of vibration and conduct health maintenance. Firstly, the excitation test was carried out for the high-pressure compressor rotor of the aero-engine, and the sampling frequency was 6400 Hz. The time domain waveform of the excitation response signal measured by sensor I (in Fig. 7.35) is shown in Fig. 7.44a and its signal decays rapidly. The frequency spectrum of the excitation response signal is shown in Fig. 7.44b and there are many frequency components, of which the largest component is about 2200 Hz, and the other three large components are about 1000, 1400, and 2600 Hz. After three layers of decomposition and reconstruction using the lifting wavelet packet transform, the excitation response sub-signals a31–a38 of eight frequency bands are obtained, as shown in Fig. 7.44c. The signal is decomposed into different frequency bands by the lifting wavelet packet. The sub-signals corresponding to each frequency band contain different information. The relative energy of eight subsignals a31–a38 is obtained from the decomposition and reconstruction according to Eq. (7.19), and its distribution is shown in Fig. 7.44d. It can be seen from the figure that the 5th frequency band still has the largest energy, but the 3rd band also has a large energy. Through comparison with Fig. 7.43, it is found that the relative energy distribution of the current signal is scattered, and the scattered energy mainly appears in the 3rd and 8th bands. These energies may be response information generated due

464

7 Complex Electro-Mechanical System Operational Reliability …

Fig. 7.44 Vibration response signals of the aero-engine rotor in service (from Sensor I)

References

465

Fig. 7.45 Dye penetration of balanced holes of labyrinth #32

to the looseness of the rotor. Preliminary diagnosis suggests that the current rotor may have a looseness fault. The operational reliability index R = 0.7914 is shown in Eq. (7.19), which is between the assembly states 1 and 2 as shown in Table 7.8, and much closer to state 2. It can be seen that the assembly tightness of the pull rod bolts has deteriorated in service, and the bolts have loosened, which has not met the requirements for the optimal use of the engine. Loose pull rod bolts will result in more vibration and may exceed the standard level, thus causing fatigue cracks and fracture accidents of the wheel disc. After the maintenance of the engine rotor in the factory, it was found that 7 balanced holes in the 9th labyrinth disc of the rotor had cracks of varying degrees, where Fig. 7.45 shows the cracks after dye penetration of the balanced holes. The assembly performance prediction results of the high-pressure compressor rotor in the aero-engine and the actual maintenance results of the factory jointly show that the aero-engine in service in the field is affected by maneuver tasks, flight loads, body vibration and other factors, resulting in the gradual degradation of the performance of the aero-engine rotor especially the loosening of pull rod bolts, which increases the vibration and leads to cracks in the balanced holes of the 9th labyrinth disc, thus, the service life of the engine is reduced. The dynamic assembly information of the aero-engine rotor is tested by exciting vibration. The lifting wavelet packet method is used to analyze the rotor vibration response signal and extract the relative energy characteristics of the reconstructed signal. The relative energy distribution of the signal is mapped to [0, 1] from the perspective of information entropy, which can effectively evaluate the reliability degree and health status of the aero-engine rotor, and provides a new tool for life prediction and health management.

References 1. Bazovsky, I.: Reliability Theory and Practice. Prentice-Hall, Englewood Cliffs (1961) 2. Dupow, H., Blount, G.: A review of reliability prediction. Aircr. Eng. Aerosp. Technol. 69(4), 356–362 (1997) 3. Denson, W.: The history of reliability prediction. IEEE Trans. Reliab. 47(3), 321–328 (1998)

466

7 Complex Electro-Mechanical System Operational Reliability …

4. Sweldens, W.: The lifting scheme: a construction of second generation wavelets. SIAM J. Math. Anal. 29(2), 511–546 (1998) 5. Daubechies, I., Sweldens, W.: Factoring wavelet transforms into lifting steps. J. Fourier Anal. Appl. 4(3), 247–269 (1998) 6. He, Z.J., Zi, Y.Y., Zhang, X.N.: Modern Signal Processing Technology and Its Application (in Chinese). Xi’an Jiaotong University Press, Xi’an (2006) 7. Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(4), 623–656 (1948) 8. Shannon, C.E.: Communication theory of secrecy systems. Bell Syst. Tech. J. 28(4), 656–715 (1949) 9. Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948) 10. Chen, P.C., Chen, C.W., Chiang, W.L., et al.: GA-based decoupled adaptive FSMC for nonlinear systems by a singular perturbation scheme. Neural Comput. Appl. 20(4), 517–526 (2011) 11. Matteson, S.: Methods for multi-criteria sustainability and reliability assessments of power systems. Energy 71, 130–136 (2014) 12. Lo Prete, C., Hobbs, B.F., Norman, C.S., et al.: Sustainability and reliability assessment of microgrids in a regional electricity market. Energy 41(1), 192–202 (2012) 13. Moharil, R.M., Kulkani, P.S.: Generator system reliability analysis including wind generators using hourly mean wind speed. Electric Power Components Syst. 36(1), 1–16 (2008) 14. Whyatt, P., Horrocks, P., Mills, L.: Steam generator reliability - Implications for APWR codes end standards. Nuclear Energy-J. Br. Nuclear Energy Soc. 34(4), 217–228 (1995) 15. Tsvetkov, V.A.: A mathematical-model for analysis of generator reliability, including development of defects. Electrical Technol. 4, 107–112 (1992) 16. Sun, Y., Wang, P., Cheng, L., et al.: Operational reliability assessment of power systems considering condition-dependent failure rate. IET Gener. Transm. Distrib. 4(1), 60–72 (2010) 17. Baraldi, P., Di Maio, F., Pappaglione, L., et al.: Condition monitoring of electrical power plant components during operational transients. Proc. Inst. Mech. Eng. Part O—J. Risk Reliab. 226(O6), 568–583 (2012) 18. Lu, F., Huang, J.Q., Xing, Y.D.: Fault diagnostics for turbo-shaft engine sensors based on a simplified on-board model. Sensors 12(8), 11061–11076 (2012) 19. Li, Z.J., Liu, Y., Liu, F.X., et al.: Hybrid reliability model of hydraulic turbine-generator unit based on nonlinear vibration. Proc. Inst. Mech. Eng. Part C—J. Mech. Eng. Sci. 228(11), 1880–1887 (2014) 20. Qu, J.X., Zhang, Z.S., Wen, J.P., et al.: State recognition of the viscoelastic sandwich structure based on the adaptive redundant lifting wavelet packet transform, permutation entropy and the wavelet support vector machine. Smart Mater. Struct. 23(8) (2014) 21. Si, Y., Zhang, Z.S., Liu, Q., et al.: Detecting the bonding state of explosive welding structures based on EEMD and sensitive IMF time entropy. Smart Mater. Struct. 23(7) (2014) 22. Yu, B., Liu, D.D., Zhang, T.H.: Fault diagnosis for micro-gas turbine engine sensors via wavelet entropy. Sensors 11(10), 9928–9941 (2011) 23. Sawalhi, N., Randall, R.B., Endo, H.: The enhancement of fault detection and diagnosis in rolling element bearings using minimum entropy deconvolution combined with spectral kurtosis. Mech. Syst. Signal Process. 21(6), 2616–2633 (2007) 24. Tafreshi, R., Sassani, F., Ahmadi, H., et al.: An approach for the construction of entropy measure and energy map in machine fault diagnosis. J. Vib. Acoustics—Trans. Asme 131(2) (2009) 25. He, Y.Y., Huang, J., Zhang, B.: Approximate entropy as a nonlinear feature parameter for fault diagnosis in rotating machinery. Measurement Sci. Technol. 23(4) (2012) 26. Wu, S.D., Wu, P.H., Wu, C.W., et al.: Bearing fault diagnosis based on multiscale permutation entropy and support vector machine. Entropy 14(8), 1343–1356 (2012) 27. Rényi, A.: On measures of entropy and information. Proc. Fourth Berkeley Symp. Math. Statist. Prob. 1, 547–561 (1961) 28. Fehr, S., Berens, S.: On the conditional Rényi entropy. IEEE Trans. Inf. Theor. 60(11), 6801– 6810 (2014)

References

467

29. Nanda, A.K., Sankaran, P.G., Sunoj, S.M.: Rényi’s residual entropy: a quantile approach. Statist. Probab. Lett. 85, 114–121 (2014) 30. Endo, T., Omura, K., Kudo, M.: Analysis of relationship between Rényi entropy and marginal Bayes error and its application to weighted Naive Bayes classifiers. Int. J. Pattern Recogn. Artif. Intell. 28(7) (2014) 31. Nagy, A., Romera, E.: Relative Rényi entropy for atoms. Int. J. Quantum Chem. 109(11), 2490–2494 (2009) 32. Lake, D.E.: Rényi entropy measures of heart rate Gaussianity. IEEE Trans. Biomed. Eng. 53(1), 21–27 (2006) 33. Zhang, B.C.: Test and Measurement Technology for Aero Engines (in Chinese). Beihang University Press, Beijing (2005)