Night vision processing and understanding 9789811316685, 9789811316692

234 39 7MB

English Pages 274 Year 2019

Table of contents :
Foreword by Xiangqun Cui......Page 5
Preface......Page 9
Acknowledgements......Page 10
Contents......Page 11
1.1 Research Topics of Multidimensional Night-Vision Information Understanding......Page 15
1.1.1 Data Analysis and Feature Representation Learning......Page 16
1.1.2 Dimension Reduction Classification......Page 19
1.1.3 Information Mining......Page 22
1.2 Challenges to Multidimensional Night-Vision Data Mining......Page 24
References......Page 26
2.1 Multiplexing Measurement in Hyperspectral Imaging......Page 30
2.2.1 Traditional Denoising Theory of HTS......Page 32
2.2.2 Denoising Bound Analysis of HTS with S Matrix......Page 35
2.2.3 Denoising Bound Analysis of HTS with H Matrix......Page 38
2.3 Spatial Pixel-Multiplexing Coded Spectrometre......Page 40
2.3.1 Typical HTS System......Page 41
2.3.2 Spatial Pixel-Multiplexing Coded Spectrometre......Page 42
2.4 Deconvolution-Resolved Computational Spectrometre......Page 48
2.5 Summary......Page 54
References......Page 55
3.1 Infrared Image Super-Resolution via Transformed Self-similarity......Page 57
3.1.1 The Introduced Framework of Super-Resolution......Page 59
3.1.2 Experimental Results......Page 62
3.2 Hierarchical Superpixel Segmentation Model Based on Vision Data Structure Feature......Page 69
3.2.1 Hierarchical Superpixel Segmentation Model Based on the Histogram Differential Distance......Page 70
3.2.2 Experimental Results......Page 74
3.3 Structure-Based Saliency in Infrared Images......Page 82
3.3.1 The Framework of the Introduced Method......Page 83
3.3.2 Experimental Results......Page 89
3.4 Summary......Page 93
References......Page 94
4.1.1 New Adaptive Supervised Manifold Learning Algorithms......Page 98
4.1.2 Kernel Maximum Likelihood-Scaled LLE for Night-Vision Images......Page 100
4.2.1 Review of LDA and CMVM......Page 101
4.2.2 Introduction of the Algorithm......Page 103
4.2.3 Experiments......Page 105
4.3.1 Review of LPP......Page 109
4.3.2 Adaptive and Parameterless LPP (APLPP)......Page 110
4.3.3 Connections with LDA, LPP, CMVM and MMDA......Page 114
4.3.4 Experiments......Page 115
4.4.1 KML Similarity Metric......Page 120
4.4.2 KML Outlier-Probability-Scaled LLE (KLLE)......Page 123
4.4.3 Experiments......Page 124
4.4.4 Discussion......Page 131
4.5 Summary......Page 134
References......Page 135
5.1 Classification Methods......Page 137
5.1.1 Research on Classification via Semi-supervised Random Subspace Sparse Representation......Page 138
5.1.2 Research on Classification via Semi-supervised Multi-manifold Structure Regularisation (MMSR)......Page 139
5.2.1 Motivation......Page 140
5.2.2 SSM–RSSR......Page 142
5.2.3 Experiment......Page 146
5.3.1 Probability Semi-supervised Random Subspace Sparse Representation (P-RSSR)......Page 156
5.3.2 Experiment......Page 161
5.4.2 Multi-manifold Structure Regularisation (MMSR)......Page 169
5.4.3 Experiment......Page 174
5.5 Summary......Page 179
References......Page 181
6.1 Machine Learning in IM......Page 184
6.1.2 Feature Extraction and Classifier......Page 185
6.2.1 Denoising and Sparse Autoencoders......Page 186
6.2.2 LDAE......Page 188
6.2.3 Experimental Comparison......Page 191
6.3.1 Algorithm and Implementation of Detection System......Page 197
6.3.2 Experiments and Evaluation......Page 203
6.4 Summary......Page 206
References......Page 207
7.1.1 Investigation of Infrared Small-Target Detection......Page 209
7.1.2 Moving Object Detection Based on Non-learning......Page 210
7.1.3 Researches on Target Tracking Technology......Page 211
7.2.1 Framework of Object Detection......Page 212
7.2.2 Experimental Results......Page 214
7.3.1 Tracking Model Based on Global LARK Feature Matching and CAMSHIFT......Page 216
7.3.2 Target Tracking Algorithm Based on Local LARK Feature Statistical Matching......Page 219
7.3.3 Experiment and Analysis......Page 220
7.4 An SMSM Model for Human Action Detection......Page 225
7.4.1 Technical Details of the SMSM Model......Page 227
7.4.2 Experiments Analysis......Page 232
References......Page 240
8.1 Research on Colorization of Low-Light-Level Images......Page 243
8.2.1 Summary of the Principle of the Algorithm......Page 244
8.2.2 Mining of Multi-attribute Association Rules in Grayscale Images......Page 246
8.2.3 Colorization of Grayscale Images Based on Rule Mapping......Page 247
8.2.4 Analysis and Comparison of Experimental Results......Page 248
8.3 Multi-sparse Dictionary Colorization Algorithm Based on Feature Classification and Detail Enhancement......Page 254
8.3.1 Colorization Based on a Single Dictionary......Page 255
8.3.2 Multi-sparse Dictionary Colorization Algorithm for Night-Vision Images, Based on Feature Classification and Detail Enhancement......Page 256
8.3.3 Experiment and Analysis......Page 262
8.4 Summary......Page 271
References......Page 273

Recommend Papers

Dictionary of Computer Vision and Image Processing 0470015268, 9780470015261

The definitive guide for professionals working with computer vision, image processing and multimedia applications Develo

445 4 102KB Read more

Night Vision: Seeing Ourselves Through Dark Moods 0691242682, 9780691242682

A philosopher's personal meditation on how painful emotions can reveal truths about what it means to be truly human

151 38 4MB Read more

Night Vision: Seeing Ourselves Through Dark Moods 9780691242682, 9780691215457, 0691242682

A philosopher's personal meditation on how painful emotions can reveal truths about what it means to be truly human

156 4 4MB Read more

Night Vision: Seeing Ourselves through Dark Moods 9780691242682

A philosopher’s personal meditation on how painful emotions can reveal truths about what it means to be truly human Und

101 97 898KB Read more

Image Processing with LabVIEW and IMAQ Vision 0130474150, 9780130474155

This book brings together everything you need to achieve superior results with PC-based image processing and analysis. E

425 13 10MB Read more

OPENCV: Python for Computer Vision: Face Detection and Image Processing

One of the best things about OpenCV is that it comes with a lot of built-in primitives for image processing and computer

113 91 8MB Read more

Nonlinear eigenproblems in image processing and computer vision 9783319758466, 9783319758473

401 48 3MB Read more

Computer vision and image processing 9780367265731, 9780815370840, 9781351248396, 1561571601, 3393413503

410 95 8MB Read more

Advancements in Computer Vision and Image Processing 1522556281, 9781522556282

Interest in computer vision and image processing has grown in recent years with the advancement of everyday technologies

402 12 17MB Read more

Understanding digital signal processing 9789811049613, 9789811049620

432 73 2MB Read more

Night vision processing and understanding
9789811316685, 9789811316692

Author / Uploaded
Bai L.
Han J.
Yue J

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Lianfa Bai · Jing Han · Jiang Yue

Night Vision Processing and Understanding

Night Vision Processing and Understanding

Lianfa Bai Jing Han Jiang Yue •

•

Night Vision Processing and Understanding

123

Lianfa Bai School of Electronic and Optical Engineering Nanjing University of Science and Technology Nanjing, Jiangsu, China

Jing Han School of Electronic and Optical Engineering Nanjing University of Science and Technology Nanjing, Jiangsu, China

Jiang Yue National Key Laboratory of Transient Physics Nanjing University of Science and Technology Nanjing, Jiangsu, China

ISBN 978-981-13-1668-5 ISBN 978-981-13-1669-2 https://doi.org/10.1007/978-981-13-1669-2

(eBook)

Library of Congress Control Number: 2018965447 © Springer Nature Singapore Pte Ltd. 2019 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional afﬁliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Foreword by Huilin Jiang

With the continuous development of information mining, cognitive computing and other disciplines, this research method that combines intelligent information understanding and night-vision imaging technology can effectively simulate human perception mechanisms and processes, and has wide application prospects in the ﬁeld of night-vision information perception. However, due to the cross-disciplinary studies, some new developments and achievements are scattered. For the time being, there is less systematic introduction of books in this area, especially in the ﬁeld of night-vision technology. In order to change this situation and better promote the development of night-vision information technology, this book focuses on new theories and technologies currently being developed in the ﬁelds of multispectral imaging, dimensionality reduction, data mining, feature classiﬁcation learning, target recognition, object detection, colorization algorithm, etc., by exploring optimization models and new algorithms. It solves the applications of perception computing, mining learning and information understanding technologies in night-vision data, and strives to demonstrate the major breakthroughs brought by modern information technology to the night-vision ﬁeld. This book is a landmark and timely contribution in this direction as it offers, for the ﬁrst time, detailed descriptions and analysis of this frontier theory and method of night-vision information processing. Based on the differences in the imaging environment, target characteristics and imaging methods, it concentrates on multispectral data, video data, etc., and researches a variety of information mining and perceptual understanding algorithms, which aims to analyse new processing methods for multiple types of scenes and targets. The selection of content fully reflects the main technical connotations and dynamics of the new ﬁeld of night vision. Eight chapters include spectral imaging and coding noise reduction, multi-vision tasks based on data structure and feature analysis, feature classiﬁcation based on manifold dimension reduction, data classiﬁcation based on sparse representation and random subspace, target detection and recognition based on learning, motion detection and tracking based on non-learning, colorization of night-vision images based on rule mining, etc., which v

vi

Foreword by Huilin Jiang

cover the comprehensive research areas of artiﬁcial intelligence in night vision. These provided algorithm models and hardware systems can be used as the reference basis for the general design, algorithm design and hardware design personnel of the photoelectric system. In this monograph, Lianfa Bai, Jing Han and Jiang Yue have brought together their work in Night Vision Processing and Understanding over the past decade to result in a book that will become a standard for the area. Well done. Changchun, China

Huilin Jiang

Foreword by Xiangqun Cui

Night-vision technology is used to extend human activities beyond the limits of natural visual ability. For example, it is widely used in the ﬁelds for observation, monitoring and low-light detection. Night-vision research includes low-level light (LLL) vision, infrared thermal imaging, ultraviolet imaging and active near-infrared systems. Multi-source night-vision technology uses the complementarity of multi-sensor information to solve the problem of incomplete or inaccurate information of single imaging sensor. However, the extraction of useful information from multi-sensor presents new problems. Thus, it is necessary to synthesise information provided by different sensors. The possible redundancy and contradiction of multi-source information can thus be eliminated, allowing users to describe complete and consistent target information in complex scenes. Along with the advancement of information mining, cognitive computing and other disciplines, it is known and believed that combining night-vision technology and intelligent information understanding can effectively simulate human perception. Data structure analyses, feature representation learning, dimension reduction classiﬁcations and information mining theories have all been studied extensively. These models and algorithms have obvious advantages over conventional methods in terms of information understanding. However, there are few studies on feature mining of night-vision images. For complex night-vision information processing, the manifold learning, classiﬁcation and data mining methods still require investigation. For practical applications (e.g. security, defence and industry), this research is needed to solve problems of multispectral target detection or large-number image classiﬁcation and recognition. This book compiles an intelligent understanding of night-vision data under high dimensionality and complexity. Several night-vision data processing methods, based on feature learning, dimension reduction classiﬁcation and information mining, are explored and studied, providing various new technical approaches for information detection and understanding. This book aims to present a systematic and comprehensive introduction of the latest theories and technologies for various aspects of night-vision information technologies. Speciﬁcally, it covers multispectral imaging, dimension reduction, vii

viii

Foreword by Xiangqun Cui

data mining, feature classiﬁcation learning, object recognition, object detection, colorization algorithm, etc. Additionally, the application of the up-to-date optimization models and algorithms (including perception computing, mining learning and information understanding technologies) is explored to night-vision data and demonstrates major breakthroughs in the ﬁeld. The reader of this book will get both, a fairly comprehensive overview of the ﬁeld of night-vision processing and understanding, reached in the last two decades. I am very proud to have had the opportunity to follow this development for almost 20 years. Enjoy reading this book as I did. Nanjing, China

Xiangqun Cui

Preface

Along with the advancement of information mining, cognitive computing and other disciplines, it is known and believed that combining night-vision technology and intelligent information understanding can effectively simulate human perception; however, this topic has not yet been studied systematically. This book aims to present a systematic and comprehensive introduction of the latest theories and technologies for various aspects of night-vision information technologies. Speciﬁcally, it covers multispectral imaging, dimension reduction, data mining, feature classiﬁcation learning, object recognition, object detection, colorization algorithm, etc. Additionally, we explore the application of the up-to-date optimization models and algorithms (including perception computing, mining learning and information understanding technologies) to night-vision data and demonstrate major breakthroughs in the ﬁeld. The book structure is as follows: starting with a practical and advanced approach, comprehensively discussing the frontier theory and methods of night-vision information processing; we then systematically analyse new night-vision imaging processing and perception understanding-related theories and methods; in the last part, we provide an algorithm model and a hardware system that can be used for general, algorithmic or hardware design aspects of new systems. Nanjing, China May 2018

Lianfa Bai Jing Han Jiang Yue

ix

Acknowledgements

We thank Professor Yi Zhang, Professor Qian Chen, Professor Weiqi Jin, Professor Jie Kong and Associate Professor Chuang Zhang for their support and constructive criticism of the material in this book. Professor Yi Zhang was instrumental in the process of preparing this document. Without his continuing support, some of the research presented here would have never been accomplished. We thank former and current students and collaborators—Zhuang Zhao, Xiaoyu Chen, Enlai Guo, Linli Xu, Haotian Yu, Jingsong Zhang, Dan Yan, Jiani Gao, Junwei Zhu, Wei Zhang, Qin Wang, Mingzu Li and Dongdong Chen— for letting us draw upon their work, thus making this monograph possible. Research efforts summarised in this book were supported by the National Natural Science Foundation of China (61231014).

xi

Contents

1

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Research Topics of Multidimensional Night-Vision Information Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Data Analysis and Feature Representation Learning . . . . 1.1.2 Dimension Reduction Classiﬁcation . . . . . . . . . . . . . . . 1.1.3 Information Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Challenges to Multidimensional Night-Vision Data Mining . . . . 1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

1 2 5 8 10 12 12

2 High-SNR Hyperspectral Night-Vision Image Acquisition with Multiplexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Multiplexing Measurement in Hyperspectral Imaging . . . . 2.2 Denoising Theory and HTS . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Traditional Denoising Theory of HTS . . . . . . . . . . 2.2.2 Denoising Bound Analysis of HTS with S Matrix . 2.2.3 Denoising Bound Analysis of HTS with H Matrix . 2.3 Spatial Pixel-Multiplexing Coded Spectrometre . . . . . . . . 2.3.1 Typical HTS System . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Spatial Pixel-Multiplexing Coded Spectrometre . . . 2.4 Deconvolution-Resolved Computational Spectrometre . . . . 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

17 17 19 19 22 25 27 28 29 35 41 42

........

45

........ ........ ........

45 47 50

3 Multi-visual Tasks Based on Night-Vision Data Structure and Feature Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Infrared Image Super-Resolution via Transformed Self-similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 The Introduced Framework of Super-Resolution 3.1.2 Experimental Results . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

..

xiii

xiv

Contents

3.2 Hierarchical Superpixel Segmentation Model Based on Vision Data Structure Feature . . . . . . . . . . . . . . . . . . . . 3.2.1 Hierarchical Superpixel Segmentation Model Based on the Histogram Differential Distance . . . . . . . . . . 3.2.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 3.3 Structure-Based Saliency in Infrared Images . . . . . . . . . . . . 3.3.1 The Framework of the Introduced Method . . . . . . . 3.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.....

57

. . . . . . .

. . . . . . .

58 62 70 71 77 81 82

... ...

87 87

...

87

...

89

. . . .

. . . .

. . . .

90 90 92 94

. . . . .

. . . . .

. 98 . 98 . 99 . 103 . 104

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

4 Feature Classiﬁcation Based on Manifold Dimension Reduction for Night-Vision Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Methods of Data Reduction and Classiﬁcation . . . . . . . . . . . . 4.1.1 New Adaptive Supervised Manifold Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Kernel Maximum Likelihood-Scaled LLE for Night-Vision Images . . . . . . . . . . . . . . . . . . . . . . . 4.2 A New Supervised Manifold Learning Algorithm for Night-Vision Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Review of LDA and CMVM . . . . . . . . . . . . . . . . . . . 4.2.2 Introduction of the Algorithm . . . . . . . . . . . . . . . . . . . 4.2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Adaptive and Parameterless LPP for Night-Vision Image Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Review of LPP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Adaptive and Parameterless LPP (APLPP) . . . . . . . . . 4.3.3 Connections with LDA, LPP, CMVM and MMDA . . . 4.3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Kernel Maximum Likelihood-Scaled Locally Linear Embedding for Night-Vision Images . . . . . . . . . . . . . . . . . . . 4.4.1 KML Similarity Metric . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 KML Outlier-Probability-Scaled LLE (KLLE) . . . . . . . 4.4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

5 Night-Vision Data Classiﬁcation Based on Sparse Representation and Random Subspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Classiﬁcation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Research on Classiﬁcation via Semi-supervised Random Subspace Sparse Representation . . . . . . . . . . . . . . . . . . 5.1.2 Research on Classiﬁcation via Semi-supervised Multi-manifold Structure Regularisation (MMSR) . . . . .

. . . . . . .

109 109 112 113 120 123 124

. . 127 . . 127 . . 128 . . 129

Contents

5.2 Night-Vision Image Classiﬁcation via SSM–RSSR . . . . . . . . . 5.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 SSM–RSSR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Night-Vision Image Classiﬁcation via P-RSSR . . . . . . . . . . . . 5.3.1 Probability Semi-supervised Random Subspace Sparse Representation (P-RSSR) . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Night-Vision Image Classiﬁcation via MMSR . . . . . . . . . . . . 5.4.1 MR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Multi-manifold Structure Regularisation (MMSR) . . . . 5.4.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xv

. . . . .

. . . . .

. . . . .

130 130 132 136 146

. . . . . . . .

. . . . . . . .

. . . . . . . .

146 151 159 159 159 164 169 171

. . . .

. . . .

175 175 176 176

. . . .

. . . .

177 177 179 182

. . . . .

. . . . .

188 188 194 197 198

. . . . .

. . . . .

201 201 201 202 203

6 Learning-Based Night-Vision Image Recognition and Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Machine Learning in IM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Feature Extraction and Classiﬁer . . . . . . . . . . . . . . . . . . 6.2 Lossless-Constraint Denoising Autoencoder Based Night-Vision Image Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Denoising and Sparse Autoencoders . . . . . . . . . . . . . . . 6.2.2 LDAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Experimental Comparison . . . . . . . . . . . . . . . . . . . . . . 6.3 Integrative Embedded Night-Vision Target Detection System with DPM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Algorithm and Implementation of Detection System . . . 6.3.2 Experiments and Evaluation . . . . . . . . . . . . . . . . . . . . . 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Non-learning-Based Motion Cognitive Detection and Self-adaptable Tracking for Night-Vision Videos . . . . . . . . . . . . . 7.1 Target Detection and Tracking Methods . . . . . . . . . . . . . . . . . . 7.1.1 Investigation of Infrared Small-Target Detection . . . . . . 7.1.2 Moving Object Detection Based on Non-learning . . . . . 7.1.3 Researches on Target Tracking Technology . . . . . . . . . . 7.2 Infrared Small Object Detection Using Sparse Error and Structure Difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Framework of Object Detection . . . . . . . . . . . . . . . . . . 7.2.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . .

. . 204 . . 204 . . 206

xvi

Contents

7.3 Adaptive Mean Shift Algorithm Based on LARK Feature for Infrared Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Tracking Model Based on Global LARK Feature Matching and CAMSHIFT . . . . . . . . . . . . . . . . . . 7.3.2 Target Tracking Algorithm Based on Local LARK Feature Statistical Matching . . . . . . . . . . . . . . . . . 7.3.3 Experiment and Analysis . . . . . . . . . . . . . . . . . . . 7.4 An SMSM Model for Human Action Detection . . . . . . . . 7.4.1 Technical Details of the SMSM Model . . . . . . . . . 7.4.2 Experiments Analysis . . . . . . . . . . . . . . . . . . . . . . 7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . 208 . . . . . . 208 . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

8 Colourization of Low-Light-Level Images Based on Rule Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Research on Colorization of Low-Light-Level Images . . . . . . . . 8.2 Carm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Summary of the Principle of the Algorithm . . . . . . . . . . 8.2.2 Mining of Multi-attribute Association Rules in Grayscale Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.3 Colorization of Grayscale Images Based on Rule Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.4 Analysis and Comparison of Experimental Results . . . . 8.3 Multi-sparse Dictionary Colorization Algorithm Based on Feature Classiﬁcation and Detail Enhancement . . . . . . . . . . . . . 8.3.1 Colorization Based on a Single Dictionary . . . . . . . . . . 8.3.2 Multi-sparse Dictionary Colorization Algorithm for Night-Vision Images, Based on Feature Classiﬁcation and Detail Enhancement . . . . . . . . . . . . . . . . . . . . . . . . 8.3.3 Experiment and Analysis . . . . . . . . . . . . . . . . . . . . . . . 8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

211 212 217 219 224 232 232

. . . .

. . . .

235 235 236 236

. . 238 . . 239 . . 240 . . 246 . . 247

. . . .

. . . .

248 254 263 265

Chapter 1

Introduction

Night-vision technology is used to extend human activities beyond the limits of natural visual ability. For example, it is widely used in military and civilian fields for observation, monitoring and low-light detection. Night-vision research includes low-level light (LLL) vision, infrared thermal imaging, ultraviolet imaging and active near-infrared systems. Multi-source night-vision technology uses the complementarity of multi-sensor information to solve the problem of incomplete or inaccurate information of single imaging sensors. However, the extraction of useful information from the output of the multi-sensor presents new problems. Thus, it is necessary to synthesise information provided by different sensors (Fig. 1.1). The possible redundancy and contradiction of multi-source information can thus be eliminated, allowing users to describe a complete and consistent target information in complex scenes. Thus, we embrace a new field of intelligent night-vision and target perception research. For practical applications (e.g. security, defence and industry), this research is needed to solve problems of multispectral target detection or large number image classification and recognition. This book compiles an intelligent understanding of night-vision data under high dimensionality with complexity. Several night-vision data processing methods, based on feature learning, dimension reduction classification and information mining, are explored and studied, providing various new technical approaches for information detection and understanding.

1.1 Research Topics of Multidimensional Night-Vision Information Understanding The core of multidimensional information understanding is the feature analysis learning, information mining and dimension reduction classification of structural data. After decades of development, data mining has become the most intelligent feature of artificial intelligence and has achieved great success in pattern recognition, © Springer Nature Singapore Pte Ltd. 2019 L. Bai et al., Night Vision Processing and Understanding, https://doi.org/10.1007/978-981-13-1669-2_1

1

2

1 Introduction

Fig. 1.1 LLL and infrared night vision

data analysis, information retrieval and other fields. It is one of the most actively researched fields in computer science. The inherent characteristics and simplified data structure of multispectral and mass complex information make datasets more divisible. Technical challenges include effective extraction, effective visualisation and accurate classification recognition based on learning and non-learning. These problems urgently need to be solved, beginning with further research on theories and methods of complex night-vision information understanding. Hidden rules, data relations, patterns, etc. are extracted from high-dimensional and massive night-vision image datasets and are used to construct various algorithms, including multispectral and massive data analysis, target detection and classification. Their algorithms have high efficiency and high performance. The following subsections describe the development of several key technologies, including data structure analysis, feature representation learning, dimension reduction classification and information mining.

1.1.1 Data Analysis and Feature Representation Learning In the fields of data mining, machine learning and pattern recognition, the analysis and research of massive high-dimensional data structures has been an important research topic. The features of high-dimensional data include clustering and manifold structures, involving several mathematical branches (e.g. multivariate analysis, nonlinear functional analysis and statistics). People have been searching for simpler and more effective methods to analyse the relationships and associations among data points (Fig. 1.2). Data analysis methods are expressed in distance (e.g. far or near), sparsity/density, category/label (e.g. attribute values), clustering, similarity/dissimilarity (e.g. more detailed than class), features, variance (e.g. principal curve), distribution, statistical characteristics, data model (e.g. neural networks, fuzzy systems), etc.

1.1 Research Topics of Multidimensional Night-Vision Information Understanding

3

Image feature representation is key to computer vision, image analysis and artificial intelligence. A good image representation method can provide a strong expression mechanism, not only saving computer storage space but also reducing image processing complexity. Feature representation strongly influences image classification and recognition. Unfortunately, current feature representation techniques do not meet the needs of actual image analysis applications, because many key problems in feature representation have not been well solved. For example, original image features are high dimensional, whereas image content is complex and various. Images are often affected by light, scale, rotation, translation, occlusion and other noises. Presently, the research of image feature representation focuses on bottom feature extraction by artificial design and learning-based feature representation. The features extracted by artificial design are used to describe image content and similarity. Methods include global and local feature representation. All information in the region of interest and background area are extracted from the whole image as global features. Global information of image includes colour, texture, shape, etc. Corresponding feature vectors are formed by processing the basic information of each feature point. The image is then mapped to a feature vector set. The most representative global features are the histogram of oriented gradient (HOG) in Dalal and Triggs (2005), local binary pattern (LBP) in Ojala et al. (2002), and colour histogram in Rubner et al. (2010). For image classification, global features make good use of image context information. Thus, feature extraction is very fast. These advantages make the global feature widely used in classification systems, such as for scene classification. Global features focus on whole-image information, ignoring the target object. Additionally, image representation, based on global features, is usually presented as statistical data. It is more redundancy, but, under the circumstances, it is too sensitive to geometric transformation (e.g. rotation and zooming), image brightness changes, target occlusion, target displacement and deformation. This severely limits its application to image classification and recognition. Later, a primitive feature is introduced to flexibly describe the internal information and details in images. This local feature can find invariant parts in the same class of ever-changing images. Image representations based on local features are increasingly popular. Thus, many researchers have proposed image representation methods, including the word-bag mode of Li and Perona (2005) (i.e., bag-of-words), the soft quantisation algorithm of Philbin et al. (2008) and Liu et al. (2011), the locally restricted linear coding algorithm of Wang et al. (2010) and the spatial pyramid algorithm of Lazebnik et al. (2006). Visual data is represented by video and spectral images, characterised by huge amounts of data, high-resolution information, unstructured data, interpretation variety, ambiguity and uncertainty. Similarly, image data often possess very high dimensions, again making image representation and artificial design waste time and energy. The design and extraction process of manual features is very difficult. Thus, it cannot be directly applied to current high-dimensional image analysis and processing. Thus, to find a reasonable method of representing image data, the non-artificial design has

4

1 Introduction

Fig. 1.2 Data analysis applications

been studied to represent an image as accurately as possible. This promises to save both time and effort. Learning-based image feature representation is achieved by its algorithm. Thus, the process does not require human participation—it can automatically feature-learn and extract good features. Compared to the image representation method of artificial design, learning-based image feature representation uses massive data to featurelearn but can describe rich internal image data information. In practice, however, there are various problems. The time required for image classification and recognition processes has different requirements. Thus, various feature-learning methods are

1.1 Research Topics of Multidimensional Night-Vision Information Understanding

5

also quite different. Currently, the most representative feature-learning methods are based on depth learning and matrix decomposition. Because of its robustness and accuracy, depth-learning methods have become the focus of current research, such as Rumelhart et al. (1988). The concept of depth learning is training several restricted Boltzmann machines (RBM) using an unsupervised approach. Then, multiple RBMs comprise a deep network. The initial weight of the network is the trained RBM weight. Finally, the traditional supervised backpropagation algorithm is used to fine-tune the whole network for subsequent classification (see Baldi and Hornik 1989; Schölkopf et al. 2006; Seung et al. 2000). The structure is a machine-learning process based on sample data, and it obtains a depth network structure with multiple levels via a training method. Many hidden layer artificial neural networks present good feature-learning ability. Acquired features provide a better essential characterisation of data, which is good for visualisation or classification. It also overcomes the shortcomings of traditional neural networks, such as gradient dispersion or easily falling into the local optimum because of the large number of layers. Additionally, a deep neural network can use layer-by-layer initialisation to effectively overcome training difficulties. The layer-by-layer initialisation is implemented using unsupervised learning, via the depth-learning model. Feature learning is the goal.

1.1.2 Dimension Reduction Classification With the growth of information technology and related fields, dataset sizes have increased dramatically. Thus, data structures become very complex. In mass image processing and information sciences, data dimensions are larger. High-dimensional data thus requires a great deal of computing and storage resources. Unknown structures are usually implemented for high-dimensional data, describing the intrinsic development or variation trends of data. Generally, the trend is difficult to display intuitively. Thus, it is unlikely to be reflected among attribute values. Some data do not have any property descriptions. For example, facial expression images can be considered low-dimensional manifold embedding in image space (see Edelman 1999). Facial expressions change all the time, in a continuous fashion. Through human vision and perception system, these changes can be recognised from the same object. Dimensionality reduction is the low-dimensional representation of an intrinsic structure (see Donoho 2000; Tenenbaum et al. 2000). The studies of Yang (2005a, b, 2006a, b), Cox and Cox (2008) found that many high-dimensional data have low-dimensional structures. In the high-dimensional space, the data has a lower intrinsic dimension, and all sample data points are distributed in a low-dimensional manifold structure. Therefore, the low-dimensional representation corresponding to these high-dimensional image data matrices must be found. Thus, it is required to well describe the intrinsic structure of an image. In the night-vision field, with the development of night-vision equipment and digital storage technology, the data collected are often high dimensional, huge, complex

6

1 Introduction

and disordered. Valuable information is swamped in massive datasets. Thus, there is an urgent need to discover the inherent laws of data, to simplify data structures and to reduce data. Manifold learning assumes that these observations are approximately located on an inner low-dimensional manifold, embedded in a high-dimensional Euclidean space. The main purpose is discovering the high-dimensional observation dataset’s intrinsic low-dimensional manifold structure and its embedded mapping relation (Fig. 1.3). For high-dimensional mass and unstructured datasets, dimensionality reduction of manifold learning works well for simplification and analysis. It has been widely used in machine learning, pattern recognition, data mining and related fields. Manifold learning methods include spectral analysis, main manifold, variation methods, mutual information, etc. The dimensionality reduction of manifold learning, based on spectral analysis, is the focus of ongoing research. Manifold learning includes global and local preservation. Several typical global preserving manifold algorithms are principal component analysis (PCA), multidimensional scaling (MDS) (Shawe-Taylor and Cristianini 2004) and Kernel PCA (KPCA). KPCA is a nonlinear generalisation of PCA, directly calculating the kernel matrix (i.e. the metric matrix) between the data points by introducing the kernel function without knowing the nonlinear mapping (see Tenenbaum et al. 2000). KPCA is also equivalent to using a kernel matrix, calculated by a kernel function instead of an MDS Euclidean distance matrix. Additionally, the global preserving manifold algorithm includes isometric mapping (ISOMAP) (see Choi and Choi 2007; Wen 2009; Weinberger and Saul 2006). It also includes maximum variance unfolding (MVU). ISOMAP uses the geodesic distance matrix instead of an MDS Euclidean distance matrix. MVU uses the semi-definite programming to learn the kernel matrix, which is like KPCA but needs constraint centralization (see Belkin and Niyogi 2003). These methods focus on the global structure preservation of statistical data. Thus, they neglect the information in the local structure of the data, and their computation is complex. For mass datasets, the data distribution is complex and nonlinear, making it difficult to keep the real intrinsic law of data by global structure constraint. The local preserving manifold method analyses local neighbourhood data relations in the high-dimensional observation space, maintaining an intrinsic local structure during the embedding process. The entire dataset structure is formed in low-dimensional space. Classic local algorithms include Laplacian Eigenmap (LE), in Roweis and Saul (2000), local linear embedding (LLE), in Chen and Ma (2011) and Chang and Yeung (2006), Zhang and Zha (2005), local tangent space alignment (LTSA), in Yang (2006a, b), local MDS, in Coifman and Lafon (2006), diffusion map (DM), in Donoho and Grimes (2003) and Hesse feature map, in He and Niyogi (2003). To improve the generalisation ability of the local manifold method, the explicit projection matrix is obtained while preserving the local data structure, and nonlinear manifold learning method is used for linearisation. Currently, proposed methods contain locality preserving projection (LPP), in Li (2012), Li et al. (2011), He et al. (2005), neighbourhood-preserving embedding, in Zhang et al. (2007) and linear LTSA, in Xiao (2011).

1.1 Research Topics of Multidimensional Night-Vision Information Understanding

7

Fig. 1.3 Dimensionality reduction of manifold learning. Reprinted from Lawrence and Sam (2003)

Compared to the traditional linear subspace method, manifold learning can better reveal the nonlinear geometric structure of data. However, it belongs to unsupervised learning, which does not consider the class markers of data. Thus, obtained features are not optimally classified. Manifold learning cannot extract discriminant features and cannot be directly applied to data classification (see de Ridder et al. 2003). Therefore, an extended supervised manifold learning method has been proposed, such as supervised LLE, in Zhang et al. (2008), supervised LPP, in Chen et al. (2005) and local discriminant embedding, in Shuicheng et al. (2007). Yan combined the local analysis of manifolds and class information to propose marginal Fisher analysis (MFA). Class compactness and class separation criterion were defined, and a unified framework based on graph embedding was given. Manifold learning was integrated into the nonlinear pattern discrimination analysis framework of Zaiane and Han (2000).

8

1 Introduction

1.1.3 Information Mining Data mining is the process of extracting implicit unknown but potentially useful information and knowledge from massive, incomplete, noisy, obscure and random data. This process has many similar terms, such as database knowledge, data analysis, data fusion and decision support. Initial data is a source of knowledge, like mining for minerals. Initial data can be structured, such as data in a relational database, or it can be semi-structured, such as text, graphics and image data. It can even be heterogeneous, distributed in the network. The methods of discovering knowledge can either be mathematical or non-mathematical, deductive or inductive. Discovered knowledge can be used for information management, query optimisation, decision support, process control, etc. It can also be used to maintain the data itself. Therefore, data mining is a generalised interdisciplinary technique that unites researchers from different fields engaged in databases, artificial intelligence, mathematical statistics, visualisation, parallel computing, etc. Data mining is essentially different from traditional data analysis (i.e. queries, reports and online application analysis). With data mining, one ‘excavates’ information and discovers knowledge without clear assumptions. Information obtained from data mining should have three characteristics: previously unknown, effective and practical. Previously unknown information is self-explanatory. Data mining includes the discovery of information or knowledge that could not otherwise be found by intuition. The more unexpected the mined information, the higher its value. In business applications, a classic example one in which a chain store uses data mining to reveal a surprising link between children’s diapers and beer. Image mining (IM) is based on data mining theory. Information is extracted from massive images, including hidden useful regularities, relationships, patterns, rules and interest trends. IM extracts information, knowledge, and patterns using semantics for intelligent image processing. Commonly used IM methods include association rule analysis, image clustering, image classification, etc. Related technologies include artificial neural networks, genetic algorithms, decision trees, rule induction, etc. (Fig. 1.4). An original image cannot be directly used for mining. It must be preprocessed to generate an image feature database for high-level mining modules. An IM system should include image storage, preprocessing, indexing, retrieval, mining, pattern evaluation and display capabilities (Fig. 1.5). Currently, per its driving mechanism, IM system frameworks can be divided into three categories: function-driven, information-driven and concept-driven. For function-driven IM frameworks, the emphasis is organising IM system functions using different models. For information-driven IM, the emphasis is a layered structure, which uses different levels to distinguish degrees of information abstraction. For concept-driven IM, the emphasis is IM processing, which extracts concepts at different levels and relationships between concepts. The extraction occurs at different levels, different sides and different granularities. Thus, association relationships

1.1 Research Topics of Multidimensional Night-Vision Information Understanding

9

Fig. 1.4 Cross-industry process for data mining model: a new blueprint for data mining. Reprinted from Rüdiger and Jochen (2000)

are mined while concept cluster analysis occurs. The main goal of IM is to extract different concept levels. The generalisation and instantiation of concepts are realised on the different conceptual levels. The underlying concept is ascension service of upper layer concept. The upper layer concept guides the extraction of the underlying concept. It implements the process from data to information, from information to knowledge, from low-level concept to high-level concept. For different application environments, many IM technologies have been studied thus far. A development team led by Fayyad at Canada systematically studied related theories and techniques of multimedia data mining. Another team engaged in knowledge discovery from image sets (see Foschi et al. 2002). Institute at America Georgia studied the IM association rules of target objects, based on content image retrieval. North Dakota State University carried out remote sensing IM based on association rules on multi-band data grey values and agricultural output data. The research was applied to precision agriculture by Zhang et al. (2001). Zhang et al. of National University Singapore studied IM theory, proposing a framework of information-driven IM (see Burl et al. 1999). The diamond eye system, which has been successfully used, was developed and studied by the National Aeronautics and Space Administration’s (NASA) Jet Propulsion Laboratory. A prototype system for IM software was developed by Ordonez and Omieeinski (1999). The blob-world mining system was developed by Datcu et al. (1998) at the University of California, Berkeley, based on content-based image retrieval system. Their system proposed a more efficient mining method of association rules for two-dimensional colour images. The research team was led by Mihai Datcu, German Aerospace Centre, who proposed a conceptual framework for data retrieval of remote sensing images, based on content image retrieval. The team developed a set of knowledge-driven IM systems for remote

10

1 Introduction Data acquisiƟon, preprocessing, archiving system

MulƟ-sensor sequence of images

Data ingesƟon

Image archive

Browsing engine

Image features extracƟon

Inventory

Query engine

ClassificaƟon

Knowledge acquisiƟon

InteracƟve learning

User

InformaƟon fusion and interacƟve interpretaƟon

Fig. 1.5 IM system architecture

sensing images. The system employed an architecture that enabled human–computer interaction over the Internet and could be combined with specific applications (see Zaiane et al. 1998). The data mining research centre at Alabama State University, Huntsville, developed a geospatial data mining software prototype, called ‘Adam’. It aims at mining meteorological satellite images. Working with NASA, the research team is preparing to apply IM technology to global change research. There are also reports on medical IM and geological spatial data mining software from Foschi et al. (2003), Liu et al. (2003) and Scholkopf and Smola (2002).

1.2 Challenges to Multidimensional Night-Vision Data Mining Data structure analyses, feature representation learning, dimension reduction classifications and information mining theory have all been studied extensively. These models and algorithms have obvious advantages over conventional methods in terms of information understanding. However, there are few studies on feature mining of night-vision images. For complex night-vision information processing, the manifold learning, classification and data mining methods still require investigation. The following are primary problems faced by multidimensional night-vision information understanding. • According to the characteristics of night-vision data, extant data analytic methods should be optimised. Aiming to excavate rules and patterns on complex, disordered, nonlinear and unstructured night-vision data, researchers studied data structure analysis, feature-learning and rule mining algorithms, which produce

1.2 Challenges to Multidimensional Night-Vision Data Mining

11

generalisation abilities, to improve comprehension and analysis of complex information. Thus, they can adapt to different vision tasks and output requirements. For example, to fully exploit nonlinear structural relationships in data, kernel learning is widely used. Via kernel processing, many learning classification methods based on linear hypotheses can be applied to nonlinear complex data. Shawe-Taylor and Cristianini (2004), Pan et al. (2009) improved the performance of the learning machine and classifier. • For night-vision dimension reduction classification, low signal-to-noise ratios (SNR) of low-light images and low-contrast infrared images, it is easy to generate outliers. Too many outliers lead to the inaccurate embedding of high dimensions to low dimensions. To reduce outlier interference, current manifold learning methods typically provide two solutions: detect outliers or filter noisy data. See Zhang and Zha (2005), Shuzhi et al. (2012), who improved local distance metrics of Zhao and Zhang (2009) and Zhang (2009), during preprocessing to adapt classification attributes. However, the intrinsic manifold structure of night-vision data usually has high complexity and randomness. When class distribution is discrete or the selected sample is not representative, it may lead to greater deviations between the obtained results and the actual classification (see Li and Zhang 2011; Hastie et al. 2009). Therefore, similarity evaluation methods that match the classification properties of night-vision data, which have more characteristics, can provide an effective strategy for outlier suppression and neighbourhood selection of night-vision data. This becomes a difficult problem for massive night-vision data classification. • With night-vision data classification and discrimination having massive information, classification learning algorithms face many difficulties, such as tagging burdens, overwhelming categories, high dimensions and large scales. Owing to the dimensional disaster described by Jegou et al. (2010), many machine-learning algorithms apply to low-dimensional space but do not apply to high-dimensional data (see Duda et al. 2000). The current massive image processing systems typically use bag-of-features methods (see Zhang 2010). In this case, the feature dimension of the image is very high, greatly reducing the performance of the traditional machine-learning algorithm. However, it cannot achieve the ideal discriminant effect, because its computational complexity is high. However, in many practical night-vision applications, it is easy to obtain large amounts of unlabelled data. Yet, it is difficult to get labelled samples. The cost of tagging has been minimised using information not provided by unlabelled data (see Chapelle et al. 2006). For classification and discrimination of night-vision data in practical applications, non-training or semi-supervised learning methods should be studied. Implicit rules of input data are discovered by mining, avoiding the difficulty of large datasets and improving the small-sample learning ability of night-vision data.

12

1 Introduction

1.3 Summary For complex and multidimensional night-vision information processing, extant feature mining learning methods still need adaptive research. This book focuses on related theories and methods of multi-source and high-dimensional night-vision image information understanding. This involves spectral imaging and noise reduction, multi-vision task optimisation, multidimensional data feature classification, learning and non-learning target detection and tracking, rule mining and colorization. The remaining chapters of this book are arranged as follows: • Chapter 2 presents spectral imaging and coding noise reduction, describing the principle and system of multispectral imaging, based on coding technology. It describes related encoding noise reduction and enhancement of SNR in spectral imaging. • Chapter 3 presents multi-vision tasks based on data structure and feature analysis. It describes the related contents of super-resolution (SR) reconstruction, superpixel segmentation and saliency analysis in visible and infrared images. • Chapter 4 presents feature classification based on manifold dimension reduction. It describes the related contents of feature reduction and classification of night-vision images based on manifold learning. • Chapter 5 presents data classification based on sparse representation and random subspace. It describes the related contents of semi-supervised classification of high-dimensional data, based on sparse representation and random subspaces. • Chapter 6 presents target detection and learning-based recognition. It describes the classification and target detection of multi-band image data based on depth learning and deformable parts model (DPM). • Chapter 7 presents motion detection and tracking based on non-learning. It describes related contents of visible-light and infrared target tracking, motion recognition, detection and localisation, based on robust structure features. • Chapter 8 presents colorization of night-vision images based on rule mining. It describes related content of night-vision image colorization, based on association rule mining and multi-rule learning.

References Baldi, P., & Hornik, K. (1989). Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks, 2(1), 53–58. Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15, 1373–1396. Burl, M. C., Fowlkes, C., Roden, J., Stechert, A., Muukhtar, S. (1999). Diamond eye: A distributed architecture for image data mining. In SPIE Conference on Data Mining and Knowledge Discovery (Vol. 3695, pp. 197–206).

References

13

Chang, H., & Yeung, D. (2006). Robust locally linear embedding. Pattern Recognition, 39(6), 1053–1065. Chapelle, O., Scholkopf, B., Zien, A. (2006). Semi-supervised learning. The MIT Press. Chen, H., Chang, H., Liu, T. (2005). Local discriminant embedding and its variants. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (Vol. 2, pp. 846–853). Chen, J., Ma, S. (2011). Locally linear embedding: A review. International Journal of Pattern Recognition and Artificial Intelligence, 25(07). Choi, H., & Choi, S. (2007). Robust kernel Isomap. Pattern Recognition, 40, 853–862. Coifman, R. R., & Lafon, S. (2006). Diffusion maps. Applied and Computational Harmonic Analysis, 221(1), 5–30. Cox, T., & Cox, M. (2008). Multidimensional scaling (pp. 315–347). Berlin Heidelberg: Springer. Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 886–893), San Diego, CA, USA. Datcu, M., Seidel, K., Walessa, M. (1998). Spatial information retrieval from remote sensing images. Part I: Information theoretical perspective. IEEE Transactions on Geoscience & Remote Sensing, 36(5), 1431–1445. de Ridder, D., Kouropteva, O., Okun, O., Pietikäinen, M., & Duin, R. P. W. (2003). Supervised locally linear embedding. ICANN/ICONIP, 2(714), 333–341. Donoho, D. (2000). High-dimensional data analysis: The curses and blessings of dimensionality. In Mathematical Challenges of the 21st Century. Los Angeles, CA: American Mathematical Society. Donoho, D. L., Grimes, C. (2003). Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proceedings of National Academy of Sciences of the United States of America, 100(10), 5591–5596. Duda, R. O., Hart, P E., Stork, D. G. (2000). Pattern classification. Wiley-lnterscience Publication. Edelman, S. (1999). Representation and recognition in vision. Cambridge, MA: MIT Press. Foschi, P. G., Kolippakkam, D., Liu, H., Mandvikar, A. (2002). Feature extraction for image mining. In International Workshop on Multimedia Information Systems (pp. 103–109). Foschi, P. G., Kolippakkam, D., Liu, H. (2003). Feature selection for image data via learning. IMMCN, 1299–1302. Ge, S. S., He, H., & Shen, C. (2012). Geometrically local embedding in manifolds for dimension reduction. Pattern Recognition, 45, 1455–1470. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical leaming. New York: Springer. He, X., Cai, D., Yan, S., Zhang, H. J. (2005). Neighbourhood preserving embedding. In Proceedings 10th IEEE International Conference on Computer Vision (pp. 1208–1213). He, X., Niyogi, P. (2003). Locality-preserving projections (LPP). Advances in Neural Information Processing Systems, 155–160. Jegou, H., Douze, M., & Schmid, C. (2010). Improving bag-of-features for large scale image search. International Journal of Computer Vision, 87(3), 316–336. Lawrence, K. S., Sam, T. R. (2003). Think globally, fit locally: Unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research, 119–155. Lazebnik, S., Schmid, C., Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2169–2178), New York. Li, J. (2012). Gabor filter based optical image recognition using Fractional Power Polynomial model based common discriminant locality-preserving projection with kernels. Optics and Lasers in Engineering, 50, 1281–1286. Li, J., Pan, J.-S., & Chen, S.-M. (2011). Kernel self-optimised locality-preserving discriminant analysis for feature extraction and recognition. Neurocomputing, 74, 3019–3027.

14

1 Introduction

Li, F. F., Perona, P. (2005). A Bayesian hierarchical model for learning natural scene categories. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 524–531), San Diego, CA, USA. Li, B., & Zhang, Y. (2011). Supervised locally linear embedding projection (SLLEP) for machinery fault diagnosis. Mechanical Systems and Signal Processing, 25, 3125–3134. Liu, H., Mandvikar, A., Foschi, P. (2003). An active learning approach to Egeria densa detection in digital imagery. In Next generation of data-mining applications (pp. 189–210). Wiley. Liu, L., Wang, L., Liu, X. (2011). In defence of soft-assignment coding. In IEEE International Conference on Computer Vision (ICCV) (pp. 2486–2493), Barcelona, Spain. Ojala, T., Pietikäinen, M., & Mäenpää, T. (2002). Multiresolution grayscale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7), 971–987. Ordonez, C., Omieeinski, E. (1999). Discovering association rules based on image content. In Proceedings IEEE Forum on Research and Technology Advances in Digital Libraries (Vol. 1999, pp. 38–49). Pan, Y., Ge, S. S., & Mamun, A. A. (2009). Weighted locally linear embedding for dimension reduction. Pattern Recognition, 42, 798–811. Philbin, J., Chum, O., Isard, M., et al. (2008). Lost in quantisation: Improving object retrieval in large scale image databases. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1–8), Anchorage, AK, USA. Roweis,S. T., Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323–2326. Rubner, Y., Tomasi, C., & Guibas, L. J. (2010). The earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision, 40(2), 99–121. Rüdiger, W., Joche, H. (2000). GRISP–DM: Towards a standard process model for data mining. In Proceedings of the Fourth International Conference on the Practical Application of Knowledge Discovery and Data Mining (pp. 29–39). Rumelhart, D. E., Hinton, G. E., Williams, R. J. (1988). Learning representations by backpropagating errors. In Neurocomputing: Foundations of research (pp. 533–536). MIT Press. Schölkopf, B., Platt, J., Hofmann, T. (2006). Greedy layer-wise training of deep networks. In International Conference on Neural Information Processing Systems (pp. 153–160). MIT Press. Scholkopf, B., Smola, A. J. (2002). Learning with kernels: Support vector machines, regularisation, optimisation, and beyond. The MIT Press. Seung, H. S., Lee, D. D. (2000). The manifold ways of perception. Science, 2268–2269. Shawe-Taylor, J., Cristianini, N. (2004). Kernel methods for pattern analysis. Cambridge University Press. Tenenbaum, J., De Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear dimension reduction. Science, 290(5500), 2319–2323. Wang, J., Yang, J., Yu, K., et al. (2010). Locality-constrained linear coding for image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3360–3367), San Francisco, CA, USA. Weinberger, K. Q., & Saul, L. K. (2006). Unsupervised learning of image manifolds by semi-definite programming. International Journal of Computer Vision, 70(1), 77–90. Wen, G. (2009). Relative transformation-based neighbourhood optimisation for isometric embedding. Neurocomputing, 72, 1205–1213. Xiao, R. (2011). Discriminant analysis and manifold learning of high dimensional space patterns. Shanghai Jiao Tong University. Yan, S., Xu, D., Zhang, B., Zhang, J., Yang, Q., & Lin, S. (2007). Graph embedding and extensions: A general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(1), 40–51. Yang, L. (2005a). Building k-edge-connected neighbourhood graphs for distance-based data projection. Pattern Recognition Letters, 26(13), 2015–2021.

References

15

Yang, L. (2005b). Building k edge-disjoint spanning trees of minimum total length for isometric data embedding. In IEEE Transaction on Pattern Analysis and Machine Intelligence, 27(10), 1680–1683. Yang, L. (2006a). Building k-Connected Neighbourhood Graphs for Isometric Data Embedding. IEEE Transaction on Pattern Analysis and Machine Intelligence, 28(5), 827–831. Yang, L. (2006b). Locally multidimensional scaling for nonlinear dimensionality reduction. In 18th International Conference Pattern Recognition (ICPR’06) (Vol. 4, pp. 202–205). Zaiane, O. R., Han, J. (2000). Discovering spatial associations in images. In Proceedings of SPIE (Vol. 4057, pp. 138–147). Zaiane, O. R., Han, J., Li, S., Chee, S. H., Chiang, J. Y. (1998). MultiMediaMiner: A system prototype for multimedia data mining. In ACM SIGMOD International Conference on Management of Data (Vol. 27, No. 2, pp. 581–583). Zhang, S. (2009). Enhanced supervised locally linear embedding. Pattern Recognition Letters, 30, 1208–1218. Zhang, L. (2010). Research and application of large-scale machine learning theory. Zhejiang University. Zhang, J., Hsu, W., Lee, M. (2001). An information-driven framework for image mining. In The 12th International Conference on Database and Expert Systems Applications (Vol. 2113, pp. 232–242). Zhang, Z., Yang, F., Xia, K., & Yang, R. (2008). A supervised LPP algorithm and its application to face recognition. Journal of Electronics and Information Technology, 30(3), 539–541. Zhang, T., Yang, J., & Zhao, D. (2007). Linear local tangent space alignment and application to face recognition. Neuro computing, 70(7–9), 1547–1553. Zhang, Z., & Zha, H. (2005). Principal manifolds and nonlinear dimension reduction via local tangent space alignment. SIAM Journal on Scientific Computing, 26(1), 313–338. Zhao, L., & Zhang, Z. (2009). Supervised locally linear embedding with probability based distance for classification. Computers & Mathematics with Applications, 57, 919–926.

Chapter 2

High-SNR Hyperspectral Night-Vision Image Acquisition with Multiplexing

High-throughput and high-spectral resolution are essential requirements for spectrometers. Conventional slit-based spectrometers require the input slit to be narrow to achieve a reasonable resolution. However, too small a slit cannot gather enough radiation. Many designs have been presented to address these demands. One method (i.e. the Jacquinot advantage) maximised throughput without sacrificing spectral resolution. Over several decades, there have been two important, strongly investigated approaches to improving spectrometre performance. One resulted in the codedaperture spectrometre (CAS); another resulted in the Fourier transform spectrometre (FTS). CAS replaced the slit with a 2-dimensional coded matrix aperture (i.e. mask), introduced to increase light throughput without loss of spectral resolution. After more than half a century of development, the top CAS is the Hadamard transform spectrometre (HTS), whose encoded aperture theories are based on Hadamard matrices. However, there have more recently been some new static, multiplex CASs proposed, based on new mathematical models. In this chapter, we introduce the multiplexing measurements applied to spectrometers for high-SNR data acquisition.

2.1 Multiplexing Measurement in Hyperspectral Imaging First, we define the SNR of acquired signal as follows: SNR

signal noise

(2.1)

With spectrometers, the source is typically low intensity and the dispersed signal is very weak and noisy. However, the SNR of data can be improved via two methods: improving the intensity of the signal and reducing the noise. For the first method, the denoising sensitivity of the detector can be markedly improved. For the second method, it turns out to be impossible to improve the SNR by directly enhancing the signal, because non-laboratory detection cannot change © Springer Nature Singapore Pte Ltd. 2019 L. Bai et al., Night Vision Processing and Understanding, https://doi.org/10.1007/978-981-13-1669-2_2

17

18

2 High-SNR Hyperspectral Night-Vision Image Acquisition with …

the light source condition. If multiple signals are combined and superposed, every measurement signal will be much higher than the original. Then, SNR will increased. To restore all signals, a well-designed combination is required (i.e. coding). Measuring multiple combinations of signals at once is called ‘multiple measurement’, which is an effective method to solve the single-signal detection problem of low SNR. With spectral measurement, multiplexing technology can be broadly divided into two major categories: Hadamard transform and Fourier transform. Multiplexing theory is a form of weighting theory, which studies the accurate measuring of weights of several objects. This applies to optical systems. Suppose the weight of n objects need to be measured, f 1 , f 2 , f 3 . . . f 7 . Conventionally, we would measure the weight of each object individually. Obviously, if the instrument’s lowest measuring value is near the weight of the measured object, measurement error will be large or would fail completely. Therefore, to improve measurement quality, we combine objects and measure several objects’ weight at once. This becomes our multiplexing technique for spectral measurement systems. Equation (2.1) indicates a coding measurement method for seven unknown quantities. Seven measurements were made, and four different quantities were selected to combine each time. f1 + f1 + f1 + f2 + f1 + f3 + f2 +

f2 + f2 + f3 + f5 + f4 + f4 + f3 +

f3 + f4 + f6 + f6 + f5 + f5 + f4 +

f 5 + e1 f 7 + e2 f 7 + e3 f 7 + e4 f 6 + e5 f 7 + e6 f 6 + e7

η1 η2 η3 η4 , η5 η6 η7

(2.2)

where f 1 , f 2 , f 3 . . . f 7 are the quantities to be measured, e1 , e2 , e3 . . . e7 are the noises of each measurement and η1 , η2 , η3 . . . η7 are the results of each measurement (in matrix form). This can be represented as Hf + e η

(2.3)

In Eq. (2.3), H is a coding matrix, f is a vector consisting of signals to be measured, e is a vector consisting of the noise from each measurement and η is a vector consisting of measurement results. The original signal can be obtained by solving (2.4), as shown. fˆ H−1 η H−1 Hf + H−1 e.

(2.4)

From (2.4), the luminous flux of combined measurements is significantly improved, over individual measurements of each signal. In Golay’s (1949) paper, he proposed replacing the incident and outgoing slits of conventional spectrometers with multiple slits that can be switched. He aimed to increase the amount of light transmitted by the system by introducing multiplexing. In Golay’s (1951) study, he

2.1 Multiplexing Measurement in Hyperspectral Imaging

19

proposed replacing the multi-slit with a static coding slot that analyses aberrations from a systematic perspective to provide experimental results. Over the next two decades, several experimental systems and theoretical analyses, based on multi-slit coding, were proposed (see Girard 1963; Ibbett et al. 1968). Later, Harwit and Sloane (1979) made a detailed summary with improvements of previous works, forming a Hadamard transform spectral detection theory, giving us the HTS.

2.2 Denoising Theory and HTS Harwit and Sloane (1979) established a general framework for denoising using optimal multiplexing, based on two main assumptions: (a) noise is independent of the signal level and (b) noise is independent of the total weight on the balance. Where photon noise is concerned, they did not combine sensor and photon noise in their main theory. Damaschini (1993) demonstrated that Poisson noise (i.e. photon noise) reduces the SNR boost of Hadamard multiplexing. Wuttig (2005) showed that, if Poisson noise is the only contribution to Hadamard multiplexing, then SNR will reduce. System performance depends not only on noise type but also on the source structure (i.e. sparsity). Based on a sparsity prior, numerical state-of-the-art compressed sensing (CS) reconstruction methods have been proposed to address the Poisson noise problem (see Moses et al. 2012; Bian et al. 2013). Unlike sensing matrices (N × M) in CS, weighting matrices (N × N ) in HTS are normally nonnegative square matrices. The CS reconstruction is an optimisation problem, whereas the HTS reconstruction has an exact solution. However, CS reconstruction methods (e.g. total variation and regularised estimation strategies) were employed in HTS by Mrozack et al. (2012).

2.2.1 Traditional Denoising Theory of HTS Harwit and Sloane did the most detailed and complete work on Hadamard coding denoising theory (see Harwit and Sloane 1979), obtaining most widely recognised theoretical results. The theory is based on the comparison of conventional slit spectrometers. First, we assume the measuring error will be zero. E{e} 0,

(2.5)

where e is measuring error and E is the expected value. We assume the error, e and e , between two measurements, is completely uncorrelated and independent, satisfying E ee 0. Squared error e2 is non-negative and σ 2 is an average value of e2 :

(2.6)

20

2 High-SNR Hyperspectral Night-Vision Image Acquisition with …

E e2 σ 2 ,

(2.7)

where σ 2 is the mean square deviation of error and σ is the standard deviation. Suppose four targets are measured separately. Their true values are recorded as f 1 , f 2 , f 3 , f 4 , actual measurements are recorded as η1 , η2 , η3 , η4 and measuring errors are recorded as e1 , e2 , e3 , e4 . These four measurements can be expressed, respectively, as follows: η1 η2 η3 η4

f 1 + e1 f 2 + e2 f 3 + e3 f 4 + e4

(2.8)

xˆ expresses the estimation of x. For this measurement, we use ηi to estimate the true value, f i , and ei to estimate the error. Based on the first assumption, the average value of error is 0: E{ei } 0, E fˆi f i .

(2.9)

The estimate that meets Eq. (2.9) is called the ‘unbiased estimate’. Per Eq. (2.7), the mean square error of every measurement is σ 2 . To analyse the results of coded measurements, we use ‘1’ and ‘0’ as encoding units to measure the original data. The coding results are shown below: η1 f 1 + f 3 + e1 , η2 f 2 + f 3 + e2 , η3 f 1 + f 2 + e3 .

(2.10)

The pending measurements of estimates are 1 fˆ1 (η1 − η2 + η3 ), 2 1 ˆ f 2 (−η1 + η2 + η3 ), 2 1 fˆ3 (η1 + η2 − η3 ). 2

(2.11)

The mean square error of the estimate is 3 E ( fˆi − f i )2 σ 2 . 4

(2.12)

2.2 Denoising Theory and HTS

21

From Eq. (2.12), we see the mean square error of estimate decreased to the original 3/4. We express the above process as a matrix. First, we specify that all n true values are expressed as vectors, f, with n dimensions, creating an n × n matrix, S to express the multi-signal superposition mode. Similarly, we use vectors to express n measurement values and their corresponding errors, resulting in η Sf + e, ⎡ ⎢ ⎢ where f ⎢ ⎣

(2.13)

⎡ ⎤ ⎡ ⎤ ⎤ η1 e1 f1 ⎢ η2 ⎥ ⎢ e2 ⎥ f2 ⎥ ⎢ ⎥ ⎢ ⎥ ⎥ .. ⎥, η ⎢ .. ⎥, e ⎢ .. ⎥. ⎣ ⎣ . ⎦ ⎦ ⎦ . .

fn ηn en From the multiplexed measurement of Eq. (2.13), the value to be measured is estimated by the following equation:

fˆ S−1 η S−1 Sf + S−1 e.

(2.14)

According Eq. (2.9), the estimated value to be measured can be expressed as

E fˆ S−1 Sf. (2.15) To further analyse the gap between estimate value and true value, we use f i to indicate ith estimate value to be measured. Then, the error and mean square error can be shown, respectively, as −1 −1 −1 e1 + Si,2 e2 + · · · + Si,n en fˆi − f i Si,1 εi E ( fˆi − f i )2

−1 2 −1 2 −1 2 + Si,2 + · · · + Si,1 σ 2 Si,1

(2.16)

n average mean square errors (MSE) are expressed as 1 2 σ (ε1 + ε2 + · · · εn ) n

2 −1 2 −1 2 1 σ 2 S−1 + S + · · · + Sn,n 1,1 1,2 n

T 1 σ 2 T race S−1 S−1 n

−1 1 σ 2 T race ST S n

ε

(2.17)

22

2 High-SNR Hyperspectral Night-Vision Image Acquisition with …

where T race is the sum of all diagonal elements of the matrix. If the coding matrix is a Hadamard cyclic-S matrix, Eq. (2.17) can be rewritten as

−1 1 2 σ T race ST S n 4nσ 2 ≈ . (n + 1)2

ε

(2.18)

If the size of code is large, Eq. (2.18) can be approximated as ε≈

4σ 2 . n

(2.19)

Thus, by comparing the measurements of the Hadamard cyclic-S matrix and the results of individual measurements of each variable, the former reduces to the mean square error of the √ latter: 4/n. Therefore, the SNR of root mean square difference is improved about n/2 times.

2.2.2 Denoising Bound Analysis of HTS with S Matrix The HTS mathematical problem can be defined using a linear measurement model: g Sf + n H T S ,

(2.20)

where g is the measured data, S is a weighting matrix, f is the original signal and n is noise. H and S are two main weighting matrices (see Streeter et al. 2009). The Smatrix is more commonly used because its entries are either one or zero, which can be easily implemented as open or closed states. For S-matrix-based HTS, the increase in the root mean square (RMS) SNR, compared to the RMS SNR of traditional slit-based spectrometry, was described by Tilotta et al. (1987). √

1/2 (n + 1)/2 N , (S N R)RMS σ 2 /ε

(2.21)

where ε is the mean square error in the estimate of a resolution element’s actual value in HTS. σ is standard deviation of measurement in a traditional slit-based spectrometre. However, Tilotta et al. (1987) explained that this is only true if several conditions are met. One states that the error in each measurement, e, must be independent of the total weight on the balance, and the errors of different measurements, e and e , must be independent. Normally, low number of photons detected during hyperspectral imaging would result in a poor SNR. This can be resolved by employing a more sensitive sensor that reduces the detector noise or by increasing the signal throughput. The total measured noise is a mix of photon, dark, readout and digitisation noise. The latter three are

2.2 Denoising Theory and HTS

23

called ‘detector noise’, because they are typically object independent (see Snyder et al. 1995; Shimano 2006). Because photon noise has a Poisson distribution, it can be calculated as the square root of the total number of photons incident on the sensor. Investigating poor SNR, we consider the hyperspectral sensor, HICO, of Lucke et al. (2011). The quadratic sum of detector noise for this sensor is about 104 electrons, whereas the signal exceeds 104 electrons. Thus, the system is essentially photonnoise limited (see Bian et al. 2013). The SNR of the measured data is estimated as SNR

signal(≈ noise2photon ) noise2photon + noise2detector

≈ 70,

(2.22)

because the SNR is a monotonically increasing function of the photon noise when the detector noise is object independent. The system will exhibit a good SNR (≥ 70) while it is photon-noise limited. The sensitivity of the sensor is the limiting factor when determining the SNR. Classification is commonly employed for hyperspectral datasets, and there are a range of techniques found in literature, such as K-nearest neighbour (KNN), Gauss process maximum likelihood and support vector machine (SVM) (see Bian et al. 2013). The Euclidean distance is often used to estimate the distinction between different spectra, fa and fb : 1/2 1/2 D E (fa − fb )T (fa − fb ) f T f .

(2.23)

Noise is not included in this equation. If two spectra are acquired by a traditional spectrometre, a value, SNRDE, is defined to evaluate the disturbance in the Euclidean distance caused by noise. For a traditional slit spectrometre, this value is given by SNRDE slit

f T f nT n slit

slit

1/2 ,

(2.24)

where f is the difference in the two spectra and n is the difference in the noise. The two terms, f T f and nT n, are treated as the Euclidean distances from the origin of f and n, respectively. In HTS, the spectrum is obtained indirectly by demodulating the measured data, fˆ S−1 g f + S−1 nHTS , giving the SNRDE of HTS as SNRDE HTS

1/2 f T f ,

S−1 nHTS T S−1 nHTS

(2.25)

where V S−1 , V 2(2S − J N )/(N + 1) and J N denote an N × N matrix of ones. The term in the denominator can be expressed as

24

2 High-SNR Hyperspectral Night-Vision Image Acquisition with …

−1

T S nHTS S−1 nHTS

4 nT InHTS (N + 1) HTS 4 T − nHTS J N nHTS , (N + 1)2

(2.26)

and from linear algebra (nHTS )T J N nHTS

N

2 n HTS (i)

.

(2.27)

i1

Thus, at the right-hand side of Eq. (2.27), sum is over the elements in the noise vector,nHTS . Substituting Eq. (2.27) in Eq. (2.26) gives −1

T S nHTS S−1 nHTS

T 4 nHTS InHTS (N + 1)2

+

N N

[n HTS (i) − n HTS ( j)]2

i1 j1

⎫ ⎬ ⎭

(2.28)

for i j. Equations (2.26) and (2.28) give the upper and lower bounds, respectively, of (S−1 nHTS )T S−1 nHTS . If the expectation value of n HTS (i) tends to zero, then the lower bound will be achieved. However, if n HTS (i) ≈ n HTS ( j), then the upper bound will be achieved. The SNRDE of HTS can then be expressed as 1/2 f T f SNRDE HTS nHTS T nHTS k

(2.29)

k

√ for N + 1/2 ≤ k ≤ (N +1)/2. If the source is weak, detector noise will dominate in both slit based and HTS: n slit (i) ≈ n HTS (i). Based on Eqs. (2.24) and (2.29), HTS will have a much higher SNR than slit-based spectrometry. Thus, HTS can improve the sensor’s sensitivity. In the case of a nearly flat spectrum signal, < f (i)> < f >, photon noise in slit based and HTS, studied by R. L. Lucke et al. (2011), will be approximate to photon n HTS (i)

≈

(N + 1) photon n slit (i). 2

(2.30)

√ If the noise is dominated by photon noise, then SNRDE 2/2SNRDE HTS ≈ slit , from the substitution of Eq. (2.30) into Eq. (2.29), slit-based spectrometry has a better SNR than HTS. However, based on Eqs. (2.26) and (2.28), the value of SNRHTS also depends on the structures of signal and noise. Moreover, if the noise, nHTS , tends to be flat, the upper bound of SNRHTS can be achieved. A flatter noise, nHTS , will improve HTS performance. Whereas there is no flat noise in reality, in CS,

2.2 Denoising Theory and HTS

25

components of f with small magnitudes are set to zero to obtain sparse solutions. This method can also be adopted in HTS. Moreover, according to Eqs. (2.26) and (2.28), the noise, (S−1 nHTS )T S−1 nHTS in HTS, is independent of the original signal, f. ˆ The noise, n restore HTS (k), related to the reconstructed signal, f (k), can be expressed as ⎛ ⎞ 2 ⎝ n H T S (i) − n HTS ( j)⎠ n restore HTS (k) N + 1 i∈A j∈B −1 −1 A x|S (k, x) > 0 , B x|S (k, x) < 0 , (2.31) where n restore HTS (k) is dominated by the mean of n HTS (i) and independent of the true signal, f (k).

2.2.3 Denoising Bound Analysis of HTS with H Matrix The Hadamard matrix, H, is seldom used in multiplex coding measurement because −1 cannot be realised in an optical path. However, according to Jiang et al. (2014), encoding the sample illumination of the samples with H-matrix improves the SNR of spectral images. Here, a spatial pixel-coded Hadamard transform spectrometre is used to achieve Hadamard H-matrix multiplex coding. Moreover, the denoising capability of HTS with H-matrix is analysed, and both simulation and experiment are carried out. To realise the −1 coding, the Hadamard H-matrix is divided into two matrices: H+ and H− and H H+ − H− . Specifically, the divided two matrices are written as follows: H+ (H + J)/2, H− (H − J)/2,

(2.32)

where every element of J equals 1 and both H+ and H− are composed only of 1 and 0. Aside from high throughput and efficiency, the H-matrix will gain significant SNR improvements, even with doubled measurements of H− and H+ matrices. The multiplex coding measurement can be defined with a linear model, stated as g Sf + nHTS ,

(2.33)

where g is the measured coded data and f is original signal to be recovered. S is the coding matrix and n is noise. Because the coding H-matrix is divided into two matrices, the coding mathematics problem can be written as two equations: g+ H+ f + n+ , g− H− f + n− .

(2.34)

26

2 High-SNR Hyperspectral Night-Vision Image Acquisition with …

Here, both g+ and g− are the measured coded data related to H+ and H− , respectively. n+ and n− are irrelevant noises in those two measurements. T evaluates the denoising performance of H-matrix weighting, Euclidean distance is employed to quantify the distance between two measured spectra, fa and fb , written as 1/2 1/2 D E (fa −fb )T (fa −fb ) f T f . Furthermore, the SNR of distance D E is defined as f T f 1/2 DE SNR 10 × log10 T , n n

(2.35)

(2.36)

where ffa −fb and nna −nb . For multiplex measurements, the original spectrum is not obtained directly, but is demodulated from measured coded data, fˆ H−1 (g+ − g− ). Thus, the SNR of the distance can be written as 1/2 T f f DE SNRH 10 × log10 −1 . (2.37) T −1 (H nH ) (H nH ) Per the mathematical characteristic of the H-matrix, the right-hand side of Eq. (2.37) can be simplified as T (H−1 )T H−1 nH (H−1 nH )T (H−1 nH ) nH 1 T nH In nH . n

(2.38)

Substituting Eq. (2.38) in Eq. (2.37), we obtain the SNR of distance in HTS using H-matrix coding, stated as ⎛ ⎞ f T f 1/2 ⎠ ⎝ SNRDE H 10 × log10 1 T n nH In nH f T f 1/2 10 × log10 T (2.39) + 5 × log10 n. nH nH Furthermore, Lucke et al. (2011) showed that we obtain the SNR in HTS with cyclic-S matrix coding as ⎛ ⎞ f T f 1/2 ⎠ ⎝ SNRDE S 10 × log10 4 n nST In nS n f T f 1/2 10 × log10 T . (2.40) + 5 × log 10 4 nS nS

2.2 Denoising Theory and HTS

27

If the system is dominated by detector noise, then whatever the coding matrix, the noise will have the same level of intensity, nH (i) and nS (i), following the same distribution. Unfortunately, differing from the cyclic-S matrix, the H-matrix is divided into two separate matrices, and the noise, nH , in Eq. (2.39) should be written as nH nH+ + nH− ,

(2.41)

where nH+ (i), nH− (i) and nS (i) follow the same distribution. Therefore, we

T

T make a fine assumption: nH+ nH+ , nH− nH− and (nS )T nS tend to have the same level of intensity. Based on the basic algebraic property, we reach the following conclusion about (nH )T nH :

T

(nH )T nH nH+ + nH− nH+ + nH−

T

T nH+ nH+ + nH− nH−

T + 2 nH+ nH− .

(2.42)

Moreover, noises in different measurements are independent and can be stated as n

nH− (i)nH+ (i) ∝ 0.

(2.43)

i1

Per Eq. (2.43), Eq. (2.42) can be rewritten as

T

(nH )T nH nH+ + nH− nH+ + nH−

T

T ≈ nH+ nH+ + nH− nH− ≈ 2(nS )T nS .

(2.44)

Substituting Eq. (2.44) into Eq. (2.39) and subtracting Eq. (2.40), we obtain the DE relationship of SNRDE H and SNRS , written as DE SNRDE H ≈ SNRS + 5 × log10 2.

(2.45)

2.3 Spatial Pixel-Multiplexing Coded Spectrometre The multiplexing coded measurement, HTS, for example, can significantly improve the SNR of data, especially when the source is weak. Both theory and experiment have demonstrated the denoising capability of HTS. However, most proposed systems coding the dispersed spectra realise multiplexing, achieving high throughput and SNR.

28

2 High-SNR Hyperspectral Night-Vision Image Acquisition with …

2.3.1 Typical HTS System The first true hyperspectral imaging system was born in the 1980s. It was an undeniable leap in the field of remote-sensing technology. However, before the advent of hyperspectral imaging systems, spectrometers had been developed for technical reserves. The traditional spectral detection system focused on capturing fingerprint spectrum data of matter. For example, in the field of chemistry, spectroscopy was used to define various chemicals at the molecular level. In the field of biology, material components were analysed at the cellular level. For geological exploration, mineral bands were delineated by spectral features. Hyperspectral technology is the combination of spectral and imaging technology obtaining two-dimensional spatial information of non-targets. It also contains spectral information of each target in a scene, forming a ‘data cube’, as shown in Fig. 2.1. Imaging spectrometers can be divided into three types per their number of spectra: multispectral imaging, hyperspectral imaging and superspectral imaging (see Mrozack et al. 2012; Cull et al. 2007). The first type is multispectral imaging, which consists of 10–20 spectral bands with spectral resolution, λ/λ, of 0.1 mm. Second, hyperspectral imaging contains 100–400 spectral bands with spectral resolution, λ/λ, of 0.01 mm. Third, superspectral imaging contains about 1000 spectral segments with spectral resolution, λ/λ, of 0.001 mm. Among three types, the most widely used is the slit-based imaging spectrometre, which uses a slit to achieve spectral images, as shown in Fig. 2.2. The slit-based imaging spectrometre has a critical drawback: the SNR of data is bad when the source is too low. The HTS is a well-studied method for overcoming the shortages of slit-based imaging spectrometers. After decades of development, the most useful HTS equipment encodes the dispersed spectrum. Figure 2.3 shows the schematic diagram of traditional HTS (see Wang et al. 2017; Sun et al. 2012) (Fig. 2.3). The spectral imaging data comprise a three-dimensional data cube. Traditional HTS is too slow to capture hyperspectral images. For example, when the size of data cube is X × Y × λn , the slit-based spectrometre will require X measurements to

Fig. 2.1 Data cube diagram of hyperspectral imaging

2.3 Spatial Pixel-Multiplexing Coded Spectrometre

29

Fig. 2.2 Slit-based imaging spectrometre development: the most used HTS equipment

Fig. 2.3 Schematic diagram of traditional HTS. Reprinted from Xu et al. (2012), with permission of Elsevier

recover the data cube. Unfortunately, traditional HTS only takes n×Y measurements. n is the scale of the Hadamard matrix.

2.3.2 Spatial Pixel-Multiplexing Coded Spectrometre Apart from the traditional HTS in Xiang and Arnold (2011), high system throughput is realised by multiplexing spectra of different pixels, making the measurement faster and more efficient. From Fig. 2.5, we see that overlapping occurs between pixels in the same row at the entrance; the spectrum of pixels from different rows are irrelevant. For spectral imaging, the system is comparable to slit-based pushing-broom scanning, in that it is time consuming. A system schematic diagram is shown in Fig. 2.4. A digital micromirror device (DMD) is utilised to encode the pixels of images at the entrance. Each DMD mirror pixel can be electrically driven. When turned on, mirrors are tilted about their diagonal hinge at an angle of +12°. When turned off, they are tilted at −12°. Per the shift invariant introduced by Yue et al. (2014), the distance of pixels at the entrance is linear with the distance of the related spectrum. Thus, the system realises high throughput

30

2 High-SNR Hyperspectral Night-Vision Image Acquisition with … Object lens

Scene

DMD

aƟng Collim n le s

g GraƟn

Detector

Conduct

g

Focusin lens

Data

Fig. 2.4 Schematic of spatial pixel-coded spectrometre. Reprinted from Yue et al. (2018), with permission of Elsevier

of overlapping spectra from different pixels of entrance images, rather than different channels of the same spectrum. It is much more efficient than traditional HTS. Because the H-matrix is divided into two different matrices, which consist of ‘1’ and ‘0’ only, the H-matrix coding can be realised by DMD as same as cyclic-S matrix. Taking 8 × 8 H+ matrix as an example, the procedure of coding spatial pixels is described in Fig. 2.5. Beyond of its high throughput and efficiency, the H-matrix gains significant SNR improvement, even when realised via double measurements of H− and H+ matrices. To demonstrate the denoising capability of the H-matrix, we benchmarked two coding matrices under simulation and experiment. Xiang and Arnold (2011) and Mrozack et al. (2012), demonstrated the advantage of HTS using a cyclic-S matrix, compared to traditional slit-based spectrometers. Therefore, we focus on improving the H-matrix against the cyclic-S matrix. Specifically, the 127 × 127 cyclic-S matrix and the 128 × 128 H-matrix are employed. In the simulation, a normalised solar spectrum is employed as original reflectance data. Detector noise is simulated using white Gaussian noise and Poisson noise is used for photon noise. We analysed the denoising results of the both coding matrices, using photon noise, dominated noise, detector noise and dominated situations. For robustness, all instances were performed 100 times and the data presented are average values of the results.

2.3 Spatial Pixel-Multiplexing Coded Spectrometre

31

Fig. 2.5 Encoding the pixels at the entrance using the H+ matrix. Reprinted from Yue et al. (2018), with permission of Elsevier

Figure 2.6 shows that the H-matrix gains about 3-dB improvement over cyclicS matrix coding. Furthermore, the cyclic-S matrix causes degradation as photon noise dominates. Fortunately, the result of H-matrix coding shows comparable SNR performance with traditional slit-based spectrometry. Photon noise notwithstanding, we also simulated the denoising capability of the two coding matrices, comparing it to slit-based spectrometry with dominating detector noise. According to the results shown in Fig. 2.7, we see that both H-matrix and cyclic-S matrix coding significantly outperformed the traditional spectrometre. Moreover, the H-matrix gained about 3-dB improvement over the cyclic-S matrix. Furthermore, we implemented the two matrices in an experiment, set up according to the optical structures of the traditional HTS, as shown in Fig. 2.8 (see Yue et al. 2018). However, there was a slight different between the experiment and the schematic. We added a beam splitter after the telephoto lens to more easily focus the image on the DMD. The slit-based spectrometre, cyclic S-matrix- and H-matrix-coding all can be realised with this system. The focal length of the telephoto lens is 100 mm and the fnumber is 1.4, just like the collimating lens. The grating used here has 300 lines/mm. The Hg lamp is employed as the light source, attenuated to simulate a weak source. Considering the width of a pixel of DMD is about 13 µm, we select one column pixel to simulate the slit.

32

2 High-SNR Hyperspectral Night-Vision Image Acquisition with …

Fig. 2.6 Simulation results of traditional spectrometre and coding measurements with dominating photon noise: a result of measured spectra in a slit-based spectrometre; b measured results of HTS with H-matrix- and cyclic S-matrix-coding and c SNRs of measured spectra of all 127 pixels. Reprinted from Yue et al. (2018), with permission of Elsevier

According to Fig. 2.9, the multiplexing coding measurement significantly outperforms the traditional slit-based spectrometre when the source is weak. Furthermore, H-matrix coding gains about 2.5-dB SNR improvement, compared to cyclic S-matrix in both high and poor light intensities. Thus, whatever the noise type, H-matrix coding achieves better results. In previous theoretical analyses, the H-matrix advantage depended on only noises from different measurements being independent. From the experiments shown in Fig. 2.9, we see that different intensity of sources does not affect the advantage of H-matrix coding, demonstrating excellent agreement with simulated results.

2.3 Spatial Pixel-Multiplexing Coded Spectrometre

33

Fig. 2.7 Results of simulation with dominating detector noise: a result of measured spectra in a slit-based spectrometre; b measured results of HTS with H-matrix- and cyclic S-matrix-coding and c SNRs of measured spectra of all 127 pixels. Reprinted from Yue et al. (2018), with permission of Elsevier

Beyond the spectrum measurements, we also compared the reconstructed spectral image cube data of the two coding matrices. In Fig. 2.10, two channels of spectral images are shown. The advantage of H-matrix coding is significant. In conclusion, H-matrix coding is implemented in a multiplexed environment. Both theoretical analysis and experiment show that the H-matrix exhibits significant advantage over the cyclic S-matrix. More importantly, the improvement of the Hmatrix will not be affected by source intensity. Thus, the H-matrix is a good alternative solution of spectral measurements when the source is weak.

34

2 High-SNR Hyperspectral Night-Vision Image Acquisition with …

Fig. 2.8 Implementation of spatial pixel-coding Hadamard spectrometre. Reprinted from Yue et al. (2018), with permission of Elsevier

Fig. 2.9 Performance comparison of the two-matrix coding and slit spectrometre: a signals reconstructed from the two-matrix codings and b SNR of spectra of all pixels involved in the coding measurements. Comparing (a) and (b), the source intensity is improved in (c) and (d). Reprinted from Yue et al. (2018), with permission of Elsevier

2.4 Deconvolution-Resolved Computational Spectrometre

35

Fig. 2.10 SNR of spectral images acquired by cyclic S- and H-matrices. Reprinted from Yue et al. (2018), with permission of Elsevier

2.4 Deconvolution-Resolved Computational Spectrometre For several decades, only two important approaches have been proposed to improve spectrometre performance. One is CAS and the other is FTS. CAS replaces the slit with a two-dimensional coded matrix aperture, called a mask. It was introduced as a multi-slit spectrometre to increase light throughput without loss of spectral resolution by Cull et al. (2007). Currently, the major CAS is a Hadamard-transform spectrometers (HTS). Thus, most encoded aperture theories are based on Hadamard matrices. However, recently, new static multiplex CASs have been proposed based on new mathematical models. For CAS, better spectral resolution requires smaller mask features. Unfortunately, diffraction and optical blur negate its advantages. Apart from CAS, FTS is another method exhibiting the Jacquinot advantage. It records interference patterns to estimate the spectrum of incident light. However, a classic FTS usually contains mechanical scanning elements, preventing easy assembly. Hence, attempts have been made to eliminate moving parts to create a smaller, more reliable and inexpensive system. The key component of CAS is the measurement of weighted multiple spectral channels instead of just one for improving throughput and SNR. However, there are several critical drawbacks caused by the coded aperture. First, high-precision and high-resolution (HR) coded apertures are scarce. Additionally, the system requires a small mask feature to achieve high-spectral resolution. However, this brings unexpected diffraction and optical blur. A novel high-throughput, spectral-channelmultiplied spectrometre without a spatial filter is thus presented, wherein the spectrum is reconstructed by solving an inverse deconvolution problem. The new design achieves higher resolution and longer spectral range without aberrations by eliminating the physical spatial modulator and coded aperture. Furthermore, the proposed

36

2 High-SNR Hyperspectral Night-Vision Image Acquisition with …

Fig. 2.11 Schematic of the DSCS system. Reprinted from Yue et al. (2014)

system is stationary without moving parts. Thus, it can be shaken. Figure 2.11 shows a schematic of DSCS. In Fig. 2.11, there are two optical paths where both non-dispersed and dispersed light intensities can be measured. One path is based on a typical slit-based spectrometre (i.e. dispersed branch). The other path is the non-dispersed branch. Based on this structure, light intensities measured by the second charge-coupled device (CCD) is the result of light intensity measured by the first CCD. This relationship is described in the following paragraphs. The details about the system are as follows. Source light is focused on a wide slit (200 µm) and objective lens (FL 50 mm, f-number 1.4). The image plane of the collimating lens is also fixed on the slit to ensure the light passing though is collimated. Then, the light beam is split by the beam splitter. Considering that light is projected onto many more pixels of the second dispersed branch CCD, unequal splitting (30% reflection, 70% transmission) attempts to equalise the SNR of the two branches. Next, reflected light is refocused onto the first CCD by the first reimaging lens. According to the reversibility principle of optical path, the image on the slit is the same as the image on the first CCD. In the dispersed branch, transmitted light irradiates vertically onto the grating and is dispersed via following the grating equation. d(sin θi ± sin θ ) jλ

(2.46)

where θi is the incident angle and θ is the diffraction angel. If the slit is more than one pixel along the x-axis, where spectrum is dispersed, it causes overlapping of different wavelengths from adjacent pixels. This phenomenon is briefly described in Fig. 2.12, showing an ideal physical mechanism. Images on two CCDs obey one rule: if a pixel on the slit shifts L pixels along the x-axis, the related non-dispersed pixel on the first CCD and dispersed pixels on the second CCD shift the same distance and direction.

2.4 Deconvolution-Resolved Computational Spectrometre

37

Fig. 2.12 Demonstration of double shifting caused by different pixels and wavelengths. Reprinted from Yue et al. (2014)

This is called ‘shift invariability’. A brief derivation is presented as follows. Two pixels on the slit along the x-axis are taken as an example. Light from the two pixels is split into two collimated light beams. The beams have different incident angles, θi1 and θi2 , before travelling into the grating. According to Eq. (2.46), diffraction angles related to incident angles, θi1 and θi2 , at wavelength λ can be expressed as d(sin θi1 ± sin θd1 ) jλ d(sin θi2 ± sin θd2 ). Then, a difference equation about diffraction and incident angles is obtained, expressed as sin θi1 −sin θi2 ±(sin θd2 −sin θd1 ). Base on the precondition that the first and second re-imaging lenses have the same focal length, f, both sides of the previous difference equation multiply f. Then, it converted to f (sin θi1 − sin θi2 ) ± f (sin θd2 − sin θd1 ). Thus, f sin θd1 is the dispersing distance of wavelength λ on the image sensor. The right-hand side of the equation is the distance of the two pixels at wavelength λ on the dispersed image. The left-hand side is the distance of the two pixels on the non-dispersed image. Thus, the feature shift invariability shown in Fig. 2.12 is demonstrated. Based on this shift invariability, Eq. (2.47) describes the relationship between light intensity on the first and second CCDs. Specifically, this relation can be expressed as the dispersed image on the second CCD, where f (x) equals the convolution between the non-dispersed image observed on the first CCD, ϕ(x), and the source spectral. f (x) φ(x) ∗ S(x; λ),

(2.47)

where f (x) is the dispersed image observed on the second CCD. ϕ(x) is the nondispersed image observed on the first CCD. ∗ denotes the convolution and S(x; λ) is the spectral function that can be acquired by solving inverse problem deconvolution of f (x) and ϕ(x). In this experiment, a 200-µm-wide slit is used to acquire overlapped data and a 5-µm-wide slit is used to obtain a standard spectrum to evaluate the accuracy of the reconstructed spectrum. The 5-µm-wide slit is chosen, because the pixel of

38

2 High-SNR Hyperspectral Night-Vision Image Acquisition with …

Fig. 2.13 Experiment prototype Fig. 2.14 Flow of data preprocessing. Reprinted from Yue et al. (2014)

the CCD is also ~5 µm. Thus, smaller slits would not improve the spectral resolution. All lenses except the objective lens are the same CCTV lens (FL 25 mm, f-number 1.4) and have better aberration corrections, making the system easy to assemble. A plane beam splitter (30% reflection 70% transmission, ThorLab BSS16) is employed to construct two functionally different optical paths. A bladed grating with a 300-groove/mm intensity (Blade Angle 4° 18 , ThorLab GR50-0305) is used as the dispersing component. The two image sensors are the same industrial camera with 1280 × 1024 resolution (pixel width 5.2 µm, Daheng DH-HV1351UM). The experiment is set up as shown in Fig. 2.13. The first light source used is a deuterium lamp placed 50 cm in front of the objective lens. To quantify the proposed method’s high-throughput advantage, a filter is used to weaken the deuterium. The direction of dispersion is only along x-axis. Pixels along the y-axis have the same spectrum distribution and can be treated as one pixel. Thus, pixels in the same column are summed as one to turn an image into a curve. Data preprocessing is described in Fig. 2.14.

2.4 Deconvolution-Resolved Computational Spectrometre

39

Fig. 2.15 Measured light intensity of non-dispersed and dispersed lights of the deuterium lamp. Reprinted from Yue et al. (2014)

Fig. 2.16 Dispersed light intensity of 200-µm slit measured and fit for the deuterium lamp. Reprinted from Yue et al. (2014)

The measured non-dispersed and dispersed light intensities are shown in Fig. 2.15 with background noise removed. The blue curve is the non-dispersed light intensity and the red curve is the spectrum overlap from different pixels along the x-axis. Deconvolution belongs to the class of so-called ‘ill-posed’ problems. Both observation and convolution kernels are not precisely acquired at every measurement. Thus, noise is inevitable and any small noise may cause a large perturbation in the deconvolution. Moreover, the CCD response is not quite linear and causes distinct differences between measured and fit light intensities from the 200-µm slit. This unexpected CCD error is shown in Fig. 2.18 Pixels are a bit oversaturation when measuring the peak of the red curve in Fig. 2.16. The measured value is smaller than the real light intensity. Mathematically, deconvolution is well developed and several great solutions have been proposed to deal with this problem, such as the Richardson–Lucy deconvolution (see Moses et al. 2012). This was employed to reconstruct a spectrum in Bai et al. (2013). In Fig. 2.16, two similar curves are visible. The blue curve is the convolution result of the measured spectrum of a 5-µm slit and non-dispersed light intensity of a

40

2 High-SNR Hyperspectral Night-Vision Image Acquisition with …

Fig. 2.17 Results of reconstructed spectrum and measured spectrum with 5-µm slit for the deuterium lamp. Reprinted from Yue et al. (2014)

Fig. 2.18 Demonstration of the CCDs nonlinear response. Reprinted from Yue et al. (2014)

200-µm slit. The red curve is measured dispersing data with a 200-µm slit. Moreover, the red curve does not perfectly fit the ideal data, owing to CCDs nonlinear response. Figure 2.17 shows both the reconstructed spectrum data (blue) and measured spectrum with a 5-µm slit (red). Two characteristic spectra of the deuterium lamp are recovered and the reconstructed spectrum is fitted well to the standard spectrum. However, there are a few periodic perturbations in the restored data, caused by image aberrations, nonlinear CCD responses and noise introduced by ill-posed deconvolution. Owing to the high throughput, the proposed method’s peak SNR (PSNR) of 15.2 dB of is much higher than traditional spectrometre (PSNR 9.8 dB). And the measured spectrum with 200-µm slit exhibits about 17.2 dB when the blue fitted curve is a standard spectrum of the 200-µm slit (Fig. 2.16). Another light source is the combined light-emitting diode (LED). The measured and fitted light intensities of the 200-µm slit are shown in Fig. 2.19, where the PSNR of measured light intensity is 24.1 dB, much smoother lacking many weak peaks, compared to the spectrum of the first light source. Owing to the smooth LED spectrum, the reconstructed data matches the standard data very well. Moreover, there is a 6-dB improvement contributed by the proposed method (PSNR 24.1 dB), compared to the traditional spectrometre (PSNR 18.4 dB).

2.4 Deconvolution-Resolved Computational Spectrometre

41

Fig. 2.19 Dispersed light intensity of 200-µm slit both measured and fit for two combined LEDs. Reprinted from Yue et al. (2014)

Fig. 2.20 Results of reconstructed spectrum and measured spectrum with 5-µm slit for two combined LEDs. Reprinted from Yue et al. (2014)

In conclusion, using a beam splitter to construct two functionally different light paths (i.e. non-dispersed and dispersed) can be measured. Based on shift invariability, the spectrum of a light source is reconstructed by solving the deconvolution problem. This spectrometre design can achieve a high resolution with high throughput without spatial modulations. This method is an effective improvement to traditional spectrometre performance, especially when the light source is weak. We are currently seeking more reliable and accurate spectra by decreasing system errors and improving the reconstruction algorithm to find better linear CCD responses and analyse errors associated with deconvolution procedures (Fig. 2.20).

2.5 Summary In this chapter, we introduced high-SNR hyperspectral imaging acquisition with multiplexing. Specifically, we introduced HTS multiplexing, spatial-pixel multiplexing and deconvolution-resolved computing spectrometry. High SNR and high spectral

42

2 High-SNR Hyperspectral Night-Vision Image Acquisition with …

resolutions are always required, but the light source and detector are sometimes limited. In this case, HTS would be a good choice. Multiplexing has been long studied, and repeating average measurements is a good starting point. To evaluate our multiplexing improvements, we introduced a method to evaluate performance. Then, we discussed the mathematical model of multiplexing. HTS is the most-studied method of spectrum measurement, which uses multiplexing as a tool. However, the previous theory of HTS is based on several assumptions, making hard to understand. We introduced another theory to analyse the denoising performance of HTS. Moreover, the upper and lower bounds of denoising capability of HTS were discussed. Based on the spatial modulation device (i.e. DMD), a new spatial-pixel multiplexing coded spectrometre was proposed. It is more efficient than traditional HTS. Specifically, it requires the same measurement times as the scale of coding matrix. Thus, implementation is easy to realise. HTS is a remarkably smart method that improves SNR results. However, whereas it takes multiple measurements, it cannot measure the transient spectrum. Thus, we introduced another snapshot measurement based on deconvolution, which leverages multiplexing to improve throughput. Both experiments and simulations show that it significantly improves SNR and throughput. These tools and techniques will be useful to chemistry, biology and other fields.

References Bian, X., Zhang, T., Yan, L., Zhang, X., Fang, H., & Liu, H. (2013). Spatial–spectral method for classification of hyperspectral images. Optics Letters, 38, 815–817. Cull, E. C., Gehm, M. E., Brady, D. J., Hsieh, C. R., Momtahan, O., & Adibi, A. (2007). Dispersion multiplexing with broadband filtering for miniature spectrometers. Applied Optics, 46, 365. Damaschini, R. (1993). Limitation of the multiplex gain in Hadamard transform spectroscopy. Pure and Applied Optics, 2, 173–178. Girard, A. (1963). Spectrometre a grilles. Applied optics, 2(1), 79–87. Golay, M. J. E. (1949). Multi-slit spectrometry. Journal of the Optical Society of America, 39(6), 437–444. Golay, M. J. E. (1951). Static multislit spectrometry and its application to the panoramic display of infrared spectra. Journal of the Optical Society of America, 41, 468–472. Harwit, M., & Sloane, N. J. A. (1979). Hadamard transform optics (p. 1444). New York: Academic Press. Ibbett, R. N., Aspinall, D., & Grainger, J. F. (1968). Real-time multiplexing of dispersed spectra in any wavelength region. Applied Optics, 7(6), 1089–1094. Lucke, R. L., Corson, M., McGlothlin, N. R., Butcher, S. D., Wood, D. L., Korwan, D. R., et al. (2011). Hyperspectral Imager for the Coastal Ocean: Instrument description and first images. Applied Optics, 50, 1501–1516. Moses, W. J., Bowles, J. H., Lucke, R. L., & Corson, M. R. (2012). Impact of signal-to-noise ratio in a hyperspectral sensor on the accuracy of biophysical parameter estimation in case II waters. Optics Express, 20(4), 4309. Mrozack, A., Marks, D. L., & Brady, D. J. (2012). Coded aperture spectroscopy with denoising through sparsity. Optics Express, 20, 2297–2309. Shimano, N. (2006). Recovery of spectral reflectances of objects being imaged without prior knowledge. IEEE Transactions on Image Processing, 15, 1848–1856.

References

43

Snyder, D. L., Helstrom, C. W., Lanterman, A. D., Faisal, M., & White, R. L. (1995). Compensation for readout noise in CCD images. Journal of the Optical Society of America A. Optics and Image Science, 12, 272–283. Streeter, L., Burling-Claridge, G. R., Cree, M. J., & Künnemeyer, R. (2009). Optical full Hadamard matrix multiplexing and noise effects. Applied Optics, 48, 2078–2085. Sun, X., Hu, B., Li, L., & Wang, Z. (2012). An engineering prototype of Hadamard transform spectral imager based on digital micro-mirror Device. Optics & Laser Technology, 44, 210–217. Tilotta, D. C., Hammaker, R. M., & Fateley, W. G. (1987). Multiplex advantage in Hadamard transform spectrometry utilizing solid-state encoding masks with uniform, bistable optical transmission defects. Applied Optics, 26, 4285–4292. Wang, Z., Yue, J., & Han, J. (2017). High-SNR spectrum measurement based on Hadamard encoding and sparse reconstruction. Applied Physics, 123, 277. Wuttig, A. (2005). Optimal transformations for optical multiplex measurements in the presence of photon noise. Applied Optics, 44, 2710–2719. Xiang, D., & Arnold, M. A. (2011). Solid-state digital micro-mirror array spectrometre for Hadamard transform measurements of glucose and lactate in aqueous solutions. Applied Spectroscopy, 65, 1170–1180. Xu, J., Hu, B., Feng, D., Fan, X., & Qian, Q. (2012). Analysis and study of the interlaced encoding pixels in Hadamard transform spectral imager based on DMD. Optics and Lasers in Engineering, 50, 458–464. Yue, J., Han, J., & Li, L. (2018). Denoising analysis of spatial pixel multiplex coded spectrometre with Hadamard H-matrix. Optics Communications, 407, 355–360. Yue, J., Han, J., Zhang, Y., & Bai, L. (2014). High-throughput deconvolution-resolved computational spectrometre. Chinese Optics Letters, 12, 043001.

Chapter 3

Multi-visual Tasks Based on Night-Vision Data Structure and Feature Analysis

Data structure and feature analysis is the first step which provides inputs of many computer vision algorithms. Thus, it is vitally important to ensure its speed, accuracy and robustness. Currently, there is a wide performance gap between the slow feature analysis and the much higher requests for faster real-time solutions. This chapter introduces the active research field of visual feature analysis, referring to literature review, addressing the detail techniques and demonstrating how to apply these features in some vision tasks. Such methods utilise low-level processing to determine valuable vision tasks, while using feature attributes, such as super-resolution (SR), superpixel segmentation and saliency.

3.1 Infrared Image Super-Resolution via Transformed Self-similarity Image SR reconstruction is an active topic in computer vision, because it offers the promise of overcoming limitations of low-cost imaging sensors. Infrared image super-resolution technology plays an important role in military, medical and other vision areas, aiming to upscale LR images into the HR category. Moreover, higher resolution infrared imaging provides more detailed identification, helping to locate military targets. Accurate and reliable SR methods also benefit numerous other tasks, such as object recognition (see Chuang et al. 2014). With the prevalence of imaging sensors, high-resolution imaging is more perceivable than ever. Owing to emerging applications, an attractive SR method should output a high-quality map while being highly computationally efficient. Ur and Gross (1992) introduced a non-uniformity interpolation method by applying a generalised multichannel sampling theorem from Zhu (1992). A maximum likelihood and a maximum posterior estimation was successfully applied to SR by Tom and Katsaggelos (1995) and Hardie et al. (1997). Bahy et al. (2014) introduced an adaptive regulation method for generating high-SR images. Whereas these methods generate © Springer Nature Singapore Pte Ltd. 2019 L. Bai et al., Night Vision Processing and Understanding, https://doi.org/10.1007/978-981-13-1669-2_3

45

46

3 Multi-visual Tasks Based on Night-Vision Data Structure …

high-resolution images, they have drawbacks when handling complex low-resolution infrared images, because they cannot capture rich infrared details. Furthermore, they cannot well perform in infrared images. Recently, other sophisticated methods have used a variety of learning approaches to learn the LR–HR mapping to generate more accurate results, achieving good performance. Freeman et al. (2000) modelled their relationships with a Markov network to restore the HR image. Chang et al. (2004) applied locally linear embedding (LLE), based on manifold learning, wherein similar image patches form manifolds of similar local geometries in two distinct feature spaces. Yang et al. (2008) and (2010) introduced a super-resolution method based on sparse representation. Zeyde et al. (2012) introduced kSVD and orthogonal matching pursuit (OMP) which used stronger image representation. Chakrabarti et al. (2007) introduced a kernel PCA for SR face images. Wang et al. (2012a) introduced a semi-coupled dictionary-learning model for SR imaging. Whereas learning based methods showed good performance compared to previous methods, learning LR–HR mapped from external databases had drawbacks. For example, the number and type of training examples were not clear, and when these methods were applied to sophisticated strategies and trained to models based on external datasets, computational load was increased. Many SR methods have been introduced over the past decades, and most are designed for natural images, not infrared images. These methods suffer from limitations in SR accuracy and robustness when applied to infrared images. The underlying reasons are that infrared images are much more challenging, the extracted features from the infrared images are not well used for super-resolution and various image properties become less effective when backgrounds are cluttered and complex. Recent papers by Zhao et al. (2015) and Sui et al. (2014) studied infrared image superresolution, and perform well in some simple infrared images. Sui exploits the CS for SR in the infrared images. Zhao introduced a novel infrared image SR method based on sparse representation. Recent works in Chen et al. (2016), Liu et al. (2016) also develop new methods for infrared image super-resolution. Whereas these methods were designed for infrared SR imaging, their methods had drawbacks when the infrared images were complex and low contrast. In this section, a structured self-similarity based SR method for infrared images is introduced. We carefully exploit features related to the infrared images and perform them without learning priors. First, we exploit internal structure information, such as geometric appearance, dense error and region covariance, and we use the cues to estimate the perspective patches. Then, we add scale information to incorporate potential patching caused by local scale variations. Then, we introduce a compositional framework to simultaneously integrate cues for the infrared SR imaging. We validate our method via qualitative and quantitative comparisons of five infrared scenes. Our method significantly outperforms other methods, producing more accurate results with less noise. Figure 3.1 shows the method pipeline.

3.1 Infrared Image Super-Resolution via Transformed Self-similarity

47

Fig. 3.1 Pipeline of the SR method. Given an input image, LR, we first blur and subsample the image to obtain its downsampled image. Then, we learn the transformation matrix when searching for its nearest neighbour. Finally, the HR image is reconstructed by the inverse of the transformation matrix. Reprinted from Qi et al. (2017), with permission from Elsevier

3.1.1 The Introduced Framework of Super-Resolution Super-resolution framework Given an LR infrared image I , its downsampled image I D , using the blur and subsample methods, is obtained. Based on I and I D , an HR infrared image of this method is introduced (see Fig. 3.1). For target patch T in LR infrared image I , we compute a transformation matrix M that warps T to its best matching patch, S (source patch), for the downsampled image I D . It computes a nearest-neighbour field for the transformation matrix, using a regulated PatchMatch method, described by Barnes et al. (2009), between I and ID. • Extract S H from image I , the HR patch of S. • Obtain TH by applying the inverse of M to warp S H , and paste TH in HR image I H at a similar location as T . Transformation matrix M for infrared image SR, based on the four cues related to the infrared images, is thus learnt. For infrared images, we exploit region covariance and dense error to better capture structural information. Furthermore, geometrical appearance is utilised to help estimate transformational patches, and the scale cue is applied for different scaled HR infrared images. This section carefully designs a method for generating more accurate predictions of HR images from LR images. HR infrared image Estimation For each target patch T ( pi ), centred at position y x pi pi , p j in the LR image I , our aim is to learn the transformation matrix Mi , which maps target patch Ti and its nearest neighbour in the downsampled image I D . We utilise the nearest-neighbour field to solve this problem. The objective function is computed as min

i∈O

E a ( pi ) + E d ( pi ) + E st ( pi ) + E sc ( pi ),

(3.1)

48

3 Multi-visual Tasks Based on Night-Vision Data Structure …

where O denotes a set of pixel indices I . The objective function contains four costs: appearance cost E a ( pi ), dense error cost E d ( pi ), region covariance cost E st ( pi ) and scale cost E sc ( pi ). Appearance cost This cost measures similarity between sampled targets and source patches of the infrared image. We use the Gaussian weighted distance in the grey space: E a W (T ( pi ) − S( pi ))22 ,

(3.2)

where S(Pi ) denotes the sampled patch in I D using the transformation, Mi . M(·) is the Gaussian weight with σ 2 6. This cue effectively encodes the similarity of these patches. For transformation, Mi , it exploits a search method to find similar patches, as in Barnes et al. (2010), and applies a regulated method to compute transformation y matrix six , si , siu , siv , the affinity parameters of the source patch, defined as y Mi U six , si K siu , siv ,

(3.3) y

where U captures the similarity transformation via scaling parameters six and si , defined as x y s R si 0 y U six , si i , 0 1

(3.4)

y y where R si is a 2×2 rotation matrix and si is a scaling factor in Huang et al. (2014). u v The matrix K si , si captures the shearing mapping in the affine transformation, ⎡ ⎤ 1 siu 0 u v K si , si ⎣ siv 1 siv ⎦. 0 siu 1

(3.5)

This section regulates the similarity transformation using matrix K in source patches, further measuring the similarity between patches. Region covariance cost There exists a structural similarity (SSIM) between LR and HR images, which effectively localise the reconstructed patches in the HR image by exploiting region covariance (see Erdem and Erdem 2013). We define the structure cost as follows: E st sin(Rc (T ( pi )), Rc (S( pi )))22 ,

(3.6)

where Rc (·) denotes the region covariance function from the source and target patches.

3.1 Infrared Image Super-Resolution via Transformed Self-similarity

49

Region R inside F can be defined with a d × d covariance matrix Rc , defined as 1 ( f i − β)( f i − β)T , n − 1 i1 n

Rc

(3.7)

where f i1,...,n ⊂ F denotes d-dimensional feature vectors inside patch R and β denotes the mean of these feature vectors. In this section, a seven-dimensional feature represents the image pixel: T

F I (x, y), d I /d x, d I /dy, d 2 I /d x , d 2 I /d x ,

(3.8)

where I denotes the intensity of a pixel. d I /d x, d I /dy, d 2 I /d x , d 2 I /d x are the first and second derivatives of I and (x, y) denotes the pixel location. Hence, the covariance descriptor of a region is computed as a 7 × 7 matrix. From experiments, the covariance matrix provides a strong discriminative power by distinguishing local image structures, helping us better reconstruct local HR regions of infrared images. Dense error cost In the SR process, this cost implies that the target patch, T ( pi ), with lower reconstruction error, is more likely to be a neighbouring patch of S( pi ). Therefore, the reconstruction error of each patch is computed based on the dense error appearance model generated from the neighbouring templates D of T ( pi ), using PCA. We obtain eigenvector E VT from the normalised covariance matrix D, corresponding to the largest eigenvalues, which are computed to form PCA bases of the target patches. Based on E VD , we compute the reconstruction coefficient of patch pi as αi E VDT (T ( pi ) − T ( pi )),

(3.9)

and the dense reconstruction error of each patch is E d T ( pi ) − (E VD αi + T ( pi )),

(3.10)

where T ( pi ) is the mean feature of T ( pi ). Dense error representations can better model data points with multivariate Gaussian distribution and can capture salient details of LR patches in infrared images. Scale cost Because of the different transformations, cost can match target patches in downsampled image, I D , or the HR image. This match process has a small cost (e.g. region covariance cost, appearance cost). Traditional bicubic interpolation loses more details via spatial interpolation. The scale cost, E sc , which better restores HR image, is defined by E sc max(0, S R − Scale(Mi )),

(3.11)

50

3 Multi-visual Tasks Based on Night-Vision Data Structure …

where S R denotes the final SR factor (i.e. 3x). Scale(·) denotes the scale estimation. Here, we estimate the scale of the sampled source patch using Mi . Scale(Mi )

det

M1,1 − M1,3 M3,1 M1,2 − M1,3 M3,2 , M2,1 − M2,3 M3,1 M3,1 − M2,3 M3,2

(3.12)

where M is a 7 × 7 matrix that encourages searching for source patches similar to target patches. Inspired by Yang et al. (2013), good self-exemplars are often found in localised neighbourhoods by initialising the nearest neighbour field with zero displacement, providing a good start for faster convergence. Then, it performs a randomised search to refine the current solution. Using the transformation matrix, Mi , for the four costs, we restore visually feasible HR outputs.

3.1.2 Experimental Results Considering the upscaling factor of 3 × 3 and using the 7 × 7 patch for SR in the experiments to evaluate the performance of the proposed method, this section evaluates five different infrared scenes: electric bicycle, containing 20 infrared images of different scales; human body, containing 50 images of different views; device, containing 10 images; building, containing 20 images; and low-contrast building and complex background with 20 images. Qualitative results These scenes are quite common and easy to capture. The images are captured with an infrared camera, giving us five infrared scenes. This section extracts regions from the restored HR images and puts them in the corner of the respective\HR infrared image. The red rectangle is the extracted region. In Figs. 3.2, 3.3, 3.4, 3.5 and 3.6, (a) is the LR infrared image, (b) is the reconstructed HR infrared image via the BI method, (c) is the restored HR infrared image via the SC method, (d) shows the restored HR image by the ZE method and the method of this section generates an SR infrared image in (e). From these figures, we see that the method generates high-quality HR infrared image with fewer texture noises, the restored SR infrared images are clear and the edges are shaped. These results further verify the effectiveness of our method, compared to other competing methods. Figures 3.2, 3.3, 3.4, 3.5 and 3.6b, denote the HR infrared images generated by the BI method. It simply upscales the LR image using bicubic interpolation, the edges are shifted and the texture is smoothed. The method magnifies the spatial resolution by a factor of three, which is less accurate when the infrared image is complex or low contrast, as in Figs. 3.3, 3.5 and 3.6b. The simple BI method cannot generate HR infrared images, and the details of the restored infrared images are lost, because of the interpolation process. Yang et al. produced HR infrared images via sparse coding (SC) in Figs. 3.2, 3.3, 3.4, 3.5 and 3.6c. From these, we see that this method improves the BI method and

3.1 Infrared Image Super-Resolution via Transformed Self-similarity

51

Fig. 3.2 LR infrared images of an electric bicycle magnified by a factor of 3: a LR image; b bicubic interpolation (BI) method; c sparse coding (SC) method; d image scale-up (ZE) method and e the method of this section. We produce high-quality HR images compared to other methods. Reprinted from Qi et al. (2017), with permission from Elsevier

Fig. 3.3 LR infrared image of human body magnified by a factor of 3: a LR image; b BI method; c SC method; d ZE method and e the method of this section. Our method produces clearer details with less noise in the output image. Reprinted from Qi et al. (2017), with permission from Elsevier

52

3 Multi-visual Tasks Based on Night-Vision Data Structure …

Fig. 3.4 LR image of a device magnified by a factor of three: a LR image; b Bi method; c SC method; d ZE method and e the method of this section. Our proposed method can restore highquality infrared images with clear edge regions. Reprinted from Qi et al. (2017), with permission from Elsevier

utilises sparse coding to replace the magnified spatial resolution. The restored HR infrared image contains less unwanted details (see Fig. 3.2c). However, this method does not perform well when the scene is complex, as in Figs. 3.2, 3.5 and 3.6c. The method is unstable, owing to the sparse decomposition. Some unwanted artefacts also appear in the HR infrared images. These high-frequency details are unclear, and the object edges are blurred, as in Figs. 3.3 and 3.4c (red rectangle). These results degrade the performance of the restored SR infrared images. Zeyde et al. introduced kSVD and OMP for SR reconstruction (ZE) and produced restored HR infrared images, shown in Figs. 3.2, 3.3, 3.4, 3.5 and 3.6d. Thus, this method effectively solves drawbacks of the SC method and performs better than the SC and BI methods. However, the ZE method cannot well-produce pleasing results when the scene is complex, as in Fig. 3.6d. This is a common problem of recent studies. In Fig. 3.5, the ZE method reconstructs good HR infrared images, compared to the BI method. It also improves the SC method. However, they generate blurred HR results, as shown in Fig. 3.6d, because they apply kSVD and OMPs for sparse coding, as discussed by Aharon et al. (2006). Thus, the texture is still smoothed, the edges are shifted. Our method produces high-quality reconstructed infrared images with sharp edges and few artefacts, as verified in Figs. 3.2, 3.3, 3.4, 3.5 and 3.6e. Our method can recover details of foggy scenes, as shown in Fig. 3.3. It also makes lines clear when the

3.1 Infrared Image Super-Resolution via Transformed Self-similarity

53

Fig. 3.5 LR image of buildings magnified by a factor of three: a LR image; b BI method; c SC method; d ZE method and e the method of this section. The approach produces high-quality HR images compared to other methods and the overall scene is clear. Reprinted from Qi et al. (2017), with permission from Elsevier

device is at low contrast, compared to neighbouring objects, as seen in Fig. 3.4e. The introduced method produces HR infrared images with more textures (see Figs. 3.3 and 3.6e). Additionally, the proposed method reduces reconstruction artefacts, as seen in Figs. 3.2, 3.3 and 3.5e. The restored infrared images are closer to the HR original infrared images. Quantitative results We can perform a PSNR quantitative evaluation of our method, a SSIM index (i.e. Wang et al. 2004) and information fidelity criterion (IFC) scoring (i.e. Sheikh et al. (2005). The PSNR and IFC scores have higher correlation with human perception compared to other metrics (see Yang et al. 2014). Because such quantitative metrics may not correlate well with visual perception, we also examine the visual quality of this section’s results. Numbers in red indicate the best performance and those in blue indicate the second-best performance. Table 3.1 reports the average PSNR scores of different methods for five infrared scenes. The method performs 2–3 dB better than the second-best method (ZE). In the device scene, our method obtains the highest PSNR score of 29.08. In the electric bicycle, human body and building scenes, our method obtains a PSNR of 28.63, 26.34 and 27.33, respectively. In the dim-building scene, our method obtains a PSNR of 24.01 dB, 1.03 dB better than the ZE method, which achieves a low PSNR score, because of the complexity of the scene. High PSNR scores indicate that the reconstructed infrared image appears real and that the details are clear.

54

3 Multi-visual Tasks Based on Night-Vision Data Structure …

Fig. 3.6 LR image of low contrast building magnified by a factor of three: a LR image; b BI method; c SC method; d ZE method and e the method of this section. The approach produces highquality HR images compared to other methods. Reprinted from Qi et al. (2017), with permission from Elsevier Table 3.1 Quantitative evaluation of PSNR (dB) in five infrared scenes: red indicates the best performance and blue indicates the second-best performance Infrared images

BI

SR

ZE

The method of this section

Electric bicycle

21.56

24.37

25.59

28.63

Human body

22.21

23.64

24.51

26.34

Device

26.32

27.14

27.29

29.08

Building

24.45

25.14

26.12

27.33

Dim building

21.11

22.74

22.98

24.01

Reprinted from Qi et al. (2017), with permission from Elsevier

Table 3.2 reports the average SSIM scores of competing methods. Our method obtains the best quantitative results of the five infrared scenes, consistently achieving the highest SSIM scores, about 0.02 better than the second-best performance (ZE method), except for building images. In the human body scene, our method obtains the highest SSIM score of 0.8901, SSIM scores of 0.8746, 0.7948 and 0.8398 for the electric bicycle, device and building scenes, respectively. Our method obtains lower SSIM score, 0.6901, however, in the dim building, showing that our method is less effective, compared to the other four infrared scenes. Our method utilises four

3.1 Infrared Image Super-Resolution via Transformed Self-similarity

55

Table 3.2 Quantitative evaluation of SSIM in five infrared scenes: red indicates the best performance and blue indicates the second-best performance Infrared images

BI

SR

ZE

The method of this section

Electric bicycle

0.8482

0.8521

0.8588

0.8746

Human body

0.8613

0.8699

0.8789

0.8901

Device

0.7257

0.7421

0.7548

0.7948

Building

0.8219

0.8344

0.8341

0.8398

Dim building

0.6641

0.6755

0.6784

0.6901

Reprinted from Qi et al. (2017), with permission from Elsevier Table 3.3 Quantitative evaluation of IFC in five infrared scenes. Red indicates the best performance and the blue indicates the second best

Infrared images

BI

SR

ZE

The method of this section

Electric bicycle

5.45

5.67

5.79

6.14

Human body

3.11

3.26

3.41

3.49

Device

6.13

6.16

6.25

6.38

Building

5.49

5.51

5.64

5.70

Dim building

3.13

3.17

3.28

3.37

Reprinted from Qi et al. (2017), with permission from Elsevier

cues to build a new framework and restores high-quality infrared images, whereas other methods fail. The restored high-quality infrared images directly reflect the performance of our SSIM scores. Table 3.3 reports IFC scores. Our method achieves better results than others, obtaining an IFC of 6.38 in the device scene, 0.13 better than the second-best performance: ZE. In the dim-building and human body scenes, the overall performance is low, with IFC scores of 3.37 and 3.49, respectively. Owing to scene complexity, BI and SC methods cannot well restore pleasing results. Instead, it yields a low IFC value: still better than 0.11 or 0.08 better than the second-best performance. In the building scene, our method obtains an IFC score of 5.70, but methods also perform well. We also perform well in the electric bicycle scene, producing an IFC score of 6.14, whereas other methods score lower than 6.14. These quantitative results verify the effectiveness of the recommended method. Experimental results demonstrate the superiority of this method, showing that it can restore clear infrared images with less noise. Ablation study Every component of the introduced method contributes to the final SR result. Patch size plays an important role in the introduced method. Patches that are too large may have difficulty mapping the LR, which contains complex structural information. Patches that are too small may not contain enough details of HR images. This section plots PSNR scores for patch sizes ranging from 3 × 3 to 11 × 11. Figure 3.7 shows five infrared scenes, with a patch size of 7 × 7 for best results.

56

3 Multi-visual Tasks Based on Night-Vision Data Structure …

Fig. 3.7 Quantitative evaluation of patch size of five infrared scenes. Reprinted from Qi et al. (2017), with permission from Elsevier

Table 3.4 PSNR scores of different components of five infrared images

Infrared images

Appearance

Dense error

Region covariance

Electric bicycle

16.21

13.24

16.42

Human body

17.34

15.00

17.44

Device

23.22

23.12

25.64

Building

17.47

16.59

20.41

Dim building

16.02

15.82

18.85

The three important cues (i.e. appearance, dense error and region covariance) are also important. Table 3.4 shows the PSNR performance of each cue (i.e. appearance, dense error and region covariance). The dense error appears to contribute least, but it is still indispensable to the overall performance. Thus, this cue may be less effective for the five infrared scenes. The appearance cue is used in SR infrared imaging because it performs better than the dense error cue. The region covariance plays an important role in this method as well, consistently obtaining the highest PSNR scores out of five scenes. This cue well-exploits data details compared to two priors. Additionally, region covariance represents structure details like highfrequency information. These higher PSNR scores further show the effectiveness of region covariance in five infrared scenes.

3.2 Hierarchical Superpixel Segmentation Model Based …

57

3.2 Hierarchical Superpixel Segmentation Model Based on Vision Data Structure Feature Superpixel segmentation is an important preprocessing technique, wherein characteristics of high-accuracy, fast running speed, high superpixel tightness and antinoise are expected. In this section, a new histogram similarity criterion based on one-dimensional differential distance is introduced. Our algorithm reflects the weak gap between histograms through their structural information, and shows advantages of simplicity and high accuracy. Moreover, a hierarchical superpixel segmentation model, based on pyramid theory, is elaborated, for which the regional block segmentation accuracy is guaranteed. Then, the weakened details are restored when the image is restored to its original resolution. This idea improves the anti-noise capability of the algorithm. A tightness constraint term is also introduced, encouraging the superpixel of similar size. The experimental results on the Berkeley BSD300 dataset show that the perfect combination of accuracy, efficiency and tightness can be achieved with this algorithm, which has better anti-noise performance. With the progress of science and technology, HR and large-scale images gradually become mainstream. Real-time performance is an important index for evaluating saliency detection and target tracking. These computer vision algorithms benefit from the concept of superpixels. Cheng et al. (2011) and Yang et al. (2013) used superpixel segmentation to remove redundant information and reduce running time, thus achieving high accuracy and efficiency. The superpixel concept was elaborated by Ren and Malik (2003), using pixels of similar colour, texture and brightness. They were comprised of pixels belonging to the same object. The watershed algorithm introduced by Vincent et al. (1991) was the earliest superpixel segmentation algorithm. Many segmentation algorithms have since been elaborated for different purposes. These algorithms include normalised cuts by Shi and Malik (2000), superpixel lattices by Moore et al. (2008), entropy-rate superpixel segmentation by Liu et al. (2011), meanshift by Comaniciu and Meer (2002), turbopixel by Levinshtein et al. (2009) and others. To obtain higher efficiency and accuracy, Achanta et al. introduced simple linear iterative clustering (SLIC) in 2012, which includes local K-means clustering. Their running time was greatly reduced using the local clustering method. Moreover, the user controls the number and tightness of the superpixels. However, SLIC requires multiple iterations to update superpixels. Its running time is directly proportional to the total number of pixels, and it is difficult to meet real-time requirements. Moreover, the algorithm is less effective in the weak-edge detection; its anti-noise ability is poor. To solve its time-consuming problem, Ren et al. introduced the gSLICr algorithm based on SLIC (2011). The algorithm greatly reduced time consumption using the graphics processing unit (GPU) operation. Since then, Van den Bergh et al. introduced superpixels extracted via energy-driven sampling (SEEDS) in 2012 to further reduce the running time and to improve detection accuracy. The similarity between SEEDS and SLIC is their superpixel segmentation, which begins with a regular grid shape. However, SLIC requires multiple iterations and is a pixel-by-pixel operation. SEEDS uses the principle of

58

3 Multi-visual Tasks Based on Night-Vision Data Structure …

hill climbing to avoid calculating the distance from the superpixel block centre pixel to the surrounding pixel. Only the boundary pixels of the regular grid must be computed, significantly reducing running time. However, the size of superpixels obtained by SEEDS is not uniform, and these superpixels have poor tightness. Although the anti-noise performance of the algorithm is greatly improved, compared to SLIC. Yet, it is still not ideal. In 2016, Wang et al. introduced a superpixel segmentation method based on region merging. To improve the anti-noise performance of these algorithms, a hierarchical superpixel segmentation model, based on a pyramid, is constructed. Moreover, a new histogram similarity criterion based on one-dimensional differential distances is elaborated to improve segmentation accuracy. Finally, for achieving artificial control of superpixel tightness, a tightness constraint term is introduced into the superpixel segmentation criterion.

3.2.1 Hierarchical Superpixel Segmentation Model Based on the Histogram Differential Distance In this section, the core SEEDS theory, or hill-climbing algorithm, is used for reference and a hierarchical superpixel segmentation model based on the pyramids is established. A new criterion for measuring the similarity between histograms is also introduced, providing a method that restores histogram structure information. Moreover, to enable running results to be more effectively used in fields like saliency detection, a superpixel tightness constraint term is introduced. A new histogram similarity criterion based on one-dimensional differential distance Common histogram similarity criteria include Euclidean distance, Mahalanobis distance, correlation coefficients, chi-square distances, etc. Chi-square distances have the advantage of high real time and high accuracy. Therefore, it is analysed representative of the traditional algorithm. However, it retains some shortcomings. Thus, a special case is given in Fig. 3.8. The chi-square distance of histograms a and b is the same as the chi-square distance of histograms a and c. However, histograms b and c are different. The information of each histogram should be considered for this problem. As shown in Fig. 3.9, a random histogram is created. Observe the distance between points a and b in Fig. 3.9. Obviously, when comparing the Euclidean distance, L1, L2 is the actual distance between the two points. The arc length between the points can thus be expressed by Eq. (3.13). S {x, p(x)},

(3.13)

where x is the abscissa of the quantised colour histogram and p(x) is its ordinate. Then, the differential distance of the histogram can be expressed as d S 2 d x 2 + d( p(x))2 . Per the differential theorem, dp px d x can be obtained, becoming

3.2 Hierarchical Superpixel Segmentation Model Based …

Fig. 3.8 Special case showing the shortcomings of chi-square distance

Fig. 3.9 A random histogram

59

60

3 Multi-visual Tasks Based on Night-Vision Data Structure …

d S 2 d x 2 + d( p(x))2 d x 2 + px2 d x 2 1 + px2 d x 2 .

(3.14)

In other words, ds 2 Ci d x 2 , Ci 1 + px2 .

(3.15)

To better apply the differential distance, local kernel regression was introduced by Takeda et al. (2007 and Seo and Milanfar (2011). The local kernel is shown in Eq. (3.16). |Ci | (xi − x)Ci (xi − x) K (xi − x) 2 exp . (3.16) h −2h 2 Ci is the variance estimated from a collection of spatial gradients within the local analysis window around position x. h is a global smoothing parameter. Equaw tion (3.15) is applied to Eq. (3.16), and the local kernel K i (l), corresponding to l1

each point of the histogram, is calculated. w is the size of the local window. Thus, the quantised colour histogram is transformed into a two-dimensional matrix of bins × w, where bins is the histogram quantisation level. To compute histogram similarity, the cosine similarity of the w-dimensional vector for each point is calculated as follows: w (K 1il × K 2il ) . (3.17) cosi w l1 w l1 K 1il × l1 K 2il ) When the similarity of the two histograms grows, the chi-square distance shrinks and the cosine distance gets larger. Thus, we change Eq. (3.17) to Eq. (3.18). cosi 1 − cosi .

(3.18)

The chi-square distance is Fi

(y1i − y2i )(y1i − y2i ) i ∈ {1 . . . bins}. y1i + y2i

(3.19)

y1i , y2i are the normalised probability values of the two histograms when the abscissa is i. After the two distances are normalised, a new similarity criterion is established as follows: similarityi Fi + γ cosi ,

(3.20)

where γ represents the proportional relationship between the two similarity criteria. M is the number of superpixels we wish to obtain and Size is the number of pixels in the image. S is the mapping from size to M, shown as

3.2 Hierarchical Superpixel Segmentation Model Based …

s : {1 . . . size} → {1 . . . M}.

61

(3.21)

Then, the two norms of the vector similarity are calculated, becoming the final similarity of the two histograms. similarity(s)

bins similarity . i

(3.22)

i0

Moreover, the number of superpixels is so large that the histogram distribution in each block is centralised, and many adjacent bin values are repeated. To reduce computational complexity, interval sampling is used to ensure the accuracy of segmentation results without increasing time and complexity. Superpixel segmentation based on data and tightness constraint items In the field of superpixel segmentation, the size of each superpixel is crucial and a high consistency for each superpixel area is expected. Non-uniformity of the superpixel area causes significant errors in fields such as saliency detection and target tracking. Thus, in addition to the histogram similarity criterion, we add the tightness constraint to the algorithm, represented by G(s). Then, the histogram similarity measure standard can be expressed as E(s) similarity(s) + λG(s).

(3.23)

Several representation symbols are introduced to explain G(s).current_area, representing the area of rule block a in the current layer. parent_area represents the area of the parent layer superpixel A around a. average_area represents the average size of the superpixels: average_area size/M. Thus, the probability not belonging to A is G(a ∈ / A)

current_area + parent_area − average_area . average_area

(3.24)

When the value is larger, the probability of a belonging to A is smaller. The tightness of superpixels can be controlled by controlling the size of λ. The tightness of superpixels also increases with the increase of λ. A hierarchical superpixel pyramid segmentation model A hierarchical superpixel segmentation model, based on pyramid, is elaborated, consisting of three parts. The superpixel segmentation level is set to N + 1. Considering that downsampling can ignore image details, we remain concerned about the segmentation of large areas. The front-N layers are segmented on the downsampling image, comprising the first part. The image resolution is then restored to the original state for further image division. The details of segmentation are focused here, and the N + 1 layer

62

3 Multi-visual Tasks Based on Night-Vision Data Structure …

Fig. 3.10 Hierarchical superpixel segmentation model based on pyramid

segmentation is performed. This progress comprises the second part. Finally, the division of the remaining pixels is the third part. The sample graph corresponding to each algorithm step is shown in Fig. 3.10. In the first part, criterion 1 is used for segmentation, E(s). It consists of two parts: histogram similarity terms and tightness constraint terms. In the second part, the number of regular grids is so large that a much running time is spent. However, because the colour histogram distribution is concentrated, its differential structure does not require much consideration. Therefore, only the histogram similarity via chi-square distance in the second part of the segmentation should be calculated, greatly reducing computational complexity. In the third part, to remove the weak sawtooth oscillation caused by the abovementioned segmentation, SLIC is used at the boundary pixels, shown as (3.25) E(s) (li − ls )2 + (ai − as )2 + (bi − bs )2 . li , ai , bi are the three-channel pixel values for each superpixel boundary pixel. ls , as , bs are the means of all pixels in the superpixel M.

3.2.2 Experimental Results To test the algorithm effects, the Berkeley dataset is used as an image dataset. It consists of 300 images and several ground-truth manual segmentations for each image. We now compute the standard metrics used in the most recent superpixel

3.2 Hierarchical Superpixel Segmentation Model Based …

63

Fig. 3.11 Influence of segmentation layers: a BR; b CUE and c time

sections, known as boundary recall (BR), corrected under segmentation error (CUE) by Neubert and Protzel (2012) and Buyssens et al. (2014). The evaluation criterion implies that BR should be higher. However, for CUE, lower the better. Influence of the input parametres Because a hierarchical strategy is adopted, different layers can be used for the same number of superpixels. It is essential to select an appropriate number of layers. Test results are shown in Fig. 3.11. From the evaluation curve, too many layers (five in this experiment) comprise the runtime increase and accuracy decrease. When the number of superpixels is small, the segmentation effect of four layers and time consumption are better than the effect of three layers. With an increase in the number of superpixels, the advantage is weakened. However, fewer segmentation layers have a poor segmentation effect and

64

3 Multi-visual Tasks Based on Night-Vision Data Structure …

Fig. 3.12 Influence of the histogram quantisation level; a BR; b CUE and c time

a high time cost, because the segmentation of fewer superpixels occurs pixel by pixel. Therefore, the number of layers should be selected as needed. In this experiment, the number of layers is four. The runtime can be saved by quantifying the colour histogram. Then, 3 × 3 × 3, and 8 × 8 × 8 three-quantisation levels are analysed for researching the influence of the histogram quantisation level. The test results are shown in Fig. 3.12. Via the accuracy rate, the effect of the three-level quantification is the worst, and the effects of the other two types of quantitative level are similar. However, with the increase in the number of superpixels, the advantages of the five-level quantification gradually appear. From the point of view of time consumption, with the increase of the quantification level, time increases rapidly. Thus, with the increase of superpixels,

3.2 Hierarchical Superpixel Segmentation Model Based …

65

Fig. 3.13 Performance test of histogram structure similarity: a BR and b CUE

this phenomenon becomes more obvious. In summary, when a colour histogram is computed by five bins per channel, time and accuracy achieve the most satisfactory level. Comparison of this algorithm and the classical algorithm When all conditions are the same, experiments are carried out under conditions of γ 0 and γ 10000. Thus, the test results are shown in Fig. 3.13. The experimental results (red lines) are shown in Fig. 3.13. Because the onedimensional differential distance reflects the histogram structure information, it reflects the nuances of both histograms. The introduction of a one-dimensional differential distance improves the boundary adherence performance. When the number of superpixels is larger, the effect is more obvious. Moreover, the segmentation results achieved from using the algorithm in this section are compared with SLIC in Fig. 3.14. Observe the black line in Fig. 3.14, the effect of SLIC is poor on the weak edge. The algorithm in this section can solve the segmentation problem of small targets, weak edges, etc. SLIC is a pixel-by-pixel segmentation and weak differences between pixel values are difficult to distinguish. However, the algorithm in this section is based on the segmentation of the region. Thus, differences among the histograms are more easily distinguished. The similarity measure, based on histogram structure information, can accurately reflect the weak differences between histograms. Thus, the accurate segmentation of the weak edge is realised. Thus, the experiments on the dataset are shown in Fig. 3.15.

66

3 Multi-visual Tasks Based on Night-Vision Data Structure …

Fig. 3.14 Partial effect images of two algorithms with 400 superpixels: a SLIC; b SLIC partial magnification; c algorithm in this section and d algorithm in this section with partial magnification

Fig. 3.15 Segmentation performance of different algorithms: a BR; b CUE and c TIME

3.2 Hierarchical Superpixel Segmentation Model Based …

67

Fig. 3.16 Partial effect images of the two algorithms having 300 superpixels: a SEEDS and b the algorithm of this section

The algorithm in this section performs well in all respects, and the gap grows with the increasing number of superpixels. For the time complexity, this algorithm is smaller than the SLIC. Whereas the edge effect of SEEDS is very good, its tightness is not good enough. The degree of difference in the size of superpixels directly affects the accuracy of subsequent algorithms. The partial effect images of the two algorithms are shown in Fig. 3.16. Because a tightness constraint term is introduced, the superpixels obtained from the algorithm in this section are more regularly shaped and uniform than SEEDS. This gap is more obvious on the uniform colour part. For the segmentation of small parts, such as the white circle in Fig. 3.16, the algorithm in this section is more accurate.

68

3 Multi-visual Tasks Based on Night-Vision Data Structure …

Fig. 3.17 Partial effect images of the two algorithms having 300 superpixels: a SEEDS; b SEEDS with partial magnification; c the algorithm in this section and d the algorithm in this section with partial magnification

For testing the anti-noise performance purposes, Gaussian noise is added to the experimental dataset. Its mean is zero and the standard deviation is two. Because SLIC displays a pixel-by-pixel clustering, the results are very poor in the case of high noise. Only SEEDS and the algorithm in this section are compared. Some experimental results with high-noise cases are shown in Fig. 3.17. The detection accuracy of SEEDS on weak edges and small targets is worse than the algorithm with high noise, as shown in Fig. 3.17. Objective experiment results between this algorithm and SEEDS are shown in Fig. 3.18. Because the one-dimensional differential distance has anti-noise ability, and because the segmentation model, based on pyramid, is elaborated, this algorithm has better anti-noise performance than SEEDS. Moreover, experimental results from the objects of the former N-layer segmentation are downsampled images. Additionally, the original resolution images are shown in Fig. 3.19. Given that the image is downsampled, the segmentation of large areas is focused. Furthermore, the segmentation details are weakened. When the image is restored to its original resolution, the details of the segmentation are focused, enhancing the anti-noise performance of the algorithm. Thus, the experimental results are more ideal.

3.2 Hierarchical Superpixel Segmentation Model Based …

69

Fig. 3.18 Experiments of two algorithms in high-noise cases

Fig. 3.19 Experiments of high-noise cases. The objects of the former N-layer segmentation are the original resolution images

70

3 Multi-visual Tasks Based on Night-Vision Data Structure …

3.3 Structure-Based Saliency in Infrared Images Saliency detection is a modern issue in computer vision. However, most extant saliency algorithms have been introduced to detect saliency in natural images. Thus, less work has been designed for infrared images. In this section, a novel structurebased saliency method that explores both local structure and global constraints via local adaptive regression kernel (SLARK) and Gaussian mixture model (GMM) are introduced. First, we estimate a local saliency map based on region covariances and LARK features. An affinity matrix is applied to further refine the local saliency map, second, introduce a global constraint prior to estimating saliency based on GMM. Then, a structure-filtering prior is used to boost this saliency map. To better absorb local and global cues, this section integrates two saliency maps as the final map. Quantitative and qualitative experiments on the infrared image dataset demonstrate that the method significantly outperforms previous saliency methods, producing satisfying saliency results. High evaluation scores further show the effectiveness and robustness of this method. The goal of saliency detection is computing a saliency map that highlights the interest region and suppresses the background noise in a scene. Serving as a preprocessing step, it can help focus interesting object regions related to the current vision tasks, giving wide applications, such as target detection (see Wang et al. 2012b; Guo and Zhang 1988; Mahadevan and Vasconcelos 2013). Saliency estimation has roots in psychological theories, such as feature integration theory (FIT), by Treiaman and Gelade (1980). The earliest computation models were built on FIT, which fused several computed feature maps to a final map. Itti et al. (1998) used colour, intensity and orientation features to estimate saliency by centre-surrounding differences across multiscale images to produce a saliency map. Ma and Zhang (2003) introduced an alternative local contrast method to compute centre-surround colour differences in a fixed neighbourhood for each pixel. Harel et al. (2007) introduced a saliency method based on graph algorithms to estimate visual saliency with graph-based visual saliency (GBVS). In 2007, Hou and Zhang exploited spectral image components to extract saliency regions. Then, Hou and Zhang (2009) introduced a dynamic visual model, based on the rarity of features, applying incremental coding lengths to estimate the entropy gain of each feature. Achanta et al. (2009) introduced a frequency-tuned (FT) method to subtract average colour from the low-pass filtered input. Seo et al. (2009) introduced a novel local regression kernel to estimate the likeness of a pixel to its surroundings, assigning a high saliency value to the salient region. Liu et al. (2010) introduced a binary saliency method by training a conditional random field to combine a set of novel features, such as multiscale contrast, centre-surround histogram and colour spatial distribution. Whereas numerous saliency methods have been introduced, a few commonly noticeable issues remain. Extant methods are introduced for natural images, whereas only a few methods are designated for infrared images. Previous methods produce poor saliency results and achieve weak performance for infrared images, because traditional cues are not successfully applied. Foremost, existing methods estimate

3.3 Structure-Based Saliency in Infrared Images

71

Fig. 3.20 Diagram of the introduced method. A local saliency map is constructed based on SLARK descriptors; then, the local saliency map is refined using an affinity matrix. The global saliency map, based on GMM constraint, is learnt to estimate saliency; and structure filtering is applied to boost the saliency result. Detected results are combined to produce the final saliency map

visual saliency uses image properties, such as colour, intensity and texture (see Zhou et al. 2014). These priors may be not well utilised in the infrared scenes, but the quality of infrared image remains poor owing to the low quality of infrared cameras. They degrade the overall detection performance. These problems widely exist in the infrared-image spectrum. Recent works by Wang et al. (2015) exploited luminance contrast and contour characteristics, estimating visual saliency. Thus, this method may result in false results, and some background noise is expected in salient regions. To address these problems, we introduce a structure-based saliency method that exploits both local and global cues. First, we introduce a local saliency method that exploits structured region covariances. It combines LARK into a new feature descriptor, called structured LARK (SLARK). An improved saliency map is generated by applying an affinity matrix. Then, a global saliency map is estimated, based on a GMM prior as a global constraint model. Then, the global map is refined using a structure-filtering method. Next, a simple integration method is applied to generate the final saliency map, based on local and global models. Figure 3.20 shows the main steps of the salient object detection method.

3.3.1 The Framework of the Introduced Method The LARK method has been used in saliency detection, but the self-similarity gradient is not well represented for estimating visual saliency in infrared images. Considering the importance of structure information in infrared images, this section introduces region covariances described by Tuzel et al. (2006) and Karacan et al.

72

3 Multi-visual Tasks Based on Night-Vision Data Structure …

(2013) into the LARK descriptor, introducing a locality-structure saliency method to consider the local structure. To the best of our knowledge, this work is the first to use structured LARK information for saliency detection in infrared images. I is an infrared image and F is the feature image extracted from the input image I , defined by F(x, y) (I, x, y),

(3.26)

where extracts a d-dimensional function of features from I . Gradient, location, luminance, LBP and HOG for the infrared images are used. Then, an image pixel is represented with d 7, a dimensional feature vector. dI dI (3.27) F(x, y) x, y, , , HOG(x, y), Lu(x, y), LBP(x, y) , dx dy where (x, y) is the pixel location. dd xI and ddyI are the edge information. HOG denotes the HOG feature. Lu denotes the luminance feature. LBP represents the LBP feature. Then, a region, R, inside F represents a d × d covariance matrix, C R , of each pixel feature, i ∈ I , defined by 1 CR (z i − μ)(z i − μ)T , k − 1 i1 k

(3.28)

where z i denotes the d-dimensional feature points inside R. μ is the mean value of a region. Then, we compute the SLARK feature as follows: SL(Cl , X l ) exp −ds 2 exp −X lT C R X l ,

(3.29)

where l ∈ 1, . . . , P 2 and P 2 denote the total number of pixels in a local window. X denotes the coordinate relationship of the centre and its surroundings. s {x1 , x2 , z(x1 , x2 )} and z(x1 , x2 ) are the grey -value of pixel (x1 , x2 ). In the classical (unweighted) LARK scheme, without spatial information, we estimate saliency using only a simple self-similarity gradient. Thus, we introduce an affinity matrix between SLARK features in a local window. SLARKs are generated to represent the self-similarity of local regions. The similarity of two regions is estimated by a defined distance of two SLARKs, which can be found in neighbouring regions that share similar SLARKs (appearances). Thus, the affinity entry, wmn , of SLARK m, to a certain region, n, is MCS(slm , sln ) wmn exp , m ∈ (n), (3.30) 2σ12 where slm , sln denote the mean SLARKs of regions m, n, respectively. (n) denotes the set of neighbouring regions of region n. σ1 is a parameter that controls similarity

3.3 Structure-Based Saliency in Infrared Images

73

strength. MCS () denotes the matrix cosine similarity introduced by Seo and Milanfar (2010). Thus, the affinity matrix, W [wmn ] N ×N , is utilised to indicate the similarity between any node pairs. The degree matrix, D diag{d1 , d2 , . . . , d N }, where dm n wmn sums the total entries of each region (node) to other regions (nodes). Thus, a row-normalised affinity matrix is defined by A D −1 W.

(3.31)

Apart from common connected networks of two regions, described by Cheng et al. (2011), when a geodesic constraint prior enhancing the relationship among boundary regions (nodes) is applied, a boundary connection that distinguishes a salient object from the background region is based on the given affinity matrix. We propagate the SLARK descriptor to define the saliency of a local region. The new saliency score is defined as S(xm )

N

amn SSLARK (xn ),

(3.32)

n1

where amn is the affinity entry computed in Eq. (3.31). SSLARK is the coarse saliency map, based on SLARK features. N denotes the number of regions of SLARK features. Figure 3.21 shows the effects of affinity matrix construction. We see that the affinity matrix having a geodesic constraint plays a vital role in saliency estimation. With a geodesic constraint prior, the saliency maps of Fig. 3.21c are more accurate and background noise is well suppressed. Based on extensive experiments of the local saliency model in the infrared images, σ12 0.8 can produce the best saliency results with least noise. Other visual examples are generated by different values of σ1 as shown in Fig. 3.22. Global model via GMM The local method can well estimate the saliency of an infrared image; however, global cues offer valuable saliency information (see Cheng et al. 2011; Perazzi et al. 2012). Without global information, local methods can falsely assign high saliency values to high-contrast edges and attenuate smooth object interior regions. Therefore, we introduce a global constraint- prior based on GMM. The novelty is exploiting GMM global cues of the SLARK descriptors. We measure the dissimilarity between foregrounds and backgrounds based because GMM details the distribution of different object components. Then, a global term is defined and the following cost is minimised. SG (n)

(− log(Pn (b1 ) − log Pn (b2 )),

(3.33)

n

where b1 and b2 denote the foreground and background GMM of an infrared image, respectively. Each SLARK descriptor is represented as a weighted combination of GMM components. The probability of belonging to neighbouring region is defined by

74

3 Multi-visual Tasks Based on Night-Vision Data Structure …

Fig. 3.21 Saliency maps from the local model. Brighter pixels indicate higher saliency scores: a input images; b local saliency maps and c affinity matrix imposes on local saliency maps

Pn

N

wmn Pn ; 0,

,

(3.34)

m

m1

where Pn denotes a linear operator that extracts the nth SLARK region in the image. We use the method of Zoran and Weiss (2011) where 300 GMM mixture components are used for training and a SLARK size of 7×7 isused. wmn is computed in Eq. (3.30), and the covariance matrix is defined by . is a 49-dimensional Gaussian m

distribution. The global model may contain small values (e.g. noise). Thus, to further filter the background noise, an effective filtering method based on SLARK features is introduced: SG (x1 )

1 Z

x1 ∈R(x2 )

u x1 ,x2 SG (x2 ),

(3.35)

3.3 Structure-Based Saliency in Infrared Images

75

Fig. 3.22 Different local saliency maps for var σ1 : a σ12 0.2; b σ12 0.4; c σ12 0.6 and d σ12 0.8

where SG (x2 ) denotes the saliency value at pixel x2 in SG , Z x1 ∈R(x2 ) u x1 ,x2 as the normalisation R(x2 ) denotes the set of neighbouring pixels centred at factor and MCS(x1SLARK ,x2SLARK ) . This section obtains the strong saliency, SG , x1 ,u x1 ,x2 exp − σr2 by applying a structure-filtering method. Figure 3.23 shows three examples of structure-filtering diffusion, uniformly highlighting the objects and effectively suppressing background noise. Based on extensive experiments of global models on infrared images, σr2 0.6 produces the best results with least noise. Some visual examples are generated using different values of σr , as in Fig. 3.24. Integration via local and global model The introduced local and global saliency maps have complementary properties. The local map is generated to capture local structural information caused by the SLARK feature descriptors. The global model exploits global details by applying the GMM prior as a global constraint. In this case, the global map may miss local details of the image. By integrating the two saliency maps, both global and local information are considered and the final saliency map is computed as

76

3 Multi-visual Tasks Based on Night-Vision Data Structure …

Fig. 3.23 Visual examples from global saliency maps: a input images; b global saliency maps and c structure filtering imposed on global model

Fig. 3.24 Different global saliency maps for σr : a σr2 0.2; b σr2 0.4; c σr2 0.6 and d σr2 0.8

3.3 Structure-Based Saliency in Infrared Images

77

Fig. 3.25 Visual examples of the final framework. The first row denotes the input infrared images; the second row denotes the integrated saliency maps

S ∗ αS1 + (1 − α)S2 ,

(3.36)

where α is a balance factor for the final combination. S1 denotes the global saliency map and S2 denotes the local saliency map. The final integration results are illustrated in Fig. 3.25. Simple integration can effectively produce more accurate saliency results, resulting in more uniform highlighting of salient objects robust to the background. For Eq. (3.36), we set α 0.6 which gives the global saliency map more weight than the local saliency map.

3.3.2 Experimental Results This section introduces experimental results of eight saliency detection methods, including the method used with the infrared image dataset composed of 500 infrared camera images (aperture size f/5.6, exposure time 1/50 s). This section evaluates the introduced method and seven state-of-the-art methods, including IT, GBVS, SR and LC [in Zhai and Shah (2006)], and DVA, LARK and FT. The baseline closely relates to the LARK method and we extend it to the infrared images. Colour-based methods are not considered. For fair comparison, the original source code from the authors’ homepage in the literature is used. Evaluation metrics This section evaluates all methods in terms of precision–recall (PR) curves by binarising the saliency map with a threshold ranging from 0 to 255. We compare the binary maps with the ground truth. Precision indicates the percentage of correct output salient pixels, whereas recall denotes the percentage of detected ground

78

3 Multi-visual Tasks Based on Night-Vision Data Structure …

Fig. 3.26 Precision–recall: a each component in the introduced method on the infrared image dataset; b number of GMM components

truth pixels. In many cases, high precision and recall scores are both computed. Therefore, the F-measure is defined as the overall performance measurement, 1 + λ2 · precison · recall Tλ , (3.37) λ2 · precison + recall where λ2 0.3 is set to emphasise the precision. Additionally, we introduce the mean absolute error (MAE), which reflects the average difference between the ground truth GT, and the saliency map S ∗ , in the pixel level. MAE

O 1 S ∗ (o) − GT(o), O o1

(3.38)

where O is the number of image pixels. The MAE value shows the similarity of a saliency map to the ground truth. A low MAE value indicates better saliency map performance (Figs. 3.26, 3.27 and 3.28). Qualitative results Figure 3.29 provides results of saliency maps generated by the seven methods, successfully highlighting the salient objects, preserving the boundaries to the salient regions and producing saliency results much closer to ground truth. Whereas these salient objects appear at different image locations, our method still generates more accurate results that are well-localised with high saliency values, owing to the effectiveness of the infrared method. Using self-similarity gradient information, the LARK method generates saliency results with low saliency scores, and the salient object is not wholly detected. It is also observed that other methods cannot effectively produce good saliency results. The saliency maps detect salient objects with fewer noisy results. This method tackles more complicated scenarios, such as two objects (rows 4 and 5 in Fig. 3.29), cluttered backgrounds (rows 1, 2 and 6 in Fig. 3.29) and low contrast between foreground and background regions

3.3 Structure-Based Saliency in Infrared Images

79

Fig. 3.27 Partial evaluation of different saliency methods: a precision–recall curves; b F-measure curves

Fig. 3.28 Left: MAE of different saliency methods to ground truth; right: area under the curve (AUC)

(row 3 in Fig. 3.29). With the help of the global model (i.e. structure-filtering, GMM constraint), this method detects more accurate salient regions with less noise in the cluttered background. Considering both local and global models, out method effectively separates the salient region from the background in low contrast images. Other competing methods may be invalid in the above cases. Thus, this method handles complex infrared image scenes, successfully highlights the salient regions and effectively suppresses the background noise. Quantitative results This section uses the PR curves and F-measure curves to evaluate the methods of Fig. 3.27. Figure 3.27a shows the quantitative performance of the introduced method against the seven competing methods in terms of PR curves. The introduced method achieves the highest precision rate in most cases when the recall rate is fixed. From Fig. 3.27b, the introduced method achieves the best F-measure performance, compared to other methods. This section measures the quality of the

80

3 Multi-visual Tasks Based on Night-Vision Data Structure …

Fig. 3.29 Comparison of different methods of infrared image datasets. Our method consistently produces better saliency results, which are much closer to the ground truth: a inputs; b ground truth; c IT; d GB; e SR; f DVA; g LC; h LARK; i FT; j our method

saliency maps using the MAE of Fig. 3.28a. Additionally, we use the AUC, the overlapping ratio (OR) described by Li et al. (2013), and the weighted F-measure (wFmeasure) described by Ran et al. (2014) in Fig. 3.28b. Our method obtains the lowest MAE value (0.33), whereas other methods obtain higher values. Low MAE scores indicate the effectiveness of the method in estimating salient regions. The method also obtains high AUC, OR and wFmeasure scores in these infrared images. The introduced method obtains superior performance in terms of these metrics, further demonstrating their effectiveness and robustness. The results validate the potential of the introduced method in detecting the saliency of infrared images with complex backgrounds. Validation of components of the method of this section We evaluate the saliency detection performance of the introduced method based on different components: local and global models. Each component of the introduced method contributes to the results. This section describes the performance of each step of our method in Fig. 3.26a. The structure-filtering/weight prior contributes less, but still contributes to the overall performance. The integration of the two models in Fig. 3.26a is very simple and produces accurate saliency results. Feature analysis in the local model This section computes a SLARK of size 7 × 7 in a 13 × 13 local window. To clarify the contribution of region covariances in the SLARK feature descriptor, we devise two relevant methods by removing one of the features and demonstrate the comparable performances in Fig. 3.26a. Results show that the structure information helps improve detection accuracy. σ1 controls

3.3 Structure-Based Saliency in Infrared Images

81

Table 3.5 Different values of α in Eq. (3.36) for integrating two cues α

0.3

0.5

0.7

0.9

MAE

0.37

0.34

0.33

0.34

wFmeasure

0.38

0.38

0.4

0.39

OR

0.32

0.34

0.36

0.36

Table 3.6 Comparison of average execution time (seconds per image). M: Matlab, C: C++ Method

IT

GB

SR

DVA

LC

LARK

FT

The method of this section

Times(s)

0.22

0.26

0.17

0.96

0.25

0.19

0.21

0.21

Code

M

M

M

M

C

M

M

M

the affinity matrix. Figure 3.22 reports visual examples with different values of σ1 . σ12 0.6 is given to produce the most accurate results. Analysis of the global model To train the GMM with different components, we test the numbers of components in the experiments. Figure 3.26b displays visual examples of different components (i.e. 200, 300 and 400). 300 components are set in the GMM to generate the best results. To better filter the noise in the global model, σr 0.6 is set to filter most irrelevant information in Fig. 3.24. Integration of different values α in Eq. (3.36) is a balance factor for integrating two saliency maps. We select different values of α to integrate the local and global saliency maps. To assess these values, three evaluation metrics (i.e. MAE, wFmeasure and OR) are used to evaluate these saliency maps in Table 3.5. α 0.7 is set to produce more accurate saliency results. This factor helps to integrate both saliency maps, leading to an improved detected result accuracy. Running time Table 3.6 compares the average running time of the infrared image dataset with other methods. All methods process one image from the infrared image dataset on a PC equipped with i7-3630 2.9 GHz CPU and 8 GB RAM. Our method spends about 0.284 s processing one image. Whereas our running time is not the fastest, it still outperforms other methods and achieves more accurate results with less noise.

3.4 Summary In this chapter, three applications for data structure and feature analysis were discussed. Several importance issues for structure feature analysis were covered. Some interesting models offered data structure analysis perspectives, which help us better understand the importance of features. Therefore, features, in terms of SR, superpixel and saliency, should be trusted.

82

3 Multi-visual Tasks Based on Night-Vision Data Structure …

A new infrared SR method, based on transformed self-similarity, was introduced. This chapter exploited appearance, dense error and region covariance to help analyse the similarity between LR and HR images. We added the scale cue to upscale the image and introduced a composite framework to accommodate them to produce high-quality SR images. Regarding superpixel segmentation’s poor anti-noise ability and poor detail segmentation effects, a hierarchical superpixel segmentation model, based on histogram one-dimensional differential distance, was elaborated. Moreover, a new histogram similarity criterion was introduced. A new saliency detection method for infrared images was introduced, in which both local and global cues were exploited. We introduced a local saliency method that exploits SLARK features and we applied an affinity matrix to refine the local saliency map. Additionally, a global saliency model was introduced, based on GMM prior and a structure-filtering prior was utilised to boost the saliency map. Finally, we introduced the final saliency map by effectively integrating two maps. Experimental results demonstrate that the features are robust, effective and simple to understand. The solutions introduced the connection between feature analysis and vision tasks where a novel, separable structure is applied as a reliable support for feature analysis. The algorithm shows superior performance for these applications, and more visualisation examples have been given in this chapter for understanding how to implement in the future.

References Achanta, R., Hemami, S., & Estrada, F., et al. (2009). Frequency-tuned salient region detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1597–1604), IEEE. Achanta, R., Shaji, A., Smith, K., et al. (2012). SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis & Machine Intelligence, 34(11), 2274. Aharon, M., Elad, M., & Bruckstein, A. (2006). rmK-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. In IEEE Transactions on Signal Processing (pp. 4311–4322). Bahy, R. M., Salama, G. I., & Mahmoud, T. A. (2014). Adaptive regularisation-based super resolution reconstruction technique for multi-focus low-resolution images. Signal Processing, 103(1), 155–167. Barnes, C., Shechtman, E., & Finkelstein, A., et al. (2009). Patch Match: A randomised correspondence algorithm for structural image editing. ACM, 1–11. Barnes, C., Shechtman, E., Dan, B. G., et al. (2010). The generalised PatchMatch correspondence algorithm. In European Conference on Computer Vision Conference on Computer Vision (pp. 29–43). Springer–Verlag. Buyssens, P., Gardin, I., Su, R., et al. (2014). Eikonal-based region growing for efficient clustering. Image & Vision Computing, 32(12), 1045–1054. Chakrabarti, A., Rajagopalan, A. N., & Chellappa, R. (2007). Super-resolution of face images using kernel PCA-based prior. IEEE Transactions on Multimedia, 9(4), 888–892.

References

83

Chang, G.,Yeung, D. Y., & Xiong, Y. (2004). Super-resolution through neighbour embedding. In Computer Vision and Pattern Recognition (CVPR). Proceedings of the 2004 IEEE Computer Society Conference, IEEE Xplore (Vol. 1, pp. 275–282). Chen, L., Deng, L., Shen, W., et al. (2016). Reproducing kernel hilbert space based single infrared image super resolution. Infrared Physics & Technology, 77, 104–113. Cheng, M. M., Zhang, G. X., & Mitra, N. J., et al. (2011) Global contrast based salient region detection. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 409–416). Chuang, C. H., Tsai, L. W., & Deng, M. S., et al. (2014). Vehicle licence plate recognition using super-resolution technique. In IEEE International Conference on Advanced Video and Signal Based Surveillance (pp. 411–416). Comaniciu, D., & Meer, P. (2002). Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis & Machine Intelligence, 24(5), 603–619. Erdem, E., & Erde, A. (2013). Visual saliency estimation by nonlinearly integrating features using region covariances. Journal of Vision, 13(4), 11. Freeman, W. T., & Pasztor, E. C. (2000). Learning low-level vision. International Journal of Computer Vision, 40(1), 25–47. Guo, C., & Zhang, L. (1988). A Novel Multiresolution Spatiotemporal saliency detection model and its applications in image and video compression. Oncogene, 3(5), 523. Hardie, R. C., Barnard, K. J., & Armstrong, E. E. (1997). Joint MAP registration and high-resolution image estimation using a sequence of undersampled images. IEEE Transactions on Image Processing A Publication of the IEEE Signal Processing Society, 6(12), 1621–1633. Hou, C., & Zhang, L. (2007). Saliency detection: A spectral residual approach. In IEEE Conference on Computer Vision and Pattern Recognition(CVPR) ’07 (pp. 1–8), IEEE. Hou, X., & Zhang, L. (2009). Dynamic visual attention: Searching for coding length increments. In Conference on Neural Information Processing Systems (pp. 681–688). Huang, J. B., Kang, S. B., Ahuja, N., et al. (2014). Image completion using planar structure guidance. Acm Transactions on Graphics, 33(4), 129. Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis & Machine Intelligence, 20(11), 1254–1259. Karacan, L., Erdem, E., & Erdem, A. (2013). Structure-preserving image smoothing via region covariances. Acm Transactions on Graphics, 32(6), 1–11. Levinshtein, A., Stere, A., Kutulakos, K. N., et al. (2009). TurboPixels: Fast superpixels using geometric flows. IEEE Transactions on Pattern Analysis & Machine Intelligence, 31(12), 2290. Li, X., Li, Y., & Shen, C., et al. (2013). Contextual hypergraph modeling for salient object detection. In IEEE International Conference on Computer Vision (pp. 3328–3335), IEEE. Liu, T., Yuan, Z., Sun, J., et al. (2010). Learning to detect a salient object. IEEE Transactions on Pattern Analysis & Machine Intelligence, 33(2), 353–367. Liu, M. Y., Tuzel, O., & Ramalingam, S., et al. (2011). Entropy rate superpixel segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 2097–2104), IEEE. Liu, J., Dai, S., Guo, Z., et al. (2016). An improved POCS super-resolution infrared image reconstruction algorithm based on visual mechanism. Infrared Physics & Technology, 78, 92–98. Ma, Y. F., & Zhang, H. J. (2003). Contrast-based image attention analysis by using fuzzy growing. In Eleventh ACM International Conference on Multimedia, Berkeley, CA, USA (pp. 374–381), November, DBLP. Mahadevan, V., & Vasconcelos, N. (2013). Biologically inspired object tracking using centresurround saliency mechanisms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(3), 541–554. Michael, V. D. B., Boix, X., & Roig, G., et al. (2012). SEEDS: superpixels extracted via energydriven sampling. In European Conference on Computer Vision (pp. 13–26), Springer, Berlin, Heidelberg. Moore, A. P., Prince, S. J. D., & Warrell, J., et al. (2008). Superpixel lattices. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1–8), IEEE. Neubert, P., & Protzel, P. (2012). Superpixel benchmark and comparison. Forum Bildverarbeitung.

84

3 Multi-visual Tasks Based on Night-Vision Data Structure …

Perazzi, F., Krähenbühl, P., & Pritch, Y., et al. (2012). Saliency filters: Contrast based filtering for salient region detection. IEEE Conference on Computer Vision and Pattern Recognition (pp. 733–740). IEEE Computer Society. Qi, W., Han, J., & Zhang, Y., et al. (2017). Infrared image super-resolution via transformed selfsimilarity. Infrared Physics & Technology, 89–96, Elsevier. Ran, M., Zelnik–Manor, L., & Tal, A. (2014). How to evaluate foreground maps. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 248–255), IEEE. Ren, M. (2003). Learning a classification model for segmentation. IEEE International Conference on Computer Vision (p. 10), IEEE Computer Society. Ren, C. Y., & Reid, I. (2011). gSLIC: A real-time implementation of SLIC superpixel segmentation. University of Oxford. Schölkopf, B., Platt, J., & Hofmann, T. (2007). Graph-based visual saliency (pp. 545–552).MIT Press. Seo, H. J., & Milanfar, P. (2009). Nonparametric bottom-up saliency detection by self-resemblance. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. CVPR Workshops 2009 (pp. 45–52), IEEE. Seo, H. J., & Milanfar, P. (2010). Training-free, generic object detection using locally adaptive regression kernels. IEEE Transactions on Pattern Analysis & Machine Intelligence, 32(9), 1688. Seo, H. J., & Milanfar, P. (2011). Action recognition from one example. IEEE Transactions on Pattern Analysis & Machine Intelligence, 33(5), 867–882. Sheikh, H. R., Bovik, A. C., & Veciana, G. D. (2005). An information fidelity criterion for image quality assessment using natural scene statistics. IEEE Transactions on Image Processing, 14(12), 2117–2128. Shi, J., & Malik, J. (2000). Normalised cuts and image segmentation. IEEE Computer Society. Sui, X., Chen, Q., Gu, G., et al. (2014). Infrared super-resolution imaging based on compressed sensing. Infrared Physics & Technology, 63(11), 119–124. Takeda, H., Farsiu, S., & Milanfar, P. (2007). Kernel regression for image processing and reconstruction. IEEE Transactions on Image Processing, 16(2), 349–366. Tom, B. C., & Katsaggelos, A. K. (1995). Reconstruction of a high-resolution image by simultaneous registration, restoration, and interpolation of low-resolution images. In IEEE International Conference on Image Processing (Vol. 2, pp. 539–542). Treiaman, A. M., & Gelade, G. (1980). A feature-integration theory of attention. Cognitive Psychology, 12(1), 97–136. Tuzel, O., Porikli, F., & Meer, P. (2006) Region covariance: A fast descriptor for detection and classification. Computer Vision—ECCV 2006, European Conference on Computer Vision, Graz, Austria (pp. 589–600), May 7–13, 2006, DBLP. Ur, H., & Gross, D. (1992). Improved resolution from subpixel shifted pictures. Cvgip Graphical Models & Image Processing, 54(2), 181–186. Vincent, L., & Soille, P. (1991). Watersheds in digital spaces: An efficient algorithm based on immersion simulations. IEEE Transactions on Pattern Analysis & Machine Intelligence, 13(6), 583–598. Wang, Z., Bovik, A. C., Sheikh, H. R., et al. (2004). Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing A Publication of the IEEE Signal Processing Society, 13(4), 600–612. Wang, S., Zhang, L., & Liang, Y., et al. (2012a). Semi-coupled dictionary learning with applications to image super-resolution and photo-sketch synthesis. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 2216–2223). Wang, X., Lv, G., & Xu, L. (2012b). Infrared dim target detection based on visual attention. Infrared Physics & Technology, 55(6), 513–521. Wang, X., Ning, C., & Xu, L. (2015). Saliency detection using mutual consistency-guided spatial cues combination. Infrared Physics & Technology, 72, 106–116.

References

85

Yang, J., Wright, J., & Huang, T., et al. (2008) Image super-resolution as sparse representation of raw image patches. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1–8). Yang, J., Wright, J., Huang, T. S., et al. (2010). Image super-resolution via sparse representation. IEEE Transactions on Image Processing A Publication of the IEEE Signal Processing Society, 19(11), 2861–2873. Yang, J., Lin, Z., & Cohen, S. (2013). Fast image super-resolution based on in-place example regression. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 1059–1066). Yang, C., Zhang, L., & Lu, H., et al. (2013). Saliency detection via graph-based manifold ranking. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 3166–3173). Yang, C. Y., Ma, C., & Yang, M. H. (2014). Single-image super-resolution: A benchmark. In European Conference on Computer Vision (pp. 372–386). Springer, Cham. Zeyde, R., Elad, M., & Protter, M. (2012). On single image scale-up using sparse-representations. In International Conference on Curves and Surfaces (pp. 711–730), Springer Berlin Heidelberg. Zhai, Y., & Shah, M. (2006). Visual attention detection in video sequences using spatiotemporal cues. In ACM International Conference on Multimedia ACM (815–824). Zhao, Y., Chen, Q., Sui, X., et al. (2015). A novel infrared image super-resolution method based on sparse representation. Infrared Physics & Technology, 71, 506–513. Zhou, F., Kang, S. B., & Cohen, M. F. (2014). Time-mapping using space-time saliency. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 3358–3365). IEEE Computer Society. Zhu, Y. M. (1992). Generalised sampling theorem. IEEE Transactions on Circuits & Systems II Analog & Digital Signal Processing, 39(8), 587–588. Zhu, S., Cao, D., Wu, Y., et al. (2016). Improved accuracy of superpixel segmentation by region merging method. Frontiers of Optoelectronics, 4, 633–639. Zoran, D., Weiss, Y. (2011). From learning models of natural image patches to whole image restoration. In International Conference on Computer Vision (pp. 479–486). IEEE Computer Society, 2011.

Chapter 4

Feature Classification Based on Manifold Dimension Reduction for Night-Vision Images

To improve the speed of data processing and the accuracy of data classification, it is necessary to apply dimensionality reduction methods. This chapter introduces dimension reduction methods based on manifold learning. First, in order to improve the classification accuracy using sample classification information, choosing corresponding samples via data similarity to construct the intra-class and inter-class scatter matrices. Second, to solve the selecting problem of parameters in LPP using the correlation coefficient to adaptively construct the label graph to characterise the discriminative information of different manifolds, and the local relation graph to characterise the distribution of each manifold. Finally, a maximum kernel maximum nuclear likelihood (KML) similarity measure is defined to calculate the outlier probability of high-dimensional data and detect outliers. The kernel LLE (KLLE) is weighted based on the KML measure, which chooses the best neighbour to generate a precise mapping from high-dimensional night-vision data to the low-dimensional vision.

4.1 Methods of Data Reduction and Classification 4.1.1 New Adaptive Supervised Manifold Learning Algorithms Feature extraction is an important method of extracting useful data from original highdimensional data. In many scientific fields, such as visualisation, computer vision and pattern recognition, feature extraction is a fundamental component (see Chung 1997). For example, feature extraction is the key to completing face recognition task (see Jin et al. 2001). Many feature extraction methods have been proposed, including linear and nonlinear algorithms, and supervised and unsupervised algorithms. Among these, principal component analysis (PCA) (Wold et al. 1987; Martinez and Kak 2001) and linear © Springer Nature Singapore Pte Ltd. 2019 L. Bai et al., Night Vision Processing and Understanding, https://doi.org/10.1007/978-981-13-1669-2_4

87

88

4 Feature Classification Based on Manifold Dimension Reduction …

discriminant analysis (LDA) (Fukunaga 1990; Yu and Yang 2001) are two representative algorithms for linear feature extraction. PCA projects the original samples into a low-dimensional subspace, spanned by the eigenvectors corresponding to the largest eigenvalues of the covariance matrix of all input samples. PCA is the optimal representation of the input samples for minimising MSE. Because PCA ignores the class label information, it is a completely unsupervised algorithm. Unlike PCA, LDA is a supervised learning algorithm. The essential idea of LDA is to choose a direction that can make the Fisher criterion function achieve its maximum as an optimal projection, maximising the ratio of the trace of the between-class scatter matrix to the trace of the within-class scatter matrix. However, both PCA and LDA consider the global Euclidean structure instead of the manifold structure of the original samples. Recent studies have shown that face images possibly reside on nonlinear sub-manifolds (see Tenenbaum et al. 2000; Roweis and Saul 2000; Belkin and Niyogi 2003; Niyogi 2004). Therein, faces of different people reside on different manifolds. Many manifold learning algorithms have been proposed for discovering intrinsic low-dimensional sub-manifold embedding in the original high-dimensional manifold. For example, ISOMAP in Tenenbaum et al. (2000), local linear embedding (LLE) in Roweis and Saul (2000), LE in Belkin and Niyogi (2003) and locality preserving projections (LPP) in Niyogi (2004) were the most famous algorithms. Experiments showed that these algorithms could meaningfully embed artificial and real-world datasets, such as face images. However, evaluating the effectiveness of these algorithms remains uncertain. ISOMAP, LLE, LE and LPP are unsupervised manifold learning algorithms, which cannot use class label information. Thus, many supervised manifold learning algorithms are proposed. Li et al. proposed constrained maximum variance mapping (CMVM) in Li et al. (2008). They used the within-class scatter matrix to characterise the local structure of each manifold and the between-class scatter matrix to characterise discriminant information between different manifolds. Using the Fisher criterion, the local structure of each manifold is maintained, and the discriminant information between different manifolds are explored. However, CMVM ignores the class label information when using the within-class scatter matrix to characterise the local structure of each manifold. It also ignores the influences of different class distances when using the between-class scatter matrix to characterise the discriminant information between different manifolds. Yan et al. proposed MFA in Yan et al. (2007). They designed an intrinsic graph to characterise the compactness of within-class and penalty graphs to characterise the discreteness of between-class relationship. The intrinsic graph describes the within-class sample relationship, and each sample is connected to its k-nearest neighbours belonging to the same class. The penalty graph describes the connection of the between-class sample relationship and its marginal point pairs. However, in MFA, it is difficult to select the number of nearest neighbours for each sample, because it ignores the influences of different class distances. W. Yang et al. proposed multi-manifold discriminant analysis (MMDA) in Yang et al. (2011). MMDA uses the sum of the within-class weight matrix to weigh the mean vectors of classes and to regard the weighted mean vectors as new samples. They seek an optimal projection direction to maximise the ratio

4.1 Methods of Data Reduction and Classification

89

of the trace of between-class scatter matrices to the trace of within-class matrices. However, using the weighted mean vectors of classes to characterise the samples produces a deviation when the distribution of samples is irregular. This performance becomes more evident over time. In this chapter, a new supervised manifold algorithm, based on CMVM, is introduced. First, the within-class scatter matrix is used to characterise the local structure of each manifold, and the original local structure information is kept. Second, the between-class scatter matrix is used to characterise discriminant information between different manifolds. Finally, the within-class scatter matrix and the between-class scatter matrix are used to establish the objective function. The introduced algorithm inherits the merits of CMVM and overcomes its shortcomings. Compared to CMVM, the introduced algorithm has two advantages. First, when establishing the within-class scatter matrix, class label information is considered, and only the samples belonging to the same class are used to establish the within-class scatter matrix. Thus, the interference caused by different classes can be reduced, conforming more to the local linear hypothesis. Second, when establishing the between-class scatter matrix, samples more accurately reflecting the distribution of different classes are selected to establish the between-class scatter matrix. Thus, it is used to characterise the discreteness of different classes. Therefore, the shortcomings of ignoring the different class distances of the original algorithm can be avoided. All algorithms require artificial parameters to influence their performance. All samples share the same parameters. Thus, a new adaptive and parameterless LPP (APLPP) is proposed to realise no parameters. We use the correlation coefficient to adaptively construct a local relation graph to characterise the distribution of each manifold. Additionally, we adaptively construct a label graph to characterise the discriminative information of different manifolds. Extensive experiments on UCI and face image datasets demonstrate the performance of the proposed APLPP.

4.1.2 Kernel Maximum Likelihood-Scaled LLE for Night-Vision Images With the development of night-vision technology (i.e. LLL and infrared imaging), we should represent high-dimensional night-vision data in low-dimensional spaces and preserve its intrinsic properties to facilitate subsequent recognition and visualisation. Many manifold dimension reduction algorithms with appealing performances (e.g. LLE, ISOMAP and LPP) have been introduced (Choi and Choi 2007; Wen 2009; Chen and Ma 2011; Chang 2006; Li 2012; Li et al. 2011) to discover intrinsic structures in night-vision data. However, heavy noises in LLL images and low contrasts in infrared images can produce outliers, leading to inaccurate embedding. To reduce outlier interference, two types of solutions have been developed. The first is preprocessing procedures of detecting outliers and filtering noisy data (see Chang 2006; Pan et al. 2009). The second is improving the local distance metric

90

4 Feature Classification Based on Manifold Dimension Reduction …

to fit the classification property (see Shuzhi et al. 2012; Zhao and Zhang 2009). Nevertheless, the intrinsic manifold structures of night-vision imaging usually have a high degree of complexity and randomness, causing the obtained classification results to deviate from actual situations when the classificatory distribution is discrete or the selection of samples is not representative (see Zhang 2009; Li and Zhang 2011). Thus, automatic unsupervised manifold methods should be studied to produce better similarity estimation. This corresponds to classification properties. Similarity estimations discovered from statistical regularities of large datasets provide effective measures for selection of neighbours and inhibition of outliers. This chapter proposes a robust manifold dimension reduction and classification algorithm to combat deformed distributed data, using the KML-scaled LLE (KLLE). A KML similarity metric is presented to detect outliers and select neighbourhoods to produce an accurate mapping of high-dimensional night-vision data. Compared to Chang (2006), Pan et al. (2009) and Ge et al. (2012), the KML similarity emphasises local manifold statistical distribution to evaluate the outlier probability, instead of computing the distance among data. Moreover, an outlier-probability-scaled KLLE method is proposed to suppress outliers in LLL and infrared images.

4.2 A New Supervised Manifold Learning Algorithm for Night-Vision Images 4.2.1 Review of LDA and CMVM Suppose samples are X [x1 , x2 , . . . , xN ] ∈ Rm and the number of classes is C. The number of samples belonging to class Ci is n i , where 1 ≤ i≤ C. The corresponding samples after feature extraction are Y y1 , y2 , · · · , yN ∈ Rd , where d m. The optimal projection matrix is A and Y AT ∗ X . Because the dimension of the original data is very high, the purpose of data dimensionality reduction is to project the original data into a meaningful d-dimensional low-dimensional space. Introduction of LDA LDA is a widely used algorithm for feature extraction and dimensionality reduction. The goal of LDA is achieving a projection direction maximising the Fisher criterion (i.e. the ratio of the trace of between-class scatter matrix to the trace of within-class scatter matrix). The between-class scatter matrix, Sb , and within-class scatter matrix, Sw , of LDA are shown in Eqs. (4.1) and (4.2), respectively.

Sb

Sw

C 1 ni ∗ (mi − m0 )T ∗ (mi − m0 ) N i1

(4.1)

C 1 (xi − mi )T ∗ (xi − mi ), N i1 x ∈c

(4.2)

i

i

4.2 A New Supervised Manifold Learning Algorithm for Night-Vision Images

91

where the mean vector of class ci is denoted by mi and the mean vector of all samples is denoted by m0 . The objective function of LDA is shown in Eq. (4.3). J (W ) arg max

W T Sb W . W T Sw W

(4.3)

The objective function can be solved by the generalised eigenvalue problem, as shown in Eq. (4.4). Sb W λSw W

(4.4)

We sort the eigenvalues, λ in descending order: λ1 ≥ λ2 . . . λm−1 ≥ λm . The optimal projection matrix, W , is comprised of the eigenvector corresponding to the first d largest eigenvalues, λ. Introduction of CMVM CMVM is a supervised manifold learning algorithm. During calculation, the within-class scatter matrix and between-class scatter matrices are used to characterise the local structure of each manifold and the discreteness of different manifolds, respectively. Then, the projection direction is sought to maximise the ratio of the trace of the between-class scatter matrix to the trace of the within-class scatter matrix. The calculation of CMVM is comprised of two steps. First, the within-class weight matrix, W c, is established to characterise the local structure. This step is like LPP. Then, the between-class weight matrix, Hc, is established to characterise the discriminant information based on class label information. The within-class weight matrix, W c, and the between-class weight matrix, Hc, are shown in Eqs. (4.5) and (4.6), respectively. W cij

1 xi is among k nearest neighbors of xj 0 otherwise

or W cij Hcij

x −x2 i exp − β j xi is among k nearest neighbors of xj 0

otherwise

,

0 if sample xi and sample xj have the same class label . 1 else

(4.5) (4.6)

In Eq. (4.5), k and β are constants. The objective function of CMVM is to maximise JLc and to minimise JDc . The expressions of JLc and JDc are shown in Eqs. (4.7) and (4.8), respectively. JLc 2AT X (Dc − W c)X T A, Dcij

N j

W cij

(4.7)

92

4 Feature Classification Based on Manifold Dimension Reduction …

JDc 2A X (Qc − Hc)X A, Qcij T

T

N

Hcij .

(4.8)

j

Like LPP, the objective function of CMVM can be solved using the generalised eigenvalue problem, as shown in Eq. (4.9). X (Qc − Hc)X T A λX (Dc − W c)X T A.

(4.9)

Then, we sort the eigenvalues, λ, in descending order: λ1 ≥ λ2 . . . λm−1 ≥ λm . The optimal projection matrix, A, is comprised of the eigenvector corresponding to the first d largest eigenvalue, λ.

4.2.2 Introduction of the Algorithm Motivation From the above introduction, we conclude that CMVM has two obvious shortcomings. First, when establishing the within-class weight matrix, CMVM ignores the class label information. Thus, the nearby samples of one sample may contain samples belonging to other classes. Thus, the local linear hypothesis will be destroyed and the classification will be impacted. Second, when establishing the between-class weight matrices, CMVM ignores the distances of different classes and regards all samples belonging to other classes as nearby samples. Thus, in many manifold learning algorithms, distances of Euclidean distance are used to measure the discreteness, and different distances have different influences on the results. Therefore, the hypothesis that the power of influences of different classes relies on their distances is echoed. If the class is distant from another class, its influence will be weaker. Otherwise, if the class is closer to one class, its influence will be stronger. Experimental results demonstrate the effectiveness of this hypothesis. However, CMVM has not noted this information. To overcome these shortcomings, a new algorithm, based on CMVM, is introduced. First, when establishing the within-class matrix, only the samples belonging to the same class are selected. Thus, the local linear hypothesis should be preserved. Second, when establishing the between-class matrix, KNN is used to select the K1nearest samples belonging to other classes, assuming these samples belong to several classes. The numbers of samples belonging to these classes are n1 , n2 , . . . , nC . Then, we sort n1 , n2 , . . . , nC in ascending order as n1 , n2 , . . . , nC . The first Kn smallest ni is selected until the sum of the ni reaches T % of the total sum. The classes, including the ni , are the nearby classes of the sample and assume all samples belonging to these classes as the nearby samples. Thus, the distances of these samples have a rational scale and can well reflect the distribution of the nearby manifolds. The between-class scatter matrix established by these samples characterises the discreteness of different classes more accurately and makes it more convenient for classification.

4.2 A New Supervised Manifold Learning Algorithm for Night-Vision Images

93

Implementation Process The similar method of CMVM is used to establish the within-class weight matrix. However, the class label information is considered. The within-class weight matrix, W , is shown in Eq. (4.10). Wij

1 xi is among k nearest neighbors of xi , xj have the same label . 0 otherwise

(4.10)

The within-class scatter matrix, J1 , is shown in Eq. (4.11). J1

N N

T Yi − Yj Yi − Yj Wij 2AT X (D − W )X T A, Dij Wij . i,j

(4.11)

j

For every sample, xi , the algorithm is used to select nearby samples of xi . First, the K1 nearest samples from other classes of xi are selected. Second, the number of classes is counted, and the numbers of samples belonging to these classes are recorded as n1 , n2 , . . . , nC . We then sort n1 , n2 , . . . , nC in ascending order, and the new order is recorded as n1 , n2 , . . . , nC . Third, the percentage of the sum of the first Kn ni to the total sum is calculated until the ratio reaches the given threshold, T which is set to 60%. When the sum reaches T, Cinc is used to record the first Kn, ni , and all samples belonging to these classes are selected as nearby samples of xi . Then, these samples are used to establish the between-class weight matrix, H . Hij

1 if the label of xj include in Cinc . 0 otherwise

(4.12)

The between-class scatter matrix, J2 , is shown in Eq. (4.13). J2

N N

T Yi − Yj Yi − Yj Hij 2AT X (Q − H )X T A, Qij Hij . i,j

(4.13)

j

The objective function of the introduced algorithm is shown in Eq. (4.14). J (A) arg max

AT X (Q − H )X T A . AT X (D − H )X T A

(4.14)

Like LPP, the objective function of the introduced algorithm can be solved using the generalised eigenvalue problem, as shown in Eq. (4.15). X (Q − H )X T A λX (D − H )X T A.

(4.15)

Sort the eigenvalues, λ, in descending order, λ1 ≥ λ2 . . . λm−1 ≥ λm . The optimal projection matrix, A, is composed by the eigenvector corresponding to the first d largest eigenvalue, λ.

94

4 Feature Classification Based on Manifold Dimension Reduction …

4.2.3 Experiments Owing to the inverse of the within-class scatter matrix, if the number of samples is less than the sample dimensions, the within-class scatter matrix becomes singular. To avoid the small sample size (SSS) problem, PCA is conducted prior to feature extraction. Nearly 95% energy is preserved to select the principal components. In the algorithm, when the nearby samples are selected, if the number of train samples, n_train, is larger than 10, K is set to 10. Else, K is set to n_train-1. Additionally, K1 is set to 3*K, the samples are selected to establish the between-class weight matrix. In MFA, when establishing the between-class weight matrix, if the number of train samples, n_train, is greater than 10, K is set to 10, else K is set to n_train-1. The AR dataset, infrared dataset and pose, illumination and expression (PIE) dataset are used to evaluate the introduced algorithm against PCA, LDA, LPP, CMVMi MFA and MMDA. All algorithms were run 10 times, and the mean training time was used as the training time of each algorithm. The experiments were implemented on a desktop PC with an i5 3470 CPU and 8-GB RAM using MATLAB (R2014a). Experiment on AR dataset The AR dataset of Martinez (1998) contains over 4000 images of 126 people with different facial expressions and lighting conditions. A subset of the AR dataset is selected, including 1400 images of 100 people (14 images each). All samples are resized to 60 × 43. In the experiment, the first L images per person are selected for training, and others are used for testing. Finally, a nearestneighbour classifier with Euclidean distance is used to classify the extraction feature. The recognition rates and training times are shown in Tables 4.1 and 4.2, respectively. The number in brackets reflects the dimensions of samples when the recognition rates achieved their highest (Figs. 4.1 and 4.2). Experiment on Infrared dataset The infrared dataset contains 1400 images of five people and two cups each, with different rotational angles. All images were captured using an infrared camera. Each person or cup has 200 images, and all samples are resized to 40 × 40. In the experiment, the first L images are selected for training and others for testing. Finally, a nearest-neighbour classifier with Euclidean distance is used to classify extraction features. The recognition rates and training times are shown in Tables 4.3 and 4.4, respectively. The number in brackets reflects the dimensions of samples when the recognition rates achieved their highest (Figs. 4.3 and 4.4).

Table 4.1 Recognition rates of the AR dataset by different methods PCA

LDA

LPP

MFA

CMVM

Ours

L7

0.6700(61)

0.6543(26)

0.8014(73)

0.8257(72)

0.7800(69)

0.8071(72)

L8

0.7900(64)

0.8950(18)

0.9200(58)

0.9650(74)

0.9133(71)

0.9783(73)

L9

0.7760(67)

0.9440(19)

0.9220(60)

0.9620(74)

0.9040(60)

0.9760(73)

L 10

0.7575(75)

0.9375(73)

0.9375(73)

0.9600(73)

0.9350(71)

0.9825(69)

L 11

0.7467(75)

0.9700(19)

0.9833(73)

0.9900(73)

0.9833(59)

1.00(71)

4.2 A New Supervised Manifold Learning Algorithm for Night-Vision Images Table 4.2 Average training time of the AR datasets

95

PCA

LDA

LPP

MFA

CMVM

Ours

L7

2.12

1.51

0.042

1.00

0.066

0.16

L8

2.57

1.87

0.053

1.21

0.084

0.19

L9

3.24

2.34

0.068

1.46

0.10

0.22

L 10

3.82

2.81

0.083

1.72

0.13

0.26

L 11

4.74

3.48

0.10

2.01

0.15

0.29

Reprinted from Zhao et al. (2015), with permission of Springer

Fig. 4.1 Images of two people in the AR dataset. Reprinted from Zhao et al. (2015), with permission of Springer Fig. 4.2 Recognition rates of all algorithms on the AR dataset. Reprinted from Zhao et al. (2015), with permission of Springer

Table 4.3 Recognition rates on the infrared dataset by different methods PCA

LDA

LPP

MFA

CMVM

Ours

L 80

0.6452(8)

0.6548(4)

0.7464(20)

0.7036(14)

0.6667(20)

0.7500(3)

L 100

0.6071(3)

0.6143(3)

0.6414(9)

0.6743(13)

0.6643(20)

0.7500(4)

L 120

0.6268(14)

0.6250(6)

0.6804(18)

0.7357(20)

0.7000(13)

0.7446(20)

L 140

0.6190(12)

0.6452(1)

0.6976(2)

0.6929(19)

0.6810(19)

0.8214(5)

L 150

0.5657(11)

0.6400(1)

0.7114(18)

0.7914(13)

0.7514(7)

0.8371(3)

Reprinted from Zhao et al. (2015), with permission of Springer

96

4 Feature Classification Based on Manifold Dimension Reduction …

Table 4.4 Average training time on the infrared dataset(s)

PCA

LDA

LPP

MFA

CMVM

Ours

L 80

0.32

0.29

0.023

0.079

0.036

0.11

L 100

0.53

0.47

0.036

0.12

0.056

0.15

L 120

0.81

0.78

0.053

0.17

0.081

0.19

L 140

1.17

1.07

0.08

0.20

0.14

0.23

L 150

1.43

1.30

0.08

0.23

0.13

0.28

Reprinted from Zhao et al. (2015), with permission of Springer

Fig. 4.3 Sample images of two people and one cup in the infrared dataset. Reprinted from Zhao et al. (2015), with permission of Springer Fig. 4.4 Recognition rates of all algorithms on the infrared dataset. Reprinted from Zhao et al. (2015), with permission of Springer

4.2 A New Supervised Manifold Learning Algorithm for Night-Vision Images

97

Table 4.5 Recognition rates on the PIE dataset using different methods PCA

LDA

LPP

MFA

CMVM

Ours

L 30

0.3876(64)

0.4365(21)

0.5132(59)

0.4997(35)

0.4991(50)

0.6068(59)

L 40

0.4978(60)

0.4897(19)

0.5746(57)

0.5879(5)

0.5566(50)

0.7022(59)

L 50

0.5853(61)

0.5441(23)

0.5907(57)

0.6696(59)

0.6211(58)

0.7216(59)

L 60

0.5772(61)

0.6140(18)

0.6890(59)

0.7404(59)

0.6816(54)

0.8162(59)

L 70

0.8191(61)

0.8559(20)

0.9147(53)

0.9235(59)

0.9250(47)

0.9632(59)

Reprinted from Zhao et al. (2015), with permission of Springer Table 4.6 Average training time on the PIE dataset (s)

PCA L 30

5.17

LDA

LPP

MFA

CMVM

Ours

4.43

0.33

4.17

0.52

0.75

L 40

9.77

9.02

0.58

6.88

0.90

1.19

L 50

16.59

15.36

0.91

10.20

1.42

1.74

L 60

27.70

25.21

1.33

14.29

2.12

2.45

L 70

39.68

38.09

1.81

18.96

2.86

3.19

Reprinted from Zhao et al. (2015), with permission of Springer

Fig. 4.5 Sample images of two people from the PIE dataset. Reprinted from Zhao et al. (2015), with permission of Springer

Experiment on PIE dataset The PIE dataset in Sim et al. (2003) and Sim et al. (2001) contains 41,368 images of 68 people with different facial expressions and lighting conditions. These images were captured by 13 synchronised cameras and 21 flashes under varying poses, illuminations and expressions. 80 images for each person were selected and all samples were resized to 32 × 32. In the experiment, the first L images per person were selected for training, whereas others for selected for testing. Finally, a nearest-neighbour classifier with Euclidean distance was used to classify the extraction feature. The recognition rates and training times are shown in Tables 4.5 and 4.6, respectively. The number in brackets reflects the dimensions of samples when the recognition rates achieved their highest (Figs. 4.5 and 4.6). From the above results, we can see that the introduced algorithm has obviously higher recognition rates than other algorithms. Compared to CMVM, the proposed algorithm uses a little more time.

98

4 Feature Classification Based on Manifold Dimension Reduction …

Fig. 4.6 Recognition rates of all the algorithms on PIE dataset. Reprinted from Zhao et al. (2015), with permission of Springer

4.3 Adaptive and Parameterless LPP for Night-Vision Image Classification 4.3.1 Review of LPP Suppose that samples are X [x1 , x2 , . . . , xN ] ∈ Rm and the number of classes is C. The number of samples belonging to each class is ni , i ≤ C. The corresponding samples, after feature extraction, are Y y1 , y2 , · · · , yN ∈ Rd , where d < m. The optimal projection matrix is A and Y AT ∗ X . Introduction of LPP LPP is a widely used algorithm for feature extraction and dimensionality reduction. The goal of LPP is to keep the local structure of the original manifold and minimise the objective function. Calculation of LPP is comprised of three steps. First, we calculate the Euclidean distances between any two samples. Then, we construct the local relation graph, W, based on the Euclidean distances and establish the objective function. Finally, we solve the objective function to obtain the optimal projection matrix, A. The objective function of LPP is 1 T A xi − AT xj2 Wij AT X (D − W )X T A, Dij Wij , 2 i,j j N

minJLPP

N

s.t. AT XDX T A 1.

(4.16)

(4.17)

Wij is the local relation graph (see Roweis and Saul 2000). The objective function can be solved by the generalised eigenvalue problem, as shown in Eq. (4.18).

4.3 Adaptive and Parameterless LPP for Night-Vision Image Classification

X (D − W )X T A λXDX T A.

99

(4.18)

Sort the eigenvalues, λ, in ascending order: λ1 ≥ λ2 · · · λm−1 ≥ λm . The optimal projection matrix, A, is composed of the eigenvector corresponding to the first d smallest eigenvalues, λ.

4.3.2 Adaptive and Parameterless LPP (APLPP) The original LPP has two shortcomings. First, it cannot use class label information when constructing the local relation graph, because it is an unsupervised algorithm. Second, we cannot set parameters, such as the neighbour size K. To overcome these shortcomings, APLPP is introduced. Motivation APLPP uses artificial parameters as little as possible while dividing different classes as much as possible to keep the local structure as constant as possible. LDA uses the mean vector of each class and the total mean vector to characterise the discriminative information of different classes. However, because of the uncertain distribution of data, the mean vector of each class cannot well reflect the data distribution of each class. Thus, it is sometimes incorrect to use the mean vector of each class to characterise the discriminative information of different classes. In CMVM, all samples in the other classes are taken as the nearby samples indiscriminately. Thus, the real neighbourhoods can be contained, but some unneeded samples will be taken and the result will be influenced. In MMDA, the weighted mean vector of each class is used to characterise the corresponding class, so that it cannot reflect the distribution of all samples accurately. In the real world, only a part of samples from different classes will be nearby. However, other samples are far away from one another. Thus, we show that only a part of the sample needs to be used to extract feature for classification. Experimental results prove this out. Implementation of APLPP Because real-world images contain noise, especially for low-light-level and infrared images, if the Euclidean distance is used to directly characterise the similarity of samples, it may cause errors. To weaken this influence and achieve adaptive and parameterless operation, a correlation coefficient is used to characterise the similarity of samples. The correlation coefficient between samples xi and xj are defined as

cov xi , xj Pij . D(xi ) ∗ D xj

(4.19)

In Eq. (4.19), Pij is the correlation coefficient between samples xi and xj , cov xi , xj is the covariance between samples xi and xj and D(xi ) and D xj are the variances of samples xi and xj , respectively.

100

4 Feature Classification Based on Manifold Dimension Reduction …

If we add noise to the original samples, xi and xj , the correlation coefficient between them is shown in Eq. (4.20). cov xi + xi , xj + xj Pij (4.20) .

D xi + xi ∗ D xj + xj In Eq. (4.20), xi and xj are the noise of samples xi and xj . Generally, the noise is irrelevant to the original signal and to one another. Thus, the correlation coefficient between xi + xi and xj + xj is shown in Eq. (4.21). cov xi + xi , xj + xj Pij

D xi + xi ∗ D xj + xj

cov xi , xj + cov xi , xj + cov xj , xi + cov xi , xj

D(xi ) + D xi + 2cov xi , xi ∗ D xj + D xj + 2cov xj , xj

cov xi , xj

D(xi ) + D xi ∗ D xj + D xj

cov xi , xj (4.21)

D xj D(xi ) 1 + D(xi ) ∗ 1 + D x D(xi ) ∗ D xj ∗ ( j) In many situations, the deviation of noise is far less than that of the original signal. Thus, Eq. (4.21) can be written as

cov xi , xj cov xi , xj Pij ≈ Pij .

D xj D(xi ) ∗ D xj D(xi ) 1 + D(xi ) ∗ 1 + D x D(xi ) ∗ D xj ∗ ( j) (4.22) From Eq. (4.22), we arrive at the conclusion that the correlation coefficient is robust to noise. The method used to construct the label graph contains two main steps: finding the nearby classes and finding the nearby samples between them.

4.3 Adaptive and Parameterless LPP for Night-Vision Image Classification

101

Input: All training samples, X Output: The nearby classes Label = zeros (C , C ) ;// Record the relationship between any two classes, if class i and class j are nearby classes, Label (i, j ) = 1 . Calculate the correlation coefficient P between any two samples. For i = 1 to C do Calculate the mean correlation coefficient M between class i and other classes. For j = 1 to C do Calculate the mean correlation coefficient M 1 between class i and class j . If M 1 ≥ M Label (i, j ) = 1 // Class i and class j are nearby class. EndIf EndFor EndFor

After finding all nearby classes, for any two nearby classes, such as C1 and C2 , we use the method shown to find the nearby samples. Input: Nearby class C1 and C2 Output: The nearby samples G = zeros ( n , n ) ;// n1 is the number of samples belong to C1 and n2 is the number of samples belong to C2 freq _ row = [], freq _ col = []; // Record the occurrences of every sample. Calculate the mean correlation coefficient m1 between class i and class j . For i = 1 to n1 do For j = 1 to n2 do If Pij ≥ m1 G ( i, j ) = 1 ;// Record the relationship of all samples. freq _ row = [ freq _ row i ], freq _ col = [ freq _ col j ]; // i and j is the index of row and column respectively. EndIf EndFor EndFor unique _ freq _ row = unique( freq _ row), unique _ freq _ col = unique( freq _ col ); // unique _ freq _ row and unique _ freq _ col are the same values as in freq _ row and freq _ col but with no repetitions. hsit _ freq _ row = hsit ( freq _ row), hsit _ freq _ col = hsit ( freq _ col ); // hsit _ freq _ row and hsit _ freq _ col are the occurrences of each element in unique _ freq _ row and unique _ freq _ col , respectively. 1

T _ row =

2

sum(hist _ freq _ row) sum(hist _ freq _ col ) , T _ col = ; numel (unique _ freq _ row) numel (unique _ freq _ col )

//

T _ row

and

T _ col

are the mean occurrences of all

elements. For i = 1 to numel( unique _ freq _ row ) do // numel( unique _ freq _ row ) is the number of element in unique _ freq _ row . If hist _ freq _ row(i) ≥ T _ row do // The occurrences of unique _ freq _ row(i) are greater than or equal to T _ row . unique _ freq _ row(i ) is a nearby sample of class C2 . EndIf EndFor For i = 1 to numel( unique _ freq _ col ) do // numel( unique _ freq _ row ) is the number of element in unique _ freq _ row . If hist _ freq _ col (i) ≥ T _ col do// The occurrences of unique _ freq _ col (i) is greater than or equal to T _ col . unique _ freq _ col (i ) is a nearby sample of class C1 . EndIf EndFor

If the distribution of samples belongs to class C1 and class C2 , as shown in Fig. 4.7, we get S13 , S14 and S15 . S23 and S24 are nearby samples from APLPP. Calculating the correlation coefficient between two arbitrary samples in Fig. 4.7, we get matrix G:

102

4 Feature Classification Based on Manifold Dimension Reduction …

Fig. 4.7 Distribution of samples belonging to class C1 and class C2

⎡

0 ⎢1 ⎢ ⎢ G ⎢0 ⎢ ⎣0 0

0 0 1 0 0

0 0 1 1 0

0 0 1 1 0

0 0 1 1 0

⎤ 0 0⎥ ⎥ ⎥ 0⎥ ⎥ 0⎦ 0

(4.23)

We scan matrix G by column to get freq [2, 3, 3, 4, 3, 4, 3, 4]. We count the number of occurrences of each element in [2, 3, 4] and the number of occurrences of each element in [1, 4, 3]. The average is T 2 23 (T (1 + 4 + 3)/3 2 23 ), and the elements occurring more than or equal to T are 3 and 4. Thus, the nearby samples are S23 and S24 . Then, we use the same method to get the nearby samples of class C2 : S13 , S14 and S15 . After finding the nearby samples of each class, the corresponding edge is set in the label graph, Bpij 1, or other constants. After traversing all the samples, the integral label graph, Bp, can be found. The between-class scatter matrix is defined in Eq. (4.24). 1 T A xi − AT xj2 Bpij AT X (Qp − Bp)X T A, Qpij Bpij . 2 i,j j N

Sb

N

(4.24)

The algorithm like the one used by Dornaika and Assoum (2013) is used to construct the local relation graph, W p. If the correlation coefficient, pij , between samples xi and xj in the same class, is greater than or equal to mij (the mean correlation coefficient between xi ) and all other samples belong to the same class, then xj is selected as one of the nearby samples of xi , setting the corresponding edge in the local relation graph, W pij 1, or other constants. After traversing all the samples, the local relation graph, W p, can be found. The within-class scatter matrix is defined in Eq. (4.25).

4.3 Adaptive and Parameterless LPP for Night-Vision Image Classification

1 T A xi − AT xj2 W pij AT X (Dp − W p)X T A · Dpij W pij . (4.25) 2 i,j j N

Sw

103 N

The objective of APLPP is to maximise the trace ratio of the between-class scatter matrix, Sb, to the trace of the within-class scatter matrix, Sw. J (A) argmax

AT X (Qp − Bp)X T A . AT X (Dp − W p)X T A

(4.26)

The optimisation problem can be solved by the generalised eigenvalue problem, as shown in Eq. (4.27). X (Qp − Bp)X T A λX (Dp − W p)X T A.

(4.27)

We sort the eigenvalues, λ, in descending order: λ1 ≥ λ2 . . . λm−1 ≥ λm . The optimal projection matrix, A, is comprised of the eigenvector corresponding to the first d largest eigenvalues, λ.

4.3.3 Connections with LDA, LPP, CMVM and MMDA In Fig. 4.7, we find that, for classes C1 and C2 , only a few samples are nearby. However, if traditional KNN or the ε-ball criterion is used, it becomes difficult to find the correct nearby samples. In LDA, CMVM and MMDA, these algorithms take the entire class as the basic unit to process the samples and cannot accurately select the nearby part. However, using APLPP, the discriminative information of different manifolds is described more accurately. CMVM and the introduced APLPP have the same between-class and within-class scatter matrices: N

AT xi − AT xj2 Bpij ∝

i,j N i,j

N

AT xi − AT xj2 W bij ,

(4.28)

AT xi − AT xj2 W wij .

(4.29)

i,j

AT xi − AT xj2 W pij ∝

N i,j

The between-class and within-class scatter matrices have the same form. Their only difference is Bp, W b, W p and W w, which have different expressions. When the correlation coefficient between any two samples is the same, Bp W b and W p W w. Thus, CMVM is a special case of the introduced APLPP. The objective functions of MMDA, LDA and LPP have the same expression in form, and the superscript method can be MMDA, LDA, LPP or APLPP.

104

4 Feature Classification Based on Manifold Dimension Reduction …

JMMDA (A) arg max

AT Sbmethod A . AT Swmethod A

(4.30)

When all samples belonging to a class have the same distribution, and the distance between each class is far enough, MMDA, LDA, LPP and APLPP would have the same within-class scatter matrices, but would have different between-class scatter matrices. This is the key distinction of the different algorithms. When the distribution of samples is irregular, the weighted mean vector used in MMDA and the mean vector used in LDA cannot well reflect the distribution of samples. Thus, the results of MMDA and LDA are not very good. Because LPP is an unsupervised algorithm, it cannot use the class label information. Thus, it cannot divide different classes very well. However, with APLPP, the nearby part of nearby samples is selected, whereas the other part is ignored. Thus, it can describe the discriminative information of different manifolds more accurately and obtain better results. From the above analysis, we can conclude that APLPP can select nearby samples more accurately to characterise the discriminative information of different manifolds. It can divide different classes as far as possible while keeping the local structure as constant as possible.

4.3.4 Experiments To evaluate the performance of the introduced APLPP, we compared ELPP, CMVM, MMDA, LPP, PCA and LDA. Nearly 95% energy is used to select the number of principal components of PCA to avoid the SSS problem. Several UCI datasets and face image datasets were used to evaluate the performance of all algorithms. All algorithms were run 10 times and the mean recognition rates were used as recognition rates of each. The experiments were implemented on a desktop with i7 4790 CPU with 8-GB RAM, using MATLAB (R2015b). For all algorithms, a nearest-neighbour classifier with Euclidean distance was used for classification. During experiments of CMVM, MMDA and LPP, if the number of each class was greater than 10, the neighbour size K was set to 10. Otherwise, the neighbour size K was set to the training number of each class. Experiments on the UCI dataset The UCI datasets were named ‘Balancescale’, ‘Breastw’, ‘Ionosphere Image’, ‘Irist’, ‘Musk (Vision1)’, ‘Sonar’, ‘Vowel’ and ‘Wine’ (http://archive.ics.uci.edu/ml/datasets.html.). They were used to evaluate the performance of all algorithms. Properties of these datasets are shown in Table 4.7. To program conveniently, dataset samples belonging to each class were equal. In each experiment, nearly 40% of each class was randomly selected to build the training set. Other samples comprised the testing set. The recognition rates (RR/%) and standard deviation (std) are shown in Table 4.8. In Tables 4.7 and 4.8, ‘Ionosphere Image’ is shortened to ‘Ionosphere’.

4.3 Adaptive and Parameterless LPP for Night-Vision Image Classification Table 4.7 Descriptions of the UCI datasets

Name

Number of classes

Number of each class

105

Dimension

Balancescale

3

49

Breastw

2

241

4 9

Ionosphere

2

126

34

Irist

3

50

4

Musk

2

207

166

Sonar

2

120

60

Vowel

11

90

10

Wine

3

48

13

The experiment results on UCI datasets show that the introduced APLPP outperforms the other six algorithms in most situations. Compared to other algorithms, the introduced APLPP can construct an adaptive graph, and discriminative information is adopted. Thus, the distribution of all samples can be explored very well, and better results are obtained. Experiments on face image dataset Image data usually has an intrinsic invariance. However, transformations, such as rotation and scale, will influence it. Nearby pixels have similar transformations with each other in embedding manifold, and images can be characterised in a low-dimensional intrinsic manifold (see Ghodsi et al. 2005). AR, infrared image, UMIST and Yale_B datasets are used to evaluate algorithm performance. Experiments on AR dataset The AR dataset from the CVC Technical Report (1998) contains 120 people, each having 26 images. See Fig. 4.8; all images are resized to 40 × 50. In the experiment, 4, 5, …, 9 images per person were randomly selected to build the training set, and other samples were used for testing. The RR/% and std are shown in Table 4.9. Experiments on the Infrared Dataset The infrared dataset from Han et al. (2014) contains 1400 images of five people and two cups at different angles of rotation. All images were captured with an infrared camera. Each person or cup has 200 images. Several images of one person and one cup in the infrared dataset are shown in Fig. 4.9. All images were resized to 40 × 40. In the experiment, 10, 15, …, 35 images per person or cup were randomly selected to build the training set. Other samples were used for testing. The RR/% and std are shown in Table 4.10. The UMIST dataset of Graham and Allinson (1998) contains 575 images of 20 people. For programming convenience, 19 images of each person were randomly selected to build a new dataset. Several images of one person in UMIST dataset are shown in Fig. 4.10, and all were resized to 56 × 46. In the experiment, 4, 5, …, 9 images per person were randomly selected to build the training set, and other samples were used for testing. The RR/% and std are shown in Table 4.11.

Breastw

96.55 ± 0.94

95 ± 0.52

96.24 ± 0.68

95.86 ± 0.80

95.97 ± 0.46

96.24 ± 0.66

96.31 ± 0.80

Balancescale

86.2 ± 4.03

61.95 ± 4.03

58.51 ± 4.68

71.26 ± 5.00

75.52 ± 6.11

56.67 ± 6.57

79.43 ± 3.89

APLPP

ELPP

CMVM

LPP

MMDA

PCA

LDA

86.08 ± 1.81

86.42 ± 2.82

86.49 ± 1.91

87.30 ± 2.18

86.89 ± 3.64

84.05 ± 1.92

89.18 ± 1.64

Ionosphere

97.11 ± 0.94

95.33 ± 1.55

92.44 ± 2.15

93.11 ± 2.50

97.00 ± 1.74

95.11 ± 2.52

98.88 ± 1.57

Irist

Table 4.8 Classification performance (mean ± std) on the UCI datasets

100.00 ± 0

97.20 ± 1.12

100.00 ± 0

91.42 ± 1.95

96.63 ± 1.78

97.68 ± 2.74

100.00 ± 0

Musk

76.23 ± 1.87

85.09 ± 3.88

78.77 ± 3.83

81.23 ± 4.26

84.65 ± 3.44

8132 ± 3.77

89.47 ± 3.39

Sonar

90.55 ± 1.11

90.88 ± 1.04

86.16 ± 1.34

89.06 ± 0.71

91.30 ± 1.11

88.92 ± 1.31

92.08 ± 0.93

Vowel

97.74 ± 1.18

95.24 ± 1.59

96.90 ± 2.11

95.95 ± 2.39

97.62 ± 1.68

91.67 ± 3.32

98.80 ± 1.31

Wine

106 4 Feature Classification Based on Manifold Dimension Reduction …

4.3 Adaptive and Parameterless LPP for Night-Vision Image Classification

107

Fig. 4.8 Images of one person from the AR dataset Table 4.9 Classification performance (mean ± std) on the AR dataset 4

5

6

7

8

9

APLPP

90.22 ± 1.16

92.69 ± 0.88

94.50 ± 0.59

95.65 ± 0.49

96.34 ± 0.64

97.10 ± 0.47

ELPP

83.83 ± 1.14

86.74 ± 0.66

90.79 ± 0.80

92.20 ± 1.09

93.75 ± 0.93

94.65 ± 0.83

CMVM 68.00 ± 1.68

70.70 ± 1.46

76.14 ± 1.93

78.45 ± 1.21

80.15 ± 1.38

82.46 ± 0.97

72.53 ± 1.38

76.53 ± 1.92

81.95 ± 0.87

84.69 ± 0.67

86.82 ± 1.12

88.99 ± 1.00

MMDA 71.18 ± 1.43

77.65 ± 1.71

82.86 ± 1.18

86.17 ± 1.02

88.77 ± 1.07

90.55 ± 1.12

PCA

52.95 ± 1.05

56.89 ± 0.73

62.13 ± 1.01

65.19 ± 1.32

67.78 ± 0.97

70.51 ± 0.91

LDA

87.42 ± 1.11

90.39 ± 0.97

93.66 ± 0.50

94.91 ± 0.40

95.32 ± 0.58

96.66 ± 0.38

LPP

Fig. 4.9 Images of one person and one cup from the infrared dataset Table 4.10 Classification performance (mean ± std) on the infrared dataset 10

15

20

25

30

35

APLPP

96.16 ± 1.75

98.91 ± 1.49

99.04 ± 0.99

100.00 ± 0.00

100.00 ± 0.00

100.00 ± 0.00

ELPP

86.76 ± 2.95

93.22 ± 1.53

96.10 ± 1.12

98.14 ± 0.91

98.13 ± 0.97

98.56 ± 0.73

CMVM

86.23 ± 2.34

92.36 ± 2.30

95.08 ± 1.43

97.2 ± 1.16

97.55 ± 1.07

98.11 ± 0.61

LPP

84.19 ± 2.99

91.46 ± 1.77

94.57 ± 1.62

97.02 ± 1.25

96.97 ± 1.54

98.29 ± 0.59

MMDA

89.35 ± 1.51

94.78 ± 2.45

97.47 ± 0.83

98.87 ± 0.63

98.96 ± 0.76

99.45 ± 0.46

PCA

85.95 ± 3.23

92.53 ± 2.81

95.23 ± 1.12

96.87 ± 1.14

97.51 ± 1.23

98.03 ± 0.44

LDA

91.91 ± 1.90

93.22 ± 2.28

98.07 ± 0.81

99.48 ± 0.34

99.43 ± 0.56

99.8 ± 0.24

108

4 Feature Classification Based on Manifold Dimension Reduction …

Fig. 4.10 Face images of one person in the UMIST dataset Table 4.11 Classification performance (mean ± std) on the UMIST dataset 4

5

6

7

8

9

APLPP

92.66 ± 1.96

96.78 ± 2.35

97.30 ± 2.16

97.91 ± 1.77

98.62 ± 1.05

99.00 ± 1.05

ELPP

80.70 ± 2.74

83.89 ± 2.22

88.62 ± 2.51

92.25 ± 2.57

94.00 ± 1.88

94.65 ± 2.11

CMVM 75.40 ± 3.35

78.54 ± 2.59

83.04 ± 1.79

87.13 ± 3.01

89.59 ± 1.28

91.10 ± 1.26

74.20 ± 3.61

78.00 ± 2.88

82.27 ± 2.07

86.50 ± 1.86

89.23 ± 1.58

90.45 ± 2.01

MMDA 81.97 ± 3.29

85.11 ± 2.25

89.65 ± 2.66

92.58 ± 1.90

94.68 ± 1.27

95.80 ± 1.11

PCA

78.20 ± 2.79

82.18 ± 2.70

86.23 ± 2.84

91.13 ± 2.89

92.91 ± 2.25

93.70 ± 1.80

LDA

88.66 ± 2.46

92.14 ± 2.36

93.84 ± 2.66

95.83 ± 1.55

96.81 ± 1.22

98.00 ± 1.16

LPP

Fig. 4.11 Face images of one person in the Yale_B dataset

Experiments on Yale_B Dataset The Yale_B dataset of Lee et al. (2005) contains 2432 images of 38 people with different facial expressions and lighting conditions. Several images of one person in the Yale_B dataset are shown in Fig. 4.11. All images were resized to 40 × 35. In the experiment, 12, 16, …, 32 images per person were randomly selected to comprise the labelled set and other samples were used for testing. RR/% and std are shown in Table 4.12. Experimental results on the face image datasets show that APLPP outperforms the other six algorithms in most situations. APLPP can construct an adaptive graph because discriminative information is adopted. Thus, the distribution of all samples can be well explored, giving better results. APLPP uses the correlation coefficient to adaptively select nearby samples and classes. Artificial parameters are not needed, making APLPP conveniently to implement and popularise. The experimental results on UCI and face image datasets showed that APLPP has obvious advantages over other algorithms, such as recognition rate.

4.4 Kernel Maximum Likelihood-Scaled Locally Linear …

109

Table 4.12 Classification performance (mean ± std) on the Yale_B dataset 12

16

20

24

28

32

APLPP

76.82 ± 0.88

81.57 ± 1.02

83.25 ± 0.65

85.32 ± 1.00

86.18 ± 0.83

87.66 ± 1.22

ELPP

70.68 ± 0.87

74.99 ± 1.02

77.06 ± 0.82

78.55 ± 1.12

80.01 ± 1.19

80.32 ± 1.53

CMVM 37.75 ± 1.49

41.98 ± 1.29

44.57 ± 0.72

47.02 ± 0.88

49.47 ± 1.10

50.72 ± 1.53

72.80 ± 0.85

77.61 ± 0.96

79.49 ± 0.62

81.61 ± 1.13

83.18 ± 1.05

83.93 ± 1.30

MMDA 73.97 ± 1.36

78.65 ± 1.14

80.68 ± 1.07

82.57 ± 1.01

84.91 ± 1.21

85.59 ± 1.43

PCA

34.79 ± 1.03

38.90 ± 0.97

42.24 ± 0.56

45.17 ± 0.71

47.43 ± 1.04

49.14 ± 1.38

LDA

74.64 ± 0.80

78.45 ± 1.03

81.93 ± 0.78

82.93 ± 0.83

85.01 ± 0.76

86.01 ± 1.35

LPP

4.4 Kernel Maximum Likelihood-Scaled Locally Linear Embedding for Night-Vision Images 4.4.1 KML Similarity Metric Maximum likelihood (ML) is based on the probability discriminant function. The Bayesian criterion comprehensively analyses the statistical information of noise image. It is especially suitable for estimating the abnormal distribution and detecting outliers in images having fine structures. Furthermore, the ML method has simple priori knowledge integration and algorithm structure (Deledalle et al. 2009; Hu et al. 2013; Jones and Henderson 2009). We introduce ML to evaluate the outlier probability of each sample point in the feature space of LLL and infrared images. The dataset in high-dimensional input space is {xi }i1,···,M . Each sample point, xi (xi1 , xi2 , . . . xiD )T , has an ML value with their N-neighbours, X xj j1,...,N ∈ w: −1 P(w) 1 ML(xi ) P(xi /w)/P(w) exp − (xi − μw )T (xi − μw ) . 2 (2π)q/2 |w |1/2 w (4.31) Here, μw and w are the mean value and covariance in neighbourhood w. If classification of each neighbourhood has the same prior probability, P(w), the ML similarity estimation function, s(xi ), can be expressed as follows, by taking the logarithm of ML(xi ). s(xi ) ln|w | + (xi − μw )T w−1 (xi − μw ) w

N N

T 1 1 xj − μw xj − μw , μw xj . N j1 N j1

Here, a larger s(xi ) indicates a higher outlier probability of xi .

(4.32)

110

4 Feature Classification Based on Manifold Dimension Reduction …

ML estimation occurs under the premise of normally distributed features in images, whereas the characteristics of night-vision images do not follow a normal distribution. To eliminate the negative effects of this Gaussian distribution assumption, we use KML, which nonlinearly maps the input signal to the kernel space via the kernel function, φ(·). It also detects outliers using ML in the kernel space. The nonlinear information of night-vision data can be mined. However, the Gaussian distribution assumption of data in the kernel space complicates input spatial distribution, which is much closer to reality. The only difference between the input and the kernel spaces is how we nonlinearly map each input sample, xi , in the input space, to the kernel space using the kernel function, φ(·). Thus, the sample in kernel space can be expressed as φ(xi ). Then, the mean value and covariance of neighbourhood, w, in the kernel space are modified as

T μwφ N1 Ni1 φ(xi ) and wφ N1 Ni1 (φ(xi ) − μw ) φ(xi ) − μwφ . Thus, the KML similarity metric function is transformed as follows:

T −1

φ(xi ) − μwφ s(φ(xi )) lnwφ + φ(xi ) − μwφ wφ wφ

N N

T 1

1

φ xj − μwφ φ xj − μwφ , μwφ φ xj . N j1 N j1

(4.33)

The eigenvalue of wφ is ∧wφ ∧lwφ l1,···,N , and the eigenvector is

∨wφ ∨lwφ l1,···,N . Given φw xj φ xj − μwφ and φw (X ) φw xj j1,...,N , N l

we have wφ N1 φw (X )φw (X )T and ∧lwφ φw (X)αl , j1 α φw xj T T 1 α l α1l , α2l , . . . , αNl . Insert wφ N φw (X )φw (X ) and ∧lwφ φw (X )α l into ∧lwφ ∨lwφ Σwφ ∨lwφ to acquire an equation: N ∧lwφ φw (X )α l φw (X )φwT (X )φw (X )α l .

(4.34)

Multiply Eq. (4.34) by φwT (X ) on both sides and set Kw φw (X ), φw (X ) to obtain N ∧lwφ α l Kw α l . Kw K − IK − KI + IKI is the kernel matrix. Here,

K k xki , xkj ki,kj1,...N , k xki , xkj φ(xki ), φ xkj and I is the unit matrix. Set ∧w as the eigenvalue of Kw and ∨w as the eigenvector. From N ∧lwφ α l Kw α l , we get ∧w N ∧wφ , ∨w α α l l1,···,N and ∨wφ φw (X )∧w . Thus, the pseudoinverse of wφ is −1 wφ

T ∨wφ ∧−1 wφ ∨wφ φw (X ) ∨w

1 ∧w N

−1

∨Tw φwT (X ) N φw (X )Kw−1 φwT (X ). (4.35)

4.4 Kernel Maximum Likelihood-Scaled Locally Linear …

111

The KML similarity metric function, s(φ(xi )), is finally converted as

T −1

s(φ(xi )) lnwφ + φ(xi ) − μwφ wφ φ(xi ) − μwφ

T

lnwφ + N φ(xi ) − μwφ φw (X )Kw−1 φwT (X ) φ(xi ) − μwφ .

(4.36)

Here, φ(xi )T φw (X ) φ(xi )T (φ(x1 ), φ(x2 ), . . . , φ(xN )) − μwφ Kxi (k(xi , x1 ), k(xi , x2 ), . . . , k(xi , xN )) −

N

k xi , xj /N , j1

and μTwφ φw (X ) μTwφ (φ(x1 ), φ(x2 ), . . . , φ(xN )) − μwφ Kμ

N

(k(xi , x1 ), k(xi , x2 ), . . . , k(xi , xN ))/N −

j1

N N

k xki , xkj /N 2 . ki1 kj1

Simultaneously, based on the definition of determinant, det wφ is the directed area or directed volume of a super-parallel polyhedron constituted by row or column vectors in wφ . Therefore, wφ can be approximately calculated as the summation of distances between centralised sample vectors and the eigenvector in wφ . N

T

wφ ≈ φ xj − μwφ ∧wφ ∧Twφ φ xj − μwΦ j1 N

T

φ xj − μwφ φw (X ) ∨w ∨Tw φwT (X ) φ xj − μwΦ j1

N

Kpw ∨w ∨Tw KpT .

(4.37)

j1

T Here, Kp φ xj φw (X ) − μTwφ φw (X ) k xj , x1 , k xj , x2 , . . . , k xj , xN − N

k xj , xki /N − Kμ . ki1

T −1

As shown in Eq. (4.36), φ(xi ) − μwφ wφ φ(xi ) − μwφ is the kernel Mahalanobis distance (KMD). Thus, the covariance distance between sample point

T

and overall neighbours is in kernel space. In Eq. (4.37), φ xj − μwφ ∨wφ

∨Twφ φ xj − μw is KPCA, which represents the projection distance summation for overall neighbours on the principal component of neighbourhood in kernel space. Thus, the KML similarity metric is the combination of KPCA and KMD. Only when each sample point has the most compact neighbourhood with minimum covariance

112

4 Feature Classification Based on Manifold Dimension Reduction …

distance can it select the best neighbours and have the highest similarity within neighbourhood to gain minimum outlier probability and highest reliability.

4.4.2 KML Outlier-Probability-Scaled LLE (KLLE) LLE assumes that an original sample set is uniformly and continuously distributed on manifold. Thus, a linear method can be used to analyse the intrinsic structure of the data. However, it is difficult in practice. Performing LLE in an appropriate kernel space implies a complex distribution of data in the input space, according to Yun et al. (2010) and Zhang et al. (2010). This section implements the KLLE method, and further employs the KML outlier probability in scaling KLLE to decrease outlier interference. This becomes the KML–KLLE algorithm, integrating KML similarity and KLLE. First, Fc denotes the clean dataset, which is identified by outlier probability, si , of each sample point through si ≤ τ, τ > 0, the threshold of outlier. Using KNN to choose N-neighbours of xi , all neighbours must meet xj j1,···,N ∈ Fc . Second, we calculate the reconstruction weights, η ηij i1,···,M ,j1,···,N , of the neighbours for minimising the reconstruction error, J(η), in the kernel space. J(η)

M

φ(xi ) −

i1

We set J(ηi ) φ(xi ) −

N

N

2 ηij φ xj .

(4.38)

j1

2 ηij φ xj , Gi [φ(xi ) − φ(xi1 ), . . . , φ(xi ) − φ(xiN )],

j1

ηi [ηi1 , ηi2 , . . . , ηiN ]T and the constraint, eT ηi 1, where J(ηi ) Gi η2i η Ti GTi Gi ηi . We apply Lagrangian multiplier, η, to get L(ηi ) ηTi GTi Gi ηi + λ ηTi e − 1 . The optimal weights are obtained by requiring the partial derivatives of i) GiT Gi ηi + λe 0. Then, we solve Eq. (4.39) each weight, ηij , to be zero: ∂L(η ∂ηij and rescale the weights to get the desired solution: η. GTi Gi ηi e.

(4.39)

Here, GTi Gi [1 − k(xi − xit1 ) − k(xi − xit2 ) + k(xit1 − xit2 )]t1,t21,···,N . Third, we compute the d-dimensional embedding, Y {yi }i1,...M , based on the reconstruction weights, η, and the outlier probability, S {si }i1,...,M , of the highdimensional inputs, φ(x) {φ(xi )}i1,...,M , in the kernel space. We minimise the 1 T cost function, J (Y ), using the constraints, M i1 yi 0 and M YY I . J (Y )

M i1

si yi −

N j1

ηij yi2 tr YMs Y T .

(4.40)

4.4 Kernel Maximum Likelihood-Scaled Locally Linear …

113

Here, Ms S(I − η)T (I − η). Minimising J (Y ) can be achieved by finding the d + 1 eigenvectors of Ms with the smallest d + 1 eigenvalues.

4.4.3 Experiments To evaluate the effect of dimension reduction and feature extraction by KML–KLLE in artificial data, real LLL and infrared applications, we design several groups of experiments where and the RBF kernel function, data is normalised k(x, y) φ(x), φ(y) exp −x − y2 /2σk2 , is chosen with σk 1. Parameter d is set to 2 and fixed for all experiments. Artificial data mapping To better illustrate the embedding property, we created two standard manifolds with outliers (toroidal helix and 3D clusters) and obtained 2D embedded outcomes using LLE, WLLE and KML–KLLE methods. As shown in Fig. 4.12, there were 1500 clean sample points and 200 uniformly distributed random outliers on both manifolds. The colour points in the 3D figures indicate the distribution of the toroidal helix and clusters. The black points are outliers. The corresponding points, mapped from 3D to 2D by three different methods, have the same colour.

LLE

WLLE

KML–KLLE

Fig. 4.12 LLE/WLLE/KML–KLLE on artificial datasets with outliers (N 15, τ 0.85). Reprinted from Han et al. (2014), with permission from Elsevier

114 10 % outliers

20 % outliers

30 % outliers

4 Feature Classification Based on Manifold Dimension Reduction … LLE

WLLE

KML–KLLE

LLE

WLLE

KML –KLLE

LLE

WLLE

KML–KLLE

Fig. 4.13 LLE/WLLE/KML–KLLE on the Twin Peaks dataset with different levels of outliers (N 20, τ 0.75). Reprinted from Han et al. (2014), with permission from Elsevier

The purpose of mapping in low dimensions is extracting the intrinsic structure in a high dimension to make the embedding result separable. From the distribution of colour points in 2D figures (Fig. 4.12), LLE and WLLE are susceptible to outliers, whereas KML–KLLE shows a good performance at dimension reduction and clustering on artificial abnormal manifolds. From the aspect of dimension reduction, owing to the outliers, LLE maps the 3D toroidal helix to 2D an anomalous shape. Whereas WLLE can unfold the toroidal helix string, it only obtains an irregular circle. However, KML–KLLE can achieve a regular flower-shaped circle, indicating that more features of the original dataset are preserved in 2D embedding. From the aspect of clustering, one cluster in the 3D clusters shrinks to a point with WLLE and even disappears with LLE in 2D, and the connection line is twisty. However, KML–KLLE can embed the 3D clusters precisely to 2D with all clusters retained and connection line clearly distinguished clearly. KML–KLLE is also applied for dimension reduction in Twin Peaks and Swiss Roll datasets with added outliers. The embedding results of LLE, WLLE and KML–KLLE on different levels of outliers are shown in Figs. 4.13 and 4.14, respectively. In Fig. 4.13, there are 1500 clean sample points and 150, 300 and 450 uniformly distributed random outliers. In Fig. 4.14, there are 1500 clean sample points and 150, 300, 450, 600, 750 and 900 uniformly distributed random outliers. The meanings of the colour and black points are the same as above.

4.4 Kernel Maximum Likelihood-Scaled Locally Linear …

115

10 % outliers

20 % outliers

30 % outliers

40 % outliers

50 % outliers

60 % outliers

Fig. 4.14 KML–KLLE on Swiss Roll dataset with different levels of outliers (N 20, τ 0.6). Reprinted from Han et al. (2014), with permission from Elsevier

KML–KLLE outperforms LLE and WLLE while embedding the Twin Peaks dataset with outliers. It can be seen from Fig. 4.13 that the KML–KLLE-embedded outcomes are reliable and negligibly influenced by the outliers, compared to LLE and WLLE. KML–KLLE can also extract the intrinsic features of data in a smoother colour distribution. LLE fails to preserve the global structure of manifold because of the damage caused by outliers to local neighbour relationships. WLLE can effectively unfold the manifold in low-density outliers, but its performance drops when the density increases. To investigate the extent to which KML–KLLE can estimate and filter outliers, we implement KML–KLLE on the Swiss Roll datasets, which are polluted with different levels of outliers from 10 to 60%. The dimension reduction results are given in Fig. 4.14. The embedding of clean data is not significantly affected by the noisy data. Furthermore, whereas the local intrinsic feature is badly destroyed as the percentage of outliers increases, the mapping of KML–KLLE remains accurate and robust. The embedding results prove the effectiveness of KML–KLLE in discovering essential features of input data by suppressing noisy data. Noisy UMIST face data mapping To reveal the robust performance of KML–KLLE on face recognition in noisy images, we tested a total of 158 face images of five different persons selected randomly from the UMIST database. Each face image had 92 × 112 pixels with Gaussian noise added (see Fig. 4.15a). The input sample point, xi , of x is constructed by concatenating the image pixels column by column sequentially to a column vector. It is embedded to 2D with LLE, GLE and KML–KLLE algorithms. The embedding results of five faces from top to bottom in Fig. 4.15a are represented by circle points of blue, cyan, green, yellow and orange. See Fig. 4.15b, c. For 2D embedding, cluster information can be discriminated by both axes of feature1 and feature2.

116

4 Feature Classification Based on Manifold Dimension Reduction …

(a) UMIST face data with Gaussian noise added (mean value 0, variance 0.15) variance 0.15

(b) LLE

variance 0.15

variance 0.15

GLE

KML–KLLE

variance 0.25

variance 0.35

variance 0.45

(c) KML–KLLE

KML–KLLE

KML–KLLE

Fig. 4.15 Comparison of dimension reduction and classification on noisy UMIST face data (N 20, τ 0.6). Reprinted from Han et al. (2014), with permission from Elsevier

4.4 Kernel Maximum Likelihood-Scaled Locally Linear …

117

Figure 4.15b compares the classification effect of the three algorithms by noisy face data. The variance of Gaussian noise is 0.15. LLE generates some overlaps between different clusters of face data, making them non-separable. Influenced by the serious outliers, GLE shows a diffuse shape of classes with fewer overlaps. KML–KLLE produced a relative parallel shape of five clusters with no overlap. However, the short distance between different clusters indicates that the embedding result is not linearly separable. In Fig. 4.15c, KML–KLLE works on different levels of noisy face data, where Gaussian noise variances are 0.25, 0.35 and 0.45. As predicted, the embedding performance of KML–KLLE drops slightly for the dense noises. However, all three different embedded outcomes are linearly separable. They can be horizontally classified when the variance of Gaussian noise is 0.25, vertically classified when the variance of Gaussian noise is 0.35 and diagonally classified when the variance of Gaussian noise is 0.45. LLL and infrared application In this simulation, several sets of LLL and infrared images were built to further demonstrate the effectiveness of KML–KLLE on finding a coherent relationship. As shown in Fig. 4.16, the dataset contains LLL and infrared images of two still lifes and three different faces, all having poses from the right side through the front view and to the right side under four illuminance levels (L1, L2, L3, L4) and two temperatures (T1 31C, T2 8C) indoors. All images were zoomed to 120 × 120 pixels. The embedding points of each target covered whole Jet colourmap from blue to red, representing different poses sequentially in datasets. The colour bar shown in Fig. 4.16a is the same as the one in Fig. 4.15. For 2D embedding of night-vision image sets, pose information is described by the feature2 axis, whereas the cluster information can be discriminated by both axes of feature1 and feature2. Figure 4.17 shows the dimension reduction effects of KML–KLLE by analysing the image sets of same target (CUP1 or H2) with all poses at four illuminances of L1, L2, L3 and L4. There are 1140 images in H2 dataset and 1070 in CUP1 dataset. We use differently shaped points to illustrate 2D embedding results in different illuminances. The square, asterisk, circle and triangle points represent embedding results in L1, L2, L3 and L4, respectively. 2D embedding can reflect the pose distribution of the cup and face very well. Owing to the simple structure and smooth surface of the cup, the shape and position of the distribution curves in 2D embedding are almost the same among four levels of interferential outliers. Whereas the surface structure of the human face is relatively complex. Subjected to different levels of outliers, the distribution curves in 2D embedding have different shapes. Thus, colour distributions (i.e. pose information) are consistent. Therefore, KML–KLLE is robust against outliers for dimension reduction of the same target in LLL image sets under different illuminances.

118

4 Feature Classification Based on Manifold Dimension Reduction …

(a) LLL image set in illuminance of L4 and infrared image set at temperature of T1 (from top to bottom are ordinal CUP1, CUP2, H1, H2, and H3); the colour bar represents different pose directions.

(b) LLL image of H3 in illuminance of L1, L2, L3, L4 and infrared image of H3 at temperature of T1, T2 respectively Fig. 4.16 LLL and infrared image sets. Reprinted from Han et al. (2014), with permission from Elsevier

CUP1

H2

Fig. 4.17 KML–KLLE on LLL image sets of same target with all poses in four illuminance (N 30; τ 0.9 for L1,τ 0.85 for L2, τ 0.75 for L3 and τ 0.6 for L4). Reprinted from Han et al. (2014), with permission from Elsevier

4.4 Kernel Maximum Likelihood-Scaled Locally Linear …

L2

119

L4

Fig. 4.18 KML–KLLE on LLL image sets of three targets with all poses in two illuminance (N 30; τ 0.85 for L2, τ 0.6 for L4). Reprinted from Han et al. (2014), with permission from Elsevier

In Fig. 4.18, we investigate the performance of KML–KLLE on LLL image sets of three different targets with all poses at L2 and L4 illuminances. There were 1220 images in L2 dataset and 1380 in L4 dataset, both including CUP2, H1 and H3. The circle, square and triangle points represent the embedding results of CUP2, H1 and H3, respectively. The 2D embedding result not only accurately indicates the pose distribution of all targets, but it also effectively classifies different targets. The pose and cluster distributions of the three targets in the 2D embedding are explicit for higher SNR ratios in L2, and the large distance among clusters caused the embedding result to be linearly separable. Serious noises in L4 resulted in interference to the pose distribution of CUP2 and H3. However, it still was distinctly separable for three targets. Thus, KML–KLLE is robust for both dimension reduction and classification of different targets in LLL image sets under different illuminances. KML–KLLE is applied to the infrared image sets of four targets with all poses at T1 and T2 temperatures. There were 1850 images in T1 dataset and 1900 in T2 dataset, both including CUP1, CUP2, H1 and H3. The circle, triangle, asterisk and square points represent embedding results of CUP1, CUP2, H1 and H3, respectively. The 2D embedding results of Fig. 4.19 show exact and linearly separable distributions of different targets on poses and clusters. Because of the slight difference among thermal radiation responses of faces in infrared images, there is little interference in performances at different temperatures. Thus, KML–KLLE is also robust to dimension reduction and classification of different targets in infrared image sets at different temperatures.

120

4 Feature Classification Based on Manifold Dimension Reduction …

T1

T2

Fig. 4.19 KML–KLLE on infrared image sets of four targets with all poses at two temperatures (N 30; τ 0.85). Reprinted from Han et al. (2014), with permission from Elsevier

4.4.4 Discussion Robust problem WLLE uses a weighted distance measurement by giving the prototype data with high density having a smaller weight scaling, whereas those with low density have a larger weight scaling, presupposing a standard normal distribution to estimate the probability density function. GLE selects neighbours using geometric distance, defined as the Euclidean distance from each point to a linear manifold spanning the principal vectors. It is represented as the volume of a parallelotope determined by the principal vectors. The fundamental geometric or weighting information of input data is recovered under the assumption that data are from smooth manifolds, so that each datum and the neighbours lie on a locally approximated linear patch. The robust version of LLE performs the local KML of the data points. The KML algorithm gives us a measure on how likely each point comes from the underlying data manifold. This is crucial to the subsequent KLLE learning procedure. Unlike WLLE and GLE, KML is based on robust statistics to reduce the influence of distributional deviations. From Eq. (4.36), KML–KLLE contains both density (i.e. KMD) and parallelotope volume (i.e. KPCA) in the kernel space. Thus, it synthesises the concepts of WLLE and GLE in outlier detection. Moreover, KML–KLLE utilises a kernel to match an ideal distribution in a complex dataset. In Figs. 4.13, 4.14 and 4.15, in the presence of outliers, the N nearest neighbours of a clean data point may no longer lie on a locally linear patch of the manifold. This gives small bias to the reconstruction. Regarding an outlier point, its neighbourhood is typically much larger and more irregular than clean data. Thus, the estimated reconstruction weights of its neighbours cannot well reflect the local structure of the manifold in high-dimensional input space. This leads to a large bias to the embedding result. The KML–KLLE can identify the outliers and reduce their influence in embedding. Thus, it is insensitive to outliers and can well preserve when the local structure of the data manifolds is in the input space, even when disturbed by outliers.

4.4 Kernel Maximum Likelihood-Scaled Locally Linear …

121

Fig. 4.20 a Comparison of error rates as number of neighbours N for KML–KLLE (τ 0.6) and LLE; b Error rates as outlier probability threshold τ for KML–KLLE on infrared T1 and LLL L1, L2, L3, L4 datasets (N 30). Reprinted from Han et al. (2014), with permission from Elsevier

Sensitivity to parameters The neighbour size, N, is a selective parameter in the configuration of our algorithm. To research the stability of KML–KLLE to N changes, we compare both LLE and KML–KLLE with the same configuration to perform dimension reduction and classification on the datasets of five targets under L4 and T1 in Fig. 4.16a. Figure 4.20a shows a range of embeddings discovered by LLE and KML–KLLE algorithms, all on the same dataset, but using different neighbour size Ns. The results of KML–KLLE show a wider stable range of neighbour size, but the results of LLE are easily broken down as N becomes too small or large. A small N cannot reconstitute the local relationship comprehensively, whereas a large N makes a data point and its nearest neighbours hard to be modelled as locally linear. However, the embeddings of KML–KLLE under the same situation is much better, mainly because the outlier probability evaluation algorithm in KML–KLLE eliminates the abnormally distributed data, even when the neighbour size is quite large. Thus, Fig. 4.20a shows a lower error rate and a higher stability of KML–KLLE on both LLL and infrared image sets. Noting that the performance of KML–KLLE is also greatly affected by changes to outlier probability threshold, τ, we record the error rate with different τ. Figure 4.20b shows the error rates of KML–KLLE on the datasets of five targets under four illuminance levels, L1, L2, L3, L4 and a temperature, T1. The shape of the error rate line indicates how the outlier probability threshold is chosen in KML–KLLE. On the one hand, Fig. 4.20b expounds the interference of noise to KML–KLLE. The error rate decreases with the decline of noise level. Thus, a smaller τ should be set at a lower illuminance of the LLL image set. On the other hand, a smaller τ involves stricter exclusion of outliers, and a lower error rate is achieved with the loss of computation time. Therefore, we choose a τ whenever the error rate is convergent. Stability problem The true positive (TP) rate and false positive (FP) rate are used as performance measures. The TP rate measures the chance that an outlier is correctly

122

4 Feature Classification Based on Manifold Dimension Reduction …

Fig. 4.21 ROC curves for the night-vision datasets. a ROC curves for the infrared T1 dataset. b ROC curves for the LLL L4 dataset. Reprinted from Han et al. (2014), with permission from Elsevier

detected, whereas the FP rate measures the chance that a clean data point is incorrectly detected as an outlier. Figure 4.21 shows the receiver operating characteristic (ROC) curves comparing the KML–KLLE outlier detection method, the WLLE outlier detection method and the simple neighbourhood-based outlier detection method for both infrared and LLL datasets. Specifically, the simple neighbourhood-based outlier detection method computes the neighbourhood volume of each data point. The larger the neighbourhood volume, the more likely the data point is an outlier. Since a more rapid and higher convergence value implies better performance, the outlier detection by KML–KLLE is significantly better in the infrared T1 dataset. KML–KLLE is also better in the LLL L4 dataset, but the difference is not so significant. Computational complexity analysis All algorithms used in this chapter mainly consist of three phases: the nearest neighbour estimation phase, the optimal reconstruction weights phase and reconstruction of low embedding phase. WLLE, GLE and KML–KLLE have one more step than the LLE in their nearest neighbour phase. The WLLE contains one more step of parameter estimation than the original LLE selection, which gives a computational complexity of

in its neighbour about O DM 2 + DM . However, the main computational burden of GLE is calculating of the manifold matrix, where the computation complexity is around

the inverse O MN 3 . Despite the good results of visualisation, clustering and classification in KML–KLLE, the computational requirement is significantly higher than the other three methods. The bottleneck lies with the computation of weights, si , by KML,

which involves more computations of around O MN 2 (N + D) than LLE. Thus, the computation complexities of our method are mainly determined by parameters of sample, M, sample dimension, D and neighbour size, N. As shown in Table 4.13, the computation complexity of KML–KLLE is less than that of WLLE, but is more than LLE and GLE in handling high-dimensional data. Our future work will introduce approximations to speed up the KML–KLLE procedure.

4.5 Summary

123

Table 4.13 Comparison of computation time on image set L4 Algorithm

Parameter (image set of L4 in Fig. 4.18)

Computational time (s)

Error rate (%)

LLE

M 1380, D 14400

106.3

21.3

WLLE

N 30, D 14400

220.4

18.5

GLE

N 30, M 1380, D 14400

129.9

17.2

KML–KLLE

N 30, M 1380, D 14400, τ 0.6

176.1

15.1

Reprinted from Han et al. (2014), with permission from Elsevier

4.5 Summary In view of the problems of feature reduction and classification, such as highdimensional data, different class distances and outlier interferences affect different application scenarios. This chapter introduced three methods, which are as follows: • In pattern recognition, data dimensionality reduction, which enhances the dissimilarity between different classes, is conducive for classification. To overcome the shortcomings of CMVM, a new algorithm based on CMVM was introduced. In the introduced algorithm, samples more accurately describing the relationship of different classes are selected and used to establish a more suitable between-class scatter matrix to describe the discreteness of different classes. Thus, the effectiveness of classification can be improved. Experiments on three datasets showed that the introduced algorithm has higher recognition rates than others. • APLPP was introduced for classification. In the introduced APLPP, artificial parameters, such as the neighbour size, K, were not needed. Thus, the introduced APLPP was convenient to implement. It used the correlation coefficient to adaptively construct the label graph to characterise the discriminative information of different manifolds. Furthermore, the correlation coefficient is used to adaptively construct the local relation graph to characterise the distribution of each manifold. Experiments on UCI datasets and face image datasets showed that the introduced APLPP obtained the best recognition rates, compared to other algorithms. • A new mapping method was introduced for night-vision applications. The potential of this method was demonstrated via artificial manifolds, real LLL and infrared image datasets. Because of the presented similarity metric, the accuracy of outlier detection and neighbour selection could be improved. Moreover, the embedding process was scaled by the outlier probability based on the KML similarity metric, which further inhibits outliers. Whereas night-vision datasets have great complexity and different intrinsic features, the introduced approach shows better performance than other methods. It is robust to outliers in both dimension reduction and classification. It has high stability to parameter changes, and the same ideas introduced for LLE can be extended to other NLDR methods, such as LPP.

124

4 Feature Classification Based on Manifold Dimension Reduction …

References Belhumeur, P. N., Hespanha, J. P., & Kriegman, D. (1996). Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. IEEE TPAMI, 19(7), 711–720. Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representa-tion. Journal of Neural Computation, 15(6), 1373–1396. Chang, H. D. (2006). Yeung. Robust locally linear embedding. Pattern Recognition, 39(6), 1053–1065. Chen, J., Ma, Z. (2011). Locally linear embedding: A review. International Journal of Pattern Recognition and Artificial Intelligence, 25(07). Choi, H., & Choi, S. (2007). Robust kernel Isomap. Pattern Recognition, 40, 853–862. Chung, F. R. K. (1997). Spectral graph theory, CBMS regional conference series in mathematics (p. 92). American Mathematical Society. CVC Technical Report. (1998). The AR face database. Deledalle, C., Denis, L., & Tupin, F. (2009). Iterative weighted maximum likelihood denoising with probabilistic patch-based weights. IEEE Transactions on Image Processing, 18(12), 2661–2672. Dornaika, F., & Assoum, A. (2013). Enhanced and parametreless locality-preserving projections for face recognition’. Neurocomputing, 99, 448–457. Fukunaga, K. (1990). Introduction to statistical pattern recognition (2nd Ed.). Iteration Number Initial. Ge, S. S., He, H., & Shen, C. (2012). Geometrically local embedding in manifolds for dimension reduction. Pattern Recognition, 45, 1455–1470. Ghodsi, A., Huang, J., Souther, F. et al. (2005). Tangent-corrected embedding. IEEE CVPR (pp. 518–525). Graham, D. B., & Allinson, N. M. (1998). Face recognition: From theory to applications. NATO ASI Series F, Computer and Systems Sciences, 163, 446–456. Han, J., Yue, J., Zhang, Y., et al. (2014). Kernel maximum likelihood-scaled locally linear embedding for night vision images. Optics & Laser Technology, 56, 290–298. He, X., Yan, S., Hu, Y., et al. (2005). Face recognition using Laplacianfaces’. IEEE TPAMI, 27(3), 328–340. Hu, H., Ossikovski, R., & Goudail, F. (2013). Performance of Maximum Likelihood estimation of Mueller matrices taking into account physical realizability and Gaussian or Poisson noise statistics. Optics Express, 21(4), 5117–5129. Jin, Z., Yang, J., Hu, Z., et al. (2001). Face recognition based on the uncorrelated discriminant transformation. Pattern Recognition, 34(7), 1405–1416. Jones, M. C., & Henderson, D. A. (2009). Maximum likelihood kernel density estimation: On the potential of convolution sieves. Computational Statistics & Data Analysis, 53, 3726–3733. Lee, K. C., Ho, J., & Kriegman, D. J. (2005). Acquiring linear subspaces for face recognition under variable lighting. IEEE TPAMI, 27(5), 684–698. Li, B., Huang, D. S., Wang, C., et al. (2008). Feature extraction using constrained maximum variance mapping. Jornal of Pattern Recognition, 41(11), 3287–3294. Li, B., & Zhang, Y. (2011). Supervised locally linear embedding projection (SLLEP) for machinery fault diagnosis. Mechanical Systems and Signal Processing, 25, 3125–3134. Li, J. (2012). Gabor filter based optical image recognition using Fractional Power Polynomial model based common discriminant locality-preserving projection with kernels. Optics and Lasers in Engineering, 50, 1281–1286. Li, J., Pan, J., & Chen, S. (2011). Kernel self-optimised locality-preserving discriminant analysis for feature extraction and recognition. Neurocomputing, 74, 3019–3027. Martinez, A. M. (1998). The AR face database (CVC Technical Report), p. 24. Martinez, A. M., Kak, A. C. (2001). PCA versus LDA. C. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2), 228–233. Niyogi, X. (2004). Locality-preserving projections. Neural information processing systems. MIT, 16, 153–161.

References

125

Pan, Y., Ge, S. S., & Mamun, A. A. (2009). Weighted locally linear embedding for dimension reduction. Pattern Recognition, 42, 798–811. Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Journal of Science, 290(5500), 2323–2326. Sim, T., Baker, S., Bsat, M. (2001). The CMU pose, illumination, and expression (PIE) database of human faces (Technica1 Report CMU-RI-TR-01-02), Carnegie Me11on University. Sim, T., Baker, S., Bsat, M. (2003). The CMU pose, illumination, and expression database. C. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(12), 1615–1618. Tenenbaum, J. B., De Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323. Wen, G. (2009). Relative transformation-based neighbourhood optimisation for isometric embedding. Neurocomputing, 72, 1205–1213. Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemo metrics and intelli-gent laboratory systems, 2(1), 37–52. Yan, S., Xu, D., Zhang, B., et al. (2007). Graph embedding and extensions: a general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(1), 40–51. Yang, W., Sun, C., & Zhang, L. (2011). A multi-manifold discriminant analysis method for image feature extraction. Journal of Pattern Recognition, 44(8), 1649–1657. Yu, H., & Yang, J. (2001). A direct LDA algorithm for high-dimensional data with application to face recognition. Pattern Recognition, 34(10), 2067–2070. Yun, Z., Xuelian, T., Benyong, L., & Xueang, W. (2010). Radar Target Recognition based on KLLE and a KNRD Classifier. WSEAS Transactions on Signal Processing, 2(6), 47–57. Zhang, S. (2009). Enhanced supervised locally linear embedding. Pattern Recognition Letters, 30, 1208–1218. Zhang, S., Li, L., Zhao, Z. (2010). Spoken emotion recognition using kernel discriminant locally linear embedding. Electronics Letters, 46(19). Zhao, Z., Han, J., Zhang, Y., et al. (2015). A new supervised manifold learning algorithm. ICIG 2015 (pp. 240–251). Springer International Publishing. Zhao, L., & Zhang, Z. (2009). Supervised locally linear embedding with probability-based distance for classification. Computers & Mathematics with Applications, 57, 919–926.

Chapter 5

Night-Vision Data Classification Based on Sparse Representation and Random Subspace

The traditional classification method is difficult to achieve good classification results when the training samples are few, and the unsupervised classification algorithms cannot use class information to improve their performance. Thus, it is necessary to apply semi-supervised classification methods. This chapter introduces semisupervised data classification based on sparse representation and stochastic subspace. First, the dictionary in sparse representation is simplified to improve the speed and accuracy of sparse representation. To meet the needs of a complete dictionary, we combine each random subspace of sparse representation when the dimensions of the original sample are far lower than the random subspace to enhance the ability of the original data. Second, with traditional random subspace methods, all features have the same probability for selection, and a method based on attribute features is introduced, promoting the accuracy of selected key feature probabilities to enhance the final accuracy of data classification.

5.1 Classification Methods The goal of classification is to separate different classes as far apart as possible. Over the past decades, many methods for classification have been introduced, such as the SVM in Vapnik (1998) and LDA in Belhumeur et al. (1997). The classification function of these methods is performed by minimising an empirical prediction loss function. To process high-dimensional data, many feature extraction algorithms have been proposed [e.g. PCA in Wold et al. (1987), Martinez and Kak (2001)]. PCA is the optimal representation of the input samples in the sense of minimising MSE. PCA takes the global Euclidean structure into account instead of the manifold structure. However, recent studies have shown that face images possibly reside on a nonlinear sub-manifold. Many manifold learning algorithms have been proposed for discovering intrinsic low-dimensional sub-manifold embedding in the original high© Springer Nature Singapore Pte Ltd. 2019 L. Bai et al., Night Vision Processing and Understanding, https://doi.org/10.1007/978-981-13-1669-2_5

127

128

5 Night-Vision Data Classification Based on Sparse Representation …

dimensional manifold [e.g. ISOMAP in Tenenbaum et al. (2000), LLE in Roweis and Saul (2000), LE in Belkin and Niyogi (2003) and LPP in He et al. (2005)]. Euclidean structure-based PCA and manifold structure-based ISOMAP, LLE, LE and LPP are fully unsupervised algorithms that can only be used for dimensional reduction rather than for classification. LDA, based on PCA, CMVM in Li et al. (2008) and MMDA in Yang et al. (2011) are based on LPP and are fully supervised algorithms. Thus, they perform well when there is enough label information. However, in the real world, it is difficult to obtain enough labelled samples, and unlabelled samples are easily obtained. So many semi-supervised manifold algorithms have been proposed, which combine information provided by the labelled and unlabelled samples to obtain better results in many fields [e.g. image annotation in Fan et al. (2012), medical diagnosis in Zhao et al. (2014) and classification in Xue et al. (2009), Fan et al. (2011) and Zhao et al. (2015)].

5.1.1 Research on Classification via Semi-supervised Random Subspace Sparse Representation In semi-supervised manifold learning algorithms, one constructs a graph to characterise the distribution of all samples. KNN and the ε-ball-based method in Zhu et al. (2003) are the two most used traditional methods. The main steps of KNN or ε-ball are defining the graph and establishing the edges’ weight. However, when processing high-dimensional data, the K in KNN or the ε in ε-ball-based method is difficult to determine, and there is no theoretical method to solve it. The random subspace method (RSM) was first proposed by Ho (1998). On the basis of RSM, many algorithms have been proposed, such as semi-supervised classification based on random subspace dimensionality reduction (SSC–RSDR) (see Yu et al. 2012a, b), semi-supervised ensemble classification (SSEC) (see Yu et al. 2012a, b) and semi-supervised classification based on subspace sparse representation (SSC–SSR) (see Yu et al. 2015). The authors constructed graphs in the subspaces of the original sample and obtained results through majority vote in SSC–RSDR and SSEC or by minimising a linear regression in SSC–SSR. Traditional RSM-based algorithms (e.g. SSC–RSDR, SSC–SSR and SSEC) only process data in independent random subspaces. However, combining all the random subspaces, as with image fusion, can get more information than each independent random subspace. In the real world, the same object can be observed at different viewpoints. Thus, multi-view learning has attracted many researchers’ attention, because it can fuse information from different views. Many multi-view learning algorithms have been proposed for image annotation (see Liu et al. 2014), feature selection (see Shi et al. 2015), clustering (see Yin et al. 2015), etc. In this chapter, each random subspace can represent the original high-dimensional space in one aspect and then fuse them to get the multi-view results.

5.1 Classification Methods

129

To connect different random subspaces and fuse them, Sect. 5.2 combines the multi-view random subspace and introduces a new semi-supervised learning algorithm called ‘semi-supervised multi-random subspace sparse representation’ (SSM–RSSR). SSM–RSSR randomly selects P features in the original space and generates T random subspaces. Then, in each random subspace, it uses all samples as the dictionary to reconstruct each sample and build the graph of each subspace through subspace sparse representation. Finally, it fuses all random subspaces and gets results. By fusing the random subspaces, the introduced SSM–RSSR obtains better and more stable recognition results in a wider range of random subspace dimensions. Experimental results on UCI datasets and several face image datasets demonstrate the effectiveness of the introduced SSM–RSSR. In SSC–RSDR, SSEC and SSC–SSR, when features are selected to compose the random subspace, all features in the original high-dimensional space have the same probability of selection. However, in the real world, especially in images, some regions are salient, and some are not. When using the above algorithms, features in saliency and non-saliency regions have the same probability of selection. The saliency regions are thus more important than non-saliency regions for classification. To select features and to classify more accurately, in Sect. 5.3, we select the features based on probability and introduce a new semi-supervised learning algorithm called ‘probability semi-supervised random subspace sparse representation’ (P-RSSR). P-RSSR uses the local entropy based on a 3 × 3 neighbour to measure the importance of every pixel in the image. Then, on the basis of local entropy, it selects features for each random subspace and classifies results in all random subspaces. Finally, it fuses the results in all random subspaces via majority vote to get the final classified results.

5.1.2 Research on Classification via Semi-supervised Multi-manifold Structure Regularisation (MMSR) The most popular semi-supervised learning algorithms are transductive learning algorithms and inductive learning algorithms. Gaussian Fields and Harmonic Functions (GFHF), in Zhu et al. (2003), is a typical transductive learning algorithm. However, its properties make it difficult to extend to new samples, limiting its application. Semi-supervised discriminant analysis (SSDA) is a typical inductive learning algorithm. In SSDA, the labelled samples are used to maximise discriminating power, whereas the unlabelled samples are used to estimate the intrinsic geometric structure of the data. SSDA can use the unlabelled samples when establishing the within-class scatter matrix, but it ignores the label information and will influence the performance of SSDA. Then the continuous locality preserving projections (CLPP) is proposed in Cai et al. (2007). In CLPP, the within-class weight matrix obtained by the ordinary algorithm is improved and the constraints are used to modify the neighbourhood relations. CLPP can use the unlabelled samples, but the edges between unlabelled

130

5 Night-Vision Data Classification Based on Sparse Representation …

samples remain unchanged. Therefore, it cannot make full use of the unlabelled samples. To overcome the drawbacks of ordinary manifold-based semi-supervised algorithms, Belkin proposed manifold regularisation (MR) in Cevikalp et al. (2008). The underlying sample distribution information of manifold structures is introduced into the traditional regularisation of MR, which has two regularisation terms. One regularisation term controls the complexity of the classifier and the other controls the complexity, which is measured by the manifold geometry of the sample distribution. Based on MR, many semi-supervised manifold regularisation methods have been proposed: discriminatively regularised least-squares classification in Belkin et al. (2006), semi-supervised discriminative regularisation (SSDR) in Xue et al. (2009) adds a discriminative term to MR and multi-manifold discriminative analysis (multiMDA) in Wu et al. (2010); sparse regularised least-squares classification (S-RLSC) in Fan et al. (2011), which uses sparse representation to construct the graph in MR, instead of KNN or local and global regression (LGR) (see Zhao et al. 2015), which combines MR and local and global regression. These algorithms can overcome some of the drawbacks of the original MR. However, when describing the distribution of the unlabelled samples, they cannot use the labelled samples, and discreteness information is ignored during when establishing the weighted matrix. Thus, Sect. 5.4 introduces a new semi-supervised multimanifold structure regularisation (MMSR). MMSR considers the discreteness information and uses the distribution of labelled samples to characterise the underlying unlabelled multi-manifold sample distribution when building the weighted matrix. Experimental results on UCI datasets and face datasets further demonstrate the effectiveness of MMSR.

5.2 Night-Vision Image Classification via SSM–RSSR 5.2.1 Motivation Recently, sparse representation (SR) has been used in various machine-learning algorithms, such as classification by Mairal et al. (2008), image SR by Kim and Kwon (2010) and Yang et al. (2008), image denoising by Protter and Elad (2009), feature extraction by Lai et al. (2011, 2012) and signal reconstruction by Aleksandar and Qiu (2010). In SR, a dictionary is used to reconstruct the sample. In many algorithms, one requires sufficient basic samples for SR, which indicates that atoms in the dictionary are larger than the dimensions of the sample. However, in the real world, dimension of data is very high. Thus, the requirement cannot be fulfilled, and the sought SR coefficients might not be sparse (see Yan and Wang 2009). In this situation, two approaches are typically used to achieve this requirement. First, we increase the number of atoms in the dictionary, making the sparse coefficients unstable and expensive (see Ramirez et al. 2010). Second, we reduce the dimensions

5.2 Night-Vision Image Classification via SSM–RSSR

131

of the samples. In this section, we apply the second approach and solve the SR in the random subspace of the original sample, whose dimension is much lower than the original sample. All semi-supervised manifold learning algorithms mentioned above have two common assumptions: manifold and cluster. The manifold assumption assumes that the high-dimensional data resides on a low-dimensional sub-manifold. The cluster assumption indicates that, if two samples are in the same cluster, they have high likelihood of belonging to the same class. However, traditional manifold learning algorithms overfocus on the smoothness of the manifold and ignore the discriminative information provided by labelled samples, which is key to classification. By exploring the discriminative information in each subspace, SSM–RSSR can characterise the distribution of all samples more accurately and achieve better results. Sparse Representation Suppose that X [x1 , x2 , . . . , xn ]T ∈ R m×n is a dictionary that contains enough basic samples, where n is the number of atoms in X and m is the dimension of X . The goal of SR is to reconstruct each sample, xi , using as few atoms in X as possible. The objective function of SR is minsi 0 , s.t. xi X si or minsi 0 , s.t. xi − X si ≤ ε si

si

(5.1)

ε is a small constant parameter, si [si,1 , . . . , si,i−1 , 0, si,i+1 , . . . , si,n ]T is the sparse coefficient with its ith element equal to zero, indicating that xi is not used to reconstruct itself. Other elements in si are contributions to reconstructing xi . However, Eq. (5.1) is an NP-hard problem. If l0 norm is used instead of l1 norm, the solution of Eq. (5.1) can be solved by least absolute shrinkage and selection operator (LASSO) (see Tibshirani 1996) or least angle regression (LARS) (see Drori and Donoho 2006) using sparse modelling software (SPAMS) (see Mairal et al. 2010; Jenatton et al. 2010). After obtaining all reconstruction coefficients, the sparse coefficient matrix S becomes S [s1 , s2 , . . . , sn ]

(5.2)

Then, the graph, G {X, S}, can be obtained. With SSM–RSSR, to obtain a dictionary containing sufficient basic samples for SR, we use all the samples, including labelled and unlabelled, as the dictionary. Thus, the samples in the dictionary are larger than the dimension of the random subspace to make the reconstruction coefficients sparse. Thus, the graph is constructed by SR and with SR to obtain the following advantages. • The SR-constructed graph is superior to KNN and ε-ball methods, especially for high-dimensional data. In KNN or ε-ball-based method, it is unreasonable that all samples share the same parameter. However, through SR, the graph is automatically established and different samples have different parameters corresponding to their properties.

132

5 Night-Vision Data Classification Based on Sparse Representation …

• When processing high-dimensional data, noise will interfere with the graph constructed by KNN or ε-ball. However, SR has shown its robustness in Wright et al. (2009). • Naturally, discriminative information is important for classification. However, many semi-supervised manifold learning algorithms overfocus on the smoothness of manifold and ignore the discriminative information. SR has shown its natural discriminative power in Qiao et al. (2010) and can work well when processing high-dimensional data.

5.2.2 SSM–RSSR SSM–RSSR Given a set of samples and their labels, (X, Y ) {(x1 , y1 ), . . . , (xl , yl ), . . . , (x N , y N )}, where X ∈ R M×N is the training set, Y ∈ R N ×C is the label of all training samples, M is the dimension of samples, N l + u is the number of all samples and C is the number of classes. The first l samples are labelled, whose label, yi , is a C × 1 column vector. If xi belongs to the cth class, then yi (c) 1 and other elements in yi are 0. The other u samples are unlabelled samples, whose label, yi , is a C × 1 column vector with all elements 0. The objective function of SSM–RSSR is J min{( f, X, Y ) + γ A f 2A + γl f 2SSM-RSSR }

(5.3)

( f, X, Y ) is the loss function, and the squared loss function is used in SSM–RSSR. f 2A is the regularisation term of SSM–RSSR, which is used to avoid overfitting. f 2SSM - RSSR is the other regularisation term, used to measure the smoothness of sample distribution. The squared loss function, ( f, X, Y ), can be written as l N T 2 (xi , yi , f ) V xi + b − Yi ((V T xi + b − Yi )T Hii (V T xi + b − Yi )) i1

i1

tr ((V T X + be − Y T )H (V T X + be − Y T )T )

(5.4)

V ∈ R M×C is the optimal projection matrix, b ∈ R C×1 is the bias term, tr() is the matrix trace operation, e is an N-dimensional vector with all elements are ones, and H is an N × N diagonal matrix. If xi is a labelled sample, then Hii 1, otherwise Hii 0. Suppose this section generates T random subspaces. Then, the other regularisation term, f 2SSM - RSSR , can be written as f 2SSM - RSSR

T i1

αir

T N T V x j − V T X s j 2 , αi 1, αi ≥ 0. j1

i1

(5.5)

5.2 Night-Vision Image Classification via SSM–RSSR

133

αi is the coefficient of each subspace, r is a constant parameter that can make the 2 T T in Eq. (5.5) can be simplified via some results stable. j1 V x j − V X s j algebraic formulation. N N T V x j − V T X s j 2 V T ( (x j − X s j )(x j − X s j )T )V j1

j1

⎛ ⎞ N (Xq j − X s j )(Xq j − X s j )T ⎠V VT⎝ j1

⎛ V T X⎝ ⎛ V T X⎝

N

⎞ (q j − s j )(q j − s j )T ⎠XT V

j1 N

⎞ (q j q Tj − s j q Tj − q j s Tj + s j s Tj )⎠XT V

j1

V XLs XT V T

(5.6)

Ls I − S − S T + SS T , q j is an N-dimensional selection indicative vector with its jth element 1 and others 0. s j is the sparse coefficients, I is an N × N unit matrix and S is the sparse coefficient matrix. Then, we combine Eqs. (5.5) and (5.6). The other regularisation term, f 2SSM - RSSR , can be rewritten as f 2SSM - RSSR

T i1

αir V T X L iS X T V V T X (

T

αir L iS )X T V V T X L X T V

i1

(5.7) L iS is the graph of the ith random subspace, and L

T i1

αir L iS .

Based on Eqs. (5.4) and (5.7), the objective function of the introduced SSM–RSSR can be written as J tr ((V T X + be − Y T )H (V T X + be − Y T )T ) + γ A tr (V T V ) + γl tr (V T X L X T V )

(5.8)

Optimisation of SSM–RSSR In this section, the solution to Eq. (5.8) will be derived, which has no direct analytical solution. In this chapter, we use the alternating method used in Bezdek and Hathaway (2002) to obtain the optimal solution. This method updates α and V, b iteratively until convergence. The main steps are shown next. First, it fixes α and updates V , b. In Eq. (5.8), when α is fixed, the objective function, J , can be solved by taking a partial derivative of J with V and b: dJ 2X H (X T V + eb T − Y ) + 2γ A V + 2γl X L X T V, dV

(5.9)

134

5 Night-Vision Data Classification Based on Sparse Representation …

dJ 2(V T X + beT − Y T )He. db Let

dJ db

0 and

(5.10)

0, we get

dJ dV

V (X HC X T + γ A I + γl X L X T )−1 X HC Y, b T

(Y T − V T X )HeT , N

(5.11) (5.12)

T

where HC H − (He NeH ) . In Eqs. (5.11) and (5.12), V and b depend on L, which is derived from αi and the sparse coefficients, L iS , in each random subspace. Then, it fixes V , b and updates α. Using a Lagrange multiplier, λ, the Lagrange function L(α, λ) can be obtained: L(α, λ) tr ((V T X + be − Y T )H (V T X + be − Y T )T ) + γ A tr (V T V ) + γl tr (V T X

T

αir L iS X T V ) − λ(

i1

T

αi − 1).

(5.13)

i1

By taking partial derivative of L(α, λ) with α and λ, dL(α, λ) r αir −1 tr (V T X L iS X T V ), i 1, . . . , T, dαi

(5.14)

dL(α, λ) αi − 1. dλ i1

(5.15)

T

Let

dL(α,λ) dαi

0 and

dL(α,λ) dλ

αi

0. (1/tr (V T X L iS X T V ))1/(r −1) T (1/tr (V T X L iS X T V ))1/(r −1)

(5.16)

i1

In Eq. (5.16), if r → ∞, all αi will have the same value. If r → 1, only αi 1, which means only one view or only one subspace is selected and other subspaces are ignored. Repeat Eqs. (5.11), (5.12) and (5.16) until αi converges. After obtaining the optimal αi , use Eqs. (5.11) and (5.12) to get the optimal V and b. Then, for each testing sample, xi , its predicting label can be obtained using Eq. (5.17). l(x) max (V T xi + b), where l(x) is the predicting label of xi .

(5.17)

5.2 Night-Vision Image Classification via SSM–RSSR

135

This section uses the following method to generate T random subspaces and corresponding dictionaries. • Randomly generate T selection indicative vectors pt (1 ≤ t ≤ T ): T

pti P, s.t. pti ∈ {0, 1},

(5.18)

i1

where, if the ith feature is selected in the tth random subspace, then pti 1, otherwise pti 0. • Suppose that the dictionary is D [d1 , . . . , dl , . . . , d N ]T . The subspace dictionary Dt and the subspace set X t is generated as X t X ( pt ), Dt D( pt ),

(5.19)

where X t [xt1 , . . . , xtl , . . . , xtN ], xti ∈ R P×1 is the ith sample in X t , which means xti is composed of the features selected by pt . Like X t , every atom in Dt is composed of the features selected by pt . Then, Dt [xt1 , . . . , xtl , . . . , xtN ]. The main procedure of SSM–RSSR is listed below. The corresponding framework diagram is depicted in Fig. 5.1. SSM–RSSR • Input: (X, Y ) {(x1 , y1 ), · · · , (xl , yl ), · · · , (x N , y N )}, M, P, T , C and the testing sample, xi • Output: l(xi ), predicting label of xi . • Preparation: - Generate the dictionary for all samples, D, in the original space. • Training: For t 1 to T do - Generate a random subspace using Eqs. (5.18) and (5.19), and get the subspace sample Xt and its corresponding dictionary Dt . - Use dictionary Dt to reconstruct X t and obtain the sparse coefficients. Then, get L is via Eq. (5.6). End For - Initialise αi 1/T and use Eqs. (5.11) and (5.16) to obtain αi until convergence. Then, get optimal αi . - Use Eqs. (5.11) and (5.12) to get the optimal V and b. • Testing: - Use Eq. (5.17) to obtain the predicting label l(xi ) of the testing sample, xi .

136

5 Night-Vision Data Classification Based on Sparse Representation …

Fig. 5.1 Framework of the introduced SSM–RSSR

5.2.3 Experiment The UCI datasets and several face image datasets were used to evaluate the performance of SSM–RSSR and Laplacian regularised least-squares (LapRLS), SSDR, PCA, LDA, CLPP (see Cai et al. 2007), SSDA, GFHF (see Zhu et al. 2003), S-RLSC and SSC–SSR. All algorithms were implemented with Matlab (R2016a). All algorithms were run 10 times on a desktop with i7 4790 CPU and 8-GB RAM, and the mean recognition rates were taken as the recognition rates. After feature extraction for PCA, LDA, CLPP and SSDA, a nearest-neighbour classifier was used for classification. For LapRLS, SSDR, S-RLSC and SSC–SSR, we used the same classifier described in the corresponding papers. In all results, the best recognition rates have a bold front. Fourfold cross-validation was used to obtain the best algorithm configuration, which needed manually set parameters. For LapRLS, SSDR, CLPP and SSDA, the Gaussian kernel parameter, σ , was selected from {10−5 , 10−4 , . . . , 101 , 102 }, and the nearby neighbour size, K , was set as follows. If the training number per class is larger than 10, the neighbour size K is set to 10. Otherwise, the neighbour size, K , was set as the training number of per class. γ A and γl in SSM–RSSR, LapRLS, SSDR and γd in SSDR are selected from {10−5 , 10−4 , . . . , 101 , 102 }. In S-RLSC, all parameters were set as described in Zhao et al. (2015). In SSC–SSR, all parameters

5.2 Night-Vision Image Classification via SSM–RSSR

137

Table 5.1 Basic information of the UCI dataset Name

Number of classes

Balancescale

3

Size 49*3

Dimension

Heartstatlog

2

120*2

13

Musk

2

207*2

166

Vehicle

4

189*4

18

Waveform

3

1653*3

40

Wine

3

48*3

83

Sonar

2

120*2

60

4

were set as described in Yu et al. (2015). The dimension of subspace P as set to 100 for image datasets or 0.1* M for UCI datasets, and the number of random spaces, T , was set to 30 for SSM–RSSR and SSC–SSR. To avoid the SSS problem and to reduce calculation time, PCA was conducted before classification, except for SSM–RSSR, S-RLSC and SSC–SSR. Nearly, 95% energy was preserved to select the number of principal components. Experiments on UCI Datasets This section uses several UCI datasets (i.e. Balancescale, Diabetes, Heartstatlog, Musk (Vision1), Vehicle, Waveform Database Generator (Version 2) and Wine (http://archive.ics.uci.edu/ml/datasets.html)) to test the recognition rates of all algorithms. Basic information of these datasets is shown in Table 5.1. In the main experiment, nearly 20% of the samples belonging to each class were randomly selected to comprise the unlabelled set. Nearly, 40% of the samples belonging to each class were randomly selected to comprise the labelled set. The remaining samples belonging to each class comprised the testing set. In all experiments, the testing set remained the same. In GFHF, the training set was the labelled samples, because of its property. In other semi-supervised algorithms, the unlabelled and labelled samples comprised the training set. For non-semi-supervised algorithms, only labelled samples comprised the training set. The recognition rates (%) and std of all algorithms are shown in Table 5.2. In Tables 5.1 and 5.2, ‘Waveform Database Generator’ is abbreviated to ‘Waveform’. The experimental results on UCI datasets show that the introduced SSM–RSSR outperforms the other nine algorithms in most situations. Experiments on Face Image Datasets. (1) Recognition experiments on the AR dataset In our experiments, we used the AR dataset, containing 3120 images of 120 people, to evaluate the recognition rates of all algorithms. We resized all images in the dataset to 40 × 50 and showed all images of one person. See Fig. 5.2. We randomly selected

138

5 Night-Vision Data Classification Based on Sparse Representation …

Table 5.2 Recognition rates (mean ± std) on the UCI datasets Balancescale

Heartstatlog

Musk

Vehicle

Waveform

Wine

Sonar

84.21 ± 5.45

85.42 ± 2.43

100.00 ± 0

82.06 ± 2.49

86.88 ± 0.73

98.85 ± 3.27

86.49 ± 5.17

LapRLS 64.56 ± 4.36

81.46 ± 3.37

99.88 ± 0.26

41.13 ± 2.51

85.78 ± 0.63

92.96 ± 3.58

71.89 ± 5.05

SSDR

70.53 ± 5.54

80.00 ± 2.86

96.36 ± 1.66

55.47 ± 2.75

85.85 ± 0.64

94.63 ± 2.95

79.87 ± 4.99

PCA

56.84 ± 3.62

78.13 ± 3.87

96.98 ± 1.50

52.13 ± 2.75

82.46 ± 0.76

95.74 ± 2.15

85.41 ± 2.97

LDA

77.72 ± 5.43

81.15 ± 1.87

100.00 ± 0

50.80 ± 2.58

82.37 ± 0.77

97.04 ± 1.29

77.43 ± 6.37

CLPP

49.75 ± 6.71

78.23 ± 3.97

92.78 ± 2.10

52.67 ± 2.27

81.87 ± 0.96

96.11 ± 2.04

81.89 ± 2.13

SSDA

70.70 ± 7.76

76.35 ± 3.54

90.74 ± 2.49

50.50 ± 2.03

59.10 ± 1.38

91.48 ± 3.17

76.22 ± 2.93

GFHF

63.10 ± 5.67

81.88 ± 2.13

79.84 ± 12.84 39.91 ± 1.58

75.87 ± 0.64

92.98 ± 1.53

74.74 ± 5.96

SRLSC

75.44 ± 5.30

76.25 ± 3.02

99.94 ± 0.20

70.77 ± 1.62

82.23 ± 0.60

91.67 ± 4.02

78.24 ± 4.34

SSC– SSR

58.87 ± 6.10

66.95 ± 2.85

76.10 ± 2.11

34.18 ± 4.69

61.49 ± 8.10

80.80 ± 5.95

55.21 ± 7.36

SSM– RSSR

Fig. 5.2 Example images of one person in the AR dataset Table 5.3 Recognition rates (mean ± std) on the AR dataset 2

4

6

8

10

12

77.34 ± 1.28

89.43 ± 1.14

94.52 ± 0.67

95.83 ± 0.52

96.67 ± 0.55

98.02 ± 0.49

LapRLS 73.13 ± 2.50

87.55 ± 0.97

92.63 ± 0.75

95.44 ± 0.93

96.90 ± 0.71

97.97 ± 0.38

SSDR

35.22 ± 1.80

49.22 ± 0.11

57.81 ± 1.27

64.28 ± 1.23

68.24 ± 1.52

72.96 ± 1.62

PCA

37.32 ± 1.51

52.96 ± 0.69

61.71 ± 1.13

68.34 ± 1.01

72.44 ± 1.27

77.20 ± 1.32

LDA

70.30 ± 1.59

88.52 ± 0.89

94.13 ± 0.99

95.97 ± 0.47

96.50 ± 0.57

97.50 ± 0.68

CLPP

53.51 ± 1.54

71.10 ± 1.48

80.64 ± 1.32

86.22 ± 0.75

89.39 ± 1.26

92.57 ± 1.54

SSDA

43.02 ± 1.77

64.76 ± 1.48

73.58 ± 2.14

77.50 ± 1.41

81.60 ± 2.59

85.07 ± 1.72

GFHF

69.63 ± 1.47

83.46 ± 1.74

86.63 ± 1.50

89.50 ± 1.49

91.30 ± 1.54

93.14 ± 0.85

SRLSC

65.75 ± 1.29

85.58 ± 0.68

92.57 ± 1.00

95.54 ± 0.88

96.67 ± 0.56

97.60 ± 0.47

SSC– SSR

69.82 ± 0.93

81.23 ± 1.42

85.11 ± 1.31

87.26 ± 1.25

87.58 ± 1.52

89.27 ± 1.15

SSM– RSSR

six images of each person to build the unlabelled set and randomly selected 2, 4,…, 12 images of each person to build the labelled set. The remaining samples were used for testing. The recognition rates (%) and std of all algorithms are shown in Table 5.3.

5.2 Night-Vision Image Classification via SSM–RSSR

139

Fig. 5.3 Example images of one person and one cup in the Infrared dataset Table 5.4 Recognition rates (mean ± std) on the Infrared dataset 5

10

15

20

25

30

35

40

SSM– RSSR

85.47 ± 4.78

92.86 ± 2.30

97.13 ± 1.39

98.04 ± 1.15

99.63 ± 1.12

99.84 ± 1.10

100.00 ± 0

100.00 ± 0

LapRLS

75.33 ± 5.06

89.85 ± 4.57

95.07 ± 1.88

97.71 ± 1.11

98.49 ± 1.47

99.30 ± 0.77

98.91 ± 1.25

99.69 ± 0.32

SSDR

76.25 ± 3.77

76.53 ± 5.00

82.74 ± 3.08

85.60 ± 4.07

87.31 ± 4.97

90.76 ± 2.94

91.18 ± 2.24

92.66 ± 2.38

PCA

75.07 ± 5.69

87.03 ± 2.91

92.13 ± 1.15

93.88 ± 3.92

96.41 ± 1.42

98.00 ± 0.99

98.22 ± 0.86

98.77 ± 0.84

LDA

82.59 ± 3.75

90.81 ± 2.96

96.70 ± 1.29

97.72 ± 1.93

98.87 ± 1.10

99.63 ± 0.33

99.70 ± 0.57

99.81 ± 0.45

CLPP

74.11 ± 6.43

86.14 ± 3.42

92.07 ± 1.50

93.54 ± 4.32

96.86 ± 1.15

98.03 ± 0.71

98.61 ± 0.73

98.90 ± 0.79

SSDA

72.38 ± 3.78

83.94 ± 4.38

90.66 ± 2.02

92.56 ± 4.37

95.89 ± 1.46

98.08 ± 1.85

97.94 ± 0.72

98.69 ± 0.85

GFHF

81.01 ± 3.05

83.20 ± 3.68

85.87 ± 3.34

87.15 ± 2.56

89.79 ± 2.67

90.08 ± 2.80

93.48 ± 1.39

95.25 ± 2.24

SRLSC

75.45 ± 3.55

89.45 ± 2.77

94.82 ± 2.20

97.98 ± 0.66

98.21 ± 1.66

98.98 ± 0.88

99.21 ± 0.94

99.37 ± 0.79

SSC– SSR

78.34 ± 2.23

88.13 ± 4.11

93.53 ± 2.42

94.89 ± 1.84

98.08 ± 0.81

99.43 ± 0.95

99.40 ± 0.49

99.69 ± 0.80

(2) Recognition experiments on the infrared dataset In our experiments, we used the infrared dataset, containing 1400 images of five people and two cups, to evaluate the recognition rates of all algorithms. We resized all images in the dataset to 40 × 40 and showed several images of one person and one cup. See Fig. 5.3. We randomly selected 20 images of each person to build the unlabelled set and randomly selected 5, 10,…, 40 images of each person or cup to build the labelled set. The remaining samples were used for testing. The recognition rates (%) and std of all algorithms are shown in Table 5.4. (3) Recognition experiments on the ORL dataset In our experiments, we used the ORL dataset, containing 400 images from 40 people with different rotational angles, to evaluate the recognition rates of all algorithms. We resized all images in the dataset to 56 × 46 and showed all images of one person. See Fig. 5.4. We randomly selected three images of each person to build the unlabelled set and randomly selected 1, 2,…, 6 images of each person to build the labelled set. The remaining samples were used for testing. The recognition rates (%) and std of all algorithms are shown in Table 5.5.

140

5 Night-Vision Data Classification Based on Sparse Representation …

Fig. 5.4 Example images of one person in the ORL dataset Table 5.5 Recognition rates (mean ± std) on the ORL datasets 1

2

3

4

5

6

SSM–RSSR

72.08 ± 1.76

85.50 ± 2.55

91.25 ± 1.76

95.83 ± 2.75

97.50 ± 2.22

100.00 ± 0

LapRLS

68.04 ± 2.49

84.40 ± 4.11

90.63 ± 3.15

95.33 ± 2.58

95.25 ± 2.93

97.75 ± 1.42

SSDR

68.04 ± 2.49

78.45 ± 2.98

87.25 ± 2.88

91.17 ± 2.49

91.38 ± 2.97

91.50 ± 3.94

PCA

68.42 ± 2.40

82.4 ± 1.24

88.81 ± 2.86

93.25 ± 1.07

94.25 ± 2.30

96.00 ± 2.69

LDA

44.50 ± 4.81

73.55 ± 5.37

89.75 ± 2.47

94.83 ± 1.66

96.25 ± 1.57

98.00 ± 1.58

CLPP

57.71 ± 5.67

70.95 ± 2.42

82.50 ± 3.18

89.25 ± 2.17

92.00 ± 3.83

94.50 ± 4.05

SSDA

54.58 ± 3.84

57.95 ± 7.12

59.44 ± 5.96

69.00 ± 4.26

76.50 ± 6.12

74.75 ± 7.21

GFHF

66.14 ± 1.73

75.81 ± 3.46

80.61 ± 3.92

81.13 ± 2.58

83.10 ± 2.51

85.19 ± 3.03

S-RLSC

68.33 ± 3.81

82.60 ± 3.20

90.19 ± 2.19

93.50 ± 1.75

96.00 ± 2.19

98.00 ± 2.85

SSC–SSR

65.75 ± 2.45

81.55 ± 2.46

87.88 ± 2.66

89.33 ± 3.70

91.88 ± 2.38

91.00 ± 5.80

Fig. 5.5 Example images of one person and one cup in the CMU PIE dataset

(4) Recognition experiments on the CMU PIE dataset In our experiments, we used the CMU PIE dataset, containing 41,368 images of 68 people with different facial expressions and lighting conditions to evaluate the recognition rates of all algorithms. We selected 80 images of each person to build the new dataset, resizing them to 32 × 32, showing several images of each person. See Fig. 5.5. We randomly selected 10 images of each person to build the unlabelled set and randomly selected 10, 12,…, 20 images of each person to build the labelled set. The remaining samples were used for testing. The recognition rates (%) and std of all algorithms are shown in Table 5.6. (5) Recognition experiments on the UMIST dataset In our experiments, we used the UMIST dataset, containing 575 images of 20 people, to evaluate the recognition rates of all algorithms. We resized all images in the dataset to 56 × 46 and showed several images of one person. See Fig. 5.6. We randomly

5.2 Night-Vision Image Classification via SSM–RSSR

141

Table 5.6 Recognition rates (mean ± std) on the PIE dataset 10

12

14

16

18

20

SSM–RSSR

83.77 ± 0.74

85.25 ± 0.95

87.01 ± 0.70

87.88 ± 0.53

88.63 ± 0.65

89.50 ± 0.39

LapRLS

55.30 ± 2.16

58.61 ± 1.68

61.77 ± 1.59

64.16 ± 1.22

65.28 ± 1.61

67.50 ± 1.16

SSDR

35.63 ± 0.83

40.65 ± 1.14

44.11 ± 0.60

46.78 ± 1.26

49.35 ± 0.79

51.15 ± 1.45

PCA

33.82 ± 0.73

37.52 ± 1.07

41.40 ± 0.52

44.31 ± 1.02

46.92 ± 0.73

50.43 ± 0.88

LDA

74.02 ± 1.01

78.49 ± 0.90

80.83 ± 2.04

81.97 ± 1.01

84.49 ± 0.73

84.70 ± 0.83

CLPP

45.36 ± 2.21

51.36 ± 1.52

55.20 ± 2.04

56.70 ± 2.26

61.11 ± 2.50

64.83 ± 1.67

SSDA

58.51 ± 0.95

64.23 ± 1.19

67.77 ± 1.71

70.31 ± 1.33

73.73 ± 1.05

75.14 ± 0.83

GFHF

60.42 ± 0.57

66.56 ± 0.64

67.10 ± 0.48

75.72 ± 0.75

77.55 ± 0.71

78.05 ± 1.12

S-RLSC

71.32 ± 0.70

76.39 ± 1.20

80.26 ± 0.95

82.96 ± 0.76

85.38 ± 0.57

86.65 ± 0.84

SSC–SSR

74.76 ± 1.49

78.15 ± 0.74

78.83 ± 1.67

79.32 ± 1.30

80.49 ± 0.72

81.23 ± 1.89

Fig. 5.6 Example images of one person in the UMIST dataset Table 5.7 Recognition rates (mean ± std) on the UMIST dataset 2

3

4

5

6

7

8

69.62 ± 4.47

75.42 ± 3.42

85.46 ± 4.96

90.50 ± 2.24

91.67 ± 3.34

93.75 ± 2.96

96.43 ± 2.16

LapRLS 58.58 ± 7.52

71.96 ± 4.75

81.50 ± 4.58

85.35 ± 3.08

89.22 ± 2.08

93.13 ± 2.83

94.07 ± 3.66

SSDR

60.31 ± 4.13

70.13 ± 3.96

76.68 ± 3.62

78.75 ± 2.20

79.17 ± 2.72

81.56 ± 2.78

83.29 ± 4.82

PCA

59.23 ± 4.05

70.83 ± 2.91

77.55 ± 3.84

84.70 ± 4.23

87.22 ± 1.78

90.81 ± 2.47

91.00 ± 2.96

LDA

55.27 ± 3.53

79.17 ± 4.00

85.00 ± 3.68

91.50 ± 1.95

92.78 ± 1.80

94.38 ± 1.44

96.43 ± 1.16

CLPP

53.00 ± 3.11

66.42 ± 3.70

75.00 ± 3.48

80.95 ± 4.04

84.28 ± 2.33

88.56 ± 3.09

88.14 ± 3.22

SSDA

35.04 ± 6.79

56.71 ± 4.98

67.00 ± 3.66

71.50 ± 3.29

77.50 ± 3.10

82.38 ± 4.50

83.21 ± 2.90

GFHF

66.11 ± 3.91

69.31 ± 2.72

73.77 ± 5.80

84.75 ± 4.65

89.23 ± 4.86

93.42 ± 2.71

96.23 ± 4.53

SRLSC

62.19 ± 3.59

73.96 ± 5.53

83.14 ± 4.17

88.55 ± 2.62

91.72 ± 3.31

92.81 ± 2.57

96.79 ± 2.53

SSC– SSR

59.54 ± 4.19

71.29 ± 3.17

76.96 ± 2.79

80.50 ± 3.44

86.67 ± 2.09

87.75 ± 2.86

90.43 ± 4.22

SSM– RSSR

selected four images of each person to build the unlabelled set and randomly selected 2, 3,…, 8 images of each person to build the labelled set. The remaining samples were used for testing. The recognition rates (%) and std of all algorithms are shown in Table 5.7.

142

5 Night-Vision Data Classification Based on Sparse Representation …

Fig. 5.7 Example images of one person in the Yale_B dataset Table 5.8 Recognition rates (mean ± std) on the Yale_B dataset 6

8

10

12

14

16

18

20

SSM– RSSR

81.65 ± 1.34

85.82 ± 1.43

89.91 ± 0.81

90.98 ± 1.04

92.57 ± 1.08

93.60 ± 0.77

94.76 ± 0.91

95.24 ± 0.59

LapRLS

56.60 ± 2.65

63.81 ± 1.41

68.75 ± 1.81

72.78 ± 1.76

75.30 ± 1.24

78.37 ± 1.76

80.92 ± 1.91

83.33 ± 1.57

SSDR

22.16 ± 1.42

26.11 ± 1.23

28.74 ± 1.08

31.38 ± 1.61

34.04 ± 1.34

36.18 ± 1.51

37.94 ± 0.90

40.01 ± 0.84

PCA

25.10 ± 1.20

28.67 ± 1.11

31.85 ± 1.09

34.59 ± 1.30

37.25 ± 1.69

38.59 ± 1.43

40.36 ± 1.63

42.67 ± 0.48

LDA

64.27 ± 0.75

69.87 ± 1.19

74.37 ± 0.93

76.59 ± 1.07

78.57 ± 0.93

80.23 ± 1.31

81.99 ± 0.58

83.51 ± 0.79

CLPP

39.03 ± 3.18

46.05 ± 3.61

51.55 ± 2.29

53.48 ± 2.22

56.78 ± 3.01

60.35 ± 2.40

62.77 ± 1.79

65.76 ± 1.45

SSDA

53.60 ± 0.89

58.84 ± 1.99

62.95 ± 1.19

65.14 ± 1.23

68.63 ± 1.22

69.93 ± 1.28

72.13 ± 1.43

74.00 ± 1.46

GFHF

56.54 ± 1.20

60.59 ± 0.86

67.59 ± 1.74

69.92 ± 1.36

72.96 ± 1.45

75.99 ± 1.78

80.98 ± 1.23

85.23 ± 1.49

SRLSC

56.65 ± 1.42

65.41 ± 1.80

72.05 ± 1.12

75.97 ± 1.52

79.68 ± 0.98

82.75 ± 0.93

84.22 ± 0.52

86.10 ± 1.40

SSC–SSR 74.60 ± 1.37

79.94 ± 1.60

82.30 ± 2.82

83.58 ± 1.20

86.11 ± 1.20

87.80 ± 1.10

88.46 ± 1.46

88.84 ± 1.34

(6) Recognition experiments on the Yale_B dataset In our experiments, we used the extended Yale_B dataset, containing 2432 images of 38 people with different facial expressions and lighting conditions, to evaluate the recognition rates of all algorithms. We resized all images in the dataset to 40 × 35 and showed several images of one person. See Fig. 5.7. We randomly selected 10 images of each person to build the unlabelled set and randomly selected 6, 8,…, 20 images of each person to build the labelled set. The remaining samples were used for testing. The recognition rates (%) and std of all algorithms are shown in Table 5.8. Experimental results on face image datasets have shown that the introduced SSM–RSSR outperforms the other nine algorithms in most situations. From these experimental results, we reach the conclusion that the introduced SSM–RSSR is superior to the other nine algorithms. Compared to the non-semisupervised learning algorithms (i.e. PCA and LDA), the introduced SSM–RSSR is based on semi-supervised learning, which is superior to non-semi-supervised learning algorithms. Compared to semi-supervised learning algorithms (i.e. LapRLS, CLPP, SSDA and GFHF), the graph in the introduced SSM–RSSR is constructed by SR, having obvious advantages over graphs constructed by KNN or ε-ball-based methods. Compared to S-RLSC, in the introduced SSM–RSSR, the graph is constructed using random subspaces, which is more discriminative than graphs constructed in the original space, per the SSC–SSR. Compared to SSC–SSR, in the

5.2 Night-Vision Image Classification via SSM–RSSR

143

introduced SSM–RSSR, different multi-view subspaces are fused. Thus, we characterise the distribution of samples more accurately and obtain better results. The comparison results of the introduced SSM–RSSR and SSC–SSR in the following section demonstrate this point. Sensitivity on Number of Random Subspaces The number of random subspaces is an important parameter in random subspace methods. With the number of random subspaces increase, the computational complexity will increase in linear. It uses Yale_B dataset to test the sensitivity and the experimental configurations are as follows: 10 images of each person are randomly selected to compose the unlabelled set and 10 images of each person are randomly selected to compose the labelled set, and the remaining samples of each person compose the testing set. In order to balance the computational complexity and the performance of the algorithm, it fixes the dimension of each subspace P as 100, and change the number of random subspace from 5 to 40 as 5,10,…,40. The experimental results of the introduced SSM–RSSR and SSC–SSR are shown in Fig. 5.8. From the experimental results of Fig. 5.8, when T ≥ 15, SSC–SSR can obtain stable recognition rates and the introduced SSM–RSSR is insensitive to the number of random subspaces, because the introduced SSM–RSSR can fuse different random subspaces and can more accurately characterise samples distribution. Sensitivity on Dimensions of Each Random Subspace The dimension of each random subspace is another important parameter in the random subspace methods. As the dimension of each random subspace increases, the computational complexity increases. We use the Yale_B dataset to test the sensitivity, and the basic experimental configurations are almost identical to those in Sect. 4.3. To balance the computational complexity and the performance of the algorithm, we fix the number of random subspaces, T, to 30 and change the dimension of each from 20 to 200 as 20, 40,…, 200.

Fig. 5.8 Recognition rates versus the number of random subspaces

144

5 Night-Vision Data Classification Based on Sparse Representation …

Fig. 5.9 Recognition rates versus the dimension of each random subspace

The experimental results of the introduced SSM–RSSR and SSC–SSR are shown in Fig. 5.9. From the experimental results of Fig. 5.9, as the dimension of each random subspace increases, the recognition rates of SSC–SSR increase until P ≥ 100. The recognition rates became stable and the introduced SSM–RSSR is insensitive to the number of random subspaces, because the introduced SSM–RSSR can fuse different random subspaces and can characterise the sample distributions more accurately. Robustness Experiments on AR Dataset To test the robustness of the introduced SSM–RSSR, we normalised the grey value of all images in the AR dataset to 0–1 and add different levels of Gaussian noise. The mean values of the Gaussian noises are all zero with the stds from 0.01 to 0.08, as 0.01, 0.02, 0.04, 0.06 and 0.08. In these experiments, we randomly selected six images of each person to build the labelled set. The other experimental configurations are the same as the section about recognition experiments on the AR dataset. Several images of one person in the AR dataset having different level Gaussian noises are shown in Fig. 5.10, and the experimental results are shown in Fig. 5.11. From the experimental results, the introduced SSM–RSSR is more robust than other algorithms. Compared to other algorithms, except S-RLSC and SSC–SSR, the introduced SSM–RSSR is based on SR, which is naturally robust to noise. Compared to S-RLSC, the introduced SSM–RSSR performs SR in the random subspace, which is less affected by noise. Compared to SSC–SSR, the introduced SSM–RSSR can fuse graphs from different random subspaces to get better results.

5.2 Night-Vision Image Classification via SSM–RSSR

0.01

145

0.02

0.06

0.04

0.08

Fig. 5.10 Several images with different levels of Gaussian noise

Fig. 5.11 Performance of all algorithms with different levels of Gaussian noise

146

5 Night-Vision Data Classification Based on Sparse Representation …

5.3 Night-Vision Image Classification via P-RSSR 5.3.1 Probability Semi-supervised Random Subspace Sparse Representation (P-RSSR) Local Entropy In information theory, Shannon uses entropy to measure the information in a system. If the probability of one status in the system is pi (i 0, 1, . . . , n), n is the possible statuses of the system. Thus, the entropy of the system is

H −

n

pi log2 pi

(5.20)

i0

If all pi n1 , H achieves its maximum value, and the system contains maximum information. Generally, if the entropy of the system is larger, it contains more information. On the basic assumption of Shannon entropy, Shiozaki in Girosi (1998) proposed local entropy to measure an image. We suppose that the grey value of pixel (i, j) in the image is I0. The grey values in the neighbours of pixel (i, j) are I1 , I2 , . . . , In . Thus, the local entropy of pixel (i, j) is calculated as pi Ii /

n

Ij,

(5.21)

j0

H

n

pi log2 pi .

(5.22)

i0

From Eqs. (5.21) and (5.22), we find that the local entropy is based on grey value distribution. If all pixels have the same scale value, the local entropy minimises. Otherwise, if the grey value changes, the local entropy grows. If the grey value range of the image is 0–255, and the neighbour size is 3 × 3, the maximum local entropy is −0.2877. The minimum local entropy is −3.1699. Thus, the higher local entropy, the more information exists. Then, the local entropy is normalised to 0–1, based on the maximum and minimum values. Figure 5.12 shows the local entropy and the Shannon entropy of one person in the AR dataset. In Fig. 5.12b, the Shannon entropy is calculated in a 3 × 3 neighbour and then normalised to 0–1. From Fig. 5.12, if Shannon entropy is applied directly to the image, almost all areas have high probability. Otherwise, if local entropy is applied, many grey value smooth areas are suppressed and the salient regions are selected.

5.3 Night-Vision Image Classification via P-RSSR

(a) Original image

(b) Shannon entropy

147

(c) local entropy

Fig. 5.12 Example of local entropy and Shannon entropy

P-RSSR Given a set of samples and their labels (X, Y ) {(x1 , y1 ), . . . , (xl , yl ), . . . , (x N , y N )}, where the first sample is labelled and the other u samples are unlabelled. In the dataset, X ∈ R M×N is the training set, Y ∈ R N ×C is the label of all training samples, M is the dimension of original samples, N l + u is the number of all samples and C is the number of classes. yi is the label of xi , which is a C × 1 column vector, if xi is a labelled sample and belongs to cth class, yi (c) 1 and other elements in yi are zeros. If xi is an unlabelled sample, all elements in yi are also zeros. T random subspaces are generated and the dimension of each random subspace is P. With P-RSSR, to obtain a dictionary containing sufficient basic samples for SR, all samples used include labelled and unlabelled samples, as with the dictionary. Thus, the samples in the dictionary are larger than the dimensions of the random subspace, making the reconstruction coefficients sparse. In Yu et al. (2012a, b) and (2015), all features had the same probabilities for selection. Features from different regions should have different probabilities. If the features based on the probabilities are selected directly, when the number of selected features is much larger than the total features, such as 30 times or more, the occurrences of each feature are very consistent with the distribution of the local entropy. However, in practicality, it cannot select many features. Usually, 5–20% of all features in 20–50 random subspaces are selected. If 10% of all features in 30 random subspaces are selected, the occurrences of each feature are more consistent with random distribution. Thus, a simple and effective method is used to select features to compose the random subspace.

148

5 Night-Vision Data Classification Based on Sparse Representation …

In the training stage, the local entropy of all training images at the neighbour size, 3 × 3, is calculated. Then, the maximum value (−0.2877) and minimum value (−3.1699) are used to normalise the local entropy to 0–1. After getting the normalised local entropy, we add the local entropy of all training images and obtain the sum of local entropy, S_en. Then, we calculate the mean value, M_en of S_en. We find the region, ind_0, whose value is larger than M_en, and we calculate their sum, S_0. The other region is ind_1 and its sum is S_1. Equation (5.23) is used to determine how many features are selected in each region. S_0 ×P S_1 + S_0 P_1 P − P_0.

P_0

(5.23)

P_0 is the number of selected features in ind_0 and P_1 is the number of selected features in ind_1. Thus, for regions with local entropy larger than M_en, we randomly select P_0 features. In the remaining regions, we randomly select P_1 features using the same strategy. In each random subspace, the objective function of P-RSSR is J min{( f, X, Y ) + γ A f 2A + γl f 2P - RSSR }.

(5.24)

( f, X, Y ) is the loss function, f 2A is the regularisation term of P-RSSR and f 2P - RSSR is the other regularisation term. The squared loss function ( f, X, Y ) is defined as ( f, X, Y )

l

yi − f (xi )2

i1

N

((Y − F)T Hii (Y − F)) tr ((Y − F)H (Y − F)T ).

i1

(5.25) F [ f (x1 ), f (x2 ), . . . , f (x N )]T , tr () is the matrix trace operation and H is a N × N diagonal matrix. If xi is a labelled sample, then Hii 1. Otherwise, Hii 0. The other regularisation term, f 2P-RSSR , used to measure the smoothness of manifold, is defined as f 2P−R SS R

2 N N si j f (x j ) f (xi ) − . i1 j1

2 N N si j f (x j ) can be simplified with Eq. (5.27). f (xi ) − i1 j1 2 ⎛ ⎞ N N N f (xi ) − ⎝ si j f (x j ) ( f (x j ) − Fs j )( f (x j ) − X s j )T ⎠ i1 j1 j1

(5.26)

5.3 Night-Vision Image Classification via P-RSSR

⎛ ⎝

N

149

⎞ (Fq j − Fs j )(Fq j − Fs j )T ⎠

j1

⎞ ⎛ N (q j − s j )(q j − s j )T ⎠ F = F⎝ j1

⎞ ⎛ N (q j q Tj − s j q Tj − q j s Tj + s j s Tj )⎠ F F⎝ j1

FLs F T .

(5.27)

Ls I − S − S T + SS T , q j is an N × 1 column vector with its jth element at 1 and other elements are zeros. s j is the sparse coefficients, and I is an N × N unit matrix. Suppose that f lays in a reproducing kernel Hilbert space (RKHS). Hs, associated with a Mercer kernel K , is X × X → R. After applying the representer theorem of Girosi (1998), Scholkopf et al. (2000), the solution to Eq. (5.24) can be written as f ∗ (x)

N

αi k(x, xi ).

(5.28)

i1

The objective function can be simplified as J (K α − Y )T H (K α − Y )+γl α T K T L s K α+γ A α T K α

(5.29)

α [α1 , α2 , . . . , α N ], K ∈ R N ×N is a Gram matrix and Y [y1 , y2 , . . . , y N ]T is an N × C matrix. By setting the partial derivative of J with α set to zero, ∂J 2K H (K α − Y ) + 2γl K L s K α + 2γ A K α 0. ∂α

(5.30)

Thus, α ∗ can be obtained. α ∗ (H K + γl L s K + γ A I )−1 Y.

(5.31)

The Gaussian kernel is used as shown in Eq. (5.32). xi − x j 2 K (i, j) k(xi , x j ) exp(− ) 2σ 2

(5.32)

150

5 Night-Vision Data Classification Based on Sparse Representation …

When predicting the label of each testing sample, xi , we calculate the Euclidean distance between xi , and each training sample, x j , we use Eqs. (5.33) and (5.34) to predict the label of xi . xi − x j 2 K _test(i, j) exp(− ), 2σ 2 Y _label max(K _test T ∗ α ∗ ).

(5.33) (5.34)

where Y_label is the predicting label of testing samples, xi . We use the following steps to generate the random subspace sets and random subspace dictionaries in T random subspaces. • Randomly generate T selection indicative vectors, pt (1 ≤ t ≤ T ). Then, the equation of pt is P

pti P, s.t. pti ∈ {0, 1},

(5.35)

i1

where pti 1 is the ith feature selected in the tth random subspace. Otherwise, pti 0. As described in Eq. (5.23), we randomly select P_0 features from region ind_0 and randomly select P_1 features form region ind_1. In the testing dataset, the same strategy is used to select features for every random subspace. • Suppose the dictionaries are D [d1 , . . . , dl , . . . , d N ] for every sample. The subspace dictionaries, Dt , and the subspace set, X t , are generated as X t X ( pt ), Dt D( pt ),

(5.36)

where X t [xt1 , . . . , xtl , . . . , xtN ], xti ∈ R P×1 is the ith sample in X t , meaning that xti is composed of the features selected by pt . Like X t , every atom in Dt is composed of the features selected by pt in the original high dimension. As mentioned above, Dt [xt1 , . . . , xtl , . . . , xtN ]T . • In each random subspace, we use Eqs. (5.33) and (5.34) to predict the label of testing samples in all T random subspaces. We then use the majority vote to get the final classified results. The framework of the introduced P-RSSR is shown below.

5.3 Night-Vision Image Classification via P-RSSR

151

P-RSSR Input:(X, Y ) {(x1 , y1 ), · · · , (xl , yl ), · · · , (x N , y N )}, M, P, T and C and the testing sample, xi Output: l(xi ), predicting label of xi Preparation • Calculate local entropy of all training samples and use Eq. (5.23) to generate features for every random subspace • Generate the dictionaries for all samples, D [d1 , · · · , dl , · · · d N ]T , in the original space Training For t 1 to T do • Generate a random subspace using Eqs. (5.35) and (5.36). Obtain the subspace sample, X t and the corresponding dictionary, Dt • Use dictionary Dt to reconstruct X t and obtain the sparse coefficients. Then, obtain L s using Eq. (5.27) • Obtain the optimal, α ∗ , using Eq. (5.31) End For Testing For t 1 to T do • Project the testing sample, xi , into the tth random subspace using the selection indicative vectors, pt • Use Eqs. (5.33) and (5.34) to predict the label of xi End For Use majority vote to obtain the predicting label of xi

5.3.2 Experiment We evaluated the performance of the introduced P-RSSR using several image datasets, compared it with LapRLS, SSDR, PCA, LDA, CLPP (Cai et al. 2007), SSDA, GFHF, S-RLSC and SSC–SSR. All algorithms were run 10 times with Matlab (R2016a) on a desktop with i7 4790 CPU and 8-GB RAM, taking the mean recognition rates as recognition rates. In the experiments, for PCA, LDA, CLPP and SSDA, a nearest-neighbour classifier with Euclidean distance was used to classify during testing. For LapRLS, SSDR, S-RLSC and SSC–SSR, the same classifier described in the corresponding papers was used to classify after training. In all experimental results, the highest recognition rates have bold front. Fourfold cross-validation is used to obtain the best experimental configuration for algorithms, which need parameters. In the introduced P-RSSR, the Gaussian kernel parameter, σ , is selected from {10−5 , 10−4 , . . . , 101 , 102 }. The weighted parameters of MR, γ A and γl , are selected from {10−5 , 10−4 , . . . , 101 , 102 }. The dimension of subspace P is set to 100 for all image datasets, and the number of random spaces, T , is set to 30. In LapRLS, SSDR, CLPP and SSDA, the Gaussian kernel parameter, σ , is selected from {10−5 , 10−4 , . . . , 101 , 102 } and the neighbour size K in KNN is

152

5 Night-Vision Data Classification Based on Sparse Representation …

set as follows. If the training number of per class is larger than 10, the neighbour size, K , is also set to 10. Otherwise, neighbour size K is set as the training γd number per class. γ A and γl in LapRLS, SSDR and γd in SSDR are selected from {10−5 , 10−4 , . . . , 101 , 102 }. PCA is used as the preprocessing stage for algorithms, except for S-RLSC, SSC–SSR and the introduced P-RSSR. Nearly, 95% of all energy is preserved to select the number of principal components to balance the effectiveness and calculation times. Experiments on Face Image Datasets Face image datasets (i.e. AR dataset Han et al. 2014), Infrared dataset (Samaria et al. 1994), ORL dataset (Sim et al. 2003a, b), CMU PIE dataset (Wechsler et al. (1998)), UMIST dataset (Lee et al. 2005) and extended Yale-B (Yale_B) dataset (Chen and Haykin 2002) were used to evaluate the performance of all algorithms. Basic properties of these datasets are shown in Table 5.9. (1) Experimental results on the AR dataset The basic properties of the AR dataset are shown in Table 5.9, and all images of one person from the AR dataset are shown in Fig. 5.13. We randomly selected six images of each person to build the unlabelled set and randomly selected 2, 4,…, 12 images of each person to build the labelled set. Remaining samples were used for testing. The recognition rates (%) and std of all algorithms are shown in Table 5.10.

Table 5.9 Basic properties of used dataset. Reprinted from Zhao et al. (2018), with permission of Springer

Dataset name

Number of people

Total size

Dimension

AR

120

3120

40*50

5

1400

32*32

Infrared ORL

40

400

56*46

CMU PIE

68

5440

32*32

UMIST

20

575

56*46

Yale_B

38

2432

40*35

Fig. 5.13 Example images of one person in the AR dataset. Reprinted from Zhao et al. (2018), with permission of Springer

5.3 Night-Vision Image Classification via P-RSSR

153

Table 5.10 Recognition rates (mean ± std) on the AR dataset. Reprinted from Zhao et al. (2018), with permission of Springer 2

4

6

8

10

12

P-RSSR

78.82 ± 1.31

90.02 ± 0.79

94.29 ± 0.68

96.02 ± 0.52

97.06 ± 0.25

98.06 ± 0.44

LapRLS

73.13 ± 2.50

87.55 ± 0.97

92.63 ± 0.75

95.44 ± 0.93

96.90 ± 0.71

97.97 ± 0.38

SSDR

35.22 ± 1.80

49.22 ± 0.11

57.81 ± 1.27

64.28 ± 1.23

68.24 ± 1.52

72.96 ± 1.62

PCA

37.32 ± 1.51

52.96 ± 0.69

61.71 ± 1.13

68.34 ± 1.01

72.44 ± 1.27

77.20 ± 1.32

LDA

70.30 ± 1.59

88.52 ± 0.89

94.13 ± 0.99

95.97 ± 0.47

96.50 ± 0.57

97.50 ± 0.68

CLPP

53.51 ± 1.54

71.10 ± 1.48

80.64 ± 1.32

86.22 ± 0.75

89.39 ± 1.26

92.57 ± 1.54

SSDA

43.02 ± 1.77

64.76 ± 1.48

73.58 ± 2.14

77.50 ± 1.41

81.60 ± 2.59

85.07 ± 1.72

GFHF

69.63 ± 1.47

83.46 ± 1.74

86.63 ± 1.50

89.50 ± 1.49

91.30 ± 1.54

93.14 ± 0.85

S-RLSC

65.75 ± 1.29

85.58 ± 0.68

92.57 ± 1.00

95.54 ± 0.88

96.67 ± 0.56

97.60 ± 0.47

SSC–SSR

69.82 ± 0.93

81.23 ± 1.42

85.11 ± 1.31

87.26 ± 1.25

87.58 ± 1.52

89.27 ± 1.15

Table 5.11 Recognition rates (mean ± std) on the Infrared dataset. Reprinted from Zhao et al. (2018), with permission of Springer 5

10

15

20

25

30

PRSSR

86.25 ± 3.15

94.65 ± 2.51

96.41 ± 2.22

98.05 ± 0.99

99.28 ± 0.71

99.73 ± 0.40

35 100 ± 0

40 100 ± 0

LapRLS

75.33 ± 5.06

89.85 ± 4.57

95.07 ± 1.88

97.71 ± 1.11

98.49 ± 1.47

99.30 ± 0.77

98.91 ± 1.25

99.69 ± 0.32

SSDR

76.25 ± 3.77

76.53 ± 5.00

82.74 ± 3.08

85.60 ± 4.07

87.31 ± 4.97

90.76 ± 2.94

91.18 ± 2.24

92.66 ± 2.38

PCA

75.07 ± 5.69

87.03 ± 2.91

92.13 ± 1.15

93.88 ± 3.92

96.41 ± 1.42

98.00 ± 0.99

98.22 ± 0.86

98.77 ± 0.84

LDA

82.59 ± 3.75

90.81 ± 2.96

96.70 ± 1.29

97.72 ± 1.93

98.87 ± 1.10

99.63 ± 0.33

99.70 ± 0.57

99.81 ± 0.45

CLPP

74.11 ± 6.43

86.14 ± 3.42

92.07 ± 1.50

93.54 ± 4.32

96.86 ± 1.15

98.03 ± 0.71

98.61 ± 0.73

98.90 ± 0.79

SSDA

72.38 ± 3.78

83.94 ± 4.38

90.66 ± 2.02

92.56 ± 4.37

95.89 ± 1.46

98.08 ± 1.85

97.94 ± 0.72

98.69 ± 0.85

GFHF

81.01 ± 3.05

83.20 ± 3.68

85.87 ± 3.34

87.15 ± 2.56

89.79 ± 2.67

90.08 ± 2.80

93.48 ± 1.39

95.25 ± 2.24

SRLSC

75.45 ± 3.55

89.45 ± 2.77

94.82 ± 2.20

97.98 ± 0.66

98.21 ± 1.66

98.98 ± 0.88

99.21 ± 0.94

99.37 ± 0.79

SSC–SSR 78.34 ± 2.23

88.13 ± 4.11

93.53 ± 2.42

94.89 ± 1.84

98.08 ± 0.81

99.43 ± 0.95

99.40 ± 0.49

99.69 ± 0.80

(2) Experimental results on Infrared dataset The basic properties of infrared dataset are shown in Table 5.11, and several images of one person and one cup are shown in Fig. 5.14. We randomly selected 20 images of each person or cup to compose the unlabelled set and randomly selected 5, 10,…, 40 images of each person or cup to build the labelled set. The remaining samples were used for testing. The recognition rates (%) and std of all algorithms are shown in Table 5.11. (3) Experimental results on ORL dataset The basic properties of ORL dataset are shown in Table 5.9, and all images of one person are shown in Fig. 5.15. We randomly selected three images of each person to

154

5 Night-Vision Data Classification Based on Sparse Representation …

Fig. 5.14 Example images of one person and one cup in the Infrared dataset. Reprinted from Zhao et al. (2018), with permission of Springer

Fig. 5.15 Example images of one person in the ORL dataset. Reprinted from Zhao et al. (2018), with permission of Springer Table 5.12 Recognition rates (mean ± std) on the ORL datasets. Reprinted from Zhao et al. (2018), with permission of Springer 1

2

3

4

5

6

P-RSSR

84.58 ± 2.89

85.05 ± 2.24

92.88 ± 1.91

95.83 ± 1.68

97.50 ± 1.77

100 ± 0

LapRLS

68.04 ± 2.49

84.40 ± 4.11

90.63 ± 3.15

95.33 ± 2.58

95.25 ± 2.93

97.75 ± 1.42

SSDR

68.04 ± 2.49

78.45 ± 2.98

87.25 ± 2.88

91.17 ± 2.49

91.38 ± 2.97

91.50 ± 3.94

PCA

68.42 ± 2.40

82.4 ± 1.24

88.81 ± 2.86

93.25 ± 1.07

94.25 ± 2.30

96.00 ± 2.69

LDA

44.50 ± 4.81

73.55 ± 5.37

89.75 ± 2.47

94.83 ± 1.66

96.25 ± 1.57

98.00 ± 1.58

CLPP

57.71 ± 5.67

70.95 ± 2.42

82.50 ± 3.18

89.25 ± 2.17

92.00 ± 3.83

94.50 ± 4.05

SSDA

54.58 ± 3.84

57.95 ± 7.12

59.44 ± 5.96

69.00 ± 4.26

76.50 ± 6.12

74.75 ± 7.21

GFHF

66.14 ± 1.73

75.81 ± 3.46

80.61 ± 3.92

81.13 ± 2.58

83.10 ± 2.51

85.19 ± 3.03

S-RLSC

68.33 ± 3.81

82.60 ± 3.20

90.19 ± 2.19

93.50 ± 1.75

96.00 ± 2.19

98.00 ± 2.85

SSC–SSR

65.75 ± 2.45

81.55 ± 2.46

87.88 ± 2.66

89.33 ± 3.70

91.88 ± 2.38

91.00 ± 5.80

build the unlabelled set and randomly selected 1, 2,…, 6 images of each person to build the labelled set. The remaining samples were used for testing. The recognition rates (%) and std of all algorithms are shown in Table 5.12. (4) Experimental results on CMU PIE dataset The basic properties of CMU PIE dataset are shown in Table 5.9, and several images of one person are shown in Fig. 5.16. We randomly selected 10 images of each person to build the unlabelled set and randomly selected 10, 12,…, 20 images of each person to build the labelled set. The remaining samples were used for testing. The recognition rates (%) and std of all algorithms are shown in Table 5.13. (5) Experimental results on UMIST dataset The basic properties of UMIST dataset are shown in Table 5.9, and several images of one person are shown in Fig. 5.17. We randomly selected four images of each

5.3 Night-Vision Image Classification via P-RSSR

155

Fig. 5.16 Example images of one person and one cup in the CMU PIE dataset. Reprinted from Zhao et al. (2018), with permission of Springer Table 5.13 Recognition rates (mean ± std) on the CMU PIE dataset. Reprinted from Zhao et al. (2018), with permission of Springer 10

12

14

16

18

20

P-RSSR

85.00 ± 0.42

86.39 ± 0.76

88.65 ± 0.84

89.53 ± 0.48

90.59 ± 0.56

92.35 ± 0.69

LapRLS

55.30 ± 2.16

58.61 ± 1.68

61.77 ± 1.59

64.16 ± 1.22

65.28 ± 1.61

67.50 ± 1.16

SSDR

35.63 ± 0.83

40.65 ± 1.14

44.11 ± 0.60

46.78 ± 1.26

49.35 ± 0.79

51.15 ± 1.45

PCA

33.82 ± 0.73

37.52 ± 1.07

41.40 ± 0.52

44.31 ± 1.02

46.92 ± 0.73

50.43 ± 0.88

LDA

74.02 ± 1.01

78.49 ± 0.90

80.83 ± 2.04

81.97 ± 1.01

84.49 ± 0.73

84.70 ± 0.83

CLPP

45.36 ± 2.21

51.36 ± 1.52

55.20 ± 2.04

56.70 ± 2.26

61.11 ± 2.50

64.83 ± 1.67

SSDA

58.51 ± 0.95

64.23 ± 1.19

67.77 ± 1.71

70.31 ± 1.33

73.73 ± 1.05

75.14 ± 0.83

GFHF

60.42 ± 0.57

66.56 ± 0.64

67.10 ± 0.48

75.72 ± 0.75

77.55 ± 0.71

78.05 ± 1.12

S-RLSC

71.32 ± 0.70

76.39 ± 1.20

80.26 ± 0.95

82.96 ± 0.76

85.38 ± 0.57

86.65 ± 0.84

SSC–SSR

74.76 ± 1.49

78.15 ± 0.74

78.83 ± 1.67

79.32 ± 1.30

80.49 ± 0.72

81.23 ± 1.89

Fig. 5.17 Example images of one person in the UMIST dataset. Reprinted from Zhao et al. (2018), with permission of Springer

person to build the unlabelled set and randomly selected 2, 3,…, 8 images of each person to build the labelled set. The remaining samples were used for testing. The recognition rates (%) and std of all algorithms are shown in Table 5.14. (6) Experimental results on Yale_B dataset The basic properties of Yale_B dataset are shown in Table 5.9, and several images of one person are shown in Fig. 5.18. We randomly selected 10 images of each person to build the unlabelled set and randomly selected 6, 8,…, 20 images of each person to build the labelled set. The remaining samples were used for testing. The recognition rates (%) and std of all algorithms are shown in Table 5.15.

156

5 Night-Vision Data Classification Based on Sparse Representation …

Table 5.14 Recognition rates (mean ± std) on the UMIST dataset. Reprinted from Zhao et al. (2018), with permission of Springer 2

3

4

5

P-RSSR

68.08 ± 3.41

81.25 ± 3.34

86.36 ± 2.42

89.00 ± 3.24 92.22 ± 2.52 94.38 ± 1.40

6

7

LapRLS

58.58 ± 7.52 71.96 ± 4.75 81.50 ± 4.58 85.35 ± 3.08 89.22 ± 2.08 93.13 ± 2.83 94.07 ± 3.66

SSDR

60.31 ± 4.13 70.13 ± 3.96 76.68 ± 3.62 78.75 ± 2.20 79.17 ± 2.72 81.56 ± 2.78 83.29 ± 4.82

PCA

59.23 ± 4.05 70.83 ± 2.91 77.55 ± 3.84 84.70 ± 4.23 87.22 ± 1.78 90.81 ± 2.47 91.00 ± 2.96

LDA

55.27 ± 3.53 79.17 ± 4.00 85.00 ± 3.68 91.50 ± 1.95

CLPP

53.00 ± 3.11 66.42 ± 3.70 75.00 ± 3.48 80.95 ± 4.04 84.28 ± 2.33 88.56 ± 3.09 88.14 ± 3.22

SSDA

35.04 ± 6.79 56.71 ± 4.98 67.00 ± 3.66 71.50 ± 3.29 77.50 ± 3.10 82.38 ± 4.50 83.21 ± 2.90

GFHF

66.11 ± 3.91 69.31 ± 2.72 73.77 ± 5.80 84.75 ± 4.65 89.23 ± 4.86 93.42 ± 2.71 96.23 ± 4.53

S-RLSC

62.19 ± 3.59 73.96 ± 5.53 83.14 ± 4.17 88.55 ± 2.62 91.72 ± 3.31 92.81 ± 2.57 96.79 ± 2.53

SSC–SSR

59.54 ± 4.19 71.29 ± 3.17 76.96 ± 2.79 80.50 ± 3.44 86.67 ± 2.09 87.75 ± 2.86 90.43 ± 4.22

92.78 ± 1.80

8

94.38 ± 1.44

96.43 ± 1.75

96.43 ± 1.16

Fig. 5.18 Example images of one person in the Yale_B dataset. Reprinted from Zhao et al. (2018), with permission of Springer Table 5.15 Recognition rates (mean ± std) on the Yale_B dataset. Reprinted from Zhao et al. (2018), with permission of Springer 6

8

10

12

14

16

18

20

P-RSSR

81.65 ± 1.15

85.98 ± 1.02

88.57 ± 1.10

90.98 ± 0.68

93.22 ± 0.99

94.40 ± 0.82

95.32 ± 0.55

96.70 ± 0.75

LapRLS

56.60 ± 2.65

63.81 ± 1.41

68.75 ± 1.81

72.78 ± 1.76

75.30 ± 1.24

78.37 ± 1.76

80.92 ± 1.91

83.33 ± 1.57

SSDR

22.16 ± 1.42

26.11 ± 1.23

28.74 ± 1.08

31.38 ± 1.61

34.04 ± 1.34

36.18 ± 1.51

37.94 ± 0.90

40.01 ± 0.84

PCA

25.10 ± 1.20

28.67 ± 1.11

31.85 ± 1.09

34.59 ± 1.30

37.25 ± 1.69

38.59 ± 1.43

40.36 ± 1.63

42.67 ± 0.48

LDA

64.27 ± 0.75

69.87 ± 1.19

74.37 ± 0.93

76.59 ± 1.07

78.57 ± 0.93

80.23 ± 1.31

81.99 ± 0.58

83.51 ± 0.79

CLPP

39.03 ± 3.18

46.05 ± 3.61

51.55 ± 2.29

53.48 ± 2.22

56.78 ± 3.01

60.35 ± 2.40

62.77 ± 1.79

65.76 ± 1.45

SSDA

53.60 ± 0.89

58.84 ± 1.99

62.95 ± 1.19

65.14 ± 1.23

68.63 ± 1.22

69.93 ± 1.28

72.13 ± 1.43

74.00 ± 1.46

GFHF

56.54 ± 1.20

60.59 ± 0.86

67.59 ± 1.74

69.92 ± 1.36

72.96 ± 1.45

75.99 ± 1.78

80.98 ± 1.23

85.23 ± 1.49

SRLSC

56.65 ± 1.42

65.41 ± 1.80

72.05 ± 1.12

75.97 ± 1.52

79.68 ± 0.98

82.75 ± 0.93

84.22 ± 0.52

86.10 ± 1.40

SSC–SSR 74.60 ± 1.37

79.94 ± 1.60

82.30 ± 2.82

83.58 ± 1.20

86.11 ± 1.20

87.80 ± 1.10

88.46 ± 1.46

88.84 ± 1.34

5.3 Night-Vision Image Classification via P-RSSR

157

The experimental results on face image datasets show that the introduced PRSSR outperformed the other nine algorithms in most situations, especially with few labelled samples. From the experimental results of Sect. 4.1, we conclude that the introduced SSM–RSSR is superior to the other nine algorithms. Compared to the non-semisupervised learning algorithms (i.e. PCA and LDA), the introduced P-RSSR is based on semi-supervised learning, which is naturally superior to non-semi-supervised learning algorithms. Compared to semi-supervised learning algorithms (i.e. LapRLS, CLPP, SSDA and GFHF), the graph in the introduced SSM–RSSR is constructed using SR, which has obvious advantages over the graph constructed by KNN or εball methods. Compared to S-RLSC, the graph is constructed with random subspaces, which is more discriminative than graphs constructed in the original space, according to SSC–SSR. Compared to SSC–SSR, in the introduced P-RSSR, the features based on probability are selected. Features in the saliency regions have higher probabilities to be selected than features in the non-saliency regions. Thus, it characterises the distribution of samples more accurately to get better results. The comparison results of the introduced P-RSSR and SSC–SSR in the next two sections demonstrate this point. Sensitivity on Number of Random Subspaces The number of random subspaces is an important parameter in random subspace methods. When the number of random subspaces increases, the computational complexity will increase linearly. In this section, Yale_B is used to evaluate the sensitivity on the number of random subspaces. In these experiments, we randomly selected 10 images of each person to build the unlabelled set and randomly selected 10 images of each person to build the labelled set. The remaining images were used for testing. To balance the computational complexity and the performance of the algorithm, we fixed the dimensions of each subspace, P, to 100 and changed the number of random subspaces, T , from 5 to 40, as 5, 10, 15,…, 40. The experimental results of the introduced P-RSSR and SSC–SSR are shown in Fig. 5.19. From the experimental results, when T ≥ 15, SSC–SSR obtains stable recognition rates, whereas, when T ≥ 5, the introduced P-RSSR obtains stable and higher recognition rates. We can conclude that the introduced P-RSSR obtains better recognition rates with T in a wider range than SSC–SSR. Sensitivity of Dimensionality of Each Random Subspace The dimension of each random subspace is another important parameter in random subspace methods. When the dimension of each random subspace increases, the computational complexity will increase. When the dimension of each random subspace is equal to the dimension of the original space, the random subspace method becomes the original method. The Yale_B dataset is used to test the sensitivity, and the basic experimental configurations are almost identical to the last section. To balance the computational complexity and the performance of the algorithm, we fixed the number of random subspaces, T , to 30 and changed the dimension of each subspace from 20 to 200, as 20, 40,…,

158

5 Night-Vision Data Classification Based on Sparse Representation …

Fig. 5.19 Recognition rates versus the number of random subspace. Reprinted from Zhao et al. (2018), with permission of Springer

Fig. 5.20 Recognition rates versus the dimension of each random subspace. Reprinted from Zhao et al. (2018), with permission of Springer

200. The experimental results of the introduced P-RSSR and SSC–SSR are shown in Fig. 5.20. When the dimension of each random subspace increases, the recognition rates of SSC–SSR and the introduced P-RSSR will increase until P ≥ 100. Then, the recognition rates become stable. The results mean that SSC–RSSR and P-RSSR have almost the same range of P. However, the introduced P-RSSR achieves better recognition results. From the experimental results in the section about sensitivity, we conclude that the introduced P-RSSR obtains higher recognition rates with T in a wider range than SSC–SSR. We also obtain higher recognition rates than SC–SSR with the same range, P.

5.4 Night-Vision Image Classification via MMSR

159

5.4 Night-Vision Image Classification via MMSR 5.4.1 MR Regularisation is used to solve the problem of instability in the case of insufficient input samples (see Tikhonov and Arsenin 1977). In 1960s, Tikhonov and Arsenin pointed out that regularisation could be used to solve these problems in Zhou et al. (2004). By adding moderate prior information into the formulation, the regularisation technique made the solution stable. More recently, the regularisation theory has introduced the field of machine-learning garnering great achievements. The underlying sample distribution information was introduced into the traditional regularisation by Belkin et al. to obtain MR, including LapRLS and LapSVM. The geometry of the probability distribution was exploited and treated as a new regularisation term. Spectral graph theory, geometric viewpoint embodied in manifold learning algorithms and regularisation terms were introduced for MR. Some semisupervised learning algorithms only work with both training and test sets (see Zhu et al. 2003). These are called as transductive learning algorithms. However, more natural semi-supervised learning algorithms require that the training set only contains labelled and unlabelled samples; the testing set is not available during training. MR provides a successful framework for this situation. MR seeks an optimal classification function by minimising the loss function: 1 min { V (xi , yi , f ) + γ A f 2A + γl f l2 }. f ∈Hk l i1 l

(5.37)

V (xi , yi , f ) is the loss function of MR, squared loss function (yi − f (xi ))2 is for regularised least-squares and hinge loss function max [0, 1 − yi f (xi )] for SVM. f 2A is the regularisation term of MR, which controls the complexity of the classifier to avoid overfitting. f l2 is the other MR regularisation term, controlling the smoothness of the sample distribution. In most semi-supervised learning algorithms, f l2 N is expressed as f l2 21 Wi j ( f i − f j )2 f T L f , where Wij is used to meai, j1

sure the similarity between samples. f i represents f (xi ) and f [ f 1 , f 2 , . . . , f N ]T . L D − W is the Laplacian matrix and D is a diagonal matrix with its elements N Dii Wij . j1

5.4.2 Multi-manifold Structure Regularisation (MMSR) Motivation of MMSR Because regularisation was originally designed for fitting multivariate functions, not for classification (see Poggio and Girosi 1990a, b), the

160

5 Night-Vision Data Classification Based on Sparse Representation …

regularisation term was focused on the smoothness of the function. However, when focusing on classification, samples along the geodesics in the intrinsic geometry of Px may belong to different classes. Thus, the smoothness constraint is insufficient for classification. Regularisation algorithms assume that the conditional probability distribution, P(y|x), varies smoothly along the geodesic in the intrinsic geometry of Px , which means if two samples, x1 , x2 ∈ X , are close in the intrinsic geometry of Px , then P(y|x1 ) and P(y|x2 ) are similar. Obviously, this assumption violates the fact that two nearby samples are near the boundaries but belong to different classes. In this situation, MR regards these two samples as belonging to the same class. Therefore, smoothness is not the key for classification. The regularisation term must be modified to get better classification results. In this chapter, a new algorithm for classification, called ‘semi-supervised multi-manifold structure regularisation’ (MMSR), is introduced. MMSR considers the discreteness information and uses the distribution of the labelled samples to characterise the underlying unlabelled sample distribution in view of multi-manifold structure when building the weighted matrix. Thus, better classification results are obtained. MMSR We assume that there are l + u samples belonging to C classes. The first n l samples are labelled {(xi , yi )}il il ∈ R . The other u samples are unlabelled n ∈ R , where n is the dimension of the samples. yi is a 1 ∗ C vec{(xi , yi )}il+u il+1 tor, the label of xi , if xi belong to the first c class, y(c) 1, and other elements in y are 0. The total number of labelled and unlabelled samples is N l + u. Many semi-supervised learning algorithms have a common assumption, that in local structure, similar samples have a high probability of belonging to the same class. In fact, if two samples belong to different classes but are close along the marginal of the intrinsic geometry, we treat them belong to the same class. Thus, the classification result will be influenced. Therefore, when training the classifier, the assumption should be modified and make the classification result more reasonable. The motivation of MMSR is that class information of labelled samples should be used to characterise the deviation of unlabelled samples. According to the deviation, the regularisation term is modified to make it more suitable for characterising the distribution of samples in the intrinsic geometry. The objective function of the introduced algorithm is shown in Eq. (5.38). J min { f ∈Hk

l 1 V (xi , yi , f ) + γ A f 2A + γl f 2M M S R }, l

(5.38)

i1

where V (xi , yi , f ) is the loss function and the squared loss function is used in MMSR. V (xi , yi , f ) yi − f (xi )2 .

(5.39)

f (xi ) is the predicted label of xi . yi is the real label of xi . f 2A is the other regularisation term, used to control the smoothness of the classifier. f 2MMSR is the regularisation term of MMSR. The method used to build f 2MMSR is shown as follows.

5.4 Night-Vision Image Classification via MMSR

161

MMSR pseudocode

1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:

Input: training set X , including u labelled samples X ii==1l and l unlabelled X ii==ll++1u samples, neighbour size K . Output: weighted matrix, W . for i = 1 to N do identify the nearest neighbours {xi1 ,..., xik } of xi by K-nearest Neighbour end for for i = 1 to N do if xi is an unlabelled sample then num _ mix(i ) = the number of classes of {xi1 ,..., xik } label _ nei (i ) = the class which have the maximum in {xi1 ,..., xik } end if end for for i = 1 to N do if xi is a labelled sample then for j = 1 to N do if x j is a labelled sample and x j in the neighbours of xi then

16: if xi and x j belong the same class then 17: W (i, j ) = w1 . 18: else if xi and x j belong the different classes then 19: W (i, j ) = − w1 . 20: end if 21: else if x j is an unlabelled sample and x j in the neighbours of xi then 22: if num _ mix(i ) > 1 then 23: W (i, j ) = − w2 24: if num _ mix(i ) = 1 then 25: W (i, j ) = w2 26: end if 27: end if 28: end for 29: end if 30: end for 31: for i = 1 to N do 32: if xi is an unlabelled sample then 33: for j = 1 to N do 34: if x j is a labelled sample and x j in the neighbours of xi then 35: if the label of x j have the maximum sample in {xi1 , xik } then 36: W (i, j ) = w1 . 37: else if the label of x j have the maximum sample in {xi1 , xik } then 38: W (i, j ) = − w1 . 39: end if 40: else if x j is an unlabelled sample and x j in the neighbours of xi then 41: 42: 43: 44: 45: 46: 47: 48:

if

label _ nei ( j ) W (i, j ) = w2

else then W (i, j ) = − w2

end if end if end for end if 49: end for

have the maximum number in all Ei = pi − Dai

2 2

of

{xi1 ,

xik }

then

162

5 Night-Vision Data Classification Based on Sparse Representation …

The above steps showed the process of MMSR. The expanded algorithm is as follows. • (Lines 3–11) The KNN graph is constructed and the classes of unlabelled samples, including the K-nearby neighbour of unlabelled samples, are counted. During this step, traditional KNN is used to look for the nearby samples and counts the number of K-nearby samples of the unlabelled samples. After getting the distribution of near samples, the characteristic of samples is analysed to make it more convenient to establish the weighted matrix and to characterise the distribution of samples in the intrinsic geometry. • (Lines 12–30) The nearby labelled samples are handled. During this step, nearby labelled samples are processed, because the label information of the samples is known. Therefore, the purpose of this step is to find out whether unlabelled samples and samples belong to other classes of the nearby samples. For labelled samples, which belong to other classes, they destroy the hypothesis that nearby samples along the geodesic in the intrinsic geometry have a high probability of belonging to the same class. Thus, to penalise these samples, these weights assign negative values. For samples belonging to the same class, to strengthen them, the weights assign larger positive values. For unlabelled samples, because the class label information is unknown, only the nearby samples can be used to evaluate their properties. If all nearby samples of one unlabelled sample belong to the same class, this unlabelled sample is considered stable and can be used to characterise the relationship of nearby samples in the intrinsic geometry. The weight of this sample will set to be normal. If the nearby samples of one unlabelled sample belong to different classes, the unlabelled sample is considered unstable. To penalise this sample, the weight is assigned a negative value. The degree of penalty is small, because the sample is unlabelled. • (Lines 31–49) The nearby unlabelled samples are handled. During this step, nearby unlabelled samples are processed. Nearby unlabelled samples belong to different classes, whose class label has the maximum samples among the nearby unlabelled samples. They are selected as the main nearby samples, and the corresponding weights are strengthened. Other samples, including the nearby samples, will be penalised, and the corresponding weights assigned negative values. For unlabelled samples included in the K-nearby neighbours of the unlabelled samples, we count the class label, which has the maximum samples. The class label is then selected, having the maximum frequency in these class labels and considered as the main class label. K-nearby neighbours unlabelled samples corresponding to this class label are the main nearby samples. The corresponding weights remain unchanged. Other nearby samples, unstable in the intrinsic geometry, are assigned small negative numbers to penalise them. In this chapter, it sets w1 5 and w2 1, empirically. To characterise the distribution of samples in the intrinsic geometry, the weighted matrix, W, is modified as shown below. The original weighted matrix, W, is shown in Eq. (5.40).

5.4 Night-Vision Image Classification via MMSR

Wij

1 i f xi ∈ KNN( j) or x j ∈ KNN(i) . 0 otherwise

163

(5.40)

If xi is a labelled sample, W is modified as shown in Eq. (5.41). ⎧ w1 x j ∈ KNN(i) and label(i) label( j) ⎪ ⎪ ⎨ −w1 x j ∈ KNN(i) and label(i) label( j) Wij . ⎪ w2 x j is unlabel, x j ∈ KNN(i) and num_mix( j) 1 ⎪ ⎩ −w2 x j is unlabel, x j ∈ KNN(i) and num_mix( j) > 1

(5.41)

If xi is an unlabelled sample, W is modified as shown in Eq. (5.42).

⎧ w1 x j ∈ KNN(i) and the number of label( j) is maximum ⎪ ⎪ ⎨ −w1 x j ∈ KNN(i) and the number of label( j) is not maximum . Wij = ⎪ w2 x j is unlabeled, x j ∈ KNN(i) and the number of label_nei( j) is maximum ⎪ ⎩ −w2 x j is unlabeled, x j ∈ KNN(i) and the number of label_nei( j) is not maximum1

(5.42) The new regularisation term, f 2M M S R , is shown in Eq. (5.43). f 2MMSR

N 1 Wij ( f i − f j )2 f T L f, ∗ 2 i, j1

(5.43)

where L D − W, D is a diagonal matrix with elements Dii N j1 Wij , f i repreT sents f (xi ) and f [ f 1 , f 2 , . . . , f N ] . Suppose that f lays in a reproducing kernel Hilbert space (RKHS), Hs, which is associated with a Mercer kernel, k: X × X → R. According to the classical representer theorem of Girosi (1998) and Scholkopf et al. (2000), the solution to Eq. (5.38) can be written as f ∗ (x)

N

αi k(x, xi ).

(5.44)

i1

Equation (5.44) is used to replace f (xi ) in Eqs. (5.38) and (5.43) and can be rewritten as J (Kα−Y)T U(Kα−Y)+γl αT KT LKα+γA αT Kα.

(5.45)

α [α1 , α2 , . . . , α N ], K ∈ RN×N is a Gram matrix with its element, K(i, j) k(xi , x j ). U ∈ R N ×N is a diagonal matrix when xi is a labelled sample Uii 1, else Uii 0. Y [y1 , y2 , . . . , y N ]T is a N ∗ C matrix and every row of Y represents the label of one sample. The derivation of J is shown in Eq. (5.46): ∂J 2KU(Kα−Y) + 2γl KLKα+γA Kα 0. ∂α

(5.46)

164

5 Night-Vision Data Classification Based on Sparse Representation …

Then, Eq. (5.47) can be obtained: α∗ (UK+γl K+γA IN )−1 Y,

(5.47)

where I N is an N ∗ N identity matrix. We use the Gaussian kernel: xi − x j 2 K(i, j) k(xi , x j ) exp(− ). 2σ 2

(5.48)

After obtaining α∗ , for every test sample, xi , the Euclidean distance is calculated between every training sample x j , and the method uses Eqs. (5.49) and (5.50) to obtain its predict label. dis(i, j)2 ), 2σ 2

(5.49)

(K _testT ∗ a ∗ ),

(5.50)

K _test(i, j) exp(− Y _label

max

1≤i≤N _train

where N _train is the number of training samples.

5.4.3 Experiment To evaluate the performance of MMSR, it is compared with PCA, CLPP, SSDA, SVM, SSDR and LapRLS. Several UCI datasets and face datasets (Bezdek and Hathaway 2002) were used to evaluate the performance. All algorithms were run 10 times. The mean training time and mean recognition rate were treated as the training time and the recognition rate. The experiments were implemented on a desktop PC with i5 3470 CPU and 16-GB RAM using MATLAB (R2014a). For PCA, CLPP and SSDA, after feature extraction, the nearest-neighbour classifier with Euclidean distance was used for classification. For SSDR and LapRLS, the same classifier as MMSR was used. The best configuration for performance evaluation was used in all our experiments, obtaining the best results. All parameters in all algorithms, except PCA, were set by fourfold cross-validation. The Gaussian parameter, σ , in all algorithms except PCA is selected from {2−5 , 2−4 , . . . , 24 , 25 }. The nearby neighbour size, K , in all algorithms except PCA, is selected as follows. If the training number per class was greater than 10, the neighbour size, K , was set to 10, else the neighbour size K was set to the training number of per class. γ A and γ B in LapRLS, SSDR and MMSR were selected from {2−5 , 2−4 , . . . , 24 , 25 }.

5.4 Night-Vision Image Classification via MMSR

165

UCI Datasets In this experiment, UCI datasets (i.e. Breast Cancer Wisconsin (Original), Musk (vision1), Hayes-Roth, Ionosphere Image, Segmentation, Waveform Database Generator (Version 2), Wine (see Bezdek and Hathaway 2002)) were used to evaluate performance. Properties are shown in Table 5.16. For convenient programming, in each dataset, samples belonging to each class were equalised. In all experiments, 10% of the samples of each class were randomly selected to build the unlabelled set, and 50% of the samples of each class were randomly selected to build the labelled set. These unlabelled and labelled samples comprise the training set. Others comprise the testing set. For non-semi-supervised algorithms, only labelled samples were used for training, but the test set remained. The recognition rates and training times are shown in Table 5.17. The number in brackets is the training time. Results show that MMSR outperformed the other six algorithms in most datasets. Compared with other algorithms, MMSR can characterise the distribution of samples in the intrinsic geometry accurately. Thus, it obtained better results. However, the introduced algorithm costed more time, because more calculation was required when establishing the weighted matrix. Face Image Datasets Image data usually have an intrinsic invariant, but transformation, such as rotation and scale, will influence it. However, nearby pixels have similar transformation among each other in embedding manifold. Images can be characterised in a low-dimensional intrinsic manifold. AR dataset (see Martinez and Benavente (1998), PIE dataset (see Sim et al. 2003a, b), UMIST dataset (see Wech-

Table 5.16 Descriptions of the UCI datasets Number of classes

Number of each class

Dimension

Musk

2

207

167

Hayes-Roth

3

30

4

Ionosphere

2

126

34

Segment

7

330

19

Waveform

3

1653

40

Wine

3

48

13

Table 5.17 Classification performance on the UCI datasets PCA

CLPP

SSDA

SVM

SSDR

LapRLS

MMSR

Musk

97.35(0.021)

90.83(1.90)

91.67(6.3e-3)

88.58(0.15)

99.75(0.012)

100.00(0.014)

100.00 (0.014)

HayesRoth

56.11(7.90e4)

58.89(0.10)

54.17(7.96e4)

43.89(3.81e4)

50.83(1.20e3)

48.06(1.30e3)

53.06(6.1e3)

Ionosphere

86.46(5.71e3)

86.35(0.74)

83.85(3.1e-3)

74.79(3.21e3)

90.21(5.6e-3)

91.15(6.41e3)

92.50(0.056)

Segment

94.15(0.25)

95.27(56.05)

92.06(0.19)

83.90(0.058)

95.82(0.45)

93.45(0.47)

96.31(1.43)

Waveform

79.31(1.61)

80.49(258.71)

65.62(0.83)

73.99(0.68)

86.02(2.43)

86.05(2.57)

86.05(10.61)

Wine

95.19(1.62e3)

95.19(0.25)

91.11(1.23e3)

93.70(7.17e4)

97.04(2.54e3)

94.81(2.57e3)

96.11(0.015)

166

5 Night-Vision Data Classification Based on Sparse Representation …

Fig. 5.21 Face images of one person in AR dataset Table 5.18 Classification performance on the AR dataset PCA

CLPP

SSDA

SVM

SSDR

LapRLS

MMSR

L3

47.11(4.64)

57.56(5.43)

55.71(0.10)

47.97(.98)

58.51(0.18)

65.48(0.17)

65.77(0.39)

L6

62.08(8.80)

75.08(21.53)

71.61(0.16)

62.44(3.21)

77.32(0.40)

83.32(0.36)

83.71(0.72)

L9

70.49(12.07) 83.67(48.22

80.87(0.25)

71.92(6.40)

86.48(0.70)

91.15(0.63)

91.36(1.11)

L 12

77.30(14.86) 90.53(82.55)

85.02(0.36)

77.07(11.01) 92.28(1.00)

94.85(1.03)

95.00(1.75)

L 15

80.61(18.63) 93.42(129.02)

88.44(0.53)

79.55(16.76) 94.33(1.43)

96.92(1.59)

97.03(2.29)

L 18

84.80(23.13) 96.42(184.51)

91.10(0.67)

80.87(23.84) 96.57(1.98)

98.18(2.06)

98.35(3.16)

L 21

86.83(27.54) 98.63(250.97)

92.71(0.82)

82.33(32.04) 98.08(2.73)

98.83(2.78)

98.96(4.01)

sler et al. 1998), Yale_B dataset (see Lee et al. 2005) and Infrared image dataset (see Han et al. 2014) were used to evaluate the performance of MMSR. To avoid SSS, PCA was conducted before feature extraction for CLPP, SSDA and SVM. Nearly, 95% energy was preserved to select the number of principal components. The AR dataset contained 120 people, and each person has 26 images. All images of one person in AR dataset are shown in Fig. 5.21. All images were resized to 40 × 50. In the experiment, three images per person were randomly selected to build the unlabelled set, and 3, 6,…, 21 images per person were randomly selected to build the labelled set. Others were used for testing. The recognition rates and training time are shown in Table 5.18. The number in brackets is the training time. The CMU PIE dataset contained 41368 images of 68 people with different facial expressions and lighting conditions. These images were captured by 13 synchronised cameras and 21 flashes under varying pose, illumination and expression. Several images of one person in CMU PIE dataset are shown in Fig. 5.22. All the images are resized to 32 × 32. Eighty images of each person were selected to build the dataset. In the experiment, eight images per person were randomly selected to build the unlabelled set, and 8, 16,…, 64 images per person were randomly selected to build the labelled set. Others were used for testing. The recognition rates and training times are shown in Table 5.19. The number in brackets is the training time. The UMIST dataset contained 575 images of 20 people. For convenient programming, 19 images of each person were randomly selected to build the new dataset. Several images of one person in UMIST are shown in Fig. 5.23. All images were

5.4 Night-Vision Image Classification via MMSR

167

Fig. 5.22 Face images of one person in PIE dataset Table 5.19 Classification performance on the CMU PIE dataset PCA

CLPP

SSDA

SVM

SSDR

LapRLS

MMSR

L8

36.60(1.59)

49.69(11.07)

60.02(0.12)

34.63(0.80)

52.57(0.31)

60.82(0.20)

61.45(0.71)

L = 16

52.90(2.93)

68.27(43.28)

76.51(0.23)

45.31(2.88)

72.66(0.69)

79.96(0.67)

80.45(1.33)

L = 24

63.86(4.71)

77.47(95.03)

82.83(0.46)

57.22(6.22)

82.28(1.33)

86.83(1.38)

87.14(2.46)

L = 32

72.20(5.93)

83.27(166.60)

86.61(0.64)

60.20(10.85) 87.04(1.99)

90.25(2.06)

90.46(3.51)

L = 40

77.96(7.99)

86.24(259.31)

88.21(0.96)

72.45(16.81) 89.99(3.46)

92.42(3.32)

92.52(5.22)

L = 48

82.71(10.26) 88.67(374.61)

89.83(1.27)

77.19(24.00) 92.14(4.69)

93.59(4.83)

93.71(7.25)

L = 56

85.68(14.71) 89.83(509.55)

90.61(1.66)

79.16(32.50) 92.79(6.12)

94.44(6.62)

94.59(9.76)

L = 64

88.50(18.07) 91.31(663.03)

91.43(2.14)

81.62(42.31) 94.10(8.48)

95.43(8.85)

95.51(12.31)

Fig. 5.23 Face images of one person in UMIST dataset

resized to 56 × 46. In the experiment, two images per person were randomly selected to build the unlabelled set, and 2, 4,…, 16 images per person were randomly selected to build the labelled set. Others were used for testing. The recognition rates and training times are shown in Table 5.20. The number in brackets is the training time. The Yale_B dataset contained 2432 images of 38 people with different facial expressions and lighting conditions. Several images of one person in Yale_B dataset are shown in Fig. 5.24. All images were resized to 40 × 35. In the experiment, seven images per person were randomly selected to build the unlabelled set, and 7, 14,…, 49 images per person were randomly selected to build the labelled set. Others were used for testing. The recognition rates and training times are shown in Table 5.21. The number in brackets is the training time. The infrared dataset contained 1400 images of five people and w cups with different rotational angles. All images were captured with an infrared camera. Each

168

5 Night-Vision Data Classification Based on Sparse Representation …

Table 5.20 Classification performance on the UMIST dataset PCA

CLPP

SSDA

SVM

L2

62.70(0.15)

25.00(0.07)

49.33(2.0e-3)

L=4

78.35(0.28)

72.96(0.31)

L=6

87.59(0.44)

L=8

92.44(0.60)

L = 10

SSDR

LapRLS

MMSR

56.40(0.023) 60.50(6.1e3)

62.70(0.01)

63.00(0.02)

66.62(3.1e-3)

71.19(0.075) 76.38(9.0e3)

79.23(0.01)

79.38(0.03)

83.59(0.65)

76.64(5.5e-3)

70.36(0.014) 85.23(0.01)

87.36(0.01)

87.67(0.04)

89.94(1.12)

85.00(7.24e-3)

66.94(0.23)

89.00(0.02)

92.17(0.02)

92.39(0.05)

95.29(0.77)

93.36(1.73)

89.79(9.24e-3)

78.43(0.34)

92.43(0.03)

95.21(0.02)

92.39(0.05)

L = 12

96.40(0.94)

95.10(2.39)

91.20(0.010)

73.40(0.45)

93.30(0.03)

96.80(0.03)

96.90(0.08)

L = 14

98.17(1.13)

97.50(3.21)

94.83(0.013)

71.83(0.60)

96.50(0.04)

95.83(0.04)

96.00(0.09)

L = 16

98.50(1.32)

98.00(4.19)

96.00(0.014)

76.50(0.74)

96.00(0.05)

98.50(0.05)

98.50(0.11)

Fig. 5.24 Face images of one person in Yale_B dataset Table 5.21 Classification performance on the Yale_B dataset PCA

CLPP

SSDA

SVM

SSDR

LapRLS

MMSR

L7

29.77(0.74)

47.63(2.94)

58.69(0.025)

36.59(0.37)

47.59(0.07)

55.02(0.09)

55.79(0.26)

L = 14

40.09(1.21)

60.53(11.77) 71.03(0.057)

47.91(1.24)

67.33(0.18)

73.83(0.18)

74.40(0.42)

L = 21

45.99(1.88)

65.86(25.88) 75.23(0.11)

57.60(2.63)

77.70(0.67)

83.04(0.32)

83.47(0.66)

L = 28

49.71(2.60)

65.86(25.88) 77.87(0.16)

60.95(4.55)

84.36(0.49)

88.02(0.51)

88.55(1.00)

L = 35

52.14(3.15)

71.19(70.75) 79.21(0.24)

65.08(7.00)

88.07(0.76)

90.74(0.80)

91.08(1.34)

L = 42

52.99(3.90)

74.49(101.70) 80.84(0.32)

65.77(10.03) 91.43(1.05)

92.83(1.00)

93.06(1.80)

L = 49

54.50(4.92)

75.82(137.53) 81.61(0.42)

69.15(13.52) 93.22(1.46)

94.42(1.39)

94.74(2.40)

person or cup had 200 images. Several images of one person and one cup in the infrared dataset are shown in Fig. 5.25. All images were resized to 40 × 40. In the experiment, 20 images per person were randomly selected to build the unlabelled set and 20, 40,…, 160 images per person were randomly selected to compose the labelled set, and others compose the testing set. The recognition rates and training time are shown in Table 5.22. The number in brackets is the training time. To evaluate performance with the same labelled samples but with different unlabelled samples, 12 images of each person in AR dataset were randomly selected to build the unlabelled set. 3, 4, 5, 6, 7 and 8 images of each person were randomly selected to build the labelled set. Other images were for testing. SSDR, LapRLS and MMSR were selected, and all algorithms were run 10 times. The mean training time

5.4 Night-Vision Image Classification via MMSR

169

Fig. 5.25 Images of one person and one cup in the Infrared dataset Table 5.22 Classification performance on the Infrared dataset SVM

SSDR

L 20

PCA 94.04(0.13)

CLPP 94.67(0.87)

94.17(0.01)

77.80(0.082)

90.86(0.02)

95.66(0.02)

95.87(0.10)

L = 40

97.95(0.22)

98.17(3.25)

98.52(0.02)

79.76(0.29)

97.17(0.05)

98.91(0.06)

98.99(0.19)

L = 60

99.12(0.35)

99.37(7.19)

99.64(0.03)

82.81(0.61)

99.26(0.09)

99.68(0.10)

99.70(0.32)

L = 80

99.51(0.49)

99.64(12.52)

99.80(0.05)

81.63(1.02)

99.41(0.14)

99.81(0.15)

99.81(0.47)

L = 100

99.66(0.67)

99.80(19.63)

99.93(0.06)

81.18(1.56)

99.79(0.20)

99.98(0.21)

99.98(0.64)

L = 120

99.95(0.82)

99.98(28.29)

99.98(0.09)

81.26(2.21)

99.81(0.28)

100.00(0.28)

100.00(0.81)

L = 140

99.93(1.01)

100.00(39.71)

100.00(0.12)

82.54(2.98)

99.93(0.39)

100.00(0.36)

100.00(1.00)

L = 160

99.79(1.23)

99.79(52.18)

100.00(0.15)

81.21(3.86)

99.64(0.47)

100.00(0.49)

100.00(1.23)

Table 5.23 Classification performance on AR dataset with different unlabelled samples

SSDA

LapRLS

MMSR

Number of unlabelled samples

SSDR

LapRLS

MMSR

UL 3

91.33(1.07)

94.55(1.013)

94.68(1.69)

UL = 4

91.03(1.22)

94.67(1.08)

94.90(1.94)

UL = 5

91.74(1.30)

95.56(1.27)

95.61(2.18)

UL = 6

91.03(1.85)

94.65(1.48)

94.84(2.48)

UL = 7

90.21(1.79)

94.50(1.66)

94.69(2.75)

UL = 8

90.53(1.83)

94.97(1.88)

95.21(3.05)

and mean recognition rate were used as the training time and recognition rate for each algorithm. Results are shown in Table 5.23 and Fig. 5.26. The above experimental result showed that MMSR achieves the highest recognition rates in most conditions.

5.5 Summary In view of the difficulty of using labelled information and sample information on learning algorithm, this chapter described the disadvantage of sparse representation and random subspace, and introduced the following three methods.

170

5 Night-Vision Data Classification Based on Sparse Representation …

Fig. 5.26 Classification performance on AR dataset with different unlabelled samples

• SSM–RSSR was introduced for classification. In the introduced SSM–RSSR, random subspace sparse representation was used to construct a graph characterising the distribution of all samples in each random subspace. Then, it fused different random subspaces from the viewpoint of multi-view and obtained the optimal coefficients for different random subspaces using an alternating optimisation method. Finally, it obtained an MR-based linear classifier to obtain the classified results. Through fusing the random subspaces, SSM–RSSR obtained better and more stable recognition results in a wider range of dimensions of random subspace and numbers of random subspaces. Experimental results on UCI datasets and face image datasets showed the effectiveness of the introduced SSM–RSSR. • P-RSSR was introduced for classification. In the introduced P-RSSR, we calculated the distribution probability of an image and determined which feature to select to build the random subspace. Then, it used SR in random subspaces and trained classifiers under an MR framework in the random subspaces. Finally, it fused the results of all random subspaces to obtain classified results via majority vote. Thus, it characterised the distribution of samples more accurately. Experimental results on face image datasets showed the effectiveness of P-RSSR. • Results reflect the effectiveness of MMSR. From these results, several conclusions can be obtained. – MMSR outperforms other algorithms on most datasets in the experiments, including UCI datasets and several face image datasets. – Compared to the original MR, MMSR characterises the discreteness of samples and obtains more information of samples in the intrinsic geometry. Therefore, the introduced algorithm obtains better classification results. – Compared to the procedures of the original MR and MMSR, MR is a special case of MMSR. If every class separates from far enough, the nearby neighbours of the unlabelled samples only contain samples from the same class, and MMSR will evolve to the original MR. Experiment result on the infrared dataset supports this conclusion.

References

171

References Aleksandar, D., & Qiu, K. (2010). Automatic hard thresholding for sparse signal reconstruction from NDE measurements. Review of Progress in Quantitative Nondestructive Evaluation, 29(1211), 806–813. Belhumeur, P. N., Hespanha, J. P., & Kriegman D. J. (1997). Eigenfaces versus fisherfaces: Recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine intelligence, 19(7), 711–720. Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neurocomputing, 15(6), 1373–1396. Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Manifold regularization, a geometric framework for learning from labelled and unlabelled examples. Journal on Machine Learning Research, 7, 2399–2434. Bezdek, J. C., Hathaway, R. J. (2002). Some notes on alternating optimisation. In AFSS International Conference on Fuzzy Systems (pp. 288–300). Springer Berlin Heidelberg. http://archive.ics.uci. edu/ml/datasets.html. Cai, D. He, X., Han, J. (2007). Semi-supervised discriminant analysis. ICCV, 1–7. Cevikalp, H., Verbeek, J., Jurie, F., & Klaser, A. (2008). Semi-supervised dimensionality reduction using pairwise equivalence constraints. VISAPP, 1, 489–496. Chen, Z., & Haykin, S. (2002). On different facets of regularization theory. Neurocomputing, 14(12), 2791–2846. Chen, K., & Wang, S. H. (2011). Semi-supervised learning via regularised boosting working on multiple semi-supervised assumptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1), 129–143. Drori, I., Donoho, D. L. (2006). Solution of L1 minimisation problems by LARS homotopy methods. IEEE International Conference on Acoustics, Speech and Signal Processing, 636–639. Fan, M. Y., Zhang, X. Q., Lin, Z. C., Zhang, Z. F., & Bao, H. J. (2014). A regularised approach for geodesic-based semisupervised multimanifold learning. IEEE Transactions on Image Processing, 23(5), 2133–2147. Fan, M., Zhang, X., Lin, Z., Zhang, Z., Bao, H. (2012). Geodesic based semi-supervised multimanifold feature extraction. ICDM, 852–857. Fan, M. Y., Gu, N. N., Qiao, H., Zhang, B. (2011). Sparse regularization for semi-supervised classification. Pattern Recognition, 44(8), 1777–1784. Girosi, F. (1998). An equivalence between sparse approximation and support vector machines. Neurocomputing, 10(6), 1455–1480. Han, J., Yue, J., Zhang, Y., & Bai, L. F. (2014). Kernel maximum likelihood-scaled locally linear embedding for night vision images. Optics & Laser Technology, 56(1), 290–298. He, X. F., Yan, S. C., Hu, Y. X., Niuogi, P., & Zhang, H. J. (2005). Face recognition using laplacian faces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(3), 328–340. Ho, T. K. (1998). The random subspace method for constructing decision forests. IEEE Transactions on Pattern Aanalysis and Machine Intelligence, 20(8), 832–844. Jenatton, R., Mairal, J., Obozinski, G., Bach, F. (2010). Proximal methods for sparse hierarchical dictionary learning. International Conference on Machine Learning, 487–494. Kim, K. I., & Kwon, Y. (2010). Single-image super-resolution using sparse regression and natural image prior. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32, 1127–1133. Lai, Z. H., Wan, M. H., Jin, Z., & Yang, J. (2011). Sparse two dimensional local discriminant projections for feature extraction. Neurocomputing, 74(4), 629–637. Lai, Z. H., Wong, W. K., Jin, Z., Yang, J., & Xu, Y. (2012). Sparse approximation to the eigensubspace for discrimination. IEEE Transactions on Neural Networks and Learning Systems, 23(12), 1948–1960. Lee, K. C., Ho, J., & Kriegman, D. J. (2005). Acquiring linear subspaces for face recognition under variable lighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(5), 684–698.

172

5 Night-Vision Data Classification Based on Sparse Representation …

Li, B., Huang, D. S., Wang, C., & Liu, K. H. (2008). Feature extraction using constrained maximum variance mapping. Pattern Recognition, 41(11), 3287–3294. Liu, W. F., Tao, D. C., Cheng, J., & Tang, Y. Y. (2014). Multiview Hessian discriminative sparse coding for image annotation. Computer Vision and Image Understanding, 118, 50–60. Mairal, J., Bach, F., Ponce, J., Sapiro, G., Zisserman, A. (2008). Supervised dictionary learning. Conference on Neural Information Processing Systems, 1–8. Mairal, J., Jenatton, R., Obozinski, G., Bach, F. (2010). Network flow algorithms for structured sparsity. Conference on Neural Information Processing Systems, 1558–1566. Mallapragada, P. K., Jin, R., Jain, A. K., & Liu Semiboost, Y. (2009). Boosting for semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(11), 2000–2014. Martinez, A. M., Kak, A. C. (2001). PCA versus LDA. IEEE Transactions on Pattern Analysis and Machine intelligence, 23(2), 228–233. Martinez, A. M., Benavente R. (1998). The AR face database. CVC Technical Report (p. 24). Poggio, T., & Girosi, F. (1990b). Regularization algorithms for learning that are equivalent to multilayer networks. Science, 247(4945), 978–982. Poggio, T., Girosi, F. (1990). Networks for approximation and learning. Proceedings of the IEEE. 78(9), 1481–1497. Protter, M., & Elad, M. (2009). Image sequence denoising via sparse and redundant representations. IEEE Transactions on Image Processing, 18(18), 27–35. Qiao, L. S., Chen, S. C., & Tan, X. Y. (2010). Sparsity preserving discriminant analysis for single training image face recognition. Pattern Recognition Letters, 31, 422–429. Ramirez, I., Sprechmann, P., Sapiro, G. (2010). Classification and clustering via dictionary learning with structured incoherence and shared features. IEEE Conference on Computer Vision and Pattern Recognition, 3501–3508. Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimension reduction by locally linear embedding. Science, 290(5), 2323–2326. Samaria, F., Harter, A. (1994). Parametreisation of a stochastic model for human face identification. In Proceedings of the second IEEE workshop on applications of computer vision (pp. 138–142). Scholkopf, B., Herbrich, R., & Smola, A. J. (2000). A generalised representer theorem. Conference on Computational Learning Theory, 42(3), 416–426. Shi, C. J., Ruan, Q. Q., An, G. Y., Ge, C., & An, G. (2015). Semi-supervised sparse feature selection based on multi-view Laplacian regularization. Image and Vision Computing, 41, 1–10. Sim, T., Baker, S., & Bsat, M. (2003a). The CMU pose illumination and expression database. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(12), 1615–1618. Sim, T., Baker, S., & Bsat, M. (2003b). The CMU pose, illumination, and expression database. TPAMI., 25(12), 1615–1618. Tenenbaum, J. B., Silva, V. D., & Langform, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5000), 2319–2323. Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, 58(1), 267–288. Tikhonov, A. N., Arsenin, V. Y. (1977). Solutions of ill-posed problems. Vh Winston. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. Wechsler, H., Phillips, P. J., Bruce, V., Fogelman, F., & Huang, T. S. (1998). Face recognition: from theory to applications. NATO ASI Series F. Computer and Systems Sciences, 163, 446–456. Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and Intelligent Laboratory Systems, 2(1), 37–52. Wright, J., Yang, A. Y., Ganesh, A., Sastry, S. S., Ma, Y. (2009). Robust face recognition via sparse representation. IEEE Transactions on Pattern Analysis and Machine intelligence, 31(2), 210–227. Wu, F., Wang, W., Yang, Y., Zhuang, Y., & Nie, F. (2010). Classification by semi-supervised discriminative regularization. Neurocomputing, 73(10), 1641–1651. Xue, H., Chen, S., & Yang, Q. (2009). Discriminatively regularised least-squares classification. Pattern Recognition, 42(1), 93–104.

References

173

Yan, S., Wang, H. (2009). Semi-supervised learning by sparse representation. SIAM International Conference on Data Mining, 792–801. Yang, W. K., Sun, C. Y., & Zhang, L. (2011). A multi-manifold discriminant analysis method for image feature extraction. Pattern Recognition, 44(8), 1649–1657. Yang, J. C., Wright, J., Huang, T., Ma, Y. (2008). Image super-resolution as sparse representation of raw image patches. IEEE Conference on Computer Vision and Pattern Recognition, 2378–2385. Yin, Q. Y., Wu, S., He, R., & Wang, L. (2015). Multi-view clustering via pairwise sparse subspace representation. Neurocomputing, 156, 12–21. Yu, G. X., Zhang, G., Domeniconi, C., Yu, Z. W., & You, J. (2012a). Semi-supervised classification based on random subspace dimensionality reduction. Pattern Recognition, 45(3), 1119–1135. Yu, G. X., Zhang, G., Yu, Z. W., Domeniconi, C., You, J., & Han, G. Q. (2012b). Semi-supervised ensemble classification in subspaces. Applied Soft Computing, 12(5), 1511–1522. Yu, G. X., Zhang, G. J., Zhang, Z. L., Yu, Z. W., & Lin, D. (2015). Semi-supervised classification based on subspace sparse representation. Knowledge and Information Systems, 43(1), 81–101. Zhao, M. B., Chow, T. W. S., Zhou, W., Zhang, Z., & Li, B. (2014). Automatic image annotation via compact graph based semi-supervised learning. Knowledge-Based Systems, 76, 148–165. Zhao, M. B., Zhan, C., Wu, Z., & Tang, P. (2015). Semi-supervised image classification based on local and global regression. IEEE Signal Processing Letters, 22(10), 1666–1670. Zhao, Z., Bai, L., Zhang, Y. et al. (2018). Probabilistic semi-supervised random subspace sparse representation for classification. Multimedia Tools and Applications. Springer, pp. 1–27. Zhou, D., Bousquet, O., Lal, T. N., Weston, J., & Schölkopf, B. (2004). Learning with local and global consistency. Advances in Neural Information Processing Systems, 16(16), 321–328. Zhu, X. J., Ghahramani, Z. B., Lafferty, J. D. (2003). Semi-supervised learning using Gaussian fields and harmonic functions. International Conference on Machine Learning, 912–919. Zhu, X. (2005). Semi-supervised learning literature survey.

Chapter 6

Learning-Based Night-Vision Image Recognition and Object Detection

Image recognition and detection play an important role in many fields, especially in night-vision technology. Traditional methods of image recognition are largely based on information extracted from an image to classify. This requires users to select appropriate features to set feature representations for original images per the specific circumstances. Manual selection of features is a laborious and heuristic work, and the features acquired in this way usually have poor robustness. To automatically obtain more robust and stronger generalisation image features, many scholars have applied machine-learning methods. Through training with large volumes of data samples, the performance of recognition and detection is improved with better accuracy and robustness. This chapter first describes the operation principle of machine learning in image recognition and object detection. Then, it points out the feature learning of the noise datasets and introduces a lossless-constraint denoising-based autoencoder (LDAE) method, which adds a lossless constraint into the reconstruction error function of traditional autoencoders. The LDAE can eliminate the noise influence in the noise training dataset when learning features, getting better recognition rates. In application fields, most approaches focus on hardware design and optimisation without considering the hardware characteristics and the algorithm structure. This leads to poor performance in flexibility and adaptability. To solve this problem, we introduce an embedded system, based on CPU and FPGA, for object detection, using an improved deformable part model.

6.1 Machine Learning in IM Machine learning has a wide range of studies and applications in IM using big samples. Originally, machine-learning methods used artificially designed features to compute feature representations of images, and then used some classifiers to classify them. After years of development, automatic feature-learning methods, based on data © Springer Nature Singapore Pte Ltd. 2019 L. Bai et al., Night Vision Processing and Understanding, https://doi.org/10.1007/978-981-13-1669-2_6

175

176

6 Learning-Based Night-Vision Image Recognition …

training, have become more important and useful. The following sections introduce some approaches from this area.

6.1.1 Autoencoders The autoencoder was proposed by Rumelhart et al. (1988) and Baldi and Hornik (1989). The main purpose is learning compressed and distributed feature representations from a given dataset. Autoencoders are typical neural networks of three layers. They provide an encoding process between the input and the hidden layers and a decoding process between the hidden and the output layers. The autoencoder obtains an encoding representation of input data through the encoding operation on the input data and reconstructs input data via the decoding operation on the hidden layer. Then, it defines the reconstruction error function to measure the autoencoder learning effect. Based on the autoencoder, constraint conditions can be added to produce variants. Bengio et al. (2006), Schölkopf et al. (2006) and Bengio (2009) proposed the sparse autoencoder (SAE), which requires the neuron activation on the hidden layer to satisfy certain sparsities. In this algorithm, an additional penalty factor was added to the reconstruction error function to achieve sparsity. Vincent et al. (2008) proposed the denoising autoencoder (DAE) in 2008. The main idea was corrupting on the input vector, and then using the corrupted input vector to reconstruct the original input vector. If the autoencoder can reconstruct the original input with the corrupted input, then the network is robust to the input data. Bengio et al. (2007), Bengio and Delalleau (2011) and Erhan et al. (2010) used neural networks to construct deep networks. Each network used the autoencoder for pre-training to provide better initialisation for the deep neural network. Thus, they obtained better high-level features.

6.1.2 Feature Extraction and Classifier Object detection is mainly used for traffic safety and unmanned aerial vehicle (UAV) surveillance. It has aroused the widespread concern of many scholars. Most studies used various sensors to design intelligent systems. Whereas current GPS and radar sensors have been developed and have shown good performance in obstacle avoidance, the amount of information they provide is far less than visual sensors. With the development of visual sensors, their cost and size are greatly reduced, making them more convenient for applications, such as vehicle driving assist systems and unmanned airborne vehicles. Simultaneously, chip technology has made considerable progress, enabling complex algorithms to be implemented on embedded platforms. A multi-feature fusion method was proposed to identify the UAV aerial image in Wang (2015). Whereas this method had good accuracy, its real-time performance was too poor to carry on the UAVs.

6.1 Machine Learning in IM

177

For static vehicle detection, most current algorithms were based on the shape feature found in Yang and Yang (2014), such as HOG–LBP, Haar-like AdaBoost, DPM–SVM, etc. To solve problems of accuracy and real-time strength, Liu et al. (2013) presented a new detector based on the improved AdaBoost algorithm and the inter-frame difference method. Guzmán et al. (2015) combined HOG and SVM for vehicle detection in the outdoor environment, which improved the success rate of classification by adjusting the SVM parameters. Based on the characteristics of HOG, Xu et al. (2016) proposed a two-stage car detection method using DPM with composite feature sets, which can identify many kinds of vehicles at different viewing angles. After a comprehensive comparison of the approaches proposed in the literature, aiming at the issues of poor real-time performance, narrow detection range and low accuracy in complex environment, the DPM is a good research direction. Regarding embedded implementations of the DPM algorithm, many kinds of vehicle detection system architectures exist. Most of the methods are entirely based on FPGA. Hence, all algorithms were implemented on FPGA. These approaches focus on the details of hardware design and optimisation without considering the hardware characteristics and algorithm structure. This leads to its poor performance in flexibility and adaptability. This chapter introduces a CPU and FPGA integration embedded system platform with a vehicle detection algorithm.

6.2 Lossless-Constraint Denoising Autoencoder Based Night-Vision Image Recognition 6.2.1 Denoising and Sparse Autoencoders An autoencoder is a three-layer neural network with the same number of input and output nodes. Autoencoder networking does not need data labels; it uses unsupervised learning, which only needs to make input and output data as equal as possible. Assume the input data is x, and the output data of the autoencoder is hW,b (x). The autoencoder network needs to make hW,b (x) x as much as possible. Then, the data of the hidden layer in the middle is the coding result, and it is somewhat similar to PCA. It reduces the dimension of input data and extracts the representative data. From the input layer to the hidden layer, it could be called the cognitive or encoding process. From the hidden layer to the output layer, it could be called the generative process or the reconstruction process. For an autoencoder, making hW,b (x) ≈ x is like making an hW,b (x) − x minimisation. Thus, the problem turns into a problem of optimisation. The expression of minimisation is the reconstruction error function (or loss function):

178

6 Learning-Based Night-Vision Image Recognition …

JAE

sl +1 n sl m l −1 1 1 hW,b (x(i) ) − x(i) 2 + λ (Wji(l) )2 , m i1 2 2 i1 j1

(6.1)

l1

where the first term in the definition is an average sum-of-squares error term. The second term is a regularisation term (also called a weight decay term) that tends to decrease the magnitude of the weights and helps prevent overfitting. At present, the usual method of minimising the loss function is the gradient descent method. First, the weight matrix W and the bias b are randomly initialised, and then the parameters, W and b, update at each iteration according to the following: Wij(l) Wij(l) − α

(l) b(l) i bi −

∂ ∂ Wij(l) ∂

∂b(l) i

JAE ,

JAE ,

(6.2)

(6.3)

where α is the learning rate. We calculate partial derivatives of JAE to W and b, and then, α as a coefficient, makes W and b fall in the reverse direction of partial derivatives at each iteration. After many iterations, W and b reach an extreme point. The key to this step is computing partial derivatives. Because JAE is a nonlinear function, it is difficult to derive the partial derivatives of W and b with simple derivation. In the late 1980s, the invention of the backpropagation (BP) algorithm of Hecht-Nielsen (1989) gave an efficient method of computing partial derivatives. Vincent et al. (2008) proposed altering the training objective from mere reconstruction to that of denoising an artificially corrupted input (i.e. learning to reconstruct the clean input from a corrupted version). Learning the identity is no longer enough: the learner must capture the structure of the input distribution to optimally undo the effects of corruption processes, with the reconstruction essentially being a nearby but higher density point. DAE adds an artificial corruption (e.g. additive isotropic Gaussian noise, masking noise or salt-and-pepper noise) in the input data and uses the corrupted data to reconstruct the original data. Therefore, the autoencoder has to learn features, which are more robust. The reconstruction error function of DAE is JDAE

sl +1 n sl m l −1 1 1 hW,b (ˆx(i) ) − x(i) 2 + λ (Wji(l) )2 , m i1 2 2 i1 j1

(6.4)

l1

where xˆ is the artificially corrupted input. SAE adds a sparse constraint item based on the autoencoder. Sparsity in the representation can be achieved by penalising the hidden unit biases, making these additive offset parameters more negative, or by directly penalising the output of the hidden unit activations, making them closer to their saturating value at 0. In this section, we penalise the output of the hidden unit activations with Kullback–Leibler (KL) divergence.

6.2 Lossless-Constraint Denoising Autoencoder Based …

KL(ρρj )

ρ 1−ρ ρ log + (1 − ρ) log , ρj 1 − ρj

179

(6.5)

where ρ is the sparsity parameter, usually a small value close to 0. ρj is the average value of the output of the hidden unit activations. In this constraint, when ρj , the KL divergence reaches the minimum value to 0. The reconstruction error function of SAE is JSAE

sl +1 nl −1 sl m λ 1 1 (i) (i) 2 + hW,b (x ) − x (Wji(l) )2 m i1 2 2 l1 i1 j1 +β KL(ρρj ).

(6.6)

We then simply express Eq. (6.6) as JSAE

1 hW,b (X ) − X 2 + λ KL(ρρj ). W2 + β 2m 2

(6.7)

6.2.2 LDAE In view of the poor generalisation ability of existing autoencoders in noisy datasets, this section introduces a lossless-constraint denoising (LD) method. Applying LD to DAE and SAE, respectively, this section introduces the LDAE and the losslessconstraint denoising sparse autoencoder (LDSAE). These two autoencoders can be collectively called LDAE (Fig. 6.1). LDAE Traditional DAE needs artificially corrupted data and the original pure data to learn denoising and robust features. However, in reality, the obtained data always has some noise (e.g. LLL data). In this situation, if the autoencoder reconstructs the original noise data, it inevitably learns noise as a feature. To solve this problem, it is necessary to generate pure data to constrain the reconstruction data of our autoencoder. Use Xn to represent noise training dataset, and use Xc to represent generated denoising dataset. Thus, we modify the reconstruction error function as

JNDAE γ

1 1 hW,b (X n ) − Xc 2 + λ Xn − Xc 2 + W 2, 2m 2m 2

(6.8)

n is artificially corrupted data of original noise input data. In this error where X 1 Xn − Xc 2 . It constrains Xc and function, there is a lossless-constraint item, γ 2m the original data to ensure that the Xc does not lose information. This is why it 1 n ) − Xc 2 constrains the hW,b (X is called lossless-constraint item. Additionally, 2m reconstruction data and Xc . Thus, Xc follows two constraints, where one makes it lossless and another makes it denoising to improve the SNR ratio of Xc . The new autoencoder is LDAE.

180

6 Learning-Based Night-Vision Image Recognition …

Fig. 6.1 Flowchart of LDAE. Reprinted from Zhang et al. (2018), with permission from Elsevier

LDSAE Similar to LDAE, because the sparse encoding has a strong denoising ability to non-sparse noises, we take advantage of the denoising ability of sparseness and add the lossless-constraint item to further improve its denoising ability. We modify the SAE reconstruction error function as 1 1 hW,b (Xn ) − Xc 2 + λ Xn − Xc 2 + W2 2m 2m 2 +β KL(ρρj ).

JNDSAE γ

(6.9)

The difference between LDAE with LDSAE is that LDAE uses corrupted data to get stronger denoising capability, whereas LDSAE takes advantage of the sparsity. Optimisation Algorithm of LDAE Because LDAE adds an unknown variable, Xc , in the reconstruction error function of LDAE, it is difficult to know the partial derivative of our function to Xc . It is also difficult to simply use gradient descent to optimise it. Thus, we use an alternating iteration method to optimise Xc and W , b. We take LDSAE optimisation as an example. First, we initialise Xc by Xn . In this case, the reconstruction error function of LDSAE turns to a simple SAE. We randomly initialise W and b, and then iteratively optimise the parameters, W and b, using the gradient descent method, several times (usually 200). Then, we fix the values of W and b, treat them as constants and optimise Xc . Thus, the function becomes

6.2 Lossless-Constraint Denoising Autoencoder Based …

JNDSAE γ

1 1 hW,b (Xn ) − Xc 2 + C. Xn − Xc 2 + 2m 2m

181

(6.10)

Minimise JNDSAE by Xc to take the derivative of JNDSAE to Xc , which equals 0. 2 1 1 Xn − Xc 2 + 2m d γ 2m hW,b (Xn ) − Xc 0. (6.11) dXc Equation (6.10) is calculated to become Xc

γ Xn + hW,b (Xn ) . 1+γ

(6.12)

We update Xc , and we consider it a constant to update W and b. We repeat the above method, updating Xc , W and b sequentially until convergence or we reach the maximum number of iterations (usually 400). The optimisation idea of LDAE is almost the same. The difference is that LDAE has no sparse constraint term, but still corrupts input data. Task: learning denoising filters, W , and denoising dataset, Xc , from noise dataset Xn . Algorithm parameters: m—number of dataset, λ—weight decay parameter, Cqu 1 n T i1 (qu − μ)(qu − μ) —weight of the sparsity penalty term, ρ—sparsity n−1 parameter, W —weight of the features, b—bias term, Xc —denoising data.

1 1 hW,b (Xn ) − Xc 2 + λ Xn − Xc 2 + arg min γ W2 + β KL(ρρj ) . 2m 2m 2 XC ,W,b • Initialisation: set Xc Xn so it becomes a sparse autoencoder. ⎧ ⎫ ⎪ 2⎪ ⎨ γ 1 Xn − Xc 2 + 1 hW,b (Xn ) − Xc 2 + λ W ⎬ 2m 2m 2 arg min . ⎪ XC ,W,b ⎪ ⎩ +β KL(ρρj )⎭ We use a backpropagation algorithm and gradient descent to compute W and b. • Repeat J times: – update Xc : consider W and b to be constant. Thus, we set Xc

γ Xn + hW,b (Xn ) 2

– update W and b: use the new Xc , use gradient descent algorithm to compute the new W and b. • Compute the encoding result: W1 Xn + b1 .

182

6 Learning-Based Night-Vision Image Recognition …

Fig. 6.2 Example images of MNIST dataset

6.2.3 Experimental Comparison This section compares the classification effects of LDAE with SAE and DAE in different datasets and verifies that the lossless-constraint item improves the denoising ability of the autoencoder. This chapter uses the MNIST (Fig. 6.2) and Yale-B (Fig. 6.3) datasets with different degrees of noise coefficients. The MNIST database of handwritten digits has a training set of 60,000 examples and a test set of 10,000 examples. It is a subset of a larger set available from NIST . The digits have been size-normalised and centred in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal effort on preprocessing and formatting. The full name of the Yale-B dataset is the extended Yale Face Database B, and it contains 16,128 images of 28 human subjects under nine poses and 64 illumination conditions. In this section, 640 images of 10 human subjects are used to test image data, which is manually aligned, cropped and resized to 40 × 40 Experiments in MNIST Dataset Image size in the MNIST dataset is 28 × 28 784. This section first designates the node number of the hidden layer as 200, which means the autoencoder can learn 200 filters via the encoding process from 784 dimensions to 200 dimensions. It then sends the encoding results of the hidden layer to a softmax classifier, comparing the classification efficiency of LDAE with SAE and DAE. The MNIST dataset is added with variance of 0.1, 0.2, 0.3 and 0.4 Gauss noise. Because of the strong sparsity of the MNIST dataset, SAE provides better generalisation and denoising ability compared to DAE. All algorithms use the same values of λ, β and ρ.

6.2 Lossless-Constraint Denoising Autoencoder Based …

183

Fig. 6.3 Example images of one person in the Yale-B dataset. Reprinted from Zhang et al. (2018), with permission from Elsevier Table 6.1 Accuracy of SAE, DAE and LDSAE in MNIST dataset Noise

Method

Train Acc (%)

Test Acc1 (%)

Test RMSE1

Test Acc2 (%)

Test RMSE2

0.1

SAE

93.688

94.280

0.2379

94.040

0.2734

DAE

91.308

92.910

0.2617

91.830

0.2902

LDSAE

93.903

94.540

0.2290

94.460

0.2652

SAE

90.453

91.890

0.2724

90.820

0.3351

DAE

88.275

91.580

0.2812

88.620

0.3394

LDSAE

91.807

93.170

0.2501

92.270

0.3081

SAE

86.703

89.030

0.3075

87.080

0.3908

DAE

85.827

89.860

0.2893

86.560

0.3720

LDSAE

89.735

90.740

0.2817

90.150

0.3433

SAE

82.225

86.610

0.3399

82.780

0.4505

DAE

80.547

84.940

0.3617

81.410

0.4358

LDSAE

85.838

87.160

0.3296

86.130

0.3997

0.2

0.3

0.4

Reprinted from Zhang et al. (2018), with permission from Elsevier

Table 6.1 shows the training dataset accuracy (Train Acc), non-noise test dataset accuracy (Test Acc1), non-noise test dataset RMSE (Test RMSE1), noise test dataset accuracy (Test Acc2) and noise test dataset RMSE (Test RMSE2). RMSE is a frequently used measure of the differences between values predicted by models or estimators and the values actually observed. The RMSE represents the sample standard deviation of the differences between predicted labels and observed labels. As shown in Table 6.1, we compare SAE and DAE. LDSAE has better classification accuracy and lower RMSE values under different noise coefficients, and the enhancement is more obvious in the case of higher noise coefficients. Therefore, it can be proven that LDSAE has a better ability to resist the influence of noise, and it can effectively improve the robustness of SAE in noise data. Figure 6.4 shows 200 filters learnt by SAE, DAE and LDSAE. From the comparison of these filters, it is easy to see that SAE and LDSAE learnt many structural features, whereas DAE did not. Features learnt by SAE accompany some noise, but features learnt by LDSAE have a higher SNR. Figure 6.5 shows the comparison of

184

6 Learning-Based Night-Vision Image Recognition …

Fig. 6.4 200 filters learnt by SAE, DAE and LDSAE. Reprinted from Zhang et al. (2018), with permission from Elsevier

Fig. 6.5 Part of data reconstructed by SAE, DAE and LDSAE. Reprinted from Zhang et al. (2018), with permission from Elsevier

Fig. 6.6 Part of filters learnt by SAE, DAE, LDSAE and LDAE. Reprinted from Zhang et al.(2018), with permission from Elsevier

data reconstructed by SAE, DAE and LDSAE. The reconstruction data of DAE and SAE includes a lot of noise, but the reconstruction data of LDSAE almost recovers the original clean data. This phenomenon also proves that our LD item works, helping the autoencoders denoise the input noise dataset and reconstruct a higher SNR.

6.2 Lossless-Constraint Denoising Autoencoder Based …

185

This section applies the single-layer autoencoder to the stacked multiple or deep network, which has a better feature-learning ability to learn more generative features. To train the stacked autoencoders, first pre-train each layer using autoencoding, and then use the data labels to supervise fine-tuning the whole network. The fine-tune process can guide the network to learn features, helpful to classify, thus improving classification performance. In the experiment on the MNIST dataset, we designed a stacked network structure as 784–200–200–10: the input layer is 784 dimensions; the first and the second hidden layers are all 200 dimensions; and the output layer is 10 classes through the softmax classifier. The first layer is modified to DAE and LDSAE from original SAE to compare the performance of three autoencoders. Table 6.2 shows non-noise test dataset accuracy before fine-tuning (Acc1a) and its RMSE (RMSE1a), non-noise test dataset accuracy after the fine-tune process (Acc1b) and its RMSE (RMSE1b), noise test dataset accuracy before fine-tuning (Acc2a) and its RMSE (RMSE2a), noise test dataset accuracy after fine-tuning (Acc2b) and its RMSE (RMSE2b). As shown in Table 6.2, either before or after the fine-tune process, when the noise coefficient becomes larger, the classification accuracy and RMSE of LDSAE is better than that of SAE and DAE. This validates the purpose of optimising the autoencoder and improves the noise immunity of the multilayer network. Experiments in Yale-B Dataset For the Yale-B dataset, this section designs the autoencoder structure as 1600–400. This means that the autoencoder should learn 400 filters from the input data of 1600 dimensions and output 400 dimensions as a result. Because DAE displays better generalisation and denoising ability in the complex noise of the Yale-B dataset, we add the LDAE to compare SAE, DAE and LDSAE. As shown in Table 6.3, LDSAE and LDAE have a certain degree of promotion, compared to SAE and DAE. LDAE has the best performance at different noise coefficients. In Fig. 6.6, the difference in performance is more intuitive. SAE and DAE can hardly learn useful features, but LDSAE and LDAE can learn many facial features. PSNR is most commonly used to measure the quality of reconstruction of lossy compression codecs. A higher PSNR generally indicates that the reconstruction is of higher quality. As shown in Fig. 6.7, SAE has some denoising ability, and PSNR of its reconstruction data gets a little improvement because of the sparse constraint. Moreover, the LDSAE take usage of the ability of SAE and enhance it with the LD method, finally getting higher PSNR reconstruction data. In the experiment on Yale-B dataset, we design the stacked network structure as 1600–400–400–10. Here, the input layer is 1600 dimensions, the first and second hidden layers are all 400 dimensions, and the output is 10 classes through the softmax classifier. Table 6.4 shows the test dataset accuracy before fine-tuning (Test Acc1) and its RMSE (RMSE1), test dataset accuracy after fine-tuning (Test Acc2) and its RMSE (RMSE2). The classification performance before fine-tuning of two-layer stacked

87.200

88.180

89.600

DAE

LDSAE

92.870

LDSAE

SAE

92.040

DAE

97.020

96.330

96.940

97.530

97.060

97.430

97.860

97.660

97.790

Acc1b (%)

88.590

87.320

85.620

91.100

90.450

89.860

92.950

93.050

92.880

Acc2a (%)

Reprinted from Zhang et al. (2018), with permission from Elsevier

0.3

90.950

93.280

LDSAE

SAE

92.630

DAE

0.2

92.320

SAE

0.1

Acc1a (%)

Method

Noise

Table 6.2 Accuracy of stacked SAE, DAE and LDSAE in MNIST dataset

93.260

92.350

92.960

94.720

94.140

94.270

96.680

96.500

96.660

Acc2b (%)

0.2983

0.3050

0.3301

0.2685

0.2805

0.2831

0.2535

0.2562

0.2529

RMSE1a

0.1587

0.1743

0.1599

0.1434

0.1578

0.1436

0.1333

0.1391

0.1344

RMSE1b

0.3875

0.3879

0.4443

0.3502

0.3385

0.3708

0.3069

0.2859

0.3097

RMSE2a

0.2407

0.2555

0.2436

0.2108

0.2250

0.2103

0.1668

0.1719

0.1677

RMSE2b

186 6 Learning-Based Night-Vision Image Recognition …

6.2 Lossless-Constraint Denoising Autoencoder Based … Table 6.3 Accuracy of SAE, DAE, LDSAE and LDSAE in MNIST dataset

187

Noise

Method

Test Acc (%)

RMSE

0.08

SAE

65.625

0.6901

DAE

69.375

0.5783

LDSAE

69.063

0.6975

LDAE

71.188

0.5675

SAE

55.937

0.7640

DAE

59.375

0.7354

LDSAE

65.625

0.6084

LDAE

67.188

0.6055

0.1

Reprinted from Zhang et al. (2018), with permission from Elsevier

Fig. 6.7 PSNR of data reconstructed by SAE and LDSAE. Reprinted from Zhang et al. (2018), with permission from Elsevier

autoencoders is even worse than that of a single autoencoder. This is because the reconstruction error transmits layer-by-layer. Thus, two-layer autoencoders get a bigger reconstruction error and a worse classification accuracy. With stacked SAE, the Yale-B dataset has great illumination variety, and there are lots of non-sparse signals. Sparsity constraints make the reconstruction error even bigger. However, after supervised fine-tuning, classification accuracies of stacked autoencoders substantially increase. By comparing the classification accuracies of the four autoencoders, it can be found out that LDSAE and LDAE make an improvement on SAE and DAE.

188

6 Learning-Based Night-Vision Image Recognition …

Table 6.4 Accuracy of stacked SAE, DAE, LDSAE and LDAE in MNIST dataset Noise

Method

Test Acc1 (%)

Test Acc2 (%)

RMSE1

RMSE2

0.08

SAE

48.125

72.500

0.7236

0.5297

DAE

66.563

72.500

0.5703

0.5299

LDSAE

50.625

71.875

0.7245

0.5336

LDAE

69.375

73.125

0.5958

0.5339

SAE

35.000

68.750

0.7814

0.5615

DAE

54.063

68.000

0.6450

0.5614

LDSAE

38.438

69.688

0.7790

0.5565

LDAE

58.125

69.688

0.6262

0.5601

0.1

6.3 Integrative Embedded Night-Vision Target Detection System with DPM This section introduces an embedded system based on CPU and FPGA integration for car detection with an improved DPM. Original images are computed and layered into multiresolution HOG feature pyramids on the CPU and are then transmitted to FPGA for fast convolution operations. It finally returns to the CPU for statistical matching and display. Owing to the architecture of the DPM algorithm, combined with the hardware characteristics of the embedded system, the overall algorithm frameworks are simplified and optimised. According to the mathematical derivation and statistical rules, feature dimensions and pyramid levels of the model descend without sacrificing accuracy. It effectively reduces the amount of calculation and data transmission. The advantages of parallel processing and pipeline design of FPGA are made to fully achieve the acceleration of convolution computation, which significantly reduces the running time of the programme. Several experiments have been done for visible images in the UAVs view and the driver assistance scene, with infrared images captured from an overlooking perspective. Results show that the system has good real-time and accuracy performances in different situations.

6.3.1 Algorithm and Implementation of Detection System DPM Algorithm Overview DPM is a very successful target detection algorithm, having become an important part of many classifiers, segmentation, human posture and behaviour classification. DPM can be regarded as the expansion of the HOG algorithm. Figure 6.8 shows the overall algorithm flow of DPM. First, the HOG is calculated, and then the gradient models are obtained via SVM. Trained models can be used to directly detect the target. However, single model matching falls far short of multi-view vehicle detection. Thus, multiple model matching is necessary. Considering the spatial correspondence among multiple models, the position offset

6.3 Integrative Embedded Night-Vision Target Detection System with DPM

189

Fig. 6.8 Overall algorithm flow of DPM. Reprinted from Zhang et al. (2017), with permission of Springer

between the part and the root is added as a cost. Thus, we subtract the offset cost and get the composite score. It is essentially a spatial prior knowledge of the part model and the root model. The result of target detection can be obtained according to the statistic of response of each model and target. By analysing the flowchart of the DPM algorithm, the convolution response between each model and image requires a large amount of computation: basically a simple iterative calculation. However, the extraction of image features and the calculation of feature pyramids need little computing resources. This section introduces an embedded platform composed of CPU and FPGA, utilising the characteristics of structure of the DPM to implement the vehicle detection in different scenarios. System Architecture To implement the DPM algorithm in advanced driver assistant systems and unmanned aerial systems, the consumption and volume of hardware platform must meet the requirements. Low-power FPGA chips have strong parallel processing ability, which is suitable for the calculation of convolution. It is possible to improve the uptime of the algorithm by utilising an FPGA pipeline architecture. Tegra X1 combines the NVIDIA Maxwell GPU architecture with an ARM CPU cluster to deliver performance and power efficiency required by computer graphics and artificial intelligence. Designed for power and space-constrained applications, Tegra X1 has many kinds of peripherals and interfaces. An embedded image processing system based on NVIDIA Tegra X1 and XILINX Spartan6 XC6SLX100T is very competitive. The overall overview of the whole system is shown in Figs. 6.9 and 6.10. Tegra X1 is responsible for acquiring the video stream through a USB port, including image preprocessing and scaling. It generates a feature pyramid of images, reducing feature

190

6 Learning-Based Night-Vision Image Recognition …

Fig. 6.9 The whole architecture of image processing system. Reprinted from Zhang et al. (2017), with permission of Springer Fig. 6.10 The hardware detection system. Reprinted from Zhang et al. (2017), with permission of Springer

dimensions and pyramid levels. Selected feature images transmit FPGA convolution via Ethernet for calculation. Then, the images sorted return to Tegra X1 for statistical matching and streaming results to the HDMI. Owing to the complexity of the DPM algorithm and the system requirements, there is a need to consider detailed designs for specific system implementations. The difficulties realising this system and the specific solutions are described below. Feature Pyramid and Optimisation After the original construction of the feature pyramid, different scale images acquired exceed dozens of layers. If these layers are directly calculated by the HOG feature and carried out with subsequent convolution

6.3 Integrative Embedded Night-Vision Target Detection System with DPM

191

operations, the amount of computational required would be too large, resulting in severe time consumption. Thus, real-time performance would be seriously affected. To recover from this problem, the pyramid levels should be reduced as far as possible without affecting detection accuracy. Dollár et al. (2014) presented a method based on statistical characteristics of fast feature construction of pyramids. The key insight is that one might compute finely sample feature pyramids at a fraction of the cost, without sacrificing performance. For a broad family of features, they found that features computed at octave-spaced scale intervals were sufficient to approximate features on a finely sampled pyramid. This method greatly reduced pyramid levels, because it needed to be calculated entirely. The core idea can be described as CS Ω(IS ),

(6.13)

where I represents the original image and C represents the feature image. Using Is to denote I captured at scale s, we use R(C, s) to denote I resampled by s. If C Ω(I ) is computed, image CS Ω(IS ) can be used to predict a new scale, s, using only C Instead of computing CS Ω(R(C, s)), we show a new approximation: CS ≈ R(C, s)s−λ .

(6.14)

Figure 6.11 shows a visual demonstration of Eq. (6.14). After the original image is scaled to 1/2 resolution, the HOG features are calculated. By interpolation and scaling, these two feature maps and the other layers of the pyramid can be obtained. Convolution of Feature Matrix Convolution calculation is implemented on the FPGA. The main method for implementing FPGA design efficiently focuses on the buffer scheme and the improvement of the convolution kernel module. Next is the specific implementation process. Convolution Accelerator According to algorithm needs, two sizes of convolution kernel (6 × 6 and 15 × 5) are built. The number of 6 × 6 kernel is 16 and the other is 2. The parallel computation of convolution is realised by shift register multiplexing and module cascading. The design of the convolution architecture is shown in Fig. 6.12. The whole convolution accelerator works as follows. Buffer the incoming pixel sequence in shift registers and start calculating convolution when the first pixel arrives at the first kernel. The outputs of convolution kernel enter the next shift register and control logic module simultaneously. However, the pixel sequence continues to be transmitted to the shift register and calculated in convolution kernel. It is also transmitted to control the logic module for sorting and restructuring. The circulation has been in progress until the first pixel completes its calculation in the last convolution kernel. Then, the pixel sequences of the second row and the third row are also completed with their computation in the corresponding kernel. These results enter the control logic, sorted by specific rules and output to next module.

192

6 Learning-Based Night-Vision Image Recognition …

(a) Original construction of the pyramid

(b) Improved construction of pyramid

Fig. 6.11 Construction of the fast feature pyramid. Reprinted from Zhang et al. (2017), with permission of Springer

Fig. 6.12 The framework of convolution accelerator. Reprinted from Zhang et al. (2017), with permission of Springer

6.3 Integrative Embedded Night-Vision Target Detection System with DPM

193

Fig. 6.13 The ping-pong operation. Reprinted from Zhang et al. (2017), with permission of Springer

Buffer Scheme FIFO and DDR3 are used to cache the data to ensure timing requirements are met. The main function of FIFO is to maintain the continuity and integrity of data when it is transmitted. DDR3 is mainly for the realisation of the ping-pong operation shown in Fig. 6.13, aiming to hide the data transfer time behind the computation time. Figure 6.13a denotes the initial state. Incoming data is stored in area A of DDR3 until it is full. Then, the address control module sends a signal which changes the storage path to the B area. Meanwhile, the data stored in A is sent to the convolution kernel for computation. It is working state 1. Once B area is full, the address control module holds incoming data waiting for the convolution kernel to complete calculation. Then, the system is in working state 2. When the convolution kernel finishes all calculations, the address control module switches read and write addresses. Then, it is in working state 3. The data is continuously carrying on transmission and processing from the data input and output. Logical Architecture and Data Reuse Prior to discussing data reuse, we should consider the computation-to-communication ratio that of the DRAM traffic needed by a kernel in a specific system implementation. It is used to describe the computation operations per memory access. Data reuse optimisation can reduce the total number of memory accesses and improve computation. Under the premise of limited bandwidth and limited on-chip resources, corresponding data multiplexing architectures are constructed for different sizes of convolution kernels. To save the shift register resources, a shift register multiplexing module should be built, effectively reducing the number of shift registers. Communication The communication between FPGA and Tegra X1 is based on Ethernet. Owing to the large amount of data to be transmitted and the high stability requirements, there should be serious packet losses and poor data transmission stability if simply UDP/IP protocol is used to transmit. Whereas TCP/IP protocol can solve the problem of packet loss, the data transmission speeds cannot meet the algorithm needs. A protocol that preserves both the stability TCP and the described speed of UDP is needed. We carried out the pseudo-TCP coding of the transmitted data over the UDP protocol, which made a speed-stability trade-off.

194

6 Learning-Based Night-Vision Image Recognition …

6.3.2 Experiments and Evaluation To evaluate the adaptability and accuracy of this image processing system in various situations, many experiments and tests were done. UAV aerial images were tested at different heights, aiming to assess the effectiveness of the system for the detection of targets at different scales. From the driver’s view, the KITTI datasets were used for testing, which basically meet the requirements of the auxiliary driving scene. The system was also tested with infrared images. Assessment Architecture Precision and recall are two quality indicators used to evaluate the performance of vehicle detection, defined, respectively, as

precision

recall

true positives , true positives + false positives

true positives , true positives + false positives

(6.15)

(6.16)

Precision indicates the percentage of detected cars out of the total number of detected objects, which can be used to measure the accuracy of the algorithm. Recall represents the ability of the algorithm to detect all the targets in the image. Therefore, these indicators constitute a complete architecture. Results A total of 408 frames of UAV aerial images with 640 × 480 resolutions were collected at 60-m and 100-m height. Different sizes of positive and negative samples were selected from these images, training in a semi-supervised way with Matlab. Testing the trained models many times, the parameter of number of aspect ratio cluster (NARC) to use was set to 2, so that the trained models could achieve an optimal balance between speed and accuracy. Table 6.5 compares the local system with other systems. The vehicle detection system introduced is better than Rosenbaum et al. (2008) and Maria et al. (2016) in precision and recall index. From Fig. 6.14, we find that the background is very complex and numerous cars are shaded. In such a complicated environment, this system still has good precision and recall. It reflects the advantages of the described method. Algorithm in Qu et al. (2016) is slightly better than this system, because the algorithm was implemented on an Intel i5 processor at 3.4 GHz, whereas the method introduced in this section was carried on an embedded system. There is a marked difference in the aspect of hardware. The images tested in Qu et al. (2016) had few vehicles with very clear backgrounds. However, images detected for this paper have a number of vehicles, and some of them are partially occluded. This increases the difficulty of detection to a certain extent. As for the problem of insufficient hardware, the Tegra X1 GPU can be fully used to accelerate the construction of the pyramid to improve detection performance. For vehicle detection in complex environments,

6.3 Integrative Embedded Night-Vision Target Detection System with DPM

195

Table 6.5 Comparison of UAV images test In this chapter

Rosenbaum et al. (2008)

Maria et al. (2016)

Qu et al. (2016)

Precision (%)

84.3

75.0

83.7

91.3

Recall (%)

81.7

65.0

51.3

90.0

Reprinted from Zhang et al. (2017), with permission of Springer

(a) Flight altitude at 60 m

(c) Flight altitude at 100 m

(b) Flight altitude at 60 m

(d) Flight altitude at 100 m

Fig. 6.14 Results of UAV test. Reprinted from Zhang et al. (2017), with permission of Springer

we strengthen the depth of the training sets. Specifically, we increase the NARC parameter or increase the number of training samples. In the test images in KITTI, having a total of 379 frames with a resolution of 1392 × 512, Matlab is used to intercept many positive and negative samples for training. The parameter of NARC is set to 4. The overall difficulty of detection from the driver’s view is higher than the UAV aerial’s view. Because of various sizes of vehicle targets in this scenario and the differences among vehicles are more obvious, more samples were needed for training, which greatly increased the complexity of the training sets. Furthermore, environmental conditions, such as light differences, shadows and perplexing objects, also decreased the accuracy of detection.

196

6 Learning-Based Night-Vision Image Recognition …

Fig. 6.15 Results of KITTI test. Reprinted from Zhang et al. (2017), with permission of Springer

Fig. 6.16 Detection with infrared images. Reprinted from Zhang et al. (2017), with permission of Springer

Figure 6.15b shows the precision–recall curve obtained after many experimental tests. As can be seen from the graph, our system is slightly higher than Yebes et al. (2014) in precision when recall is in 0–0.4. It is almost the same as the Yebes et al. (2014) work, where the recall is above 0.4. In view of the overall situation, the performance of this system is slightly better than Yebes et al. (2014). Precision of this system is slightly inferior to Felzenszwalb et al. (2010) when recall is low, but it is significantly higher than Liu et al. (2013) at higher recall rate. Overall, our system has strong adaptability in a variety of situations. To evaluate detection performance with infrared images, the models were retrained to fit the infrared objection. Figure 6.16 shows the results of vehicle detection. The algorithm described still had good recognition in this situation.

6.3 Integrative Embedded Night-Vision Target Detection System with DPM

197

Table 6.6 FPGA resource occupancy report XC6SLX100T

Used

Available

Percentage (%)

Flip-flops

86,262

126,576

68.2

LUTs

57,352

63,288

90.6

Block RAM

148

268

55.2

DSP slices

164

180

91.1

Reprinted from Zhang et al. (2017), with permission of Springer Table 6.7 Comparison of performance and consumption Hardware/Performance

Processing time (ms)

Resolution (pixels)

Intel i5 2.4 GHz with original DPM

650

640*480

Intel i5 2.4 GHz with optimised DPM

185

640*480

65

640*480

110

1392*512

In this chapter (UAV) In this chapter (KITTI)

Reprinted from Zhang et al. (2017), with permission of Springer

The convolution acceleration module of the system was implemented on FPGA using ISE Design Suite 14.7, and the layout and routing were provided by the ISE toolkit. ISE also generates resource occupancy reports, as shown in Table 6.6. From the table, we see that the use of FPGA hardware resources for the design of the convolution accelerator is quite adequate. To evaluate the overall performance of the system, extensive tests were done with different hardware systems, choosing some important indicators to assess. With the bandwidth of 20 Mb/s, according to the total clock cycle of FPGA and the total time spent on Tegra X1, a theoretical processing time can be calculated. As shown in Table 6.7, compared to the other two kinds of hardware, this system had a significant advantage in power consumption, and the loss of frame rate and resolution was acceptable. The overall performance was good enough to meet the needs of most application scenarios.

6.4 Summary This chapter expounds the main theories and strategies of machine learning in the field of computer vision and introduces related algorithms and applications of image recognition and object detection. • The key insight of machine learning is the separation of complex elements from images via successful multilayer nonlinear mapping. The reason why depth models can reduce parameters lies with the reuse of computing units in the middle layer.

198

6 Learning-Based Night-Vision Image Recognition …

• The introduced lossless-constraint denoising (LD) improved the existing autoencoders’ generalisation ability when training a noisy dataset. It helped learn denoising data and then guided the autoencoder to learn denoising features. It can be seen from experiments that, for SAE-based LDSAE or DAE-based LDAE, classification accuracies in noise datasets have been promoted. This chapter also applied the LD method to stacked autoencoders and improved their noise immunity. In the future, LD can be applied to convolutional autoencoders (Masci et al. 2011) and convolutional neural networks (Simonyan and Zisserman 2014) to provide stronger noise immunity for deep-learning networks. • We improved and decomposed the deformable part model algorithm into several modules with detailed analysis, aiming to implement an embedded system with CPU and FPGA. According to the mathematical derivation and statistical rules, the feature dimensions and the pyramid levels of the model descended without sacrificing accuracy. It effectively reduced the amount of calculation and data transmission. The advantages of parallel processing and pipeline design of FPGA were fully used to achieve the acceleration of convolution computation, which significantly reduced the running time of the programme. Several experiments showed that this system had good real-time and accuracy performance in different situations.

References Baldi, P., & Hornik, K. (1989). Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks, 2(1), 53–58. Bengio, Y. (2009). Learning deep architectures for AI. Now Publishers. Bengio, Y., & Delalleau, O. (2011). On the expressive power of deep architectures. In International Conference on Discovery Science (p. 1). Springer-Verlag. Bengio, Y., Lamblin, P., Popovici, D., et al. (2007). Greedy layer-wise training of deep networks. Advances in Neural Information Processing Systems, 19, 153–160. Dollár, P., Appel, R., Belongie, S., & Perona, P. (2014). Fast feature pyramids for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(8), 1532–1545. Erhan, D., Bengio, Y., Courville, A., et al. (2010). Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research, 11(3), 625–660. Felzenszwalb, P., Girshick, R., McAllester, D. (2010). Cascade object detection with deformable part models. In Computer Vision and Pattern Recognition (CVPR) (pp. 2241–2248). San Francisco, CA, USA: IEEE. Guzmán, S., Gómez, A., Diez, G., & Fernández, D. S. (2015). Car detection methodology in outdoor environment based on histogram of oriented gradient (HOG) and support vector machine (SVM). In Networked and Electronic Media (LACNEM) (pp. 1–4). Medellin, Colombia: IET. Hecht-Nielsen, R. (1989). Theory of backpropagation neural networks. In International Joint Conference on Neural Networks (Vol. 1, pp. 593–605). IEEE Xplore. Liu, Y., Wang, H. H., Xiang, Y. L., & Lu, P. L. (2013). An approach of real-time vehicle detection based on improved Adaboost algorithm and frame differencing rule. Journal Huazhong University of Science & Technology, 41(S1), 379–382. Maria, G., Baccaglini, E., Brevi, D., Gavelli, M., & Scopigno, R. (2016). A drone-based image processing system for car detection in a smart transport infrastructure. In Electrotechnical Conference (MELECON) (pp. 1–5). Lemesos, Cyprus: IEEE.

References

199

Masci, J., Meier, U., Cire¸san, D., et al. (2011). Stacked convolutional autoencoders for hierarchical feature extraction. In International Conference on Artificial Neural Networks (pp. 52–59). Springer–Verlag. Qu, Y. Y., Jiang, L., & Guo, X. P. (2016). Moving vehicle detection with convolutional networks in UAV Videos. In Control, Automation and Robotics (ICCAR) (pp. 225–229). Hong Kong, China: IEEE. Rosenbaum, D., Charmette, B., Kurz, F., Suri, S., Thomas, U., & Reinartz, P. (2008). Automatic traffic monitoring from an airborne wide angle camera system. ISPRS Archives, 37(B3b), 557–562. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1988). Learning representations by backpropagating errors. In Neurocomputing: Foundations of research (pp. 533–536). MIT Press. Schölkopf, B., Platt, J., & Hofmann, T. (2006). Greedy layer-wise training of deep networks. In International Conference on Neural Information Processing Systems (pp. 153–160). MIT Press. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. Computer Science. Vincent, P., Larochelle, H., Bengio, Y., et al. (2008). Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning (pp. 1096-1103). Wang, J. R. (2015). Research on multi-feature fusion of aerial photography image recognition taken by UAV . Chengdu, China: Chengdu University of Technology. Xu, H., Huang, Q., & Jay Kuo, C. C. (2016). Car detection using deformable part models with composite features. In Image Processing (ICIP) (pp. 3812–3816). Phoenix, AZ, USA: IEEE. Yang, X. F., & Yang, Y. (2014). A method of efficient vehicle detection based on HOG-LBP. Computer Engineering, 40(9), 210–214. Yebes, J., Bergasa, L., Arroyo, R., & Lázaro, A. (2014). Supervised learning and evaluation of KITTI’s cars detector with DPM. In Intelligent Vehicles Symposium Proceedings (pp. 768–773). Dearborn, MI, USA: IEEE. Zhang, W., Bai, L., Zhang, Y., et al. (2017). Integrative embedded car detection system with DPM. In International Conference on Image and Graphics (pp. 174–185). Springer. Zhang, J., Zhang, Y., Bai, L., et al. (2018). Lossless-constraint denoising based autoencoders. Signal Processing: Image Communication, 92–99.

Chapter 7

Non-learning-Based Motion Cognitive Detection and Self-adaptable Tracking for Night-Vision Videos

Motion detection and tracking technology is one of the core subjects in the field of computer vision. It is significant and has wide practical value in night-vision research. Traditional learning-based detection and tracking algorithms require many samples and a complex model, which is difficult to implement. The robustness of detection and tracking in complex scenes is weak. This chapter introduces a series of infrared small-target detection, non-learning motion detection and tracking methods based on imaging spatial structure, which are robust to complex scenes.

7.1 Target Detection and Tracking Methods 7.1.1 Investigation of Infrared Small-Target Detection Infrared small-target detection is an important issue in computer vision and military fields. With the development of infrared imaging technology, infrared sensors have successfully produced high-resolution images, which facilitate the detection of objects. However, it still faces a few challenges owing to the complexity of scenes. Recently, various methods have been proposed for infrared small-target detection, such as morphological top-hat filter method by Drummond (1993). Then, the adaptive Butterworth high-pass filter method was designed for infrared small objects by Yang et al. (2004). Gu et al. (2010) proposed a kernel-based nonparametric regression method for background prediction and small-target detection. Bae et al. (2012) used the edge-directional 2D least mean square filter to estimate infrared small and dim target detection. Li et al. (2014) proposed a novel infrared small-target detection in a compressive domain. Albeit these methods have been proposed, a few commonly influencing issues still endure, related to complexity of patterns in the infrared images. These methods often fail to detect small objects when the scenes are more challenging. Aiming to solve this problem, a novel infrared small-target detection method using sparse errors and structure differences is introduced. The © Springer Nature Singapore Pte Ltd. 2019 L. Bai et al., Night Vision Processing and Understanding, https://doi.org/10.1007/978-981-13-1669-2_7

201

202

7 Non-learning-Based Motion Cognitive Detection …

Fig. 7.1 Pipeline of infrared small-target detection

method exploits sparse errors and structure differences between target and background regions. Then, a simple fusion method is applied to generate the final detection result (see Fig. 7.1).

7.1.2 Moving Object Detection Based on Non-learning Accompanied by increasing night-vision sensors, numerous videos are being produced. Thus, it is essential to automatically target and motion. This plays an important role in intelligence operations. Over the last 20 years, cascade classification (Viola and Jones 2001) and Bayes’ rule (Muesera et al. 1999) proposed several methods. To determine the parameters of classifiers, traditional detection methods (Sanchez et al. 2007; Shen and Fan 2010) require a lot of learning, which leads to the problem of low real-time performance. Accordingly, a new method without learning has been put forward gradually. In 2007, Takeda et al. (2007) proposed classic kernel regression to recover the high-frequency information of images, which is used to denoise. In 2009, Milanfar (2009) studied the adaptive kernel regression method to remove noise, enhancing image details and target detection. In the same year, Seo and Milanfar (2009a, b) made further efforts and proposed the locally adaptive regression kernel method, which is a new non-parameter method for detecting targets. A few years later, Seo and Milanfar (2009a, b, 2011a, b, c) improved the robust performance of regression kernel in different aspects. However, the matching algorithm presented by Seo and Milanfar (2011a, b, c) (hereafter called ‘Seo algorithm’) has not been suitable for non-compact targets, such as human actions. The overall template with background was applied to match with test video but limited the choice to test video scenes. Recognition accuracy relies on the background similarity between template and test video. Only when the background of a test video is quite similar with the template can the results be satisfactory. Conversely, when the view angle is changed or the scene becomes complex, the outcomes are disappointing. In 2007, Wang (2012) amplified template images and divided them into many parts to detect human faces. The template only contains a face, which gives us the inspiration to remove the background. Furthermore, when the actions were partially obscured by landscape, the matching pattern with overall template could not be recognised.

7.1 Target Detection and Tracking Methods

203

Inspired by the trend towards detection in video streaming, spatiotemporal multiscale statistical matching (SMSM), based on a new weighted 3D Gaussian difference LARK (GLARK), was introduced, which more efficiently recognises actions in video.

7.1.3 Researches on Target Tracking Technology Target tracking is a key technology of scene perception. Target information, such as moving trajectory, target position and velocity, is the basis for subsequent video image processing, such as target recognition and behavioural analysis. Many target tracking algorithms have been gradually proposed in recent years (Ming 2016; Rong et al. 2010). The continuously adaptive mean shift (CAMSHIFT) tracking algorithm proposed by Bradski was based on colour information. It effectively solved the problem of target deformation and size scaling and consumed less time. However, it was not good for tracking fast-moving targets or targets in complex backgrounds. Then, several improved CAMSHIFT tracking algorithms (Hong et al. 2006; Chun and Shi 2015; Xing et al. 2012; Ran 2012) improved the stability of tracking. However, these algorithms needed distinctive target colours, high-quality images and simple backgrounds. In a complex background scene, there are now many popular online learning-based tracking algorithms. Compression tracking and its improved algorithm (Kai 2012; Chong 2014) showed good real-time performance and good robustness with target occlusion and appearance changes. Its fixed target tracking window scaled easily, leading to tracking drift. A visual tracking algorithm based on spatiotemporal context information and its improvement (Zhang et al. 2013; Guo 2016) both required less time. Owing to simpler features to get the statistical relevance of the target to the surrounding area, the algorithm failed to track when the target moved too fast or was blocked. Additionally, because of low contrast, lack of colour information and small grayscale dynamic range, infrared image tracking has become a hot topic in tracking research. The classical mean shift tracking algorithm of Zhang et al. (2006) used grey information for real-time tracking. However, it was vulnerable to similar grey information background interference. It failed to track when target size changed. Meng (2015) proposed an improved mean shift algorithm with a weighted kernel histogram and brightness–distance space. It successfully tracked rigid infrared targets but could not track a variety of non-rigid targets, such as humans and animals. The mean shift algorithm of speed-up robust features (SURF), proposed by Zhang et al. (2011), solved the tracking problem of the target scale change in the ideal state. However, it could not track small or textured single targets because of fewer feature points or matched feature points. In summary, this tracking algorithm cannot perform target tracking well in a complex background. A tracking model based on global LARK feature matching and CAMSHIFT algorithm is thus introduced in Sect. 7.3.1. Because the LARK feature is not sensitive to the grey value of each point or the specific object in the picture, it is more sensitive to the change of grey gradient and graphic structure in the picture

204

7 Non-learning-Based Motion Cognitive Detection …

(Seo and Milanfar 2009a, b, 2011a, b, c; Wang 2012). It is possible to distinguish between rigid or compact targets and background areas by combining SSIM maps and colour probability distributions. To track non-compact infrared targets, we also propose a local LARK feature statistical matching method. LARK features can also describe the essential characteristics of weak-edge targets well in an infrared image. Whereas the overall structure of the same target is dissimilar in different forms, there are some similarities in local fine structure (Luo et al. 2015). Thus, global matching is transformed into local matching statistical analysis in Sect. 7.3.2. Combined with infrared image characteristics and microstructure similarity statistical analyses, the introduced method could well distinguish between backgrounds and infrared targets in different forms.

7.2 Infrared Small Object Detection Using Sparse Error and Structure Difference 7.2.1 Framework of Object Detection Sparse Error. There exist appearance divergences between target and background regions. Generally, the boundary regions of the image are considered background regions. Background templates are constructed from these image boundaries, and the entire image is reconstructed by sparse appearance error modelling. The image patch is defined by the learnt bases from a set of infrared image patches. First, a given infrared image is taken as input, and SLIC (Achanta et al. 2012) is used to segmentthe input image into multiple uniform regions. Each segment is described by p x, y, lu, gx , gy , where lu is the luminance information (Wang et al. 2015), the coordinates of a gx and gy are the gradient information, and x and y indicate pixel. The entire infrared image is represented as P p1 , p2 , . . . , pN , where N is the number of segments. The boundary segment, d, is extracted from P as the bases and the background template set, D [d1 , d2 , d3 , . . . , dM ], is constructed, where M is the number of image boundary segments. Given the background templates, the sparse error between object and background regions could be computed, where there should be a large difference between object and background regions based on the same bases. Then, the image segment is encoded, defined by ai arg minpi − Dai 22 + λdiag(wi )ai 1 , ai

(7.1)

where λ is the regularisation parameter and wi is used to represent the weight for segment di , computed as dj − di 2 1 exp( ), wi H (di ) 2σ 2 j∈H (di )

(7.2)

7.2 Infrared Small Object Detection Using Sparse Error and Structure Difference

205

where H (di ) denotes the amount of di ’s neighbours. This weight computes the similarity between segment di and its surrounding segments. Large wi will suppress nonzero entry, ai , which is forced to concentrate on zero entry when wi is small. The weight for background template should be proportional to the similarity between segments in the image boundaries. Then the regulated reconstruction error for each segment is computed as Ei pi − Dai 22 .

(7.3)

Coarse interest regions can be easily estimated by the regulated sparse errors. Large values of reconstruction errors are the target regions, which are different from background regions. Structure Difference Objects often have different structure information compared to background regions. In this section, structure differences between object and background regions are exploited via region covariances. Use F to denote the feature image extracted from the input image, I, as F (I ), where denotes a mapping function that extracts a k-dimensional feature vector from each pixel in I. A region, qi , inside F, can be computed as a k × k covariance matrix, Cqu 1 Cqu (qu − μ)(qu − μ)T , n − 1 i1 n

(7.4)

where qu , u 1, . . . , n denotes the k-dimensional feature vectors inside region, and q and μ denotes the mean of these feature vectors. In this section, k 5 features (e.g. x, y, lu, gx , gy ) are used to build the region feature. The structure difference map is computed based on two different covariances: G(qu , qv ) ψ(Cqu , Cqv ),

(7.5)

where ψ(Cqu , Cqv ) is used to compute the similarity between two covariances (Karacan et al. 2013). This covariance matrix can better capture local image structure data and can effectively estimate the structure differences between objects and background regions. Thus, object regions have higher values of G than background regions. Fusing Error and Structure Based on two obtained maps from two different views, the maps are fused to generate a final map via linear combination, and the weights in the linear combination are learnt.

S

T t1

βt S t ,

(7.6)

206

7 Non-learning-Based Motion Cognitive Detection …

Fig. 7.2 Image decomposition results: a input image; b structure component of the input image; c reconstruct error map; d our final result

2 where {βt } argmin S − t βt S t F , t denotes the number of maps and S T denotes the map from the error map or the structure difference map. The weights are learnt well using a least-squares estimator. This problem is solved by conditional random field solutions by Liu et al. (2010). Finally, like the previous method of Li et al. (2014), the final interest target can be extracted by a threshold method, defined as S (x, y)

1 S(x, y) ≥ γ Smax , 0 others

(7.7)

where Smax is the maximum value of S and γ 0.6 is the threshold value (Fig. 7.2).

7.2.2 Experimental Results The empirical values, σ 3, N 400, T 2 and λ 0.01, are chosen for the introduced method. The input image is 480 × 640 pixels. The method is compared with other existing methods, including top-hat (TH) filter in Drummond (1993) and compressive domain (CD) from Li et al. (2014). Signal-to-clutter ratio gain (SCRG) and background suppression factor (BSF) are used to evaluate these methods. These two metrics are defined as

7.2 Infrared Small Object Detection Using Sparse Error and Structure Difference

SCRG 20 × log10

(U /C)out Cin , BSF 20 × log10 , (U /C)in Cout

207

(7.8)

where U is the average target intensity and C is the clutter standard deviation in an infrared image. In this experiment, four different scenes are selected to verify the performance of the introduced method. As shown in Fig. 7.3, the first row of Fig. 7.3 presents the four types of inputs. The second row illustrates the predicted target results produced by TH, and the third row indicates the detection results of CD. The introduced method produces the detection results in the fourth row of Fig. 7.3. From these visual examples, we know that the introduced method can accurately detect the infrared small target in the infrared image, because the introduced method exploits sparse errors and structure differences. These cues can better represent the details of the infrared target. However, the method of TH uses a simple top-hat filter to detect the infrared target, producing some false detection results in the second row of Fig. 7.3. CD improves TH and performs well in the four scenes. However, some infrared small objects cannot be well detected, which increase the missing rate of object detection. It is thereby less accurate for infrared small object detection (third row of Fig. 7.3). To further assess the introduced method against other competing methods, SCRG and BSF are used to evaluate performance. The quantitative results for all the test images are shown in Table 7.1. It is clearly seen that the introduced method obtains the highest evaluation scores among others. Thus, the introduced method has a superior ability to detect infrared small targets in complex scenes, demonstrating the effectiveness and robustness of the introduced method. This is because the introduced method carefully designs the introduced framework based on the characteristics of infrared small targets. Using sparse errors and structure differences, the target can-

Fig. 7.3 Qualitative visual results of four scenes: a three targets in the wild; b two targets in the sky; c three targets in the sea; d one target in the sky

208

7 Non-learning-Based Motion Cognitive Detection …

Table 7.1 Quantitative evaluations (SCRG/BSF) of four infrared scenes A1

A2

A3

A4

Top-Hat

15.22/22.47

13.24/20.11

12.41/18.45

6.24/15.33

CT

20.33/29.42

18.24/26.40

21.24/24.22

15.22/20.41

Introduced method

25.41/35.45

22.58/30.49

25.37/28.14

19.69/25.38

Bold indicates the best performance

didate areas can be well detected. Then, a simple fusion framework can be applied to further improve the accuracy of the introduced method.

7.3 Adaptive Mean Shift Algorithm Based on LARK Feature for Infrared Image 7.3.1 Tracking Model Based on Global LARK Feature Matching and CAMSHIFT The tracking model based on global LARK feature matching and CAMSHIFT mainly uses the colour information and LARK structure to calculate the probability distribution of target in the processing image. Then mean shift algorithm is used to get target centre and its size in the probability map. To shorten the matching time, tracking processing in each frame extracts only two times the previous frame target area. The model process is shown in Fig. 7.4.

Fig. 7.4 Tracking model based on global LARK feature matching and CAMSHIFT

7.3 Adaptive Mean Shift Algorithm Based on LARK Feature for Infrared Image

209

Principle of LARK Feature Matching The calculation principle of local kernel values of LARK is described by Takeda et al. (2007), Milanfar (2009), and Luo et al. (2015). To reduce the influence of external interference (e.g. light), the local kernel of each point is normalised, as shown in Eq. (7.9). P 2

kil

Kil /

Kil ∈ RP

2

×1

, i ∈ [1, N ], l ∈ [1, P 2 ],

(7.9)

l1

where Kil is the local kernel, N is the total number of pixels in the image and P 2 is the number of pixels in the local window. The normalised local kernel of a point in the image is sorted by column as a weight vector:

2 T wi ki1 , ki2 , . . . , kiP .

(7.10)

Next, the P 2 size window is used to traverse the entire graph to get its weight matrix. W [w1 , w2 , . . . , wN ] ∈ Rp

2

×N

.

(7.11)

After obtaining the LARK weight matrices, W _Q and W _T , which represent the feature of template image and processing image, the PCA method is used to reduce redundancy of W _Q and preserve only the characteristic of the former d of 2 the principal component to constitute matrix, AQ ∈ RP ×d . Then feature matrices, FQ and FT , are calculated according to AQ in Wang (2012).

FQ fQ1 , fQ2 , . . . , fQN ATQ WQ , (7.12) FT fT1 , fT2 , . . . , fTM ATQ WT ,

(7.13)

where N and M are total numbers of pixels in template and processing images, respectively. Then, the LARK feature matching method is performed. First, a moving window of the same size as the template image is used to move pixel-by-pixel the processing image. Then, cosine similarity of FTi of the moving window and FQ is calculated, as shown in Eq. (7.14). fQ fT ρ(fQ , fTi ) i cos θ ∈ [−1, 1], fQ fT

(7.14)

i

where fQ and fTi are column vectors, sorted by FQ and FTi in pixel column order. Lastly, a similarity graph is constructed with a constructor, as shown in Eq. (7.15). f (ρ)

ρ2 . 1 − ρ2

(7.15)

210

7 Non-learning-Based Motion Cognitive Detection …

Framework of Tracking Algorithm The pseudocode of the global LARK feature matching tracking model introduced in this section contains four parts: target selection, LARK feature extraction, target probability map calculation and target position search. While obtaining the weighted fusion target probability graph, process diagrams are shown in Fig. 7.5. Algorithm 1 Target tracking model based on global LARK feature matching: • Manually select the tracking target as a template and calculate its WQ . – Extract two times the target area as the processing image, calculate its WT . – Apply PCA to WQ and WT , and obtain feature matrix FQ and FT . – Transform RGB into HSV, and use the H component to get original target probability graph. – Apply LARK feature matching to obtain a structure similarity graph and normalise it. – Obtain the weighted fusion target probability map. – Apply adaptive mean shift algorithm to locate target position. – Loop the second to the fourth step to achieve tracking.

(a) Tracking target and processing image

(b) structure similarity

Fig. 7.5 Process diagrams

(c) original target probability (d) weighted fusion target probability

7.3 Adaptive Mean Shift Algorithm Based on LARK Feature for Infrared Image

211

7.3.2 Target Tracking Algorithm Based on Local LARK Feature Statistical Matching The previous section described a LARK feature global matching method used to obtain a structure similarity graph. When tracking non-compact targets, such as pedestrians, its deformation is random, complex and diverse. LARK features of the same target under different forms reflect the overall structure but are not similar, whereas the local structure is highly similar. Thus, in this section, global matching will be turned into local matching, and the number of similar local structures will be statistically analysed. Because the LARK feature can describe the basic structural features of the infrared target with weak edges, infrared images have no colour information and no rich grey information. By combining grey information and the local similarity statistics to get the target probability graph, we can distinguish the non-compact infrared target from the background and realise the next iterative search tracing. The tracking process, based on local LARK feature statistical matching, is shown in Fig. 7.6. Because infrared images have no colour information, we first obtain the original target probability graph according to its grey value. From Sect. 7.3.2, every column vector in a feature matrix represents the local structure feature of the image. To use the least number of feature vectors to well reflect the structural features, the cosine similarity is used to reduce the redundancy of the feature matrix. The cosine similarity value is calculated in Eq. (7.14) between each fQi of the feature matrix, FQ , and the other N − 1 fQi . If it is more than threshold, t1, the two vectors are similar, and only one of the vectors is preserved. Otherwise, the two vectors are both preserved.

the redundancy, the local features of the feature matrix After removing FQ fQ1 , fQ2 , . . . , fQn , n < N is non-repetitive. It effectively avoids the influence of the original similar structure in the template image on statistical matching. Then, the number of similar local structures of the feature matrix between the template image and the processing image is analysed statistically. The cosine similarity matrix is established by calculating the cosine similarity of each vector of FT and FQ , as shown in Eq. (7.16).

ρL ρ FT , FQ

⎡

ρ11 · · · ⎢ .. . . ⎣ . . ρM 1 · · ·

⎤ ρ1n .. ⎥, . ⎦

(7.16)

ρMn

where ρij is the cosine similarity of the ith column of FT and the jth column of FQ . The closer the cosine similarity is to 1, the more similar the local structure represented by the two vectors and the greater the likelihood of the target. Thus, the maximum value of each row is extracted from the matrix, ρL , and its column location in FQ is preserved in the index matrix, indexL . T ρL max(ρ1k1 ), max(ρ2k2 ), . . . , max(ρMkn ) ,

(7.17)

212

7 Non-learning-Based Motion Cognitive Detection …

Fig. 7.6 Tracking model based on local LARK feature statistical matching

indexL [x1 , x2 , . . . , xM ]T x1 , x2 , . . . , xM ∈ [1, 2, . . . , n] .

(7.18)

The resulting position index matrix, indexL , and the maximum similarity matrix, ρL , are sorted by pixel column order. The similarity threshold, t2, is set to reduce interference of the less similar local structure. If the similarity value is greater than t2, it is considered to be an effective similarity. Otherwise, it is considered invalid, and the index value of the corresponding position in indexL is set to zero. The index value in the matrix represents the position of the similar structure in the template image. If there are more different index values in the local window, the local window will contain more similar local structures, thus the greater the probability of the target. Finally, a fixed-size local window is selected to traverse the index matrix. If the number of nonzero pixels in the window in the original target probability map is greater than a certain threshold, the number of non-repetitive index values is counted. Otherwise, the number of index values in the window is recorded as zero. The matrix, Rn , of the number of index value is constructed according to the number of non-repeat index values for each window. A statistical matching graph is obtained by normalising the matrix, Rn . The target probability graph is obtained by weighting it with the original target probability map, as shown in Fig. 7.7. There is a significant difference between the pixel values of the target and the background area. The target position is obtained using the adaptive mean shift algorithm to the target probability map.

7.3.3 Experiment and Analysis This section describes the tracking model based on global LARK feature matching and CAMSHIFT as a global feature matching tracking (GLMT) algorithm. The tracking model based on local LARK feature statistical matching is the local feature

7.3 Adaptive Mean Shift Algorithm Based on LARK Feature for Infrared Image

213

Fig. 7.7 Left is the original target probability map; right is the weighted fusion target probability map

matching tracking algorithm (LLSMT). Because GLMT uses global matching, it is suitable for rigid or compact target tracking. Thus, the first part uses GLMT, compressive tracking (CT), spatiotemporal context (STC) learning and CAMSHIFT algorithms to trace human faces in a standard video library, trace cars on UAV video and make an analysis comparison. Because LLSMT algorithm uses local statistical matching, it is more suitable for non-compact targets with large changes. The second part uses LLSMT, GLMT and CAMSHIFT to trace pedestrians on infrared standard video libraries and make an analysis comparison. Target Tracking Experiment in Visible Video • Experiment on tracking car In this experiment, the trajectory of the car is a turn at the beginning and then a straight line. The experimental results are shown in Fig. 7.8. The red box is the tracking result of GLMT algorithm, and the blue box is the tracking result of CAMSHIFT. As shown in Fig. 7.8, when the target colour and background are quite different, the target has a significant rotation, and the GLMT and CAMSHIFT algorithms can effectively distinguish between goals and backgrounds, owing to their use of colour information. When the colour difference between the target and the background is small, the tracking result of the CAMSHIFT algorithm produces drift or even loses the target, because of its use of the colour information. However, the GLMT algorithm uses both colour information and structure features, improving the contrast between target and background, giving a good tracking result. • Experiment on tracking a human face This experiment chooses the human face video of AVSS2007. The human face in the video has fast size changes, a slight rotation and slight occlusion. The background in the video also has similar colour to the face. Figure 7.9 displays the tracking results of

214

7 Non-learning-Based Motion Cognitive Detection …

Fig. 7.8 Tracking results for different frames

Fig. 7.9 Results of different tracking algorithms for human face video

different algorithms. The red box is the tracking result of GLMT. The green box represents CT, the blue box represents STC and the yellow box represents CAMSHIFT. The centre position error (CLE) curves of the different algorithm tracking results are shown in Fig. 7.10. The abscissa represents the number of frames of video, and the ordinate represents the error value of the centre position of the tracking result and the true target centre position. Table 7.2 is their error analysis values, including average CLE, DP and OP.

7.3 Adaptive Mean Shift Algorithm Based on LARK Feature for Infrared Image

215

Fig. 7.10 Frame-by-frame comparison of CLE (in pixels) on tracking the human face Table 7.2 Tracking result analysis on the human face video Average CLE (in pixel)

DP (%)

OP (%)

GLMT

11.6

87.5

71.9

CT

94.5

4.69

STC

151

9.69

CAMSHIFT

27.2

55

4.06 6.88 38.8

As shown in Fig. 7.9, the target edge is not clear, owing to its size rapid changes and other objects of same colour. Thus, CAMSHIFT, CT and STC algorithms cannot accurately obtain the target location. Because LARK is not subject to the edge of the target and is more sensitive to the internal structure, GLMT algorithm can better track the target by making a global match against a non-changing target internal structure. From Fig. 7.10, CLE of each frame tracking result of GLMT algorithm is lower than that of CT, STC and CAMSHIFT. The average CLE of GLMT algorithm is also the lowest from Table 7.2. DP is computed as the relative number of frames in the sequence where the centre location error is smaller than a certain threshold. The DP threshold is set to 20. OP is defined as the percentage of frames where the bounding box overlap surpasses a threshold: 0.5. The DP and OP values of GLMT algorithm are both higher than CT, STC and CAMSHIFT in Table 7.2. Thus, we can conclude that the GLMT algorithm would well track rigid or compact targets.

216

7 Non-learning-Based Motion Cognitive Detection …

Fig. 7.11 Results of different tracking algorithms for pedestrian infrared image sequence

Tracking Experiment on Infrared Non-compact Target This experiment chooses pedestrian infrared image sequences from VOT2016. The pedestrian in the sequence has a significant change in posture. Experiment results are shown in Fig. 7.11. The red box is the tracking result of the LLSMT algorithm, the blue box represents GLMT and the green box represents CAMSHIFT. From Fig. 7.11, in the infrared image, the edge of the target is weak, and its grey information is not rich. Thus, the tracking result of CAMSHIFT produces a drift because of its only use of grey information. GLMT and LLSMT both use LARK, which well describes the internal structure of the infrared target. It uses LARK feature matching to accurately obtain the target region. With large varieties of posture, the target has large changes in the larger structure and little changes in local structure (e.g. head and feet). The LLSMT algorithm divides the whole structure into local structures and performs statistical matching so it can track targets well. Figure 7.12 shows the CLE curve of the three tracking algorithms on tracking the infrared pedestrian. The CLE value of LLSMT algorithm is the lowest from Fig. 7.12 and Table 7.3. As shown in Table 7.3, the DP and OP of LLSMT algorithm are higher than that of GLMT and CAMSHIFT. Thus, we can conclude that the algorithm could well track the non-compact infrared targets of various postures.

7.4 An SMSM Model for Human Action Detection

217

Fig. 7.12 Frame-by-frame comparison of the centre location error (in pixels) on tracking pedestrian Table 7.3 Tracking result analysis on pedestrian infrared image sequence Average CLE (in pixel)

DP (%)

OP (%)

0.385

0.265

CAMSHIFT

216

GLMT

31.6

33.5

33.5

LLSMT

8.6

99.6

89.2

7.4 An SMSM Model for Human Action Detection Considering noise, background interference and massive information, this section introduces a non-learning SMSM model based on the dense computation of a socalled space-time local adaptive regression kernel to identify non-compact human actions. This model contains three parts: • calculation of GLARK feature; • multi-scaled composite template set and • spatiotemporal multiscale statistical matching. Before we begin a more detailed elaboration, it is worthwhile to highlight some aspects of the introduced model in Fig. 7.13.

218

7 Non-learning-Based Motion Cognitive Detection …

Fig. 7.13 Flowchart of our SMSM model on motion detection

Algorithm 1 Human actions detection method based on SMSM model: • Construct WQi and WTi , which are a collection of 3D GLARKs associated with Qi , Ti . Connect WQi to WQ . – Apply PCA to WTi and obtain FTi . – Remove similar column vectors in WQ to get FQ . – Compute matrix cosine similarity for every target cube Ti , where i video frames do FQ FTi and ρi , FQ FT i F F F ⎤ ⎡ ρ11 · · · ρ1nT ⎢ . . . . ... ⎥ ρ3DGLK (:, :, i) ρ FQ , FT ⎣ .. ⎦ ρmT 1 · · · ρmT nT end for – Record the position of the column vector corresponding to this maximum in FQ , count the no-duplicate index value number and get RM. – Apply non-maxima suppression parameter, τ , to RM.

7.4 An SMSM Model for Human Action Detection

219

7.4.1 Technical Details of the SMSM Model Local GLARK Feature This part first briefly introduces the space-time locally adaptive regression kernel. The LARK proposed by Seo and Milanfar (2009a, b) captures the geometric structure effectively. The locally adaptive regression kernel definition formula (see Seo and Milanfar 2009a, b; Luo et al. 2015) is K(Cl , Xl ) exp(−ds2 ) exp −XlT Cl Xl .

(7.19)

Covariance matrix, Cl , is calculated from the simple gradient information of the image. In fact, it is difficult to describe the concrete structural feature information of a target with simple gradient information. Moreover, it is easy to ignore the weak edge of the target when the contrast of the edge region of the target is relatively small. Thus, it will cause missing detection. To make up for this defect, we fully exploit LARK feature information, introduce the difference of Gaussians (DOG) operator into the LARK feature and generate a new GLARK feature descriptor to enhance weak-edge structure information. It is mainly inspired by the receptive field structure of neurons in Fig. 7.14 (Kuffler 1953). Rodieck and Stone (1965) established DOG model to simulate the concentric circular antagonistic receptive field of retinal ganglion cells. It is necessary to note that Gaussian kernel operator is defined by D 2 1 k1 xk g(·, σ ) exp − . (7.20) (2π σ 2 )N /2 2σ 2 As for the two-dimensional (D 2) image, different Gaussian convolution kernels, σ , as multiscale factors, are taken to make convolution with the gradient information of pixel point, as shown in Eq. (7.21).

Fig. 7.14 Receptive field map: the classical receptive field has a mutually antagonistic structure of the central and surrounding regions; and the non-classical receptive field is a larger area outside the classical receptive field, which removes the suppression of the classical sensory field

220

7 Non-learning-Based Motion Cognitive Detection …

Fig. 7.15 Gauss kernel convolution

D(x, y, σ ) g(x, y, σ ) ⊗ z(x, y),

(7.21)

Z(x, y, σ, k) D(x, y, kσ ) − D,

(7.22)

where ⊗ represents convolution, (x, y) is the spatial coordinate and z(x, y) is the gradient of image, which has two forms of expression: zx (x, y) and zy (x, y). Z(x, y, σ, k) is the Gaussian difference gradient matrices: Zx and Zy . Figure 7.15 shows the gradientmatrix, z, of a 3 × 3region. Here, we assume a 3 × 3 Gauss kernel operator g 1 2 3 ; 4 5 6 ; 7 8 9 and calculate the convolution as follows: f (i, j) g ∗ z

g(i − k, j − l)z(h, l)

k,l

g(k, l)z(i − k, j − l).

(7.23)

k,l

Next, we take the centre point, (2, 2), of the 3 × 3 region as an example.

Zx22 Zy22

T Zx11 · · · Zx33 9 8 7 ··· 2 1 . Zy11 · · · Zy33

(7.24)

In this section, the third dimension of 3D GALRK is time, and X dx, dy, dt . Thus, the new CGLK is

CGLK

⎡

⎤ Zx2 (m) Zx (m)Zy (xk ) Zx (m)ZT (m) ⎣ Zx (m)Zy (m) Zy2 (m) Zy (m)ZT (m) ⎦, m∈Ωl Zx (m)DT (m) Zy (m)ZT (m) ZT 2 (m)

(7.25)

where l is the space-time analysis window. Then, GLARK is defined by K(CGLK , Xl ) exp −XlT CGLK Xl .

(7.26)

7.4 An SMSM Model for Human Action Detection

221

Fig. 7.16 GLARK map of hat edge

GLARK highlights the edges of an object, which weakens the texture interference, especially paying more attention to the local structure of weak edges. It is expressed as Wx [Wx1 , Wx2 , · · · , WxM ] ∈ RN ×M ,

(7.27)

where M represents the number of pixels of the GLARK feature. Figure 7.16 gives the GLARK feature descriptor. As can be seen from the figure, GLARK can better describe the graph trend at the weak edge. In the normal edge region, the introduced feature descriptor goes even farther than the Seo algorithm. Multi-scaled Composite Template Set The colour video contains three colour channels, and each frame contains three dimensions. However, the third dimension of our model is time. Thus, the video is translated to a grey image sequence. Differing from the supervise methods containing millions of training images, our training set is minimal. Figure 7.17 displays the templates, from T1 to T5, including pedestrian, long jump, skiing, cliff diving and javelin throw actions. The actions are resized to multi-scaled, and the model can recognise multiple sizes action in one frame. Figure 7.18 shows the formation of the multi-scaled template set. Every template image is sequenced to get the original cube, and then the nearestneighbour interpolation method resizes the original template to 0.5 times and 1.5 times, respectively. Three template sequences are formed. Only the first and second dimensions of original template, indicating size, should be resized. However, the third dimension, implying motion, remains the same. 3D GLARK is utilised to get the multi-scaled local structure template set. As shown in Fig. 7.18, the size of template0.5 , templateinitial and template1.5 are, respectively, mZ /2 × nZ /2 × tZ , mz × nz × tz and 3mz /2 × 3nz /2 × tz . Hence, we can obtain three GLARK matrices, shown as W0.5 ∈ RN ×M1 , M1 M /22 , Winitial ∈ RN ×M , M mz × nz × tz and W1.5 ∈ RN ×M3 , M3 M × (3/2)2 , where N m1 × n1 × t1 is the size of 3D windows for calculating GLARK. Then, W0.5 , Winitial and W1.5 are connected to obtain the total template matrix: WQ W0.5 Worginal W1.5 ∈ RN ×Mt , Mt M1 + M + M3 .

(7.28)

222

T1:

T2:

T3:

T4:

T5:

Fig. 7.17 Template set of various actions

Fig. 7.18 Multi-scaled template set

7 Non-learning-Based Motion Cognitive Detection …

7.4 An SMSM Model for Human Action Detection

223

SMSM Model This section studies a method based on non-learning SMSM model, regarding WTS (‘s’ represents the video frames), deriving features by applying dimensionality reduction (i.e. PCA) to the resulting arrays (Seo and Milanfar 2009a, b). However, it needs a threshold, α, a so-called similar structure threshold, to remove the redundancy of WQ . The model calculates the cosine value between column vectors intersection angles reciprocally in the matrix. If the cosine value is greater than α, the two vectors are similar. If so, it retains one of them. After reducing the dimensionality and redundancy, FQ is obtained with FTS . The next step in the introduced framework is a decision rule based on the measurement of a matrix cosine similarity (Leibe et al. 2008). Regarding FTS , we calculate the cosine value of the angles between each column vector, fTi and fQi . Then, we obtain the cosine similarity matrix, ρ3DGLK : ⎤ ρ11 · · · ρ1nT ⎢ ⎥ ρ3DGLK (:, :, k) ρ FQ , FT ⎣ ... . . . ... ⎦(k 1, 2, . . . , tT ), ρmT 1 · · · ρmT nT ⎡

(7.29)

where mT × nT × tT is the size of the test video. This section takes the maximum of each row in the matrix, ρ3DGLK (:, :, k), and records the position of the column vector corresponding to the maximum in FQ . The location information is saved in the indexGLK matrix: indexGLK (:, k) (x1 , x2 , . . . , xmT )T x1 , x2 , . . . , xmT 1, 2, . . . , nT .

(7.30)

We arrange the position index matrix, indexGLK (:, k) and ρ3DGLK (:, :, k), into a matrix of mT × nT , according to the column order. Thus, we need a similarity thresh in ρ3DGLK . If old, θ , whose empirical value is 0.88, to judge each element, ρ3DGLK ρ3DGLK ≥ θ , two vectors are similar. If ρ3DGLK < θ , two vectors are not similar. At this point, the corresponding position in the indexGLK is recorded as 0. This section selects the appropriate local window of P × P × T to go through the indexGLK matrix and count the no-duplicate-index value in the window. This section uses the formula num num(Unique(indexGLKP×P×T )) to explain the process, where num represents the similarity value of interested target and present region. The index value represents the corresponding structure of the target image, like the template. The more different the index values, more similar structures are in the local window. To this end, we construct a similarity matrix, RMGLK . Figure 7.19 notes the process from indexGLK to RMGLK . On the basis of RMGLK , we can acquire a global similarity image, RM. Next, we employ a method of non-maxima suppression from Devernay (1995) to extract the local maxima of the resemblance image and seek out the location of people. Therefore, we need a non-maximum suppression threshold, τ , to decide whether to ignore some areas of RM. Then, we calculate the maximum value of the remaining possible area.

224

7 Non-learning-Based Motion Cognitive Detection …

Fig. 7.19 Process from indexGLK to RMGLK

Fig. 7.20 a Resemblance maps; b RMs with non-maximum suppression; c detection result of sampled frames

In Fig. 7.20a, the deep-red part of the map represents higher grey values, indicating the possibility of the existence of the target in the corresponding region of the test frames. It is easy to see that there is a region of high grey value. Thus, there may be a human body in the location of the test frames, T , corresponding to the maximum grey value of the region. Figure 7.20b represents RM with non-maximum suppression; Fig. 7.20c represents the detection result of a sampled frame.

7.4.2 Experiments Analysis Parameters Analysis The main parameters for our approach are as follows: similar structure threshold, α, for WQ and non-maximum suppression threshold, τ , for RM. When removing redundancies, we judge whether two vectors are similar by comparing two weighted vector cosine values with the threshold, α. The larger the threshold, α, the more the number of reserved weight vectors, the more complex the calculation, the smaller the threshold, α, the greater the difference in the retained weight

7.4 An SMSM Model for Human Action Detection

225

Fig. 7.21 Relationship between number of local structures and threshold α: a pedestrian; b athlete; c multiscale people in one video

vector matrix and the more inaccurate the recognition results. Therefore, there is a problem of selection of a specific threshold, α. This section analyses statistically the number of weight vector matrices under different α. As shown in Fig. 7.21, the abscissa represents the threshold, τ , and the ordinate represents the number of weight vectors after deallocation. This section simply regards the pictures as curve graphs, where the number of weight vector matrices first increases slowly and then grows rapidly with the threshold, α, increasing. Therefore, the threshold is determined by the turning point of the curve, where the slope is 0.5. In Fig. 7.21a, this section chooses a point (0.986,909). Thus, threshold α is set to 0.986, the number of weight vectors after de-redundancy is 909 and the new weight vector matrix is used to participate in the similarity structure analysis. Equally, in Fig. 7.21b, we choose a point (0.994,289). Threshold α is set to 0.994 and the number of weight vectors after de-redundancy is 289. In Fig. 7.21c, we choose a point (0.996,118). Threshold α is set to 0.996 and the number of weight vectors after de-redundancy is 118. As shown in Fig. 7.20, to accurately determine the location of the target, a nonmaximum suppression threshold, τ , is required for optimal retention of the pixel values in the RM. Thus, a plurality of target images are selected and probability density curves are established according to the respective RM matrices (see Fig. 7.22a).

226

7 Non-learning-Based Motion Cognitive Detection …

Fig. 7.22 a RM probability density curves; b RM probability density integrals sum curves

The probability density curves are integrated to establish the integrals sum curves (see Fig. 7.22b). Whereas the RM probability density distribution of each target image is quite different, their integrals sum curves converge at the tail. Whereas the tail corresponds to the larger part of the RM value, the target may still occur. Therefore, other relatively small grayscale pixel values can be omitted to reduce the amount of computation and improve efficiency. Thus, we select the points where the probability of integration of each image RM integrals sum curves begin to converge. The right side of the point is retained, and the left is omitted. Figure 7.22b shows that the τ is set to 0.75. Test Results Evaluation of the performance of the introduced method will be introduced in the next section. This section tests our method and the Seo algorithm using the challenging scenes (i.e. single object, fast motion object and multiple cases of objects) from the Visual Tracker Benchmark. • Single person To evaluate the robustness of the introduced method, we first test the introduced method on visible video with one object. Figure 7.23 shows the results of searching for walking people in a target video (597 frames of 352 × 288 pixels). The query video contains a very short walking action moving to the right (10 frames of 48 × 102 pixels). This section randomly displays the results of three frames. Obviously, the detection effect of the multispectral is superior to the single band. Owing to the single template, Seo algorithm can only recognise the target having a similar posture with the template. For other images, Seo has poor performance. The introduced algorithm enlarges the weight matrix commiserate with the abundant spectral information. Thus, not only do we improve the detection rate, but we also identify results more accurately. Simultaneously, because our GLARK focuses on the local structure information of the target, our algorithm can well detect when the target is partially occluded. However, the Seo algorithm does not, because it is more concerned about the overall structure of the local structure. Thus, when part of the body structure is blocked, our algorithm does not apply.

7.4 An SMSM Model for Human Action Detection

227

Fig. 7.23 Detection results of pedestrian: a introduced algorithm; b Seo algorithm; c introduced algorithm in overshadowed scene; d Seo algorithm in overshadowed scene

• Fast motions This section also tests the introduced method in the visible video with fast motions. Figure 7.24 shows the results of detecting a female skater turning (160 frames of 320 × 240 pixels) and a male surfer in a video (84 frames of 480 × 270 pixels). The query video contains fast motions of the athletes (6 frames of 121 × 146 pixels) and a surfer (6 frames of 84 × 112 pixels). Figure 7.24b finds that the Seo algorithm can coarsely localise the athlete in the images, but the bounding boxes only contain parts of people. However, the turns of the athlete (female) are detected by the introduced method even though the video contains very fast-moving parts and relatively large variability on a spatial scale and appearance, compared to that given in query Q in (a). These bounding boxes can well locate the athlete in the image. Additionally, we

228

7 Non-learning-Based Motion Cognitive Detection …

Fig. 7.24 Fast motions detection: a, c introduced algorithm; b, d Seo algorithm

test the scene with a change of the target scale in fast motions. As shown in (c) and (d), the Seo algorithm is very unstable, and our algorithm has good robustness to the volatile scale of the target. Analysis suggests that our templates are multi-scaled composited, and the Seo template is relatively simple. Thus, our method overcomes the drawbacks of the Seo algorithm and produces reliable results on this dataset. • Multiple cases and scales of pedestrians To further evaluate the performance of the introduced method, we test the introduced method in the visible video with different people. Figure 7.25 shows the results of detecting multiple cases and different sizes of humans, which occur simultaneously in two monitor videos. The query videos contain two pedestrians who drift ever farther away. Their sizes are diminishing. Figure 7.25a, c shows the results of the introduced algorithm, and Fig. 7.25b, d represents detecting results of the Seo algorithm. The

7.4 An SMSM Model for Human Action Detection

229

Fig. 7.25 Visual comparisons of different methods with different scales: a, c introduced algorithm; b, d Seo algorithm

woman’s coat on the left of the first picture in (a) and (b) and the man’s clothes in (c) and (d) have a low contrast against the cement. For such human weak edges, it easy mistakes the human as background using the Seo algorithm. Whereas our algorithm detects the target well in this case, ours can enhance weak-edge information locally. Moreover, the multi-scales of characters limit the detection of the Seo algorithm in the same field of view in (c) and (d). Our algorithm does not suffer such interference in experiments because of the establishment of multiscale template sets. Analysis of the Introduced Method This section first evaluates IoU on the images shown, in terms of our method and the Seo algorithm in Fig. 7.26a. It is clear that the introduced method achieves superior performance compared to the Seo algorithm. For target recognition systems, precision and recall are contradictory quanti-

230

7 Non-learning-Based Motion Cognitive Detection …

Fig. 7.26 Comparison between Seo and the introduced algorithm: a IoU curve; b RPC curve

ties. Receiver operating characteristic (ROC) curves (Lachiche and Flach 2003) are introduced to evaluate the effects of target recognition. T Pr

TP TP + FN

FPr =

FP , TP + FP

(7.31)

where TP is the correctly detected target area, FP is the area which is mistaken for the target and FN is the undetected target area. Hence, the RPC (Leibe et al. 2008) is utilised. As shown in Fig. 7.26b, the horizontal and vertical axes correspond to two evaluation indexes: 1-precision and recall. The visual tracker benchmark has manually tagged the test sequences with 11 attributes, representing the challenging aspects of visual detection. The 30 videos containing human actions in datasets are tested by our model and the Seo algorithm. The mAP is the average precision of 11 attributes, as shown in Fig. 7.27. Experiments at THUMOS 2014 To make the experimental results of our algorithm more credible, we leverage two other benchmark systems: (1) S-CNN (Devernay 1995) leverages deep networks in temporal action localisation via three segments based on 3D ConvNets; (2) Wang et al. (THUMOS 2014: http://crcv.ucf. edu/THUMOS14/download.html) built a system on iDT with FV representation and frame-level CNN features and performed post-processing to refine the detection results. The temporal action detection task in THUMOS Challenge 2014 was dedicated to localising action instances in long untrimmed videos (THUMOS 2014: http:// crcv.ucf.edu/THUMOS14/download.html). Our model is unsupervised and only uses the test data. The detection task involved 20 actions, as shown in Table 7.4, and every action included many scenes. Average precision of each action was tested for all scenes with three methods. mAP is the mean average precision for 213 videos, shown in Table 7.4. Whereas the mAP is slight lower than S-CNN, our model is unsupervised and does not rely on the training process.

Average precision(%)

7.4 An SMSM Model for Human Action Detection

231

70 60 50 40 30 20 10 0

Seo algorithm

The proposed algorithm

Fig. 7.27 Histogram of average precision (%) for each attribute on visual tracker benchmark Table 7.4 Histogram of average precision (%) for each class on THUMOS (2014) S-convolutional neural network (CNN)

Wang et al.

The introduced method

BaseballPitch

15.2

16

22.4

BasketballDunk

23

0.8

5.6

Billiards

16.5

2.1

16

CleanAndJerk

25

9.7

26.9

CliffDiving

27.9

7.3

1.2

CricketBowling

15.8

0.8

9

Cricketshot

13.9

0.5

26.7

Diving

18

1.8

18.3

FrisbeeCatch

15.3

0.9

15.4

GolfSwing

16.9

22

33

HammerThrow

18.3

28

15.9

HighJump

22.6

10.1

17.4

JavelinThrow

17.2

6.2

17.6

LongJump

34.8

19.7

16.8

PoleVault

31.78

17.3

25.7

Shotput

12.5

1.2

22

SoccerPenalty

18.9

4

5.7

TennisSwing

19.7

3.2

24.6

ThrowDiscus

24.8

18.5

17

VolleyballSpiking

0.48

0.18

8.6

mAP

19.4

10.3

17.3

232

7 Non-learning-Based Motion Cognitive Detection …

7.5 Summary In view of some problems of the existing detection and tracking methods (e.g. diverse motion shapes, complex image features and weak adaptability to varied scenes), we described the following three models: • A new method for infrared small object detection is capable of detecting small objects using sparse error and structure differences. This is the first time that structure differences and sparse errors were successfully applied to infrared small object detection. The infrared small target can be easily extracted by exploiting sparse error and structure differences. Results are estimated via a simple fusion framework. Experimental results demonstrate that the introduced method is effective, performs favourably against other methods and yields superior detection results. • The introduced algorithm, GLMT, utilises the advantages of LARK spatial structures without interference from weak-edge targets. It combines the colour or grey information to improve the contrast between the target and the background. The GLMT algorithm improves the lack of spatial information and interference of background in CAMSHIFT, and tracks compact or non-compact targets with less changes of posture in different spectra videos. Additionally, the local structure statistical matching method of LARK was introduced using the LARK feature to be sensitive to the change of the weak local structure. According to the local statistical matching method of LARK, the introduced algorithm, LLSMT, can track the non-compact target of different posture changes. • A space-time local structure statistical matching model was introduced to recognise non-compact actions and to expand the scenes of test video. First, a new GLARK feature was introduced by introducing Gaussian difference gradients into LARK descriptors to encode local structure information. Second, the multi-scaled composite template set was applied for actions of multiple sizes. At last, the method was applied to action videos by the space-time local structure statistics matching method to mark the actions. Experimental results demonstrate that the approach outperforms previous approaches and achieves more robust and accurate results for human actions detection. Furthermore, the SMSM is distinguished from traditional learning methods and matches template with test video more efficiently.

References Achanta, R., Shaji, A., Smith, K., et al. (2012). SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11), 2274. Bae, T. W., Zhang, F., & Kweon, I. S. (2012). Edge directional 2D LMS filter for infrared small target detection. Infrared Physics & Technology, 55(1), 137–145. Chong, X. G. (2014). Research on online video target tracking algorithm based on target scaling and blocking. Chongqing University.

References

233

Chun, B. X., & Shi, A. W. (2015). Camshift tracking method for significant histogram model. Optics and Precision Engineering, 6, 1749–1757. Devernay, F. (1995). A non-maxima suppression method for edge detection with subpixel accuracy. Technical Report RR-2724, Institut National de Recherche en Informatique et en Automatique. Drummond, O. E. (1993). Morphology-based algorithm for point target detection in infrared backgrounds. Proceedings of SPIE-The International Society for Optical Engineering, 1954, 2–11. Gu, Y., Wang, C., Liu, B. X., et al. (2010). A kernel-based nonparametric regression method for clutter removal in infrared small-target detection applications. IEEE Geoscience and Remote Sensing Letters, 7(3), 469–473. Guo, M. C. (2016). Research on target tracking algorithm for context information and color information fusion. Zhejiang Normal University. Hong, Z. Z., Jin, H. Z., Hui, Y., & Shi, L. H. (2006). Target tracking algorithm based on CamShift. Computer Engineering and Design, 27(11), 2012–2014. Kai, H. Z. (2012). Real-time compressive tracking. Computer Vision–ECCV, 30(7), 864–877. Karacan, L., Erdem, E., & Erdem, A. (2013). Structure-preserving image smoothing via region covariances. ACM Transactions on Graphics, 32(6), 1–11. Kuffler, S. W. (1953). Discharge patterns and functional organization of mammalian retina. Journal of Neurophysiology, 16, 37–68. Lachiche, N., & Flach, P. (2003) Improving accuracy and cost of two-class and multi-class probabilistic classifiers using ROC curves (pp. 416–423). ICML. Leibe, B., Leonardis, A., & Schiele, B. (2008). Robust object detection with interleaved categorization and segmentation. International Journal of Computer Vision, 77, 259–289. Li, L., Li, H., Li, T., et al. (2014). Infrared small target detection in compressive domain. Electronics Letters, 50(7), 510–512. Liu, T., Yuan, Z., Sun, J., et al. (2010). Learning to detect a salient object. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(2), 353–367. Luo, F., Han, J., Qi, W., et al. (2015). Robust object detection based on local similar structure statistical matching. Infrared Physics & Technology, 68, 75–83. Meng, R. L. (2015). Research on infrared target tracking algorithm based on mean shift. Shenyang University of Technology. Milanfar, P. (2009). Adaptive regression kernels for image/video restoration and recognition. Computational Optical Sensing and Imaging. Ming, B. W. (2016). A survey of moving target tracking methods. Computer and Digital Engineering, 11, 2164–2167, 2208. Muesera, P. R., Cowanb, N., & Kim, T. (1999). A generalised signal detection model to predict rational variation in base rate use. Cognition, 69(3), 267–312. Ran, W. (2012). Research on improved algorithm of motion prediction target tracking based on camshift algorithm. Shandong University. Rodieck, R. W., & Stone, J. (1965). Analysis of receptive fields of cat retinal ganglion cells. Journal of Neurophysiology, 28(5), 833–849. Rong, T. C., Yuan, H. W., Ming, J. W., & Qin, X. W. (2010) A survey of video target tracking algorithms. Television Technology, 12, 135–138, 142. Sanchez, J. S., Mollineda, R. A., & Sotoca, J. M. (2007). An analysis of how training data complexity affects the nearest neighbour classifiers. Pattern Analysis and Applications, 10(3), 189–201. Seo, H. J., & Milanfar, P. (2009a). Training-free generic object detection using locally adaptive regression kernels. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1688–1704. Seo, H. J., & Milanfar, P. (2009b). Detection of human actions from a single example. In 2009 IEEE 12th International Conference on Computer Vision (pp. 1965–1970). IEEE. Seo, H. J., & Milanfar, P. (2011a). Face verification using the lark representation. IEEE Transactions on Information Forensics and Security, 6(4), 1275–1286. Seo, H. J., & Milanfar, P. (2011b). Action recognition from one example. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5), 867–882.

234

7 Non-learning-Based Motion Cognitive Detection …

Seo, H. J., & Milanfar, P. (2011c). Robust visual recognition with locally adaptive regression kernels. A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Electrical Engineering. Shen, Y., & Fan, J. (2010). Multi-task multi-label multiple instance learning. Journal of Zhejiang University-Science C (Computers &Electronics), 11, 860–871. Takeda, H., Farsiu, S., & Milanfar, P. (2007). Kernel regression for image processing and reconstruction. IEEE Transactions on Image Process, 16(2), 349–366. THUMOS (2014). http://crcv.ucf.edu/THUMOS14/download.html. Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. Conference on Computer Vision and Pattern Recognition. Wang, Y. (2012). Adaboost face detection algorithm based on locally adaptive kernels regression. Nanjing University of Aeronautics and Astronautics. Wang, X., Ning, C., & Xu, L. (2015). Saliency detection using mutual consistency-guided spatial cues combination. Infrared Physics & Technology, 72, 106–116. Xing, F. G., Yao, B. M., & Qiu, J. L. (2012). Summarization of visual target tracking algorithm based on mean shift. Computer Science, 12, 16–24. Yang, L., Yang, J., & Yang, K. (2004). Adaptive detection for infrared small target under sea-sky complex background. Electronics Letters, 40(17), 1083–1085. Zhang, J., Fang, J., & Lu, J. (2011). Mean-Shift algorithm integrating with SURF for tracking. In Seventh International Conference on Natural Computation (pp. 960–963). IEEE. Zhang, B., Tian, W., & Jin, Z. (2006). Joint tracking algorithm using particle filter and mean shift with target model updating. Chinese Optics Letters, 4(10), 569–572. Zhang, K., Zhang, L., Yang, M. H., et al. (2013). Fast tracking via spatio-temporal context learning. Computer Science.

Chapter 8

Colourization of Low-Light-Level Images Based on Rule Mining

The colourization of grey image has been the hotspot research on night-vision technology for a long time. Low-light-level and infrared images are the result of optoelectronic device imaging, and they are usually grayscale. The human eyes have a high resolution and sensitivity to colourful images, so the colorization of nightvision image can enhance people’s awareness of targets and scene information. It is significant military and civil fields (see Zhen 2011). Nowadays, colour spreading based on manual brushwork and colour transferring based on the reference colourful image are two main methods of realising grayscale image colorization. This chapter first elaborates on a colorization algorithm based on the association rules in multidimensional data (CARM). We add association rule mining and image classification learning technology into the image colorization process, greatly improving the effectiveness of image colorization and realising a real-time colorization process. Simultaneously, we elaborate on a conceptual model framework for colorizing grayscale images, a multi-sparse dictionary colorization algorithm based on the feature classification and detail enhancement (CEMDC). The algorithm can achieve a natural colorization effectiveness of the grayscale image, consistent with human vision.

8.1 Research on Colorization of Low-Light-Level Images Nowadays, colour spreading, based on manual brushwork, and colour transferring, based on the reference colourful image, are two main methods of realising grayscale image colorization. Based on the assumption, ‘if the brightness of the pixels adjacent spatially is similar, then the colour of the pixels would be similar’, Levin et al. (2004) proposed a method to spread colour based on manual brushwork. This is the most representative among this kind of colorization method and can obtain satisfactory results. Welsh et al. (2002) proposed a method to transfer colour based on the reference colourful image, matching the pixels in the reference colourful image with that in the target grayscale image via the brightness and texture information, and © Springer Nature Singapore Pte Ltd. 2019 L. Bai et al., Night Vision Processing and Understanding, https://doi.org/10.1007/978-981-13-1669-2_8

235

236

8 Colourization of Low-Light-Level Images Based on Rule Mining

then transferring the colour to the most similar pixel. Colour spreading based on manual brushwork requires users to work manually, and the requirements are strict. Generally, users must operate several times repeatedly before obtaining the ideal colorization results. In the colour transferring method, calculation is complicated and the time cost is high, and it is very difficult to obtain ideal colorization results. Until recently, association rules have not been applied to the field of image colorization. However, sparse representation theory is widely used in image denoising, image restoration, image classification, face recognition and other fields. The research on colorization based on sparse representation theory has also made some achievements: Kai et al. (2016) and Di and Uruma et al. (2013) succeeded in applying sparse representation theory to image colorization, proving through experiments that this kind of colorization can achieve more accurate and realistic effects than traditional algorithms. Currently, colorization based on sparse representation is also divided into two groups: manual intervention (Pang et al. 2013; Mohammad and Balinsky 2009) and non-manual intervention. The latter is divided into two types: colorization based on a single dictionary and colorization based on multi-dictionary. The traditional colorization algorithm based on a single dictionary can only have a good effect on single-tone images, and there will be much false colouring pixels for colourful images. To solve this problem, an improved sparse optimisation algorithm is proposed in Uruma et al. (2013) and has achieved good results. Furthermore, Hai (2016) proposed a colorization algorithm based on classification dictionary and sparse representation. The algorithm consists of two parts: a training classification dictionary and dictionary matching, based on the reconstruction error minimisation. Thus, a colorization algorithm based on joint dictionary and sparse representation is proposed in Hai (2016). These two algorithms achieve the colorization of multicontent grayscale images, and the improved algorithm works well. However, neither of these algorithms solve the problem of missing detail and blurring caused by colorization via a sparse dictionary.

8.2 Carm 8.2.1 Summary of the Principle of the Algorithm Association rule mining (see Zhang and Guan 2005; Tian 2008) is used to discover interesting links between items in a shopping basket data transaction. The most classical application is in a supermarket to determine the placement of goods according to the association rules in the customer purchase data. Association rules have become an important research direction of data mining. Its purpose is to discover the relationship hidden in the data, which is core to finding the attributes of frequent items. Association rules have been applied to medical image classification, and the effect is obvious.

8.2 Carm

237

A colorization algorithm for low-light-level images based on the association rules (see Zhang and Guan 2005; Tian 2008) in multidimensional data (i.e. CARM) is elaborated in this section through the theory of data mining (Shen 2013). CARM can obtain a colorization process with high colour revivification degree automatically, which is real time and can be transformed into hardware easily, by combining Apriori (Fu 2013; Guo 2004) algorithm with classification learning technology (Wang 2001). By combining brightness y with colour components, r, g and b of each pixel in a colourful image, it can be found that when the brightness, y, is equal to a certain value, the corresponding colour components, r, g and b, are, respectively, frequently equal to certain values. We focus on how to discover this association from data and how to apply the association to the image colorization process. Specifically, two problems are solved in this section: the appropriate strong association rules (Zhang 2015; Wu 2008) mining from the reference colourful image and how to obtain the colorization result from low-light-level image. The principle of the CARM algorithm elaborated in this section is illustrated in Fig. 8.1.

Fig. 8.1 The principle of the CARM algorithm elaborated in this section

238

8 Colourization of Low-Light-Level Images Based on Rule Mining

8.2.2 Mining of Multi-attribute Association Rules in Grayscale Images To realise the mining of the association rules, brightness y and colour components, r, g and b, of every pixel in the reference colourful image is used to obtain ‘brightness–colour’ strong association rules, set by the Apriori algorithm. This algorithm can only obtain satisfactory colorization results from simple grayscale images in which different objects have different brightnesses. However, when the brightness distribution of the grayscale image is complex and the brightness of objects is similar or even the same, colorization errors will appear. To solve this problem, we add the element ‘category label’ to the strong association rules set to obtain the ‘brightness–category–colour’ strong association rules set. The specific process is described below. We distinguish different objects in the reference colourful image and produce training samples of different categories (BLOCK) by segmentation. We use the classification function, libsvm, to obtain a SVM classifier (model) by the features set {HOG, mean, variance, homogeneity, entropy} of the reference colourful image and the corresponding category labels (‘labeln’). We select the most appropriate image features set, T, as the data for training (traindata), as shown in Eqs. (8.1) and (8.2), and we obtain an SVM classifier (model) by observing the results of classification, as shown in Eq. (8.3). trainlabel label1 , label2 , . . . , labeln .

(8.1)

traindata [T (BLOCK1) , T (BLOCK2) , . . . , T (BLOCKn) ] .

(8.2)

model svmtrain (trainlabel, traindata, −s 0 −t 2 ).

(8.3)

• The corresponding transaction database is built by extracting the brightness, y, and colour components, r, g and b of every pixel in different reference colourful images. • The optimised a priori algorithm, based on the multi-attribute rules constraint, is elaborated to obtain the uniqueness of colorization. We determine parameters and discover the frequent 4-item set. Support (sup) (Rakesh et al. 1993), parameter s and confidence (con) (Wu 2016) and parameter c are determined to generate the frequent 4-item set based on Eqs. (8.4) and (8.5). Support (A ⇒ B) P(A ∪ B)

(8.4)

Support (A ⇒ B) ≥ s.

(8.5)

Obtain all association rules from the frequent 4-item set, and remove the association rules whose confidence levels are less than the user-specified threshold, c. Leave the strong association rules set waiting for constraint based on Eqs. (8.6) and

8.2 Carm

239

Fig. 8.2 Mining of multi-attribute association rules in grayscale image

(8.7). Then, obtain the original ‘brightness–colour’ strong association rules set that can meet requirements [Ori(strong rules set)], based on the constraints on the strong association rules set through the amount of intervals and the intervals distribution, such as in Eq. (8.8). Confidence (A ⇒ B) P(B/A).

(8.6)

Confidence (A ⇒ B) ≥ c.

(8.7)

A ∈ [a1, a2], B ∈ [b1, b2].

(8.8)

We mine the category property rules and add the element ‘category label’ to the strong association rules set to obtain ‘brightness–category–colour’ strong association rules sets corresponding to different objects. Finally, we gather the strong association rules sets to obtain a complete, final strong association rules set for mapping. The whole process is illustrated in Fig. 8.2.

8.2.3 Colorization of Grayscale Images Based on Rule Mapping To solve the problem of obtaining the grayscale image colorization result, the moving window of a certain size (16 × 16) is selected in the grayscale image. Then, the image blocks in the moving windows are classified by the SVM classifier, which has been obtained, and the category labels are assigned to the top-left corner pixels.

240

8 Colourization of Low-Light-Level Images Based on Rule Mining

The ‘brightness–category’ set for mapping can be obtained by the brightness and category label of each pixel in the grayscale image. The ‘brightness–category’ set can then map in the final strong association rules that have been obtained one-by-one and can be assigned by the corresponding colour components: r, g and b. Finally, the colorization result can be obtained. The specific process is described below. According to Eqs. (8.9) and (8.10), we classify the grayscale image by the SVM classifier to obtain the category label matrix, F. We extract the brightness of each pixel (y) in the grayscale image to obtain ‘brightness–category’ set (D) for mapping with the category label matrix, F. The ‘brightness–category’ set (D) can map in the final strong association rules that have been obtained one-by-one. Finally, the colorization result can be obtained by assigning the corresponding colour components, r, g and b, as shown in Eq. (8.11). classify Original + model −−−−→ F ⎡ ⎤ y1 y1 ⎢ y2 ⎢ y2 ⎥ ⎢ ⎥+FD:⎢ ⎣··· ⎣···⎦ yz yz ⎤ ⎡

⎡

⎤ L1 L2 ⎥ ⎥, ···⎦ Lz

y1 L1 ⎢ y2 L2 ⎥ search and map ⎥ D:⎢ ⎣ · · · · · · ⎦ + strong rules set −−−−−−−−−−→ result.

(8.9)

(8.10)

(8.11)

yz Lz

8.2.4 Analysis and Comparison of Experimental Results The ‘brightness–colour’ model and ‘brightness–category–colour’ model are elaborated in this section. The colorization effectiveness of the two models on grayscale images is compared and the ‘brightness–category–colour’ model is focused by more experiments. This section also addresses the practical application scene and compares the CARM algorithm to the Welsh algorithm (He 2014) through the colorization effectiveness on low-light-level image. The reference colourful images of CARM and Welsh have adaptability in similar scenes. Thus, the reference colourful images corresponding to the two algorithms for the following groups of experiments are selected first, except those with special instructions, as shown in Fig. 8.3. ‘Brightness–Colour’ Model ‘Brightness–colour’ model is used to the grayscale image colorization. It can be seen from the experimental results that, when the difference between the brightness of different objects is obvious, the result of image colorization should be good. However, when the brightness difference is difficult to find, or when the brightness is the same, image colorization becomes worse. This results from the left side of the ‘brightness–colour’ strong association rules, which has

8.2 Carm

241

Fig. 8.3 Reference colourful images applied in the experiments in this section; a CARM; b Welsh

only one element: brightness. The right has three elements: the colour components r, g and b. Under the constraints on sup and con parametres, when the brightness is equal to a certain value, the corresponding colour could be totally different. Colorization errors can be easily caused, resulting in the colour of an object being assigned to another object, thus reducing the effectiveness of image colorization (Figs. 8.4 and 8.5). ‘Brightness–Category–Colour’ Model Colorization errors are easily caused resulting in the colour of an object being assigned to another object, thus reducing the effectiveness of the image colorization. To improve the effectiveness of image colorization, the number of constraints must be increased to make the elements on the left side of the final strong association rules unique. Then, to distinguish one object from another, the idea of classification from Ni et al. (2013) can be used. For the problem of image features selection for classification, this section takes the following action after many experiments. HOG (Xu 2016), mean, homogeneity and entropy (Liu 2009) are applied to the grayscale image colorization, in which high SNR, CMOS and mean are applied in the grayscale image colorization with low SNR, taken by EMCCD. It can be seen from the experimental results that the effectiveness of the image colorization is obviously improved after adding the element, ‘category’ to the left side of the strong association rules. See Fig. 8.6c. However, two problems can also be found in Fig. 8.6. The first one is that the colorization effectiveness of the boundary is not very good. This is resulted in only three categories: tree, ground and road, making it difficult to correctly classify the image block in the boundary. However, considering the application of this colorization algorithm, we just need to distinguish different objects in low-light levels clearly with image colorization, so the colour of the boundary has little influence. Then, the grey colour of the road will be different from the real colour. This results from the amount of colour being largely decreased. The brightness and colour components are divided into intervals in CARM to reduce the time cost of the mining of association rules, which leads to low colour revivification and discontinuity.

242

8 Colourization of Low-Light-Level Images Based on Rule Mining

Fig. 8.4 Colorization effectiveness of ‘brightness–colour’ model on the grayscale image: a target grayscale image; b reference colourful image; c target colourful image; d result of image colorization

Fig. 8.5 Colorization effectiveness of ‘brightness–colour’ model on grayscale images: a target grayscale image; b target colourful image; c result of image colorization

8.2 Carm

243

Fig. 8.6 Colorization effectiveness of ‘brightness–category–colour’ model on grayscale image: a target grayscale image; b target colourful image; c result of image colorization

Relationship Between the Number of Intervals and the Discontinuity of Colour Figure 8.7c shows the effectiveness of image colorization after dividing brightness, y, and colour components, r, g and b into 20 intervals. Figure 8.7d is the effectiveness after dividing them into 50 intervals. It can be seen clearly from Fig. 8.7d, after dividing brightness and colour components into 50 intervals, the influence of colorization error and failure is small in the few local area. The strip distortion colour on the white dog has been become more continuous and uniform. Additionally, the colour of the white dog and the green grass is more close to the true colour of nature. When the amount of the intervals of brightness and colour components increases, the after image colorization can be close to continuous, uniform and true. Multiple Scene Application of Algorithms and Comparisons In this section, two grayscale images corresponding to colourful images from two different scenes taken by CMOS are applied in the first group of experiments. The original colourful images and the effectiveness of colorization are compared to illustrate the strong function of the CARM algorithm, elaborated in this section on grayscale images from various scenes. From the effectiveness of the colorization on the two grayscale images with two different angles from two different scenes in Figs. 8.8 and 8.9, we find that the CARM algorithm can obtain great colorization effectiveness for trees, ground and road. After dividing the data into different numbers of intervals, when the amount is high enough, CARM can obtain the colour infinitely close to continuous, uniform and true for image colorization. Three low-light-level images with different levels of illumination taken by EMCCD are applied in the second group of experiments, and the colorization effectiveness compared to Welsh algorithm shows the colorization function of the CARM algorithm on LLL images. Comparison of Effectiveness It can be seen from the colorization effectiveness on the three low-light-level images with differing levels of illumination in Fig. 8.10

244

8 Colourization of Low-Light-Level Images Based on Rule Mining

Fig. 8.7 Different effectiveness of colorization by dividing the brightness and colour components, r, g and b, into different intervals: a target colourful image; b reference colourful image; c effectiveness of 20 intervals; d effectiveness of 50 intervals

that the colorization function of the CARM algorithm is obviously stronger than the Welsh algorithm. The CARM algorithm can clearly distinguish the ground and trees in Fig. 8.10a, and it can obtain good visual effect by assigning bright green to trees and grey-yellow to the ground. The distinction between trees and the ground is not obvious in Welsh algorithm, and the overall visual effect is worse compared to CARM. This results from the Welsh algorithm relying on local statistical distribution to realise image colorization. Image noise is usual in low-light-level images, affecting the calculation of local statistical distribution and the following matching work. CARM is based on the image classification to realise image colorization. The first step of CARM is separating the different objects by mean, and then mapping the corresponding colour, so that the effectiveness of colorization will be much better. Comparison of Time Cost Time cost and time complexity of the two algorithms are compared. In the CARM algorithm, image classifiers and strong association rules set have been obtained. Therefore, the time of the algorithm is mainly used in each pixel classification and the process of association rules mapping. For a low-light-level image, (M × N), the time complexity of CARM algorithm is O (M × N). Assuming

8.2 Carm

245

Fig. 8.8 Effectiveness 1 of colorization based on the CARM algorithm on the grayscale images from different scenes: a target grayscale images; b corresponding target colourful images; c corresponding results of the image colorization

Fig. 8.9 Effectiveness 2 of colorization based on the CARM algorithm on the grayscale images from different scenes: a target grayscale images; b corresponding target colourful images; c corresponding results of the image colorization

246

8 Colourization of Low-Light-Level Images Based on Rule Mining

Fig. 8.10 Effectiveness of the CARM and Welsh algorithms on the three low-light-level images with different levels of illumination: a target low-light-level images with illumination; b corresponding effectiveness of colorization based on Welsh; c corresponding effectiveness of colorization based on the CARM algorithm

the size of the reference colourful image is m × n, the time complexity of the Welsh algorithm is O(m × n + (M × N ) + M × N ). Obviously, the time complexity of CARM is lower than Welsh. The experimental results show that the CARM algorithm can be real time.

8.3 Multi-sparse Dictionary Colorization Algorithm Based on Feature Classification and Detail Enhancement This section discusses an algorithm based on feature classification, multi-sparse dictionary and detail enhancement to improve the effectiveness of colorizing grayscale images and enhancing the image details. Experiments show that this algorithm can achieve a natural colorized effect, reduce the probability of false colorizing and solve the problem of missing details. The colorization result of our algorithm is consistent

8.3 Multi-sparse Dictionary Colorization Algorithm Based on Feature …

247

with human vision. The result is superior to the traditional colorization algorithm based on a single dictionary or multi-sparse dictionary, according to Liang (2016). Additionally, without manual intervention, our algorithm can achieve colorization effects the same as manual intervention. First, the algorithm establishes a multi-sparse dictionary classification colorization model. Then, a corresponding local constraint algorithm is elaborated to improve the classification accuracy rate. Finally, a detail enhancement, based on Laplacian pyramid for the colorized image, is elaborated, which is effective in solving the problem of missing details and improving the speed of image colorization.

8.3.1 Colorization Based on a Single Dictionary Colorization based on a single sparse dictionary is elaborated by Kai et al. (2011). First, the algorithm should train a joint dictionary, including information of brightness, feature and colour from the reference image. Then, they obtain the corresponding sparse coefficient, S , of the target grayscale image using the brightness and feature parts of the joint dictionary. Finally, the colour information of the target grayscale image can be reconstructed by the sparse coefficient, S , and the colour part dictionary of the joint dictionary, D. Let D [Dgf , Dc ]T denote the joint dictionary, Dgf [Dgray , Df ]T denote the brightness and feature dictionary, Dc [Dr , Dg , Db ]T denote the colour dictionary, gf [gray, f ]T denote the brightness and features information and c [r, g, b]T denote the colour information. Assuming the original colour image of the target grayscale image is [gf, c]T , it can be expressed under [Dgf , Dc ]T as [gf, c]T [Dgf , Dc ]T × S,

(8.12)

where S represents the sparse coefficient of the original colour image. Then the brightness and feature information, gf, of the target grayscale image can be expressed as gf Dgf × S ,

(8.13)

where S ≈ S. Therefore, the colour information of the target grayscale image can be expressed as c D c × S .

(8.14)

We know that colorization based on a single dictionary is ineffective for multicontent images. There are two main reasons. First, during dictionary training, the dictionary will retain the main features and ignore the secondary features of the reference image. Second, the information that a single dictionary contains is limited. Liang (2016) proposed a multi-dictionary-based colorization method. However, this

248

8 Colourization of Low-Light-Level Images Based on Rule Mining

method results in missing and blurred colour details. For a large-sized image, the colorization efficiency of this method is poor. Currently, there is almost no solution to solve the problem of missing details and blurred edges brought by sparse representation.

8.3.2 Multi-sparse Dictionary Colorization Algorithm for Night-Vision Images, Based on Feature Classification and Detail Enhancement For the problems articulated above, we first elaborate a novel conceptual colorization model framework for the grayscale image. Then, according to this framework, a multi-sparse dictionary colorization algorithm, based on feature classification and detail enhancement, is elaborated to solve the problem of single dictionary colorization and the problem of missing details and blurred edges. A Conceptual Colorization Model Framework A grayscale image only has the information of brightness and feature. Therefore, colorization can be described as follows. First, we establish the relationship among brightness, feature and colour of the reference image. This relationship is called ‘rules’. With these rules, the colorization of the target grayscale image can be realised. According to the above theory, this section presents a conceptual colorization model framework. The framework structure is shown in Fig. 8.11. Take a grayscale image that includes a white dog and grasses as an example. To colorize this grayscale image, we should select a series of reference images with white dogs and grassland. Then, we should divide them into blocks including white dogs, grassland, etc. At last, we should extract the brightness and feature information from each image block and combine the colour information to get the one-to-one rule of three kinds of information. rule [rule1, rule2, rule3, . . . , rulen]T .

(8.15)

Let rule denote the rules. rule1, rule2, rule3, . . . , rulen denote the rules corresponding to the reference image. Take the image blocks above as the sample set, label each image block, utilise the image feature as the classification standard and use the appropriate classification algorithm to train the classification model. Furthermore, divide the grey image into blocks and calculate the category of each grayscale image block using the classification model. catrgory [catrgory1, catrgory2, catrgory3, . . . , catrgoryn]T as shown in Fig. 8.12.

(8.16)

8.3 Multi-sparse Dictionary Colorization Algorithm Based on Feature …

249

Fig. 8.11 Structure of the conceptual colorization model framework. Reprinted from Yan et al. (2018), with permission of Springer Fig. 8.12 The block processing and classification of the target grayscale image. Reprinted from Yan et al. (2018), with permission of Springer

The grayscale image blocks can be reconstructed using corresponding rules. Defining the colour reconstruction algorithm is •, the rule corresponding to the reference image is Xrule , the category of the target grayscale image is Ycategory and the colour reconstructed is Fcolor . Thus, Fcolor Xrule · Ycategory .

(8.17)

250

8 Colourization of Low-Light-Level Images Based on Rule Mining

Sometimes, the colorization algorithm will bring the problems of missing details and blurred edges, such as the algorithm with sparse representation. Thus, it is necessary to enhance the details of the colorized image. Defining the detail enhancement algorithm as , the enhanced colorization result can be expressed as Fcolor Fcolor Zgaryscale ,

(8.18)

where Zgaryscale denotes the target grayscale image. Input: the target grayscale image,the reference colour images Output: the colorized image • Select appropriate reference images according to the content of the target grayscale image and divide the reference images into blocks according to the different contents to obtain different image blocks (categories): block [block1, block2, block3, . . . , blockn]T . • Use the image blocks of the reference image to train the ‘brightness–feature–colour’ rules and label each rule, as shown in Eq. (8.15). • Divide the grayscale image into blocks and calculate the category of each grayscale image block by using the classification model, as shown in Eq. (8.16). • Use the classification result and the reconstruction algorithm corresponding to the rule to obtain the colorized image, as shown in Eq. (8.17). • Use the detail enhancement algorithm to enhance the colorized image detail, as shown in Eq. (8.18).

Multi-Sparse Dictionary Colorization Algorithm Based on Feature Classification and Detail Enhancement Based on the colorization model framework above, this section elaborates a multi-sparse dictionary colorization algorithm based on CEMDC. The algorithm consists of three parts: • multi-sparse dictionary classification colorization algorithm (CMDC), • classification optimisation based on local constraint (LCC) and • detail enhancement based on Laplacian pyramid. (LPDE). The specific process is shown (Fig. 8.13). Multi-sparse Dictionary Classification Colorization Algorithm First, as shown in Fig. 8.14, we divide the selected reference image into blocks and train the joint sparse dictionary set, classified by the reference image blocks. The joint sparse dictionary set is the rules of the model above. D [dictionary1, dictionary2, dictionary3, . . . , dictionaryn]T [Dgray , Df , Dc ]T , where gray represents the grey part of the joint dictionary, f represents the feature part of the joint dictionary and C represents the colour part of the joint dictionary. Thus,

8.3 Multi-sparse Dictionary Colorization Algorithm Based on Feature …

251

Fig. 8.13 Process of CEMDC. Reprinted from Yan et al. (2018), with permission of Springer

Fig. 8.14 Joint dictionary training. Reprinted from Yan et al. (2018), with permission of Springer

⎧ dictionary1gray , dictionary2gray , dictionary3gray , . . . , dictionaryngray ⎨ Dgray . D dictionary1f , dictionary2f , dictionary3f , . . . , dictionarynf ⎩ f Dc dictionary1c , dictionary2c , dictionary3c , . . . , dictionarync (8.19) We assume the target grayscale image is divided into m blocks, as blockgray = block1gray, block2gray, block3gray · · · blockmgray . According to the characteristics of each image block, the category of each can be obtained using the libSVM classifier. We define the target grayscale image block classification results as category = category1, category2, category3 · · · categoryn . We define the target grayscale image blocks belong to ith class as Blockgray [index(i)], index(i) as the column which the ith class of the blocks and the remaining columns are set to 0. The

252

8 Colourization of Low-Light-Level Images Based on Rule Mining

corresponding joint dictionary, D(i), and the sparse coefficient, S (i), can be obtained by Eq. 8.20. Blockgray [index(i)] Dgray (i) × S (i).

(8.20)

Thus, the colorization of the target grayscale image blocks belonging to the ith class can be expressed as Blockc [index(i)] Dc (i) × S (i).

(8.21)

Then, the colorization of the entire grayscale image can be expressed as follows: Blockc

n

Dc (i) × S (i).

(8.22)

i1

Classification Optimisation Based on Local Constraint In the process of classifying the grayscale image blocks, the blocks with similar texture occasionally may be misclassified, leading to colorization errors. Therefore, we propose a classification optimisation algorithm based on local constraint (LCC). First, we construct a 2D matrix whose elements are the classification results of the corresponding grayscale image blocks, as shown in Fig. 8.15. In a certain area, the characteristics of the image are consistent. Therefore, in the image domain, the classification results should be consistent. The numbers in the figure above represent categories. In the upper left corner of the figure, the red mark is 3 and the surrounding classification is 1. Thus, it is clear that the classification at the red mark is incorrect and needs to be corrected. Defining the selected region size is Ω. The label value of the centre of the region, Ω, is F, the area size is N × N , the label value matrix of the area Ω is M and the value of F can be corrected by

Fig. 8.15 Construct the two-dimensional matrix. Reprinted from Yan et al. (2018), with permission of Springer

8.3 Multi-sparse Dictionary Colorization Algorithm Based on Feature …

253

Fig. 8.16 Result of LCC. Reprinted from Yan et al. (2018), with permission of Springer

⎧ 1, M (i, j) t ⎪ ⎪ E(i, j) 0 ≤ i, j ≤ N , t ∈ M ⎨ 0, M (i, j) t ⎪ ⎪ ⎩

g(t)

N N

,

(8.23)

E(i, j)

i1 j1

F max(g(t)), t ∈ M .

(8.24)

Here, the range of t is all the label values in M , from minimum to maximum. g(t) denotes the sum of each t in M . From Eqs. (8.23) and (8.24), the elements in the matrix in the red mark 3 should be changed to 1, as shown in Fig. 8.16. Detail Enhancement Based on Laplacian Pyramid The LCC can solve the problem of false colorization. However, it cannot solve the problem of missing details and blurred edges. Thus, we present detail enhancement algorithm based on Laplacian pyramid (LPDE). Laplacian pyramid images are a series of difference images, acquired from Gaussian pyramid images. These images represent the high-frequency detail information lost during the Gaussian pyramid process. Using I to present the original image, G0 to present the bottom image of the Gaussian pyramid images, G1 to present the image which obtained from the G0 by downsampling and Gaussian convolution, the Gaussian pyramid images can be expressed as in Eq. (8.25). Gi+1

G0 I . downsample(Gi )

(8.25)

Thus, the Laplacian pyramid images can be expressed as Li Gi − upsample(Gi+1 ).

(8.26)

254

8 Colourization of Low-Light-Level Images Based on Rule Mining

Ligray represents the Laplacian pyramid images obtained from decomposing the target grayscale image via Laplacian pyramid decomposition. Using I to present the colorized image, it is separated into three channels (R, G and B) as IR , IG and IB . If the colorized image is at the top layer of the Laplacian pyramid, we can obtain the enhanced colorized image as follows:

Ii+1_j Ij + Li+1_gray i ≥ 0, j R, G, B . Ii_j Li_gray + upsample(Ii+1_j )

(8.27)

If the colorized image is at the same level as the target grayscale image, and the colorized image is the bottom image of the Laplacian pyramid images, then the enhanced colorized image can be obtained by the following formula: I0_j L0_gray + Ij j R, G, B.

(8.28)

8.3.3 Experiment and Analysis We trained the dictionary by way of kSVD and the size of the dictionary as 64 × 256. We used the improved OMP algorithm by Elad to obtain sparse coefficients to reconstruct the colour of the grayscale image blocks. We used the libSVM to classify the blocks. The features we chose are the mean, entropy and LBP of the image. The size of the target grayscale image block is 8 × 8. Currently, the methods of evaluating colorization are divided into two types: subjective and objective evaluation. Subjective evaluation is the main objective, whereas objective evaluation is secondary. Researchers always choose the mean square error (MSE), the colour peak signal-to-noise ratio (CPSNR) and the SSIM as the objective criteria for colorization evaluation. The first two are defined as follows: MSE

c∈{r,g,b}

0≤i≤M

0≤j≤N

(cij − cij )2

3×M ×N 255 × 255 CPSNR 10 × log10 , MSE

,

(8.29) (8.30)

where M × N represents the size of the original colour image, c represents the pixel value corresponding to the three channels of the colorized image obtained with the colorization algorithm and c represents the pixel value corresponding to each channel of the original colour image. Module Analysis of Our Algorithm The three algorithms were analysed by experiments, and the performance of each algorithm is illustrated by comparison. (1) Verification of CMDC We select a detail-rich image with multi-contents as the original image. The colorization results of different algorithms are shown in Fig. 8.17, and the corresponding

8.3 Multi-sparse Dictionary Colorization Algorithm Based on Feature …

255

Fig. 8.17 a Given reference colour image; b given grayscale image; c original image; d result of the single dictionary; e result of literature (Liang 2016); f result of this section. Reprinted from Yan et al. (2018), with permission of Springer Table 8.1 The objective evaluation of SSIM, MSE and CPSNR Single dictionary colorization algorithm

Literature 0

CMDC

SSIM

0.7315

0.6992

0.9258

MSE

52.5970

96.5928

31.2947

CPSNR

30.9212

28.2814

33.1761

Reprinted from Yan et al. (2018), with permission of Springer

Fig. 8.18 a Given reference image; b original image; c result of CMDC; d result of 0; e result of LCC. Reprinted from Yan et al. (2018), with permission of Springer

objective evaluation values are shown in Table 8.1. Through the visual subjective evaluation, it can be seen that the result of the single dictionary colorization algorithm has many false colour pixels. The main reason is that the information contained by a single dictionary is limited and cannot satisfy the colorization requirement of multi-content images. Liang (2016) showed that the multi-dictionary is better than the single dictionary. In the upper part of the image, (e), there are many falsely colorized pixels caused by inaccurate classification. The effect of our algorithm is better than the two algorithms above, both in the natural sense and in a detailed fashion. Additionally, the objective evaluation results show that our algorithm is superior to the other two in SSIM, MSE and CPSNR. Therefore, the combination of feature classification and multi-sparse dictionary is effective in colouring the grayscale image. (2) Verification of LCC In this experiment, a yellow and green, evenly distributed picture is selected as the original image (Fig. 8.18), derived from Uruma et al. (2013). The experimental results in (c) and (e) indicate that the improved algorithm can significantly reduce incorrectly

256

8 Colourization of Low-Light-Level Images Based on Rule Mining

Fig. 8.19 Colorized image as the bottom image; a original image; b given grayscale image; c result of CMDC; d result of LPDE. Reprinted from Yan et al. (2018), with permission of Springer

Fig. 8.20 Colorized image as the top image; a original image; b given grayscale image; c result of CMDC; d Uruma et al. (2013); e result of LPDE. Reprinted from Yan et al. (2018), with permission of Springer

colorized pixels. Therefore, LCC can effectively improve the classification results and make the image colorization more natural. Additionally, the colour is chosen by the reference image, so the yellow and the green of the colorized image are slightly deepened. Compared with the two algorithms, we can see that the central location of (d) has obvious red false colorized pixels, and uneven colour appears in the whole image. However, (e) does not have such problems. The main reason is that, whereas Uruma et al. (2013) improved the sparse algorithm, it still uses a single dictionary to colorize the image. LCC uses a multi-dictionary method for colouring. Therefore, it has obvious advantages in image colorization. (3) Verification of LPDE Two experiments were carried out for the colorized image, the bottom image and the top image, respectively. The first group is for the colorized image, as shown in Fig. 8.19. The details of the grassland are obviously enhanced. The second group is for the colorized image, shown in Fig. 8.20. (d) is the colorized result of Uruma et al. (2013), which is even better than Welsh et al. (2002). However, the details are still fuzzy and unclear. The main reason is that the detail ambiguity is the inevitable problem for colorization by sparse representation. The detail of (e) is more prominent, particularly the colour levels that are more abundant at the bottom. This proves that the LPDE effectively enhances the details of the colorized image.

8.3 Multi-sparse Dictionary Colorization Algorithm Based on Feature …

257

Fig. 8.21 a Given grayscale image; b result of Welsh et al. (2002); c result of this section. Reprinted from Yan et al. (2018), with permission of Springer

Performance Analysis To illustrate the effectiveness of the elaborated algorithm, a number of comparative experiments were conducted. As shown in Fig. 8.21, there are two sets of experiments. The second column shows the results of Welsh et al. (2002), and the third column is ours. The reference image used in each experiment remains the same. That is, the colorization model is same. It can be seen from Fig. 8.21 that there are many false colouring pixels. For example, in rows 3–6, the sky, which should be blue is colorized yellow. They also lack details. Consistent with the above analysis, the reason is that the classification is inaccurate, and the algorithm must solve the sparse coefficients of each dictionary separately. This greatly increases the amount of calculation. However, the number of false colouring pixels is reduced in the third column, and the details of the column are clear. Thus, our algorithm, combining feature classification and detail enhancement, is very effective. The objective evaluation results are shown in Fig. 8.22, and our

258

8 Colourization of Low-Light-Level Images Based on Rule Mining

Fig. 8.22 Objective evaluation of SSIM, CPSNR and MSE. Reprinted from Yan et al. (2018), with permission of Springer

results are obviously superior to Welsh et al. (2002). Furthermore, in this experiment, we used the same colorization model to colour the images with similar colours, thus proving that our algorithm has good repeatability. Figures 8.23 and 8.24 shows the colorization results of Levin et al. (2004), Pang et al. (2013), Mohammad and Balinsky (2009) and CEMDC. These documents are related to the sparse representation and are all based on artificial intervention. As described in the introduction, the main problem of the artificial intervention colorized algorithm is that the colour is chosen by human. As in (i) and (j), the result is satisfactory for a single-colour image, but is poor for a colour mixed image. The results of our method on the two types of images are better, indicating that CEMDC, using multi-sparse dictionary, is effective, because the dictionary contains information of different textures and colours, and the colorized results are obtained by sparse reconstruction. From Pang et al. (2013) and Mohammad and Balinsky (2009), we selected the original image of the two studies as reference images and colorizes the grey image of the original. Results are shown in the last two rows, indicating that the effect of our algorithm is as good as these two studies.

8.3 Multi-sparse Dictionary Colorization Algorithm Based on Feature …

259

Fig. 8.23 a–d Given grayscale images; e–h original images; i, j result of Levin et al. (2004); k result of Pang et al. l result of Mohammad and Balinsky (2009); m–p result of CEDMC

The colorization algorithm of Larsson et al. (2016) is based on deep learning. Different from CEMDC, the algorithm builds a model by training many colour images, finally realising the colorization of grayscale images. The training set used by Larsson was from Imagenet. To ensure the fairness of the experiment, we chose an image not in Imagenet and colorized the target grayscale image similar to the reference images. In Fig. 8.25, the results obtained by our algorithm are more similar to the original image than Larsson et al. (2016). The main reason is that Larsson’s algorithm proposed training many images to obtain their colorization model. Therefore, its wide range of training will result in inaccurate matching, further resulting in false colour. As shown in Fig. 8.25, the colour of the petals in result of 0 is clearly wrong in the second column. As mentioned, our algorithm trains the colorization model using the reference images, which are similar to the target grayscale image. Thus, the colorization results are more natural and realistic (Fig. 8.26).

260

8 Colourization of Low-Light-Level Images Based on Rule Mining

Fig. 8.24 Objective evaluation of SSIM, CPSNR and MSE. Reprinted from Yan et al. (2018), with permission of Springer

Other Applications In addition to the colorization of grayscale images, our algorithm also can be applied to colour transmission between colour images, image fusion colorization and infrared images colorization. • Colour transfer Colour transfer is divided into two kinds. One is transferring the colour to the grayscale image and another is transferring the colour to the colour image. Figure 8.27 shows the results of our algorithm used for colour transfer, where the first column is the colour transfer of the grayscale image, and the remaining two columns are colour transfer of the colour image. As shown in Fig. 8.27, our algorithm can also achieve better results in colour transfer. The effect of (h) and (k) is similar; the only difference is that there are some visible false colouring pixels at the top of (h), whereas (k) is natural. For the colour transfer between colour images, our algorithm still transfers colour to the grayscale image obtained from the original image, whereas Reinhard et al. (2001) used the original 3-channel information for colour transfer. Thus, the colour of the results is different. However, from the results seen in (i), (j), (l) and (m), our algorithm is clearer in the details because of the detail enhancement algorithm. See the cloud detail on the left of (l). It is clearer than (i). The water ripples and the ship’s reflection ratio of (m) are also clearer than (j).

8.3 Multi-sparse Dictionary Colorization Algorithm Based on Feature …

261

Fig. 8.25 a Given grayscale image; b original image; c result of Larsson et al.; d result of CEDMC. Reprinted from Yan et al. (2018), with permission of Springer

262

8 Colourization of Low-Light-Level Images Based on Rule Mining

Fig. 8.26 Objective evaluation of SSIM, CPSNR and MSE. Reprinted from Yan et al. (2018), with permission of Springer

• Fused image colorization Figure 8.28 shows the results of image fusion. Here, we want to use the existing colorization models we trained to colour a fused image. Liu et al. (2016) provided a grayscale fused image (c), which had a good effect. As shown in Fig. 8.28, the colorized image is more clear and rich in content. See the top of (d) and (c). It is obvious that there are many details missing in (c), whereas the detail is clearly visible in (d). Therefore, our algorithm can play a significant role in image fusion, making the fused images easier to observe. In the second row, the colorized image is also consistent with human vision. • Infrared image colorization Figure 8.29 shows the result of the infrared image. Compared with the original infrared image, the colorized image, (b) and (d), is easier to observe. This experiment proves that CEMDC can be applied to IR images. However, we trained the colorization model using the colour visual image as the reference. Therefore, it led to the loss of thermal information. In the future, we will use an artificially coloured image as the reference image, making the infrared target more significantly.

8.4 Summary

263

Fig. 8.27 a–c Original image; d–f given reference image; h result of Zhang (2015); i, j result of Wu (2008); k–m result of CEMDC. Reprinted from Yan et al. (2018), with permission of Springer

8.4 Summary In view of the colorization of low-light-level images, this chapter elaborated two colorization methods based on rule mining: • We presented a colorization algorithm based on the association rules of multidimensional data. We added association rule mining and image classification learning technology into image colorization. An optimised Apriori algorithm, based on the multi-attribute rules constraint, was also elaborated. The main contents were mining of the strong association rules set in ‘brightness’ and ‘colour’ information of the reference colourful images, and adding the element ‘label’ to generate ‘brightness–category–colour’ strong association rules. Then, the brightness and the categories of the pixels in low-light-level images were extracted. We assigned the corresponding colour components, r, g and b, to each pixel based on

264

8 Colourization of Low-Light-Level Images Based on Rule Mining

Fig. 8.28 a, b, e, f Original image; c result of Liu et al. (2016); g fused result of wavelet; d, h result of CEMDC. Reprinted from Yan et al. (2018), with permission of Springer

(a) IR image

(b) Result of CEMDC

(c) IR image

(d) Result of CEMDC

Fig. 8.29 Colorization results of the IR image. Reprinted from Yan et al. (2018), with permission of Springer

the ‘brightness–category–colour’ strong association rules set. Finally, a colourful result image was obtained. The results show that this algorithm elaborated in this chapter not only achieved high-quality colorization with a high colour revivification degree but also was real time and easy to be transformed into hardware. • We elaborated a multi-sparse dictionary colorization algorithm based on CEMDC. The algorithm achieved a natural colorized effect of the target grayscale image, consistent with human vision. First, the algorithm established a multi-sparse dictionary classification colorization model. Second, according to the problem of inaccurate classification, the corresponding local constraint algorithm was elaborated to improve the classification accuracy rate. Lastly, a detail enhancement, based on Laplacian pyramid for the colorized image, was elaborated. It was effective in solving the problem of missing details and improving the speed of image colorization. Additionally, the algorithm not only improved the colorization of the visual grayscale image but also can be applied to other areas, such as colour transfer between colour images, colorizing grey fusion and infrared imaging.

References

265

References Fu, S. (2013). The research and improvement of Apriori algorithm for mining association rules. Microelectronics & Computer, 30(9), 110–114. Guo, X. J. (2004). Research on associational rules data mining algorithm (pp. 19–27). Changchun, China: Jilin University. He, X. J. (2014). Research on Welsh algorithm-based on greyscale image colourisation. 31(12), 268–271. Kai, H., Song, M., Bu, J., et al. (2011). Natural greyscale image colorization via local sparse coding. Journal of Computer-Aided Design & Computer Graphics, 23(8), 1401–1408. Larsson, G., Maire, M., & Shakhnarovich, G. (2016). Learning representations for automatic colorization. Computer Vision—ECCV 2016. Springer International Publishing. Levin, A., Lischinski, D., & Weiss, Y. (2004). Colorization using optimisation. In Computer graphics proceedings, annual conference series, ACM SIGGRAPH (pp. 689–694). New York: ACM Press. Liang, H. (2016). Multiple-comment image colorization algorithm based on multiple dictionary. Beijing, China: Beijing Jiaotong University. Liu, L. (2009). Overview of image textural feature extraction methods. Journal of Image and Graphics, 14(4), 622–634. Liu, Y., Chen, X., Peng, H., et al. (2016). Multi-focus image fusion with a deep convolutional neural network. Information Fusion, 36, 191–207. Mohammad, N., & Balinsky, A. (2009). Colorization of natural images via L1 optimisation. 37(3), 1–6. Ni, B. B., Xu, M. D., Cheng, B., et al. (2013). Learning to photograph: a compositional perspective. IEEE Transactions on Multimedia, 15(5), 1138–1151. Pang, J., Au, O. C., Tang, K., et al. (2013). Image colorization using sparse representation. In IEEE international conference on acoustics, speech and signal processing (pp. 1578–1582). IEEE. Rakesh, A., Tomasz, I., & Arun, S. (1993). Mining association rules between sets of items in large databases. ACM Sigmod Record, 22(2), 207–216. Reinhard, E., Ashikhmin, M., Gooch, B., et al. (2001). Color Transfer between Images. IEEE Computer Graphics and Applications, 21(5), 34–41. Shen, Y. (2013). The research of high efficient data mining algorithms for massive datasets (pp. 1–10). Zhenjiang, China: Jiangsu University. Tian, G. (2008). Research on image mining based on association rules (pp. 19–30). Xian, China: Xidian University. Uruma, K., Konishi, K., Takahashi, T., et al. (2013). An image colorization algorithm using sparse optimisation. In IEEE international conference on acoustics, speech and signal processing (pp. 1588–1592). IEEE. Wang, G. S. (2001). The theoretical basis of support vector machine—statistical learning theory. Computer Engineering and Applications, 19, 19–20. Welsh, T, Ashikhmin, M., & Mueller, K. (2002). Transferring colour to grayscale images. In Computer graphics proceedings, annual conference series, ACM SIGGRAPH (pp. 277–280). New York: ACM Press. Wu, Y. Y. (2008). Efficient medical image classification algorithm based on associations rules. 29(12), 3234–3236. Wu, J. (2016, June 10). Computing exact permutation p-values for association rules. Information Sciences, 146–162. Xu, Y. (2016). Pedestrian detection combining with SVM classifier and HOG feature extraction. Computer Engineering, 42(1), 56–65. Zhang, X. (2015). Study and improvement of association rules in data mining (pp. 17–20). Beijing, China: Beijing University of Posts and Telecommunications.

266

8 Colourization of Low-Light-Level Images Based on Rule Mining

Zhang, H., & Guan, Z. (2005). Correlative rule of image data mining in multimedia data mining. Geospatial Information, 3(6), 33–35. Zhen, X. U. (2011). Night vision image processing based on texture transfer (pp. 1–4). Shanghai, China: Donghua University.