121 59 24MB
English Pages 630 [660] Year 2017
HANDBOOK OF
NEURAL COMPUTATION Pijush Samui, Sanjiban Sekhar Roy, and Valentina E. Balas
Handbook of Neural Computation
This page intentionally left blank
Handbook of Neural Computation Edited by Pijush Samui Sanjiban Sekhar Roy Valentina E. Balas
Academic Press is an imprint of Elsevier 125 London Wall, London EC2Y 5AS, United Kingdom 525 B Street, Suite 1800, San Diego, CA 92101-4495, United States 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom Copyright © 2017 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN: 978-0-12-811318-9 For information on all Academic Press publications visit our website at https://www.elsevier.com/books-and-journals
Publisher: Mara Conner Acquisition Editor: Chris Katsaropoulos Editorial Project Manager: Anna Valutkevich Production Project Manager: Julie-Ann Stansfield Designer: Mark Rogers Typeset by VTeX
Dedicated to Joydeb Bhandari Pijush Samui
This page intentionally left blank
CONTENTS Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi About the Editors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxvii CHAPTER 1 1.1 1.2 1.3
1.4
1.5
1.6 CHAPTER 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 CHAPTER 3
3.1 3.2
3.3 3.4
Gravitational Search Algorithm With Chaos . . . . . . . . . . . . . . . . . . . . . . . . . . . . Seyedali Mirjalili, Amir H. Gandomi Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gravitational Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chaotic Maps for GSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Chaotic Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Integrating Chaotic Maps With GSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Search Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CGSA for Engineering Design Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Welded Beam Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.2 Pressure Vessel Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Textures and Rough Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Murat Diker Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fuzzy Lattices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Texture Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rough Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Definability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Order Preserving Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Approximation Spaces and Information Systems . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hydrological Time Series Forecasting Using Three Different Heuristic Regression Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ozgur Kisi, Jalal Shiri, Vahdettin Demir Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Least-Square Support Vector Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Multivariate Adaptive Regression Spline . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 M5 Model Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Applications and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1 2 3 4 5 6 7 10 11 11 13 15 15 17 17 17 19 31 34 37 38 42 42 45 45 46 46 48 49 53 63
vii
VIII
Contents
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 4
4.1 4.2
4.3
4.4 4.5 CHAPTER 5
5.1 5.2
5.3
5.4 5.5 CHAPTER 6 6.1 6.2 6.3
A Reflection on Image Classifications for Forest Ecology Management: Towards Landscape Mapping and Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anusheema Chakraborty, Kamna Sachdeva, Pawan K. Joshi Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Definitions in Remote Sensing Community . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Importance of Using Remote Sensing Tools . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Types of Remote Sensing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Uncertainty Assessments of Remote Sensing Data . . . . . . . . . . . . . . . . . . . Image Classification for Forest Cover Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Mapping Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Emerging Ensemble Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Case Study in the Himalayan Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Future Research Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64 67 67 68 68 70 70 71 72 72 75 77 79 81
An Intelligent Hybridization of ABC and LM Algorithms With Constraint Engineering Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Erdem Dilmen, Selim Yilmaz, Selami Beyhan Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Brief Introduction of Optimization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.2.1 Artificial Bee Colony Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.2.2 Levenberg–Marquardt Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.2.3 Proposed Hybrid Algorithm: ABC–LM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.3.1 Unconstrained Optimization of Benchmark Functions . . . . . . . . . . . . . . . . 94 5.3.2 Constrained Real-World Optimization Problems . . . . . . . . . . . . . . . . . . . . . 95 Strengths and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Network Intrusion Detection Model Based on Fuzzy-Rough Classifiers . . . . . . . Ashalata Panigrahi, Manas R. Patra Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Fuzzy Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Rough Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.3 Hybridization of Fuzzy-Rough Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.4 Fuzzy Nearest Neighbor (FNN) Classification . . . . . . . . . . . . . . . . . . . . . . . 6.3.5 Fuzzy-Rough Nearest Neighbor Algorithm (FRNN) . . . . . . . . . . . . . . . . . . 6.3.6 Fuzzy Ownership Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.7 Vaguely Quantified Nearest Neighbors (VQNN) . . . . . . . . . . . . . . . . . . . . . 6.3.8 Ordered Weighted Average Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . .
109 109 110 111 112 112 114 114 115 117 118 119
Contents
6.4
6.5 6.6 CHAPTER 7
7.1 7.2
7.3
7.4
CHAPTER 8 8.1 8.2 8.3
8.4
8.5
8.6
Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 NSL-KDD Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Feature Selection Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.4 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Efficient System Reliability Analysis of Earth Slopes Based on Support Vector Machine Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Subhadeep Metya, Tanmoy Mukhopadhyay, Sondipon Adhikari, Gautam Bhattacharya Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adopted Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Deterministic Slope Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Reliability Analysis of Slope Using Critical Slip Surfaces . . . . . . . . . . . . . 7.2.3 System Reliability Analysis of Slopes Using SVM-Based MCS . . . . . . . . Illustrative Example and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Slope Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Deterministic Analyses and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Reliability Analyses and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Predicting Short-Term Congested Traffic Flow on Urban Motorway Networks Taiwo Adetiloye, Anjali Awasthi Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Back-Propagation Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2 Neuro-Fuzzy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.3 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.4 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 Error Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.2 Model Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Application Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.1 Error Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.2 Rank Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions and Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
IX
119 119 120 121 122 122 124 125 127
127 128 128 129 130 133 133 134 134 140 140 141 145 145 146 147 147 150 154 156 158 158 158 158 159 161 162 162 162
X
Contents
CHAPTER 9
Object Categorization Using Adaptive Graph-Based Semi-supervised Learning Fadi Dornaika, Alirezah Bosaghzadeh, Houssam Salmane, Yassine Ruichek Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Related Work: Graph Construction Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Local Binary Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proposed LBP Graph Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.1 Weighted Regularized Least Square Minimization . . . . . . . . . . . . . . . . . . . 9.4.2 Two Phase WRLS (TPWRLS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.3 Difference Between TPWRLS Graph and Existing Graphs . . . . . . . . . . . . Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.1 Graph-Based Label Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
167
Hemodynamic Model Inversion by Iterative Extended Kalman Smoother . . . . . Serdar Aslan Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.1 Extended Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.2 Extended Kalman Smoother . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.3 Iteration Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dynamic Representation of the Hemodynamic Model System . . . . . . . . . . . . . . . Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.1 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Real Data Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.1 Test Procedure and Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
181
Improved Sparse Approximation Models for Stochastic Computations . . . . . . . Tanmoy Chatterjee, Rajib Chowdhury 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Fundamentals of HDMR, Kriging, LASSO, LAR, and FS . . . . . . . . . . . . . . . . . . . 11.2.1 High-Dimensional Model Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.2 Kriging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.3 Least Absolute Shrinkage and Selection Operator (LASSO) . . . . . . . . . . . 11.2.4 Least Angle Regression (LAR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.5 Forward Selection (FS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Proposed Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
201
9.1 9.2 9.3 9.4
9.5
9.6
CHAPTER 10 10.1 10.2
10.3 10.4
10.5
10.6
CHAPTER 11
167 168 169 170 170 171 172 172 173 174 178 178 178
181 182 183 184 184 184 185 188 188 192 193 193 195 197 198 198
201 202 202 204 205 205 205 206
Contents
11.3.1 Proposed Approach 1 (PA1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.2 Proposed Approach 2 (PA2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.3 Proposed Approach 3 (PA3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.1 Problem Set 1: Analytical Test Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.2 Problem Set 2: Practical Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 12
12.1 12.2 12.3 12.4
12.5 12.6 12.7 CHAPTER 13
13.1 13.2
13.3
13.4
Symbol Detection in Multiple Antenna Wireless Systems via Ant Colony Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Manish Mandloi, Vimal Bhatia Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overview of Ant Colony Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Point-to-Point MIMO System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Traditional MIMO Detection Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4.1 Zero Forcing Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4.2 Minimum Mean Squared Error Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4.3 Successive Interference Cancellation Detector . . . . . . . . . . . . . . . . . . . . . . . Ant Colony Optimization Based MIMO Detection . . . . . . . . . . . . . . . . . . . . . . . . . Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Application of Particle Swarm Optimization to Solve Robotic Assembly Line Balancing Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J. Mukund Nilakantan, S.G. Ponnambalam, Peter Nielsen Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Straight Robotic Assembly Line Balancing Problems . . . . . . . . . . . . . . . . . . . . . . . 13.2.1 Assumptions and Mathematical Model for Balancing Straight Robotic Assembly Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.2 Particle Swarm Optimization for Solving Straight Robotic Assembly Line Balancing Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.3 Parameter Selection for Straight RALB Problem . . . . . . . . . . . . . . . . . . . . . 13.2.4 Computational Results for Straight Robotic Assembly Line Balancing Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Robotic U-Shaped Assembly Line Balancing Problems . . . . . . . . . . . . . . . . . . . . . 13.3.1 Assumptions and Mathematical Model for Balancing U-Shaped Robotic Assembly Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.2 Particle Swarm Optimization for Solving Robotic U-Shaped Assembly Line Balancing Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.3 Differences Between Straight and U-Shaped Robotic Assembly Lines . . . 13.3.4 Parameters for PSO to Solve RUALB Problems . . . . . . . . . . . . . . . . . . . . . 13.3.5 Computational Results for RUALB Problems . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
XI
208 209 210 211 211 216 219 221 225 225 226 228 229 229 230 231 232 235 236 236 239 239 241 241 243 250 252 256 258 259 261 262 263 265
XII
Contents
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 CHAPTER 14 14.1 14.2 14.3
14.4 14.5 CHAPTER 15
15.1
15.2
15.3
15.4
The Cuckoo Optimization Algorithm and Its Applications . . . . . . . . . . . . . . . . . . Mohamed Arezki Mellal, Edward J. Williams Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cuckoo Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Applications of the Cuckoo Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . 14.3.1 Replacement of Obsolete Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3.2 Machining Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3.3 Combined Heat and Power Economic Dispatch (CHPED) . . . . . . . . . . . . . Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hybrid Intelligent Model Based on Least Squares Support Vector Regression and Artificial Bee Colony Optimization for Time-Series Modeling and Forecasting Horizontal Displacement of Hydropower Dam . . . . . . . . . . . . . . . . . Dieu Tien Bui, Kien-Trinh Thi Bui, Quang-Thanh Bui, Chinh Van Doan, Nhat-Duc Hoang Introduction and Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.1.2 Least-Squares Support Vector Regression for Time-Series Modeling . . . . 15.1.3 Artificial Bee Colony Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Implementation of the Hybrid Intelligent Model for Time-Series Modeling and Forecasting Horizontal Displacement of Hydropower Dam . . . . . . . . . . . . . . . . . . 15.2.1 Case Study with Time-Series Monitoring Data . . . . . . . . . . . . . . . . . . . . . . 15.2.2 Determining Input Factors and Appropriately Lagged Variables . . . . . . . . 15.2.3 Configuration of the Least-Squares Support Vector Regression Model . . . 15.2.4 Determination of the Fitness Function for the Model Optimization . . . . . 15.2.5 Model Optimization Using Artificial Bee Colony . . . . . . . . . . . . . . . . . . . . 15.2.6 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3.1 Training Results and Performance Assessment of the Hybrid Model . . . . 15.3.2 Model Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Modeling the Axial Capacity of Bored Piles Using Multi-Objective Feature Selection, Functional Network and Multivariate Adaptive Regression Spline Ranajeet Mohanty, Shakti Suman, Sarat Kumar Das 16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.2.1 Multi-Objective Feature Selection (MOFS) . . . . . . . . . . . . . . . . . . . . . . . . . 16.2.2 Functional Network (FN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.2.3 Multivariate Adaptive Regression Splines (MARS) . . . . . . . . . . . . . . . . . .
269 269 269 270 270 271 275 275 276 277
279
279 279 280 281 283 283 284 285 285 285 287 288 288 288 291 291 291
CHAPTER 16
295 295 297 297 299 300
Contents
16.3 Database and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.4.1 Prediction Models for Bored Pile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 17
17.1 17.2
17.3
17.4
17.5 CHAPTER 18
18.1 18.2 18.3
18.4 18.5 18.6 CHAPTER 19
Transient Stability Constrained Optimal Power Flow Using Chaotic Whale Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dharmbir Prasad, Aparajita Mukherjee, V. Mukherjee Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problem Formulation of TSCOPF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2.1 OPF Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2.2 Objective Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2.3 Constraints of the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TSCOPF Problem Using Proposed CWOA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.3.1 Overview of WOA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.3.2 Implementation of TSCOPF Problem Using Proposed CWOA . . . . . . . . . Simulation Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.4.1 Input Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.4.2 Test System 1: WSCC 3-Generator, 9-Bus Test Power System . . . . . . . . . 17.4.3 Test System 2: IEEE 30-Bus Test Power System . . . . . . . . . . . . . . . . . . . . . 17.4.4 Statistical Analysis of the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion and Scope of Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Slope Stability Evaluation Using Radial Basis Function Neural Network, Least Squares Support Vector Machines, and Extreme Learning Machine . . . . . . . . . Nhat-Duc Hoang, Dieu Tien Bui Research Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Research Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.3.1 Machine Learning Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.3.2 Historical Data Sets of Slope Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Alternating Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Melanie Po-Leen Ooi, Hong Kuan Sok, Ye Chow Kuang, Serge Demidenko Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.2 Boosting: An Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.3 Alternating Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.3.1 Univariate ADTree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.3.2 Fisher’s ADTree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
XIII
301 301 302 308 308 311 311 312 312 313 314 315 315 317 317 317 317 324 328 329 331 333 333 334 335 335 337 338 341 342 343 345 345 345 347 350 351 352
XIV
Contents
19.3.3 Sparse ADTree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.3.4 Regularized Logistic ADTree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.4 Discussion and Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.4.1 Comparison with Other Decision Tree Algorithms . . . . . . . . . . . . . . . . . . . 19.4.2 Comparison of Different Univariate ADTree Variants . . . . . . . . . . . . . . . . 19.4.3 Comparison Between SADT Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.5 Applications of ADTree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.5.1 Defect Cluster Recognition System for Semiconductor Manufacturing . . 19.5.2 Dimension Reduction Tool in Human Detection System . . . . . . . . . . . . . . 19.6 Conclusions and Future Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.6.1 Cost-Sensitive ADTree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.6.2 Credal ADTree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.6.3 Regression ADTree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.6.4 Group Lasso ADTree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 20 20.1 20.2 20.3 20.4 20.5 CHAPTER 21
21.1 21.2
21.3
21.4
21.5
354 356 359 361 362 362 365 367 368 369 369 369 369 369 370
Scene Understanding Using Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . Farzad Husain, Babette Dellen, Carme Torras Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Semantic Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.3.1 Related Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Action Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.4.1 Related Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
373
Deep Learning for Coral Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ammar Mahmood, Mohammed Bennamoun, Senjian An, Ferdous Sohel, Farid Boussaid, Renae Hovey, Gary Kendrick, Robert B. Fisher Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Annotation of Coral Reefs: Methods and Challenges . . . . . . . . . . . . . . . . . . . . . . . 21.2.1 Methods for Conventional Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Automatic Coral Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3.1 Coral Classification With Hand-Crafted Features . . . . . . . . . . . . . . . . . . . . 21.3.2 Coral Classification With Learned Features . . . . . . . . . . . . . . . . . . . . . . . . . Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4.2 Representation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4.3 Going Deeper With Neural Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Deep Learning for Coral Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.5.1 Hybrid and Quantized Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.5.2 Coral Population Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.5.3 Cost-Sensitive Learning for Corals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
383
373 373 374 375 377 379 380 380
383 384 384 385 387 387 389 389 389 390 392 393 394 396 397
Contents
21.5.4 CNN With Fluorescent Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.6 Future Prospects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 22 22.1 22.2 22.3
22.4 22.5
22.6
CHAPTER 23
23.1 23.2 23.3
23.4
23.5 23.6
A Deep Learning Framework for Classifying Sounds of Mysticete Whales . . . . Stavros Ntalampiras Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Related Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acoustic Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.3.1 Frequency Domain Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.3.2 Wavelet Domain Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Classification Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.4.1 Reservoir Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Experimental Setup and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.5.1 The Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.5.2 Framework Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.5.3 Experimental Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Unsupervised Deep Learning for Data-Driven Reliability and Risk Analysis of Engineered Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peng Jiang, Mojtaba Maghrebi, Alan Crosky, Serkan Saydam Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reliability and Risk Analysis of Engineered Systems . . . . . . . . . . . . . . . . . . . . . . . Deep Learning: Theoretical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23.3.1 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23.3.2 Hyperparameters Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23.4.1 State Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23.4.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Applying Machine Learning Algorithms in Landslide Susceptibility Assessments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paraskevas Tsangaratos, Ioanna Ilia 24.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24.2.1 Constructing the Inventory Database – Identifying the Landslide Related Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24.2.2 Preprocessing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
XV
397 398 398 399 399 403 403 404 405 405 407 408 408 410 410 411 411 413 414 414 417 417 419 420 420 426 426 427 427 427 428 429
CHAPTER 24
433 433 435 435 436
XVI
Contents
24.2.3 Implementing the Machine Learning Methods – Producing the Landslide Susceptibility Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24.2.4 Validation and Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24.3 Study Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24.3.1 Implementing the Methodology – Results . . . . . . . . . . . . . . . . . . . . . . . . . . 24.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 25
25.1 25.2 25.3
25.4
25.5 CHAPTER 26
26.1 26.2 26.3 26.4 26.5 26.6 26.7 26.8
26.9
MDHS–LPNN: A Hybrid FOREX Predictor Model Using a Legendre Polynomial Neural Network with a Modified Differential Harmony Search Technique . . . . . Rajashree Dash, Pradipta K. Dash Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MDHS–LPNN: A Hybrid FOREX Predictor Model . . . . . . . . . . . . . . . . . . . . . . . . 25.3.1 Legendre Polynomial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.3.2 Modified Differential Harmony Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.3.3 Detailed Steps of Currency Exchange Rate Prediction Using MDHS–LPNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.3.4 Computational Complexity of LPNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.4.1 Data Set Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.4.2 Input Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.4.3 Parameter Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.4.4 Performance Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.4.5 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Neural Model of Attention and Feedback for Computing Perceived Brightness in Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ashish Bakshi, Kuntal Ghosh Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Brightness Illusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . General Structure of the Eye–Brain System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lateral Inhibition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Modeling Brightness Illusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Motivation for the Attentive Vision Filter Model . . . . . . . . . . . . . . . . . . . . . . . . . . . The ECRF Filter and Attentive Vision Filter (AVF) . . . . . . . . . . . . . . . . . . . . . . . . Sample Results from the AVF filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26.8.1 Simultaneous Brightness Contrast (SBC) . . . . . . . . . . . . . . . . . . . . . . . . . . . 26.8.2 White’s Illusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26.8.3 Shifted White’s Illusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26.8.4 Sinusoidal Grating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Necessity of Non-linearity in Explaining Brightness Illusions . . . . . . . . . . . . . . . .
437 438 438 440 450 452 453 459 459 461 464 464 466 467 469 470 470 470 472 473 474 479 485 487 487 488 489 491 492 493 495 497 497 497 497 497 498
Contents
26.9.1 The Mach Band Illusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26.9.2 Necessity of Discontinuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26.9.3 Scaling Properties of Mach Bands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 27 27.1 27.2 27.3
27.4
27.5 27.6
27.7
27.8 CHAPTER 28 28.1 28.2 28.3 28.4 28.5
28.6 CHAPTER 29
XVII
499 504 505 509 511
Support Vector Machine: Principles, Parameters, and Applications . . . . . . . . Raoof Gholami, Nikoo Fakhari Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Support Vector Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.3.1 Linearly Separable Case (Hard Margin) . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.3.2 Linearly Non-separable Case (Soft Margin) . . . . . . . . . . . . . . . . . . . . . . . . 27.3.3 Non-linear Case (Kernel Machine) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Support Vector Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.4.1 Classic Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.4.2 Linear Support Vector Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.4.3 Non-linear Support Vector Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . Step by Step with SVMs for Classification and Regression Data Analysis . . . . . Strength and Weakness of SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.6.1 Strengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.6.2 Weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.7.1 Permeability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.7.2 Rock Mass Classification (RMR) System . . . . . . . . . . . . . . . . . . . . . . . . . . 27.7.3 Shear Wave Velocity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.7.4 Interfacial Tension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.7.5 Compressive Strength . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
515
Evolving Radial Basis Function Networks Using Moth–Flame Optimizer . . . . . Hossam Faris, Ibrahim Aljarah, Seyedali Mirjalili Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Radial Basis Function (RBF) Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . Moth–Flame Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MFO for Optimizing RBFN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28.5.1 Comparison with Other Metaheuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28.5.2 Comparison with Newrb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
537
515 516 517 517 519 520 522 522 523 524 525 526 526 527 527 527 529 531 532 532 533 533
537 538 539 541 543 543 544 548 549
Application of Fuzzy Methods in Power System Problems . . . . . . . . . . . . . . . . 551 Sajad Madadi, Morteza Nazari-Heris, Behnam Mohammadi-Ivatloo 29.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
XVIII
Contents
29.1.1 Definition of FL Membership . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29.1.2 Propositions of FL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29.1.3 Implication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29.1.4 Fuzzy Interface System (FIS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29.2 Mathematical Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29.2.1 Islanding Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29.3 Mathematical Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29.3.1 Dynamic Line Rating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 30
30.1 30.2 30.3
30.4 30.5 CHAPTER 31
31.1 31.2
31.3 31.4 31.5 31.6 CHAPTER 32
Application of Particle Swarm Optimization Algorithm in Power System Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Milad Zamani-Gargari, Morteza Nazari-Heris, Behnam Mohammadi-Ivatloo Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Applications of PSO Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hydrothermal Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.3.1 Objective Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.3.2 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Optimum Design of Composite Concrete Floors Using a Hybrid Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohamed G. Sahab, Vassili V. Toropov, Amir H. Gandomi Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problem Formulation of Cost Optimization of Composite Floors . . . . . . . . . . . . 31.2.1 Design Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.2 Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.3 Design Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.4 Solution Method of Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . Introducing Optimality Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multiobjective Design Optimization of Steel–Concrete Composite Floor . . . . . Comparative Design Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
553 553 554 554 556 556 559 559 567 570 571
571 574 574 574 575 578 578 579 581 581 581 582 582 582 583 585 585 586 587 588 588
A Comparative Study of Image Segmentation Algorithms and Descriptors for Building Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591 Fadi Dornaika, Abdelmalik Moujahid, Youssef El Merabet, Yassine Ruichek 32.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591 32.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
Contents
32.1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2.1 General Purpose Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2.2 Building Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3 Proposed Machine Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3.1 Overview of the Proposed Framework and Main Differences with Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3.2 Studied Image Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3.3 Training and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.4.1 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.4.2 Evaluation Metrics and Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.4.3 Segmentation Algorithms and Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . 32.4.4 Classifiers Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 33
33.1 33.2 33.3
33.4 33.5 Subject Index
Object-Oriented Random Forest for High Resolution Land Cover Mapping Using Quickbird-2 Imagery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Taskin Kavzoglu Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Study Area and Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.3.1 Object-Based Image Analysis (OBIA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.3.2 Random Forest Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.3.3 McNemar’s Test for Comparison of Classifier Performances . . . . . . . . . . Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .............................................................
XIX
592 592 592 593 593 593 594 597 597 598 598 599 603 603 605 605 607 607 609 610 611 612 613 613 617 618 621
This page intentionally left blank
CONTRIBUTORS Taiwo Adetiloye Concordia Institute for Information and Systems Engineering, Montreal, QC, Canada Sondipon Adhikari College of Engineering, Swansea University, Swansea, United Kingdom Ibrahim Aljarah King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan Senjian An The University of Western Australia, Crawley, WA, Australia Serdar Aslan McLean Imaging Center, McLean Hospital, Belmont, MA, United States; Harvard Medical School, Boston, MA, United States Anjali Awasthi Concordia Institute for Information and Systems Engineering, Montreal, QC, Canada Ashish Bakshi Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India Mohammed Bennamoun The University of Western Australia, Crawley, WA, Australia Selami Beyhan Hacettepe University, Ankara, Turkey Vimal Bhatia Discipline of Electrical Engineering, Indian Institute of Technology Indore, Indore, India Gautam Bhattacharya Indian Institute of Engineering Science and Technology (IIEST), Shibpur, India Alirezah Bosaghzadeh University of the Basque Country UPV/EHU, San Sebastian, Spain Shahid Rajaee Teacher Training University, Tehran, Iran Farid Boussaid The University of Western Australia, Crawley, WA, Australia
xxi
XXII
Contributors
Dieu Tien Bui Geographic Information System Group, University College of Southeast Norway (USN), Bø i Telemark, Norway Kien-Trinh Thi Bui Geomatics Center, Water Resources University, Hanoi, Vietnam Quang-Thanh Bui Faculty of Geography, VNU University of Science, Hanoi, Vietnam Anusheema Chakraborty TERI University, New Delhi, India Tanmoy Chatterjee Indian Institute of Technology Roorkee, Roorkee, India Rajib Chowdhury Indian Institute of Technology Roorkee, Roorkee, India Alan Crosky School of Materials Science and Engineering, UNSW Australia, Sidney, NSW, Australia Sarat Kumar Das National Institute of Technology, Rourkela, India Pradipta K. Dash Siksha O Anusandhan University, Bhubaneswar, India Rajashree Dash Siksha O Anusandhan University, Bhubaneswar, India Babette Dellen RheinAhrCampus der Hochschule Koblenz, Remagen, Germany Serge Demidenko School of Engineering and Advanced Technology, Massey University, Auckland, New Zealand Vahdettin Demir Engineering Faculty, KTO Karatay University, Konya, Turkey Murat Diker Hacettepe University, Ankara, Turkey Erdem Dilmen Pamukkale University, Denizli, Turkey Chinh Van Doan Faculty of Surveying and Mapping, Le Quy Don Technical University, Hanoi, Vietnam
Contributors
XXIII
Fadi Dornaika University of the Basque Country UPV/EHU, San Sebastian, Spain; IKERBASQUE, Basque Foundation for Science, Bilbao, Spain Nikoo Fakhari Shahrood University of Technology, Shahrood, Iran Hossam Faris King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan Robert B. Fisher University of Edinburgh, Edinburgh, United Kingdom Amir H. Gandomi BEACON Center for the Study of Evolution in Action, Michigan State University, East Lansing, MI, United States; School of Business, Stevens Institute of Technology, Hoboken, NJ, United States Raoof Gholami Curtin University, Miri, Malaysia Kuntal Ghosh Machine Intelligence Unit, Center for Soft Computing Research, Indian Statistical Institute, Kolkata, India Nhat-Duc Hoang Faculty of Civil Engineering, Institute of Research and Development, Duy Tan University, Danang, Vietnam Renae Hovey The University of Western Australia, Crawley, WA, Australia Farzad Husain Institut de Robòtica i Informàtica Industrial, CSIC-UPC, Barcelona, Spain Catchoom, Barcelona, Spain Ioanna Ilia School of Mining and Metallurgical Engineering, National Technical University of Athens, Zografou, Greece Peng Jiang School of Materials Science and Engineering, UNSW Australia, Sidney, NSW, Australia Pawan K. Joshi School of Environmental Sciences, Jawaharlal Nehru University, New Delhi, India Taskin Kavzoglu Gebze Technical University, Gebze, Turkey
XXIV
Contributors
Gary Kendrick The University of Western Australia, Crawley, WA, Australia Ozgur Kisi Faculty of Natural Sciences and Engineering, Ilia State University, Tbilisi, Georgia Ye Chow Kuang Advanced Engineering Platform and Electrical and Computer Systems Engineering, School of Engineering, Monash University, Bandar Sunway, Malaysia Sajad Madadi Faculty of Electrical and Computer Engineering, University of Tabriz, Tabriz, Iran Mojtaba Maghrebi Ferdowsi University of Mashhad, Khorasan Razavi, Iran Ammar Mahmood The University of Western Australia, Crawley, WA, Australia Manish Mandloi Discipline of Electrical Engineering, Indian Institute of Technology Indore, Indore, India Mohamed Arezki Mellal LMSS, Faculty of Engineering Sciences (FSI), M’Hamed Bougara University, Boumerdes, Algeria Youssef El Merabet Faculté des Sciences, Université Ibn Tofail, Kénitra, Morocco Subhadeep Metya College of Engineering, Swansea University, Swansea, United Kingdom; Indian Institute of Engineering Science and Technology (IIEST), Shibpur, India Seyedali Mirjalili School of Information and Communication Technology, Griffith University, Brisbane, QLD, Australia Seyedali Mirjalili School of Information and Communication Technology, Griffith University, Brisbane, QLD, Australia Behnam Mohammadi-Ivatloo Faculty of Electrical and Computer Engineering, University of Tabriz, Tabriz, Iran Ranajeet Mohanty National Institute of Technology, Rourkela, India Abdelmalik Moujahid Carlos III University of Madrid (UC3M), Madrid, Spain
Contributors
XXV
Aparajita Mukherjee Indian Institute of Technology (Indian School of Mines), Dhanbad, India V. Mukherjee Indian Institute of Technology (Indian School of Mines), Dhanbad, India Tanmoy Mukhopadhyay College of Engineering, Swansea University, Swansea, United Kingdom J. Mukund Nilakantan Aalborg University, Aalborg, Denmark Morteza Nazari-Heris Faculty of Electrical and Computer Engineering, University of Tabriz, Tabriz, Iran Peter Nielsen Aalborg University, Aalborg, Denmark Stavros Ntalampiras Politecnico di Milano, Milan, Italy Melanie Po-Leen Ooi Advanced Engineering Platform and Electrical and Computer Systems Engineering, School of Engineering, Monash University, Bandar Sunway, Malaysia; School of Engineering and Physical Sciences, Heriot-Watt University, Putrajaya, Malaysia Ashalata Panigrahi Berhampur University, Berhampur, India Manas R. Patra Berhampur University, Berhampur, India S.G. Ponnambalam School of Engineering, Monash University Malaysia, Bandar Sunway, Malaysia Dharmbir Prasad Asansol Engineering College, Asansol, India Yassine Ruichek IRTES-SET, UTBM, Belfort, France Kamna Sachdeva TERI University, New Delhi, India Mohamed G. Sahab School of Civil Engineering, Tafresh University, Tafresh, Iran Houssam Salmane IRTES-SET, UTBM, Belfort, France
XXVI
Contributors
Serkan Saydam School of Mining Engineering, UNSW Australia, Sidney, NSW, Australia Jalal Shiri Faculty of Agriculture, University of Tabriz, Tabriz, Iran Ferdous Sohel Murdoch University, Murdoch, WA, Australia Hong Kuan Sok Advanced Engineering Platform and Electrical and Computer Systems Engineering, School of Engineering, Monash University, Bandar Sunway, Malaysia Shakti Suman National Institute of Technology, Rourkela, India Vassili V. Toropov School of Engineering and Materials Science, Queen Mary University of London, London, United Kingdom Carme Torras Institut de Robòtica i Informàtica Industrial, CSIC-UPC, Barcelona, Spain Paraskevas Tsangaratos School of Mining and Metallurgical Engineering, National Technical University of Athens, Zografou, Greece Edward J. Williams College of Engineering and Computer Science, University of Michigan, Dearborn, MI, United States; College of Business, University of Michigan, Dearborn, MI, United States Selim Yilmaz Hacettepe University, Ankara, Turkey Milad Zamani-Gargari Faculty of Electrical and Computer Engineering, University of Tabriz, Tabriz, Iran
ABOUT THE EDITORS Pijush Samui is an Associate Professor in Civil Engineering at NIT Patna, Bihar, India. He graduated in 2000, with a BTech in Civil Engineering from the Indian Institute of Engineering Science and Technology, Shibpur, India. He received his MSc in Geotechnical Earthquake Engineering from the Indian Institute of Science, Bangalore, India, in 2004. Since 2008 he holds a PhD degree in Geotechnical Earthquake Engineering from the Indian Institute of Science, Bangalore, India. Pijush Samui was a postdoctoral fellow at the University of Pittsburgh, United States, during 2008–2009 and at Tampere University of Technology, Finland, during 2009–2010. At the University of Pittsburgh, he worked on designing an efficient tool for rock cutting and on applications of Support Vector Machines (SVMs) when designing a geostructure. At Tampere University of Technology, he worked on designing railway embankments, ensuring slope reliability, and on site characterization. In 2010, Dr. Samui joined the Center for Disaster Mitigation and Management at VIT University as an Associate Professor. He was promoted to a Professor in 2012. Dr. Samui’s research focuses on the application of Artificial Intelligence for designing civil engineering structures, design of foundations, stability of railway embankments, reliability analysis, site characterization, earthquake engineering, and big data. In 2009 Dr. Samui was the recipient of Finland’s prestigious CIMO Fellowship for his integrated research on the design of railway embankments. He was awarded the Shamsher Prakash Research Award in 2011 by IIT Roorkee for his innovative research on the applications of Artificial Intelligence in designing civil engineering structures. He was selected as the recipient of the IGS Sardar Resham Singh Memorial Award in 2013 for his innovative research on an infrastructure project. He was elected Fellow of International Congress of Disaster Management in 2010. Dr. Samui served as a guest editor for the Disaster Advances Journal. He also serves as an editorial board member of several international journals. Dr. Samui is active in a variety of professional organizations, including the Indian Geotechnical Society, Indian Science Congress, Institution of Engineers, World Federation of Soft Computing, and Geotechnical Engineering for Disaster Mitigation and Rehabilitation. He has organized numerous workshops and conferences on the applications of artificial intelligence in civil engineering design. Dr. Sanjiban Sekhar Roy is an Associate Professor in the School of Computer Science and Engineering, VIT University, Vellore, India. He joined VIT University in 2009 as an Assistant Professor. He has more than 7 years of teaching and research experience in the area of Computer Science and Engineering. During these years, he worked on applications of various machine learning techniques (Artificial Neural Networks, Support Vector Machines, Least Squares Support Vector Machines, Relevance Vector Machines, Extreme Learning Machines, Deep Learning, Minimax Probability Machines) to various Computer Science domains, stock market forecasting, Biological Science, Civil Engineering, Environmental Engineering and carried out international and national collaboration on various challenging problems. He has published numerous journal and conference articles on neural computing, support vector machines, rough sets, image processing, pattern recognition, and deep learning. He is an editorial board member of the International Journal of Advanced Intelligence Paradigms, Interscience. Dr. Roy has received research publication awards many times for his research contributions from VIT University.
xxvii
XXVIII
About the Editors
Valentina E. Balas is currently a Full Professor in the Department of Automatics and Applied Software at the Faculty of Engineering, “Aurel Vlaicu” University of Arad, Romania. She holds a PhD degree in Applied Electronics and Telecommunications from the Polytechnic University of Timisoara. Dr. Balas is an author of more than 250 research papers in refereed journals and international conferences. Her research interests are in Intelligent Systems, Fuzzy Control, Soft Computing, Smart Sensors, Information Fusion, Modeling and Simulation. She is the Editor-in-Chief of the International Journal of Advanced Intelligence Paradigms (IJAIP) and of the International Journal of Computational Systems Engineering (IJCSysE), member of the Editorial Board of several national and international journals and is an evaluator expert for national and international projects. Dr. Balas is the director of the Intelligent Systems Research Centre in the “Aurel Vlaicu” University of Arad. She served as General Chair of the International Workshop Soft Computing and Applications (SOFA) for seven editions during 2005–2016 held in Romania and Hungary. Dr. Balas participated in many international conferences as an Organizer, Honorary Chair, Session Chair and member in Steering, Advisory or International Program Committees. She is a member of EUSFLAT, SIAM and a Senior Member IEEE, member in TC – Fuzzy Systems (IEEE CIS), member in TC – Emergent Technologies (IEEE CIS), member in TC – Soft Computing (IEEE SMCS). Dr. Balas was Vice-president (Awards) of IFSA International Fuzzy Systems Association Council during 2013–2015 and is a Joint Secretary of the Governing Council of Forum for Interdisciplinary Mathematics (FIM) – a Multidisciplinary Academic Body, India.
CHAPTER
GRAVITATIONAL SEARCH ALGORITHM WITH CHAOS
1
Seyedali Mirjalili∗ , Amir H. Gandomi† ∗ School
of Information and Communication Technology, Griffith University, Brisbane, QLD, Australia † BEACON Center for the Study of Evolution in Action, Michigan State University, East Lansing, MI, United States
1.1 INTRODUCTION It has been proven by the No Free Lunch (NFL) theorem that there is no algorithm that is effective when solving all optimization problems [1]. In other words, some heuristic algorithms may present better solutions for some specific problems, but worse for others. The NFL theorem shows that there is always possibility for modifying or improving an algorithm with the purpose of solving a new set of problems. Some of such algorithms are Genetic Algorithm (GA) [2], Deferential Evolution (DE) [3,4], Particle Swarm Optimization (PSO) [5,6], Ant Colony Optimization (ACO) [7]. Gravitational Search algorithm (GSA) is one of the recent algorithms, proposed by Rashedi et al. [8]. It has been proven that this algorithm provides promising results compared to the other well-known algorithms in this field. Generally, the similarity of evolutionary (or population-based) algorithms is that a population of random solutions is first created. This population is evolved thought a pre-defined number of steps with some certain rules, which are based on the structure of algorithm. For instance, PSO uses the social behavior of bird flocks and GA utilizes Darwin’s theory of evolution. Regardless of structure, these algorithms also divide the search process into two main phases, such as exploration and exploitation, to increase the probability of finding the global optimum. Exploration refers to the searching different regions of the search space, whereas exploitation is the convergence towards the best solution with respect to the attained promising solutions in the exploration phase. The ultimate goal here is to find an efficient trade-off between exploitation and exploration to ensure finding the global optimum. However, it is absolutely challenging to find a proper balance due to the stochastic behavior of evolutionary algorithms. Moreover, exploration and exploitation are in conflict whereby strengthening one results in weakening the other. In improving the exploration phase, utilizing random walk methods is a recent popular way [9,10]. Basically, highest random movements are fruitful to explore the whole possible parts of search space. However, in improving the exploitation phase, local search [11,12] and gradient descent [13,14] have been mostly utilized. Both of these methods are able to improve performance, but they bring additional computational cost. Moreover, gradient descent methods are also ill-defined for non-differentiable search spaces. So, these inevitable problems should be considered especially in solving real world engineering problem. The literature shows that one of the cheapest methods for boosting both exploration and exploitation is to utilize chaos theory [15–23]. Chaos theory refers to the study of chaotic dynamical systems. Chaotic systems are nonlinear dynamical systems that are highly sensible to their initial conditions. In other words, small changes in Handbook of Neural Computation. DOI: 10.1016/B978-0-12-811318-9.00001-6 Copyright © 2017 Elsevier Inc. All rights reserved.
1
2
CHAPTER 1 GRAVITATIONAL SEARCH ALGORITHM WITH CHAOS
initial conditions result in big changes in the final outcome of the system. It seems that chaos systems behave randomly, but a system does not necessarily need randomness for providing chaotic behavior [24]. In other words, deterministic systems are also able to show chaotic behaviors. Recently, these characteristics have been utilized for improving the performance of heuristic optimization algorithms. In 2009, Alatas et al. embedded 12 chaotic maps into PSO [15]. They proved that the performance of PSO can be improved by chaos. Alatas also showed that chaos is able to improve the performances of Bee Colony Algorithm (BCA) [16] as well as Harmony Search (HS) [17]. In 2012, a chaos-enhanced version of accelerated PSO was proposed also by Gandomi et al. [18]. Some of the other chaosenhanced heuristic algorithms are chaotic Genetic Algorithms [19], chaotic Differential Evolution [20], chaotic Simulated Annealing [21], and chaotic Firefly Algorithm [22,23]. The results of these studies serve as evidence of the successful applicability of chaos in heuristic algorithms. In this study 10 chaotic maps are employed to improve the performance of GSA as it suffers from trapping in local minima and slow convergence. Basically, the search process is slowed down as iterations pass in GSA. Because of the direct effect of fitness function on mass, search agents become heavier proportional to the iteration number. This prevents them from exploring new regions of search space and exploiting the best optimum rapidly in the last iterations. These problems directly or indirectly were mentioned and tackled in several works [25–29]. All the improvement in the literature serve as evidence that GSA suffers from trapping in local minima and slow exploitation ability. In this study chaos is embedded into GSA in order to overcome these problems. The rest of the paper is organized as follows. Section 1.2 presents a brief introduction to the GSA algorithm, and Section 1.3 discusses the chaotic maps and integrates them into GSA. The experimental results for benchmark and classical engineering design problems are provided in Sections 1.4 and 1.5, respectively. Finally, Section 1.6 concludes the work and suggests some directions for future research.
1.2 GRAVITATIONAL SEARCH ALGORITHM This algorithm was proposed in 2009 and simulates the interaction of masses in our universe [8]. In fact, this algorithm considers each search agent as a mass with weight proportional to its fitness value. During optimization, search agents attract each other using the gravitational force between them. The gravitational force between the solutions is calculated using the following equations: Fid (t) =
N
randj Fijd (t)
(1.1)
j =1,j =i
where randj is a random number in the interval [0, 1] and Fijd (t) is calculated as follows: Fijd (t) = G(t)
Mpi (t) × Maj (t) d xj (t) − xid (t) Rij (t) + ε
(1.2)
where Maj is the active mass of j th solution, Mpi is the passive mass of ith solution, G(t) calculates the gravitational constant, and Rij (t) is the Euclidean distance between ith and j th agents.
1.3 CHAOTIC MAPS FOR GSA
3
In the above, the gravitational constant (G) is calculated as follows: G(t) = G0 × exp(−α × iter/maxiter)
(1.3)
Rij (t) = Xi (t), Xj (t)2
(1.4)
where α is the descending coefficient, G0 is the initial gravitational constant, iter is the current iteration, and maxiter is the maximum number of iterations. After calculating the forces, the next position of a solution is defined by the Newtonian law of motion as follows: aid (t) =
Fid (t) Mii (t)
(1.5)
vid (t + 1) = randi × vid (t) + aid (t)
(1.6)
xid (t + 1) = xid (t) + vid (t + 1)
(1.7)
where d is the problem’s dimension (number of variables), randi is a random number in the interval [0, 1], t is the current iteration, and Mii is the inertial mass of the ith agent. What guarantees the convergence of the GSA is the fact that solutions with better fitness are heavier and therefore attract lighter solutions. This mechanism gravitates the solutions towards promising regions of the search landscape. Because of the direct relation between mass and fitness function, a normalization method should be adopted to scale masses as follows: mi (t) =
fiti (t) − worst(t) best(t) − worst(t)
mi (t) Mi (t) = N j =1 mj (t)
(1.8)
(1.9)
where fiti (t) is the fitness value of the agent i at time t, best(t) is the fittest agent at time t, and worst(t) is the weakest agent at time t. In the next section, we propose a mechanism to integrate chaotic maps and improve the performance of GSA.
1.3 CHAOTIC MAPS FOR GSA In this section, the utilized chaotic maps are first presented. Then, the method of embedding these maps into GSA is proposed.
4
CHAPTER 1 GRAVITATIONAL SEARCH ALGORITHM WITH CHAOS
Table 1.1 Chaotic Maps No. Name 1 2 3
Chaotic Map
xi+1 = cos(i cos−1 (xi )) a xi+1 = mod(xi + b − ( 2π ) sin(2πxi ), 1), a = 0.5 and b = 0.2 1 xi = 0 Gauss/mouse [34] xi+1 = 1 otherwise mod(xi ,1) Chebyshev [32] Circle [33]
4
Iterative [35]
5
Logistic [35]
6
Piecewise [36]
7 8 9
Sine [37] Singer [38] Sinusoidal [39]
10
Tent [40]
xi+1 = sin( aπ xi ), a = 0.7
xi+1 = axi (1 − xi ), a = 4 ⎧x i ⎪ 0 ≤ xi < P ⎪ P ⎪ ⎪ ⎨ xi −P P ≤ xi < 0.5 , P = 0.4 xi+1 = 0.5−P 1−P −xi ⎪ 0.5 ≤ xi < 1 − P ⎪ 0.5−P ⎪ ⎪ ⎩ 1−xi 1 − P ≤ xi < 1 P xi+1 = a4 sin(πxi ), a = 4 xi+1 = μ(7.86xi − 23.31xi2 + 28.75xi3 − 13.302875xi4 ), μ = 1.07 xi+1 = axi2 sin(πxi ), a = 2.3 xi xi < 0.7 xi+1 = 0.7 10 (1 − x ) x i i ≥ 0.7 3
Range (−1, 1) (0, 1) (0, 1) (−1, 1) (0, 1)
(0, 1)
(0, 1) (0, 1) (0, 1) (0, 1)
FIGURE 1.1 Visualization of chaotic maps.
1.3.1 CHAOTIC MAPS We chose 10 different chaotic maps as presented in Table 1.1, while Fig. 1.1 illustrates these maps. Note that the initial point of all chaotic maps is 0.7.
1.3 CHAOTIC MAPS FOR GSA
5
FIGURE 1.2 Gravitational constant (G): t indicates current iteration and T is the maximum number of iterations.
FIGURE 1.3 General steps of chaos-based GSA.
1.3.2 INTEGRATING CHAOTIC MAPS WITH GSA As may be seen in Section 1.2, the gravitational constant (G) defines the intensity of total gravitational force between the masses. In other words, this variable defines the movement step for masses. This variable is depicted in Fig. 1.2. Fig. 1.2 shows that G has an adaptive value which balances exploration and exploitation. In other words, masses tend to move with big steps in initial iterations, whereas they move slowly in final iterations. We target this variable and replace it with the chaotic maps. Therefore, chaotic maps balance exploration and exploitation. Note that we normalize the return value of the chaotic maps to (0, 10) since their ranges lie in (0, 1) or (−1, 1). The general steps of chaos-based GSA algorithms are shown in Fig. 1.3. To see how the chaotic maps are theoretically efficient, some remarks are noted as follows: • Chaotic maps do not follow any special ascending or descending pattern, so they provide different values for G over the course of iterations. • Chaotic maps suddenly change the value of G which assists the trapped masses to release themselves from local minima. • Chaotic maps resolve the neutralized situation when the masses stick together in the last iterations, resulting in better convergence speed.
6
CHAPTER 1 GRAVITATIONAL SEARCH ALGORITHM WITH CHAOS
Table 1.2 Benchmark Functions Benchmark Function
Dim Range
fmin
Unimodal F1 (x) = ni=1 (xi + 40)2 F2 (x) = ni=1 |(xi + 7)| + ni=1 |(xi + 7)| n i F3 (x) = i=1 ( j −1 (xj + 60))2
30 30
[−100, 100] [−10, 10]
−80 −80
30
[−100, 100]
−80
30 30
[−100, 100] [−100, 100]
−80 −80
30 30 30
[−1.28, 1.28] −80 [−500, 500] −418.9829 × 32 [−5.12, 5.12] −80
F4 (x) = maxi {|(xi + 60)|, 1 ≤ i ≤ n} F5 (x) = ni=1 ([(xi + 60) + 0.5])2 Multimodal F6 (x) = ni=1 i(xi + 0.5)4 + random[0, 1) √ F7 (x) = ni=1 −(xi + 300) sin( |(xi + 300)|) n 2 F8 (x) = i=1 [(xi + 2) − 10 cos(2π(xi + 2)) + 10] n (xi +400) 1 n 2 √ F9 (x) = 4000 )+1 i=1 (xi + 400) − i=1 cos( i n 2 F10 (x) = 0.1{sin (3π(x1 + 30)) + i=1 ((xi + 30) − 1)2 [1 + sin2 (3π(xi + 30) + 1)] + ((xn + 30) − 1)2 [1 + sin2 (2π(xn + 30))]} + ni=1 u((xi + 30), 5, 100, 4)
30
[−600, 600]
−80
30
[−50, 50]
−80
• Chaotic maps balance exploration and exploitation randomly, so each iteration emphasizes either exploration or exploitation randomly as well. • There is no clear border between exploration and exploitation when chaotic maps define G. In other words, the current iteration does not necessarily have less exploration compared to the previous iteration as GSA. These remarks theoretically make chaos-based GSA able to provide superior results compared to GSA. In the following sections, various benchmark functions and two real engineering problems are employed to probe the effectiveness of the proposed method in action.
1.4 EXPERIMENTAL RESULTS AND DISCUSSION To evaluate the performance of the proposed chaos-based GSA algorithms, which we are name CGSA algorithms, 10 standard benchmark functions are employed in this section [30–32]. We shift and bias these benchmark functions because most of them have global optimum at [0, 0, . . . , 0] with the value 0. This brings the highest complexity compared with the current benchmark functions. The benchmark functions are unimodal or multimodal and are presented in Table 1.2. Fig. 1.4 and Fig. 1.5 illustrate the benchmark functions’ landscape. The GSA algorithms have several parameters which should be defined before being run as in Table 1.3. Table 1.4 and Table 1.5 include the experimental results. The results are averaged over 20 independent runs, and the best results are indicated in bold type. The mean and standard deviation (std) of the obtained best solutions in the last iteration are reported. Note that CGSA1 to CGSA10 uti-
1.4 EXPERIMENTAL RESULTS AND DISCUSSION
7
FIGURE 1.4 Search landscape of unimodal benchmark functions.
FIGURE 1.5 The 2-D versions of multimodal benchmark functions.
Table 1.3 Initial Parameters of GSA Parameter
Value
Number of masses G0 α Max iterations End criteria
30 Defined by chaos maps 20 500 Max iteration
lize Chebyshev, Circle, Gauss/mouse, Iterative, Logistic, Piecewise, Sine, Singer, Sinusoidal, and Tent maps, respectively. We have performed a statistical test called Wilcoxon’s rank-sum test [33] at 5% significance level and report the p-values to see how significant the results are. In the tables, N/A indicates “not applicable,” meaning that the corresponding algorithm could not statistically compare with itself in the rank-sum test. According to Derrac et al. [34], p-values less than 0.05 can be considered as strong evidences against the null hypothesis. Note that the p-values greater than 0.05 are underlined. In the following subsections, the simulation results of benchmark functions are explained and discussed in terms of search performance and convergence behavior.
1.4.1 SEARCH PERFORMANCE ANALYSIS According to the results of Table 1.4, the CGSA9 algorithm provides the best values for mean, standard deviation, median, best and worst in four out of five unimodal benchmark functions. The presented p-values prove that the results of CGSA9 are significantly better than others in F1 , F3 , and F4 . More-
8
CHAPTER 1 GRAVITATIONAL SEARCH ALGORITHM WITH CHAOS
Table 1.4 Minimization Results for Unimodal Benchmark Functions F1
Mean
Std
p-values
F2
Mean
Std
p-values
GSA CGSA1 CGSA2 CGSA3 CGSA4 CGSA5 CGSA6 CGSA7 CGSA8 CGSA9 CGSA10
8535.725 5967.269 10,007.44 36,668.71 7639.627 6216.604 7322.941 7473.488 2248.686 685.5303 7104.503
1752.356 1509.351 1844.353 4324.225 2216.408 1887.684 2073.264 1927.256 1035.216 644.2982 1985.21
0.000183 0.000183 0.000183 0.000183 0.000183 0.000183 0.000183 0.000183 0.001706 N/A 0.000183
GSA CGSA1 CGSA2 CGSA3 CGSA4 CGSA5 CGSA6 CGSA7 CGSA8 CGSA9 CGSA10
−17.7401 −68.9393 −37.275 73.79774 −53.8428 −55.1541 −62.5237 −48.0431 −77.8988 −77.7143 −45.5366
23.12693 10.93766 28.75754 14.34458 12.47357 19.89842 28.5446 22.13837 2.148916 2.823254 37.1637
0.000183 0.031209 0.000330 0.000183 0.000183 0.000583 0.025748 0.000183 N/A 0.969850 0.001706
F3
Mean
Std
p-values
F4
Mean
Std
p-values
GSA CGSA1 CGSA2 CGSA3 CGSA4 CGSA5 CGSA6 CGSA7 CGSA8 CGSA9 CGSA10
8,784,707 6,287,325 7,104,232 12,085,511 6,310,414 6,218,108 5,946,812 7,037,586 2,496,080 1,570,748 6,042,804
1,025,225 738,983.5 1,302,761 4,812,475 1,189,323 1,035,978 1,365,596 1,040,121 509,080 385,831.2 1,560,898
0.000183 0.000183 0.000183 0.000183 0.000183 0.000183 0.000183 0.000183 0.000769 N/A 0.000183
GSA CGSA1 CGSA2 CGSA3 CGSA4 CGSA5 CGSA6 CGSA7 CGSA8 CGSA9 CGSA10
−21.0901 −20.0642 −19.9401 −7.74999 −20.1518 −21.4954 −21.2829 −19.2442 −22.901 −26.2492 −19.9083
3.943662 2.648003 2.638131 2.469668 1.741123 3.165185 3.923953 3.250645 3.762992 2.640308 2.14822
0.007285 0.000583 0.000769 0.000183 0.000583 0.003611 0.004586 0.000769 0.037635 N/A 0.001008
F5
Mean
Std
p-values
GSA CGSA1 CGSA2 CGSA3 CGSA4 CGSA5 CGSA6 CGSA7 CGSA8 CGSA9 CGSA10
37,225.14 37,335.69 41,753.73 92,523.34 37,039.6 35,789.25 35,718.06 36,736.27 24,890.27 17,761.91 35,745.48
3797.144 4233.137 4898.706 10,773.51 4607.813 2084.051 4357.043 3954.342 2975.678 3059.145 3914.82
0.000183 0.000183 0.000183 0.000183 0.000183 0.000183 0.000183 0.000183 0.000330 N/A 0.000183
over, CGSA9 is not only able to significantly outperform CGSA8 in F4 benchmark function. It should be noticed that the unimodal benchmark functions have only one global solution, and there is no local solution for them so they are highly suitable to examine exploitation. Therefore, these results prove that the chaotic maps, especially sinusoidal maps, remarkably improve the exploitation of GSA.
1.4 EXPERIMENTAL RESULTS AND DISCUSSION
9
Table 1.5 Minimization Results for Multimodal Benchmark Functions F6
Mean
Std
p-values
F7
Mean
Std
p-values
GSA CGSA1 CGSA2 CGSA3 CGSA4 CGSA5 CGSA6 CGSA7 CGSA8 CGSA9 CGSA10
−79.0508 −79.8414 −79.8529 −79.9555 −79.8323 −79.8002 −79.8389 −79.786 −79.7652 −79.8462 −79.8278
1.747479 0.043478 0.038672 0.011943 0.035578 0.099999 0.044786 0.080533 0.09554 0.04868 0.057373
0.000769 0.000183 0.000183 N/A 0.000183 0.000183 0.000183 0.000183 0.000183 0.000183 0.000183
GSA CGSA1 CGSA2 CGSA3 CGSA4 CGSA5 CGSA6 CGSA7 CGSA8 CGSA9 CGSA10
−5328.37 −5049.34 −5026.92 −4868.63 −5360.47 −5246.09 −4457.83 −4860.42 −5019.25 −4945.18 −4755.94
872.6405 894.5103 986.9673 702.0189 843.4144 731.4918 486.5712 759.6471 865.1992 526.0627 726.6477
0.791337 0.273036 0.384673 0.307489 N/A 0.909722 0.01133 0.212294 0.57075 0.307489 0.10411
F8
Mean
Std
p-values
F9
Mean
Std
p-values
GSA CGSA1 CGSA2 CGSA3 CGSA4 CGSA5 CGSA6 CGSA7 CGSA8 CGSA9 CGSA10
−21.9435 −23.9506 −7.66509 29.88285 −20.156 −17.7316 −25.2354 −25.9154 −25.3564 −22.3142 −16.5256
10.75352 9.192801 13.00643 9.637889 13.09336 9.143735 10.71948 10.80922 8.890016 6.219141 12.39437
0.520523 0.909722 0.003611 0.000183 0.161972 0.088973 1.000000 N/A 1.000000 0.791337 0.212294
GSA CGSA1 CGSA2 CGSA3 CGSA4 CGSA5 CGSA6 CGSA7 CGSA8 CGSA9 CGSA10
859.5055 887.1435 851.2776 1136.725 867.0919 916.1836 853.3361 894.9651 859.7405 918.3724 837.1875
113.7342 80.5729 49.06254 161.5977 78.38844 95.82443 57.08126 88.05422 111.5811 101.866 137.4515
0.520523 0.344704 0.677585 0.001706 0.520523 0.140465 0.677585 0.273036 0.57075 0.185877 N/A
F10
Mean
Std
p-values
GSA CGSA1 CGSA2 CGSA3 CGSA4 CGSA5 CGSA6 CGSA7 CGSA8 CGSA9 CGSA10
10,734,747 261,468.4 1,767,873 3.96E+08 284,115 3,236,132 853,252.7 699,812.7 −45.6361 −69.4133 127,283.1
6,366,970 696,918.3 1,907,562 90,004,134 309,348.3 8,758,580 1,924,391 721,204.6 4.46637 3.281476 133,609.9
0.000183 0.000183 0.000183 0.000183 0.000183 0.000183 0.000183 0.000183 0.000183 N/A 0.000183
On the other hand, multimodal test functions (F5 to F10 ) have many local minima, so they are well suited to benchmark exploration of an algorithm. The results for these test functions are presented in Table 1.5. Inspecting this table, the results are highly fluctuated which is due to the complex nature of these functions. According to these results, each of Gauss/mouse, Iterative, Sine, Sinusoidal, and Tent
10
CHAPTER 1 GRAVITATIONAL SEARCH ALGORITHM WITH CHAOS
FIGURE 1.6 Convergence curves for unimodal benchmark functions.
maps provide the best results for one of the multimodal benchmark functions. However, p-values show that only Gauss/mouse and Sinusoidal maps provide significant results for F6 and F10 , respectively. To sum up, the results of unimodal functions strongly prove that CGSA algorithms have superior exploitation. The obtained results on multimodal functions also show that the chaotic maps are efficiently capable of exploring the search space. Among the CGSA algorithms, CGSA9 which uses Sinusoidal maps provides significantly better results on 40% (4 out of 10) of the benchmark functions. In the next subsection convergence behaviors of algorithm are investigated.
1.4.2 CONVERGENCE ANALYSIS The convergence curves are illustrated in Fig. 1.6 and Fig. 1.7. As may be seen from these figures, CGSA9 has the best convergence rate on unimodal benchmark functions. Generally, the convergence curves that belong to CGSA9 accelerate as iterations are being run. This behavior and the characteristic of unimodal functions indicate that the proposed method successfully improves the convergence rate of GSA and consequently exploitation. The convergence curves for multimodal benchmark functions follow a different scenario. The CGSA algorithms have worse convergence speed than GSA in the initial iterations. However, the search process is accelerated during iterations for CGSA algorithms. This acceleration helps CGSAs to surpass GSA at the end. This is the effect of chaotic maps which makes CGSA capable of balancing
1.5 CGSA FOR ENGINEERING DESIGN PROBLEMS
11
FIGURE 1.7 Convergence curves for multimodal benchmark functions.
exploration and exploitation properly to find the global optimum. Overall, these results prove that the chaotic maps are able to significantly improve the drawbacks of GSA. In the next section the performance of the CGSA algorithms are examined for solving constrained classical engineering problems to draw the final conclusions.
1.5 CGSA FOR ENGINEERING DESIGN PROBLEMS In this section two engineering design problems are employed to further benchmark the performance of CGSA algorithms in solving constrained real problems.
1.5.1 WELDED BEAM DESIGN In this problem, minimization of the fabrication cost of a welded beam is the objective, as shown in Fig. 1.8 [35,36]. The constraints are shear stress (τ ), bending stress in the beam (θ ), buckling load on the bar (Pc ), end deflection of the beam (δ), and side constraints. This problem has four variables: thickness of weld (h), length of attached part of the bar (l), the height of the bar (t ), and thickness of the bar (b).
12
CHAPTER 1 GRAVITATIONAL SEARCH ALGORITHM WITH CHAOS
FIGURE 1.8 Structure of welded beam design.
The mathematical formulation is as follows: Consider x = [x1 x2 x3 x4 ] = [h l t b], Minimize f ( x ) = 1.10471x12 x2 + 0.04811x3 x4 (14.0 + x2 ), x ) = τ ( x ) − τmax ≤ 0, Subject to g1 ( x ) = σ ( x ) − σmax ≤ 0, g2 ( x ) = δ( x ) − δmax ≤ 0, g3 ( x ) = x1 − x4 ≤ 0, g4 ( x ) = P − Pc ( x ) ≤ 0, g5 ( g6 ( x ) = 0.125 − x1 ≤ 0, x ) = 1.10471x12 + 0.04811x3 x4 (14.0 + x2 ) − 5.0 ≤ 0, g7 ( Variable range 0.1 ≤ x1 ≤ 2, 0.1 ≤ x2 ≤ 10, 0.1 ≤ x3 ≤ 10, 0.1 ≤ x4 ≤ 2,
2 x2 2 where τ ( x ) = τ + 2τ τ 2R + τ , τ = √ P , τ = MR , M = P L+ J 2x1 x2 R=
x22 4
+
x2 2
(1.10)
,
2
x1 +x3 2
,
2 √ x22 x1 +x3 J = 2 2x1 x2 4 + , 2 σ ( x) =
6P L , x4 x32
Pc ( x) =
δ( x) =
4.013E L2
x32 x46 36
6P L3 Ex32 x4
1−
x3 2L
E 4G
,
P = 6000 lb, L = 14 in, δmax = 0.25 in, E = 30 × 16 psi, G = 12 × 106 psi, τmax = 13600 psi, σmax = 30000 psi
1.5 CGSA FOR ENGINEERING DESIGN PROBLEMS
13
Table 1.6 Comparison Results for Welded Beam Design Problem Algorithm
h
CGSA3 (Gauss/mouse) GSA GA (Carlos and Coello) GA (Deb) GA (Deb) HS (Lee and Geem) Random Simplex David APPROX
0.180146 0.182129 N/A N/A 0.2489 0.2442 0.4575 0.2792 0.2434 0.2444
Optimum Variables l t b 4.133338 3.856979 N/A N/A 6.1730 6.2231 4.7313 5.6256 6.2552 6.2189
9.06236 10.0000 N/A N/A 8.1789 8.2915 5.0853 7.7512 8.2915 8.2915
0.205738 0.202376 N/A N/A 0.2533 0.2443 0.6600 0.2796 0.2444 0.2444
Optimum cost 1.774739 1.879952 1.8245 2.3800 2.4331 2.3807 4.1185 2.5307 2.3841 2.3815
FIGURE 1.9 Pressure vessel problem.
In the literature, Carlos and Coello [37] and Deb [38,39] employed GA, whereas Lee and Geem [40] utilized HS to solve this problem. There are also mathematical approaches that have been adopted by Ragsdell and Phillips [41] for this problem. The comparison results are provided in Table 1.6. The results show that the Gauss/mouse map has the highest performance for this problem. Considering the behavior of Gauss/mouse map in Fig. 1.1, it is that this algorithm emphasizes exploration in first steps of iteration and then focuses on exploitation. Meanwhile, the sudden changes in the value of G do not allow the algorithm to be trapped in local minima.
1.5.2 PRESSURE VESSEL DESIGN The objective of this problem is to minimize the total cost (material, forming, and welding) of a cylindrical vessel as shown in Fig. 1.9 [42]. Both ends of vessel are capped, and the head has a hemi-spherical shape. There are four variables in this problem: thickness of the shell (Ts ), thickness of the head (Th ), inner radius (R), and length of the cylindrical section without considering the head (L). This problem is subjected by four constraints.
14
CHAPTER 1 GRAVITATIONAL SEARCH ALGORITHM WITH CHAOS
Table 1.7 Comparison Results for Pressure Vessel Design Problem Algorithm
Ts
Optimum Variables Th R L
Optimum cost
CGSA9 (Sinusoidal) GSA GA (Coello) GA (Deb and Gene) Lagrangian Multiple (Kannan) branch-bound (Sandgren)
0.890827 1.080581 0.812500 0.937500 1.125000 1.125000
0.439804 7.897191 0.434500 0.500000 0.625000 0.625000
6143.1240 48,807.290 6288.7445 6410.3811 7198.0428 8129.1036
45.86797 55.988659 40.323900 48.329000 58.291000 47.700000
135.1164 84.4542025 200.000000 112.679000 43.6900000 117.701000
These constraints and the problem are formulated as follows: Consider x = [x1 x2 x3 x4 ] = [Ts Th R L], Minimize f ( x ) = 0.6224x1 x3 x4 + 1.7781x2 x32 + 3.1661x12 x4 + 19.84x12 x3 , x ) = −x1 + 0.0193x3 ≤ 0, Subject to g1 ( x ) = −x3 + 0.00954x3 ≤ 0, g2 ( 4 g3 ( x ) = −πx32 x4 − πx33 + 1296000 ≤ 0, 3 x ) = x4 − 240 ≤ 0, g4 (
(1.11)
Variable range 0 ≤ x1 ≤ 99, 0 ≤ x2 ≤ 99, 10 ≤ x3 ≤ 200, 10 ≤ x4 ≤ 200. This problem has also been optimized in the literature. Some of the adopted methods are: GA [43,44], augmented Lagrangian Multiple [45], and branch-bound [46]. The comparison results of the algorithms are presented in Table 1.7. Inspecting the results of this table, CGSA9 (Sinusoidal) algorithm outperforms the other algorithms. In addition, GSA surprisingly provides the worst results with a high oversight. The possible reason of this is that GSA is not able to discover new feasible areas in the search space when masses should cross an infeasible area to reach a new feasible area. Moreover, the infeasible masses are extremely light in case of using a penalty function or assigning big fitness function to handle constraints. So, they get attracted back once they exit from feasible areas. However, Sinusoidal map helps masses to cross infeasible areas of search space with sudden changes in G as illustrated in Fig. 1.1. In addition, the final value of G lies in (5, 10) in Sinusoidal map, meaning that this map favors exploration. This helps CGSA9 to explore more feasible areas and provide the best solution. Therefore, it can be stated that the chaotic maps help masses to not only accelerate toward the best solution but also discover new feasible areas. To sum up, the results show that the CGSA algorithms successfully outperform GSA in majority of the benchmark functions. Furthermore, the results of classical engineering problems prove that CGSA3 and CGSA9 are capable of providing very competitive results, indicating the superior performance of these algorithms in solving constrained problems as well. According to this comparative study, we
1.6 CONCLUSION
15
confidently state that the chaotic maps, especially Sinusoidal followed by Gauss/mouse, have merit in improving the performance of GSA in terms of avoiding local minima and convergence speed.
1.6 CONCLUSION In this work, the performance of the GSA algorithm was improved by chaotic maps. Ten different chaotic maps were applied to GSA. The proposed CGSA algorithms proved their superior performance on 10 benchmark functions in terms of avoiding local minima and convergence rate. The conducted Wilcoxon’s rank sum test allowed us to judge and find the most significant chaotic maps (Sinusoidal and Gauss/mouse). Furthermore, two classical engineering design problems were employed to evaluate the performance of Sinusoidal and Gauss/mouse maps in solving constrained real problems. The results also verified the superior performance of both maps in optimizing constrained problems. For future studies, it would be interesting to employ CGSA algorithms for solving real world engineering problems. In addition, other chaotic maps are also worth applying to GSA.
REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21]
D.H. Wolpert, W.G. Macready, No free lunch theorems for optimization, IEEE Trans. Evolut. Comput. 1 (1) (1997) 67–82. J.H. Holland, Genetic algorithms, Sci. Am. 267 (1) (1992) 66–72. K. Price, R. Storn, Differential evolution, Dr. Dobb’s J. 22 (4) (1997) 18–20. K.V. Price, R.M. Storn, J.A. Lampinen, Differential Evolution: A Practical Approach to Global Optimization, SpringerVerlag, New York Inc., 2005. J. Kennedy, R. Eberhart, Particle swarm optimization, in: Proceedings of IEEE International Conference on Neural Networks, IEEE, 1995. Y. Shi, R. Eberhart, A modified particle swarm optimizer, in: IEEE International Conference on Evolutionary Computation, IEEE, Anchorage, Alaska, 1998. M. Dorigo, G. Di Caro, Ant Colony Optimization: A New Meta-Heuristic, IEEE, 1999. E. Rashedi, H. Nezamabadi-Pour, S. Saryazdi, GSA: a gravitational search algorithm, Inform. Sci. 179 (13) (2009) 2232–2248. H. Sivaraj, G. Gopalakrishnan, Random walk based heuristic algorithms for distributed memory model checking, Electron. Notes Theor. Comput. Sci. 89 (1) (2003) 51–67. X.S. Yang, S. Deb, Eagle strategy using Levy walk and firefly algorithms for stochastic optimization, in: Nature Inspired Cooperative Strategies for Optimization (NICSO 2010), 2010, pp. 101–111. N. Noman, H. Iba, Accelerating differential evolution using an adaptive local search, IEEE Trans. Evolut. Comput. 12 (1) (2008) 107–125. J. Chen, et al., Particle Swarm Optimization with Local Search, IEEE, 2005. S. Chen, et al., Identification of Nonlinear System Based on a New Hybrid Gradient-Based PSO Algorithm, IEEE, 2007. N. Meuleau, M. Dorigo, Ant colony optimization and stochastic gradient descent, Artif. Life 8 (2) (2002) 103–121. B. Alatas, E. Akin, A.B. Ozer, Chaos embedded particle swarm optimization algorithms, Chaos Solitons Fractals 40 (4) (2009) 1715–1734. B. Alatas, Chaotic bee colony algorithms for global numerical optimization, Expert Syst. Appl. 37 (8) (2010) 5682–5687. B. Alatas, Chaotic harmony search algorithms, Appl. Math. Comput. 216 (9) (2010) 2687–2699. A.H. Gandomi, et al., Chaos-enhanced accelerated particle swarm optimization, Commun. Nonlinear Sci. Numer. Simul. (2012). J. Yao, et al., A new optimization approach-chaos genetic algorithm, Syst. Eng. 1 (2001) 015. G. Zhenyu, et al., Self-adaptive chaos differential evolution, Adv. Nat. Comput. (2006) 972–975. J. Mingjun, T. Huanwen, Application of chaos in simulated annealing, Chaos Solitons Fractals 21 (4) (2004) 933–941.
16
CHAPTER 1 GRAVITATIONAL SEARCH ALGORITHM WITH CHAOS
[22] A. Gandomi, et al., Firefly algorithm with chaos, Commun. Nonlinear Sci. Numer. Simul. (2012). [23] L.S. Coelho, V.C. Mariani, Firefly algorithm approach based on chaotic Tinkerbell map applied to multivariable PID controller tuning, Comput. Math. Appl. (2012). [24] S.H. Kellert, In the Wake of Chaos: Unpredictable Order in Dynamical Systems, University of Chicago Press, 1993. [25] Z. Chen, et al., Improved gravitational search algorithm for parameter identification of water turbine regulation system, Energy Convers. Manag. 78 (2014) 306–315. [26] S. Mirjalili, S.Z.M. Hashim, H.M. Sardroudi, Training feedforward neural networks using hybrid particle swarm optimization and gravitational search algorithm, Appl. Math. Comput. 218 (22) (2012) 11125–11137. [27] S. Mirjalili, A. Lewis, Adaptive gbest-guided gravitational search algorithm, Neural Comput. Appl. 25 (7–8) (2014) 1569–1584. [28] C. Purcaru, et al., Hybrid PSO-GSA robot path planning algorithm in static environments with danger zones, in: System Theory, Control and Computing (ICSTCC), 2013 17th International Conference, IEEE, 2013. [29] C. Wang, K. Gao, J. Guo, An improved gravitational search algorithm based on neighbor search, in: 2013 Ninth International Conference on Natural Computation (ICNC), IEEE, 2013. [30] X. Yao, Y. Liu, G. Lin, Evolutionary programming made faster, IEEE Trans. Evolut. Comput. 3 (2) (1999) 82–102. [31] J. Digalakis, K. Margaritis, On benchmarking functions for genetic algorithms, Int. J. Comput. Math. 77 (4) (2001) 481–506. [32] S. Mirjalili, S.Z.M. Hashim, A New Hybrid PSOGSA Algorithm for Function Optimization, IEEE, 2010. [33] F. Wilcoxon, Individual comparisons by ranking methods, Biom. Bull. 1 (6) (1945) 80–83. [34] J. Derrac, et al., A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms, Swarm Evolut. Comput. (2011). [35] A.H. Gandomi, X.-S. Yang, A.H. Alavi, Mixed variable structural optimization using Firefly Algorithm, Comput. Struct. 89 (23–24) (2011) 2325–2336. [36] A. Gandomi, et al., Bat algorithm for constrained optimization tasks, Neural Comput. Appl. 22 (6) (2013) 1239–1255. [37] A. Carlos, C. Coello, Constraint-handling using an evolutionary multiobjective optimization technique, Civil Eng. Syst. 17 (4) (2000) 319–346. [38] K. Deb, Optimal design of a welded beam via genetic algorithms, AIAA J. 29 (11) (1991) 2013–2015. [39] K. Deb, An efficient constraint handling method for genetic algorithms, Comput. Methods Appl. Mech. Engrg. 186 (2) (2000) 311–338. [40] K.S. Lee, Z.W. Geem, A new meta-heuristic algorithm for continuous engineering optimization: harmony search theory and practice, Comput. Methods Appl. Mech. Engrg. 194 (36) (2005) 3902–3933. [41] K. Ragsdell, D. Phillips, Optimal design of a class of welded structures using geometric programming, ASME J. Eng. Ind. 98 (3) (1976) 1021–1025. [42] A. Gandomi, X.-S. Yang, A. Alavi, Cuckoo search algorithm: a metaheuristic approach to solve structural optimization problems, Eng. Comput. 29 (1) (2013) 17–35. [43] C.A. Coello Coello, Use of a self-adaptive penalty approach for engineering optimization problems, Comput. Ind. 41 (2) (2000) 113–127. [44] K. Deb, A.S. Gene, A robust optimal design technique for mechanical component design, in: D. Dasgupta, Z. Michalewicz (Eds.), Evolutionary Algorithms in Engineering Applications, Springer-Verlag, Berlin, 1997, pp. 497–514. [45] B. Kannan, S.N. Kramer, An augmented Lagrange multiplier based method for mixed integer discrete continuous optimization and its applications to mechanical design, J. Mech. Des. 116 (1994) 405. [46] E. Sandgren, Nonlinear Integer Discret. Program. Mech. Des. (1988).
CHAPTER
TEXTURES AND ROUGH SETS ∗ Hacettepe
2
Murat Diker∗
University, Ankara, Turkey
2.1 INTRODUCTION This chapter includes the motivation and basic results on textures and rough sets which are scattered in the literature. The concept of a texture space was introduced by Lawrence M. Brown at the 2nd BUFSA Conference in 1992 as an alternative crisp point-set based setting for fuzzy sets under the name of a fuzzy structure. A p-set of a fuzzy structure plays an important role in this setting just as a cell of a biological tissue and, hence, the name of a texture space was suggested for a fuzzy structure by Siegfried Gottwald. The first paper about textures was published in the Special Issue of the Journal of Fuzzy Sets and Systems entitled Topics of the Mathematics of Fuzzy Objects by Brown and Diker in 1998 [2]. A texture space, or shortly a texture, provides a useful environment for the study of complement-free mathematical concepts. The p-set and q-set enable us to determine the dual concepts in textures. In this respect, textures combine bitopological, topological, and fuzzy topological concepts using ditopology (dichotomous topology) [2,3,5,6,14,15,18,19,26,30–33]. There is a strong connection between the theories of rough set and texture space [8–17,30,31]. Here, we present two main points related to approximation operators and definability in rough set theory: The first is sections and presections as lower and upper approximations. The second is difunctions in terms of definability.
2.2 FUZZY LATTICES A texture is a completely distributive lattice with respect to set inclusion order. Therefore, the basic concepts of lattice theory are essential for the theory of textures. A relation on a non-empty set L is called a partial order if for all u, v, w ∈ L the following conditions hold: u u (reflexivity), u v and v u → u = v (anti-symmetry), and u v and v w → u w (transitivity). Then the pair (L, ) is called a partially ordered set. Now let A be a non-empty subset of L. The following are basic concepts for partially ordered sets: u ∈ L is called a lower bound of A if u a for all a ∈ A, and an upper bound of A if a u for all u ∈ A. Handbook of Neural Computation. DOI: 10.1016/B978-0-12-811318-9.00002-8 Copyright © 2017 Elsevier Inc. All rights reserved.
17
18
CHAPTER 2 TEXTURES AND ROUGH SETS
Further, v ∈ L is called the greatest lower bound of A if u v for all lower bound u of A, and the least upper bound of A if v u for all upper bound u of A. Now a partially ordered set is called a lattice if for any two elements u, v ∈ L, the set {u, v} has the least upper bound and the greatest lower bound. A lattice L is said to be complete if for each non-empty subset A of L, the least upper bound A (join) and the greatest lower bound A (meet) exist. Completely distributive lattices play an important role in algebra and topology. A lattice L is called completely distributive if for all index set K, we have ajk = aγk (k) k∈K j ∈Jk
γ ∈C k∈K
where C is the set of all choice functions γ : K → k∈K Jk such that γ (k) ∈ Jk ⊆ K and ajk ∈ L. On the other hand, L is called a fuzzy lattice [34] if it is completely distributive and there is a mapping : L → L, called an order reversing involution, satisfying the conditions (a) (u ) = u, ∀u ∈ L, (b) u v → v u , ∀u, v ∈ L. A non-zero element m of a lattice L is called a molecule, if ∀a, b ∈ L, m ≤ a ∨ b ⇒ m ≤ a or m ≤ b. A natural example to fuzzy lattice is the family P(U ) of all subsets of the universe U with the set inclusion order ⊆. Clearly all molecules of P(U ) are singleton sets {u} for all u ∈ U .
The Unit Interval [0, 1] The unit interval [0, 1] equipped with the usual ordering ≤ is also a fuzzy lattice (see Example 1.1.35(i) in [34]). Here, the molecules of [0, 1] are non-zero elements, that is, the set of all molecules is the u = {v | v ∈ (0, 1] and v ≤ u}. Clearly, interval M[0,1] = (0, 1]. Now for u ∈ [0, 1], consider the set u | u ∈ [0, 1]} has the following u = (0, u]. Here for u = 0, we have u = ∅. Then the family M[0,1] = { properties: (i) We have u ≤ v ⇐⇒ u ⊆ v , and then the mapping : [0, 1] → M[0,1] is a lattice isomorphism. Hence, (M[0,1] , ⊆) is a complete lattice containing (0, 1] and ∅. Arbitrary meets coincide with intersections. However, for the family {(0, 1 − 1/n] | n ∈ N} ⊆ M[0,1] , we have (0, 1 − 1/n] = (0, 1] and (0, 1 − 1/n] = (0, 1) n∈N
n∈N
and then only finite joins coincide with unions in M[0,1] . (ii) The mapping : [0, 1] → M[0,1] is a lattice isomorphism and, hence, (M[0,1] , ⊆) is also completely distributive. (iii) M[0,1] separates the points of U in the following sense: Let u = v in U . Suppose that u < v. Then for (0, u] ∈ M[0,1] , we have v ∈ (0, u].
2.3 TEXTURE SPACES
19
2.3 TEXTURE SPACES By the above arguments in the preceding section, on the one hand, we have the fuzzy lattice [0, 1] with its molecules, on the other hand, we have a set system ((0, 1], M[0,1] ) with the crisp points and crisp sets which represents the fuzzy lattice [0, 1] under the lattice isomorphism “”. This leads us to the following concept: Definition 2.1. Let U be a set. Then U ⊆ P(U ) is called a texturing of U , and (U, U) is called a texture space, or simply a texture, if (i) (U, ⊆) is a complete lattice containing U and ∅, which has the property that arbitrary meets coincide with intersections, and finite joins coincide with unions, that is, for all index set K, Ak = Ak k∈K
and for all finite index set K
k∈K
Ak =
k∈K
Ak
k∈K
where {Ak | k ∈ K} ⊆ U . (ii) U is completely distributive. (iii) U separates the points of U . That is, given u = v in U there exists A ∈ U such that u ∈ A, v ∈ / A, or u ∈ / A, v ∈ A. A texture may not be closed under the set complementation. However, we may take a mapping cU : U → U as a complementation satisfying the conditions (i) ∀A ∈ U, cu (cU (A)) = A, (ii) ∀A, B ∈ U, A ⊆ B ⇒ cU (B) ⊆ cU (A). Then the triple (U, U, cU ) is called a complemented texture. A texture is a lattice with respect to set inclusion. Hence, we may mention about the molecules of textures: A non-empty set A ∈ U is a molecule if B, C ∈ U, A ⊆ B ∪ C ⇒ A ⊆ B or A ⊆ C. One of the important points in textures is the equivalence of complete distributivity. To discuss this equivalence, let us consider the following basic tools of textures: Definition 2.2. For u ∈ U , the p-set and q-set are defined by / A}, Pu = {A ∈ U | u ∈ A} and Qu = {A ∈ U | u ∈ respectively. Theorem 2.3. For u ∈ U, Qu =
{Pv | u ∈ Pv }.
of p-set, for w ∈ A, Proof. First, note that for all A ∈ U , we have A = w∈A Pw . Indeed, by definition we have Pw ⊆ A and then w∈A Pw ⊆ A. Further, we clearly have A ⊆ w∈A Pw . Now for some
20
CHAPTER 2 TEXTURES AND ROUGH SETS
v ∈ U , if u ∈ Pv , then by definition of Qu we have Pv ⊆ Qu . Hence, {Pv | u ∈ Pv } ⊆ Qu . Conversely, if u ∈ A and A ∈ U , then for all have u ∈ Pw . This implies that for all w∈ A, we have Pw ⊆ w ∈ A we {Pv | u ∈ Pv }, that is, A = w∈A Pw ⊆ {Pv | u ∈ Pv }. Then we obtain Qu ⊆ {Pv | u ∈ Pv }. Complete distributivity of textures can be written in terms of p-set and q-set: Theorem 2.4. [5,7] Let (U , ⊆) be a complete lattice. The following statements are equivalent: (i) (U, U ) is completely distributive. (ii) For A, B ∈ U , if A ⊆ B then there exists u ∈ U with A ⊆ Qu and Pu ⊆ B. (iii) ∀A ∈ U , A = {Pu |A ⊆ Qu }. (iv) ∀A ∈ U , A = {Qu |Pu ⊆ A}. Example 2.5. (i) The pair (U, P(U )) is a complemented texture where P(U ) is the power set of U . It is called a discrete texture. Indeed, since P(U ) closed under intersection, by Theorem 6 in [1], it is a complete lattice. Further, we have Pu = {u} and Qu = U \ {u}. If A, B ⊆ U and A ⊆ B, then for some u ∈ U we have u ∈ A and u ∈ B. This implies that A ⊆ U \ {u} = Qu and Pu ⊆ B. Hence, by Theorem 2.4, P(U ) is completely distributive. Clearly, P(U ) separates the points of U . The mapping cU : P(U ) → P(U ) is the ordinary complementation on (U, P(U )) defined by cU (A) = U \ A, ∀A ∈ P(U ). (ii) The pair ((0, 1], M[0,1] ) is a texture where M[0,1] = {(0, r] | r ∈ [0, 1]}. It is called a fuzzy texture. Clearly, (0, 1], ∅ ∈ M[0,1] and it is closed under set intersection. Then by Theorem 6 in [1], it is a complete lattice. It is easy to see that the p-set and q-set of M[0,1] are Pr = (0, r] = Qr for all r ∈ (0, 1]. Further, let A, B ∈ M[0,1] and A ⊆ B. Then we have A = (0, r] and B = (0, q] for some r, q ∈ (0, 1]. Now (0, r] ⊆ (0, q] implies that q < r and then for some q ∈ (0, 1] with q < t < r, we obtain that A = (0, r] ⊆ (0, t] = Qt and Pt = (0, t] ⊆ (0, q] = B. Therefore, by Theorem 2.4, M[0,1] is completely distributive. The point separability is immediate. The fuzzy texture (M[0,1] , M[0,1] ) is complemented since the mapping cM[0,1] : M[0,1] → M[0,1] defined by cU ((0, r]) = (0, 1 − r], ∀r ∈ (0, 1] is a complementation on M[0,1] . (iii) Let U = {a, b, c}. It is easy to see that the family U = {∅, {a}, {a, b}, U } is a texture on U . Further, Pa = {a}, Pb = {a, b}, Pc = U and Qa = ∅, Qb = {a}, Qc = {a, b}. The mapping cU : U → U defined by cU (∅) = U, cU (U ) = ∅, cU ({a}) = {a, b}, cU ({a, b}) = {a} is a complementation on (U, U).
2.3 TEXTURE SPACES
21
(iv) The pair (I, I) where I = [0, 1] and I = {[0, r) | r ∈ I } ∪ {[0, r] | r ∈ I } is also a texture (unit texture). The family I is closed under intersection and hence by Theorem 6 in [1], it is a complete lattice. For r ∈ I , we have Pr = [0, r] and Qr = [0, r) for all r ∈ [0, 1]. Using Theorem 2.4, it can be easily checked that it is also completely distributive. The mapping cI : I → I defined by ∀r ∈ I, cI ([0, r]) = [0, 1 − r), cI ([0, r)) = [0, 1 − r] is a complementation on (I, I). A texture (U, U) is called simple if all molecules of the space are p-sets. Example 2.6. Clearly, the texture (U, U) given in Example 2.5(iii) and the textures (U, P(U )), ((0, 1], M[0,1] ), are simple, but (I, I) is not simple since the set Qr is also a molecule for all r ∈ [0, 1]. In fact, there is one-to-one correspondence between simple textures and fuzzy lattices. Now for any fuzzy lattice L, let ML = {m | m is a molecule in L}. Let us consider the family ML = { a | a ∈ L} where a = {m | m ∈ ML and m ≤ a} for every u ∈ L. We may give the following representation theorem: Theorem 2.7. [4] The mapping : L → ML defined by ∀a ∈ L, a → a is a lattice isomorphism and the triple (ML , ML , cML ) is a complemented simple texture space. Conversely, every complemented simple texture may be obtained in this way from a suitable fuzzy lattice.
Products The product of textures can be defined in a natural way. In the sequel, we will see that morphisms (direlations) between any two textures (U1 , U1 ) and (U2 , U2 ) are the pairs where the compounds are the elements of the product of a discrete texture (U1 , P(U1 )) and a texture (U2 , U2 ). Moreover, this product will be used in determining the texture corresponding to fuzzy sets. Therefore, not only for the sake of simplicity, but also to study the fuzzy sets, we consider the product of two textures. For the product of arbitrary families of textures we refer to [4]. Now let us consider the family A = {A × U2 | A ∈ U1 } {U1 × B | B ∈ U2 } and define B={ Ej | {Ej }j ∈J ⊆ A}. j ∈J
Then it is easy to see that the family of arbitrary intersections of the elements of B, that is, Dk | {Dk }k∈K ⊆ B}, U1 ⊗ U2 = { k∈K
22
CHAPTER 2 TEXTURES AND ROUGH SETS
is a texture on U1 × U2 . Note that for all A ∈ U1 and for all B ∈ U2 , we have A × B ∈ U1 ⊗ U2 . Further, for the p-sets and q-sets of U1 ⊗ U2 , we have the following: Theorem 2.8. (i) P(u1 ,u2 ) = Pu1 × Pu2 . (ii) Q(u1 ,u2 ) = (U1 × Qu2 ) ∪ (Qu1 × U2 ). Proof. (i) By definition of the p-set, we have P(u1 ,u2 ) = {A × B ∈ U ⊗ V | (u1 , u2 ) ∈ A × B}. Since u1 ∈ Pu1 and u2 ∈ Pu1 , (u1 , u2 ) ∈ Pu1 × Pu2 . Further, Pu1 × Pu2 ∈ U1 ⊗ U2 and this implies that P(u1 ,u2 ) ⊆ Pu1 × Pu2 . Suppose that Pu1 × Pu2 ⊆ P(u1 ,u2 ) . Then for some A ∈ U1 and B ∈ U2 with u1 ∈ A and u2 ∈ B, we have Pu1 × Pu2 ⊆ A × B. However, this gives that Pu1 ⊆ A or Pu2 ⊆ B, that is, we obtain the contradiction u1 ∈ A or u2 ∈ B. (ii) Suppose that Q(u1 ,u2 ) ⊆ (U1 × Qu2 ) ∪ (Qu1 × U2 ). Then by Theorem 2.3, for some w1 ∈ U1 and w2 ∈ U2 , we have P(u1 ,u2 ) ⊆ P(w1 ,w2 ) and P(w1 ,w2 ) ⊆ (U1 × Qu2 ) ∪ (Qu1 × U2 ). By (i), we have Pu1 ⊆ Pw1 or Pu2 ⊆ Pw2 , that is, u1 ∈ Pw1 or u2 ∈ Pw2 . By definition of a q-set, Pw1 ⊆ Qu1 or Pw2 ⊆ Qu2 . However, P(w1 ,w2 ) ⊆ (U1 × Qu2 ) ∪ (Qu1 × U2 ) we obtain the contradiction Pw1 ⊆ Qu2 and Pw1 ⊆ Qu1 . For the reverse inclusion, let W = {ω = (w1 , w2 ) | (u1 , u2 ) ∈ P(w1 ,w2 ) }. For A ∈ U1 and B ∈ U2 , let us define the sets E(1, A) = A × U2 and E(2, B) = U1 × B. Note that by Theorem 2.3 and (i) we have Pω = (Pw1 × Pw2 ) Q(u1 ,u2 ) = ω∈W
ω∈W
=
2 ω∈W j =1
=
=
E(j, Pwj )
γ ∈{1,2}W
ω∈W
2
E(γ (ω), Pwγ (ω) )
E(j,
{Pwj | γ (ω) = j }).
γ ∈{1,2}W j =1
Now suppose that (U1 × Qu2 ) ∪ (Qu1 × U2 ) ⊆ Q(u1 ,u2 ) . For some γ ∈ {1, 2}W we have (U1 × Qu2 ) ∪ (Qu1 × U2 ) ⊆
2
E(j,
{Pwj | γ (ω) = j }).
j =1
Without loss of generality, we may assume that (U1 × Qu2 ) ⊆
2
E(j,
j =1
= E(1,
{Pwj | γ (ω) = j })
E(2, {Pw2 | γ (ω) = 2}). {Pw1 | γ (ω) = 1})
2.3 TEXTURE SPACES
23
This means that for some u1 ∈ U1 we have Pu1 ⊆ {Pw1 | γ (ω) = 1} and for some u2 ∈ U2 , u2 ∈ Pu2 and Pu2 ⊆ Pw2 . Clearly, ω = (u1 , u2 ) ∈ P(u1 ,u2 ) and hence, ω ∈ W . But then we have γ (w) = 1 or γ (ω) = 2 and this is a contradiction. Notation. For the product texture P(U ) ⊗ V, we denote the p-set by P (u,v) and the q-set by Q(u,v) . Corollary 2.9. For every u ∈ U and v ∈ V , the p-set and q-set of the product texture (U × V , P(U ) ⊗ V) are P (u,v) = {u} × Pv and Q(u,v) = ((U \ {u}) × V ) ∪ (U × Qv ), respectively. Proof. For the discrete texture P(U ), we have Pu = {u} and Qu = U \ {u}. Then by Theorem 2.8, we have P (u,v) = Pu × Pv = {u} × Pv and Q(u,v) = (U × Qv ) ∪ (Qu × V ) = (U × Qv ) ∪ (U \ {u} × V ). The following result will be often used for the main results in the sequel. Lemma 2.10. P (u,v) ⊆ Q(u ,v ) ⇐⇒ u = u and Pv ⊆ Qv . Proof. P (u,v) = {u} × Pv ⊆ U \ {u } × V ∪ U × Qv = Q(u ,v ) ⇐⇒ {u} × Pv ⊆ U \ {u } × V and {u} × Pv ⊆ U × Qv ⇐⇒ u = u and Pv ⊆ Qv . If cU and cV are complementations on the textures (U, U ) and (V , V), respectively, then the complementation cU ×V on the product U ⊗ V is defined by cU ×V (A × B) = (U × cV (Pv )) ∪ (cU (Pu ) × V ). (u,v)∈A×B
Example 2.11. Consider the discrete texture (U, P(U )) where U = {a, b, c} and the texture (V , P(V )) where V = {d, e} and V = {V , ∅, {d}}. It is easy to see that p-sets and q-sets of the product texture (U × V , P(U ) ⊗ V) are P (a,d) = {a} × Pd = {(a, d)}, P (a,e) = {a} × V = {(a, d), (a, e)}, P (b,d) = {b} × Pd = {(b, d)}, P (b,e) = {b} × V = {(b, d), (b, e)}, P (c,d) = {c} × Pd = {(c, d)}, P (c,e) = {c} × Pe = {(c, d), (c, e)}, and Q(a,d) = ((U \ {a}) × V ) ∪ (U × Qd ) = ({(b, c)} × V ) ∪ (U × ∅)
Q(a,e) = ((U \ {a}) × V ) ∪ (U × Qe ) = ({(b, c)} × V ) ∪ (U × {d})
24
CHAPTER 2 TEXTURES AND ROUGH SETS
Table 2.1 Correspondences Between Textures and Fuzzy Lattices Texture
Universe
Texturing
Fuzzy Lattice
Discrete U P (U ) {{{u} | u ∈ A} | A ∈ P (U )} Fuzzy (0, 1] M = {(0, r] | r ∈ [0, 1]} [0, 1] Fuzzy Set U × (0, 1] P (U ) ⊗ M F (U )
= {(b, d), (b, e), (c, d), (c, e)}, Q(b,d) = ((U \ {b}) × V ) ∪ (U × Qd )
= {(b, d), (b, e), (c, d), (c, e), (a, d)}, Q(b,e) = (U \ {b} × V ) ∪ (U × Qe )
= ({(a, c)} × V ) ∪ (U × ∅)
= ({(a, c)} × V ) ∪ (U × {d})
= {(a, d), (a, e), (c, d), (c, e)},
= {(a, d), (a, e), (c, d), (c, e), (b, d)},
Q(c,d) = ((U \ {c}) × V ) ∪ (U × Qd )
Q(c,e) = ((U \ {c}) × V ) ∪ (U × Qe )
= ({(a, b)} × V ) ∪ (U × ∅)
= ({(a, b)} × V ) ∪ (U × {d}),
= {(a, d), (a, e), (b, d), (b, e)},
= {(a, d), (a, e), (b, d), (b, e), (c, d)},
respectively.
Fuzzy Set Texture Now we may determine the fuzzy set texture (for compare, see Table 2.1). Let U be a non-empty set and L be a fuzzy lattice. A L-fuzzy set is a mapping μ : U → L. Then the family F(U ) of all L-fuzzy sets is also a fuzzy lattice with the order ≤ defined by μ, η ∈ F(U ), μ ≤ η ⇐⇒ ∀u ∈ U, μ(u) ≤ η(u). In this case, for any family A = {μj | j ∈ J } ⊆ F (U ), the greatest lower bound and the least upper bound of A are defined by A=( μj )(u) = μj (u) j ∈J
and
A=(
j ∈J
j ∈J
μj )(u) =
μj (u),
j ∈J
respectively. By Proposition 2.1.5 in [34], F(U ) is also a fuzzy lattice. Further, by Theorem 2.7, there is a simple texture (W, W) which is isomorphic to F(U ). We call (W, W) fuzzy set texture. Now we have the following: Theorem 2.12. The textures (W, W) and P(U ) ⊗ ML are isomorphic. Proof. By definition of the product of textures, every element of P(U ) ⊗ ML is an arbitrary intersection of the sets (A × L) ∪ (U × a)
2.3 TEXTURE SPACES
25
where A ⊆ U and a ∈ L. We show that (A × L) ∪ (U × a ) ∈ W. Consider the fuzzy set μ : U → L defined by 1, if u ∈ A μ(u) = a, if u ∈ U \ A On the other hand, the fuzzy points vλ of F(U ) are the only molecules of F(U ). Clearly for u ∈ A and λ ∈ L, we have uλ ≤ μ. Further, for λ ≤ a, we also have vλ ≤ μ. Then we obtain μ = {vλ | vλ ≤ μ} = {(v, λ) | vλ ≤ μ}) = (U × L) ∪ (U × a) ∈ W since is a mapping from F(U ) to W. Now we show that for μ ∈ F(U ), μ = u∈U (U \ {u} × L) μ, that is, vλ ≤ μ and vλ ∈ ∪ (U × μ(u)). Suppose that for some fuzzy point vλ , we have vλ ∈ Then vλ ∈ (U \ {u} × L) ∪ (U × μ(u)) for some u ∈ U . That is, (U \ {u} × L) ∪ (U × μ(u)). u∈U This clearly implies that v = u and μ(u) < λ. However, this vλ ∈ (U \ {u} × L) and vλ ∈ (U × μ(u)). is a contradiction since vλ ≤ μ. The reverse inclusion is similar. Recall that a fuzzy point uλ and a fuzzy co-point uλ of F(U ) are defined by
λ, if z = u , uλ (z) = 0, if z = u
u (z) = λ
λ, if z = u 1, if z = u
for all z ∈ U , respectively [29]. By the above theorem, we may take (U × L, P(U ) ⊗ ML ) as a fuzzy set texture corresponding to the fuzzy lattice F(U ). Therefore, if we denote the p-sets and q-sets of the texture P(U ) ⊗ ML by P (u,λ) and Q(u,λ) , then it is easy to see that P (u,λ) = {u} × (0, λ] and Q(u,λ) = ((U \ {u}) × U ) ∪ (U × (0, λ]. By the lattice isomorphism : F(U ) → P(U ) ⊗ ML , we immediately have that uλ = P (u,λ) and uλ = Q(u,λ) . If A ∈ P(U ) ⊗ ML , then for some α ∈ F(U ), we have A = α and, therefore, α=
P (u,λ) =
α ⊆Q(u,λ)
Q(u,λ) .
P (u,λ) ⊆ α
In view of the above equalities, we can show that every fuzzy set can be written in terms of fuzzy points and fuzzy co-points: Theorem 2.13. For every fuzzy set α ∈ F(U ), we have α=
λ2000 kN) of pile capacities. But in this case, MOFS model is found to be better throughout the range of axial capacity of bored piles.
16.4.1.2 FN Model An FN model with degree 3 and tan BF was adopted. Fig. 16.3 gives the associative FN used for the prediction model of bored piles. The corresponding prediction equation is given by: Qu(p) = 7424.584 tan(D) + 127.369 tan(2D) + 23.243 tan(3D) + 2735.177 tan(L) − 712.15 tan(2L) − 826.503 tan(qc-tip ) + 1816.634 tan(qc-shaft ) − 48.344 tan(2qc-shaft ) − 100.483
(16.6)
In Eq. (16.6), the values of the inputs are their normalized values in the range [0, 1]. As per Table 16.4, the values of R in training and testing for FN model are 0.975 and 0.986 respectively. The AAE and RMSE values for training and testing are 368.20 kN, 494.01 kN and 1585.78 kN, 2484.11 kN respectively (Table 16.4). Also from Table 16.4 the overfitting ratio of FN model is 5.028, which indicates that the model is poorly generalized. Fig. 16.4 shows the plot between the measured and the predicted pile capacities of bored piles. During training (Fig. 16.4A) when bearing capacity of bored piles is less than 4000 kN, than FN model gave equal distribution of predictions around the line of equality. And when capacity of piles is greater than 4000 kN, then it gave mostly underpredicted values. During testing (Fig. 16.4B) for pile capacity of less than 4000 kN, the developed model gave equal under- and over-predictions with minimum deviation. Whereas, for values greater than 4000 kN
16.4 RESULTS AND DISCUSSION
305
FIGURE 16.4 Plot between the measured and the predicted interpreted failure load for the bored piles for (A) training and (B) testing data.
it gave highly overpredicted values. Thus, it can be concluded that the FN model is poor in predicting higher values of pile capacities.
16.4.1.3 MARS Model A MARS model with nine BFs was adopted for the prediction of bearing capacity of bored piles, and the corresponding prediction equation for the adopted MARS model can be presented as: Qu(p) = 710.309 − 6415.86 max(0, L − 0.299) − 25267.387 ∗ max(0, D − 0.299) max(0, qc-shaft − 0.428) − 34984.611 max(0, D − 0.299) max(0, 0.428 − qc-shaft ) + 30802.82 max(0, L − 0.299) max(0, qc-tip − 0.518) + 17104.621 max(0, L − 0.299) max(0, 0.518 − qc-tip ) − 48021.794 max(0, D − 0.299) max(0, 0.159 − L) + 8575.553 max(0, D − 0.074) + 30577.833 max(0, D − 0.074) max(0, qc-shaft − 0.439) + 9279.336 max(0, D − 0.074) max(0, L − 0.215)
(16.7)
In Eq. (16.7), the values of the inputs are their normalized values in the range [0, 1]. As per Table 16.4, the values of R in training and testing for MARS model are 0.995 and 0.994 respectively, indicating a strong correlation according to Smith [27]. The AAE and RMSE values for training and testing of MARS model are 168.95 kN, 225.62 kN and 559.84 kN, 790.80 kN respectively (Table 16.4). The overfitting ratio (Table 16.4) of MARS model is 3.505, which indicates that the predicted model is poorly generalized. Fig. 16.4 shows the plot between measured and predicted bearing capacities of bored piles. During training (Fig. 16.4A) when bearing capacity of bored piles is less than 4000 kN, then MARS
306
CHAPTER 16 MODELING THE AXIAL CAPACITY OF BORED PILES
FIGURE 16.5 Cumulative probability distribution for bored piles.
model gave equal distribution of predictions around the line of equality with small deviations. And when capacity of piles is greater than 4000 kN, then the predicted value matches exactly with that of measured value. During testing as indicated in Fig. 16.4B when pile capacity is less than 2000 kN, then the developed model gave equal under- and over-predicted values with minimum deviations. Whereas, for values greater than 2000 kN it gave underpredicted values. The performance of AI models developed in this study varies according to different statistical criteria (Table 16.4). In terms of R, the model which is found to be better is MARS as its corresponding values are bigger for both training and testing. For AAE and RMSE values, MARS model has least error during training but in testing, MOFS model has least error. Similarly, overfitting ratio of MOFS model is much closer to unity in comparison to others.
16.4.1.4 Comparison with Other AI Models Available in the Literature According to the first ranking criterion (R1), i.e. best fit calculations of Qu(p) /Qu(m) , MOFS model is ranked 1 (R1 = 1) as both R and E are much closer to unity as compared to other AI models (Table 16.5). MOFS model is trailed by MARS model (R1 = 2), which is followed by GEP model [4] having R1 = 3. The least efficient model is FN, with R1 = 5. For the second ranking criterion (arithmetic calculations of Qu(p) /Qu(m) ), i.e. R2, the trend is the same. MOFS model leads the pack with R1 = 1 (Table 16.5). Its μ value is closer to unity with σ approaching zero as compared to others. It is followed by MARS model and the most underperforming model is again the FN. Based on P50 and P90 values, which incidentally is the third ranking criterion (R3), MOFS model is assigned rank 1 (R3 = 1) as indicated in Table 16.5. The P50 value of MOFS model is much closer to 1 with least difference between the P50 and P90 values in comparison to other models. The least performing model is the FN. Also, the cumulative probability distribution of different AI models is presented in Fig. 16.5. As per Table 16.5 in the overall ranking system, MOFS model is ranked 1 with an RI value of 3, which is followed by MARS model, again followed by GEP model [4]. The model that came last is the FN (RI = 15).
Table 16.5 Ranking of Various AI Models for Bored Piles Based on Ranking Index
R
E
R1
Arithmetic Calculations of Qu(p) /Qu(m) μ σ R2
0.94 0.99 0.99 0.95 0.97
0.77 0.97 0.98 0.89 0.93
5 2 1 4 3
1.02 1.04 1.05 1.07 1.05
Best Fit Calculations FN (present study) MARS (present study) MOFS (present study) ANN [1] GEP [4]
0.51 0.29 0.25 0.38 0.37
5 2 1 4 3
Cumulative Probability of Qu(p) /Qu(m) P50 P90 R3 0.97 1.00 1.00 0.94 0.98
1.78 1.47 1.37 1.52 1.30
5 2 1 4 3
Overall Rank RI
Final rank
15 6 3 12 9
5 2 1 4 3
308
CHAPTER 16 MODELING THE AXIAL CAPACITY OF BORED PILES
16.5 CONCLUSION The present research dealt with the application of various AI techniques, namely multi-objective feature selection (MOFS), functional network (FN), and multivariate adaptive regression splines (MARS), for estimating the axial capacity of bored piles. Identification of the subset of features responsible for the predictive capacity of the model was addressed by considering it as a multi-objective optimization problem. Feature selection algorithm was successful in identifying the influential features involved in this problem which is a good indicator that it can also be applied to such problems where identification of the controlling parameters is of prime importance, especially geotechnical engineering where determination of the important variables is a must due to the complex nature of the soil and its surrounding elements. The use of multi-objective optimization tool for a trade-off between the inputs and the accuracy can be applied to other complex problems in geotechnical engineering like liquefaction, slope stability, etc. Distinct prediction models were developed for bored piles. The results acquired from the current study were compared with the results obtained previously in the literature in terms of statistical parameters. Based on the results of the ranking system for the developed models, for bored piles MOFS model was found to be better with an RI value of 3, followed by MARS model (RI = 6). Prediction equations for the MOFS, FN, and MARS models were provided which can be used by the practicing geotechnical engineers.
REFERENCES [1] M.A. Shahin, Intelligent computing for modelling axial capacity of pile foundations, Can. Geotech. J. 47 (2010) 230–243. [2] M.Y. Abu-Farsakh, H.H. Titi, Assessment of direct cone penetration test methods for predicting the ultimate capacity of friction driven piles, J. Geotech. Geoenviron. Eng. 130 (9) (2004) 35–944. [3] M.A. Shahin, Use of evolutionary computing for modelling some complex problems in geotechnical engineering, Geomech. Geoeng., Int. J. 10 (2) (2015) 109–125. [4] Alkroosh, H. Nikraz, Correlation of pile axial capacity and CPT data using gene expression programming, Geotech. Geol. Eng. 29 (2011) 725–748. [5] M.H. Baziar, A. Kashkooli, A. Saeedi-Azizkandi, Prediction of pile shaft resistance using cone penetration tests (CPTs), Comput. Geotech. 45 (2012) 74–82. [6] Kordjazi, F.P. Nejad, M.B. Jaksa, Prediction of ultimate axial load-carrying capacity of piles using a support vector machine based on CPT data, Comput. Geotech. 55 (2014) 91–102. [7] Alkroosh, H. Nikraz, Simulating pile load–settlement behavior from CPT data using intelligent computing, Cent. Eur. J. Eng. 1 (3) (2011) 295–305. [8] M.A. Shahin, Load–settlement modelling of axially loaded steel driven piles using CPT-based recurrent neural networks, Soils Found. 54 (3) (2014) 515–522. [9] M.A. Shahin, Load–settlement modeling of axially loaded drilled shafts using CPT-based recurrent neural networks, Int. J. Geomech. 14 (6) (2014), http://dx.doi.org/10.1061/(ASCE)GM.1943-5622.0000370. [10] S.K. Das, Artificial neural networks in geotechnical engineering: modeling and application, in: X. Yang, A.H. Gandomi, S. Talatahari, A.H. Alavi (Eds.), Metaheuristics in Water, Geotechnical and Transport Engineering, Elsevier, London, 2013, pp. 231–270, Chapter 10. [11] Guyon, A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res. 3 (2003) 1157–1182. [12] Y. Yang, J.O. Pedersen, A comparative study on feature selection in text categorization, in: Fourteenth International Conference on Machine Learning, vol. 97, ICML’97, Nashville, Tennessee, USA, 1997, pp. 412–420. [13] G. Forman, An extensive empirical study of feature selection metrics for text classification, J. Mach. Learn. Res. 3 (2003) 1289–1305. [14] F.R. Bach, Bolasso: model consistent Lasso estimation through the bootstrap, in: A. McCallum, S.T. Roweis (Eds.), 25th International Conference on Machine Learning, ICML2008, Helsinki, Finland, 2008, pp. 33–40.
REFERENCES
309
[15] H. Zare, G. Haffari, A. Gupta, R.R. Brinkman, Scoring relevancy of features based on combinatorial analysis of Lasso with application to lymphoma diagnosis, BMC Genom. 14 (2013), art. no. S14. [16] X. He, Q. Zhang, N. Sun, Y. Dong, Feature selection with discrete binary differential evolution, in: International Conference on Artificial Intelligence and Computational Intelligence, vol. 4, AICI 2009, Shanghai, 2013, pp. 327–330, art. no. 5376334. [17] Z. Zhu, Y.S. Ong, M. Dash, Wrapper-filter feature selection algorithm using a memetic framework, IEEE Trans. Syst. Man Cybern., Part B, Cybern. 37 (1) (2007) 70–76. [18] K. Neshatian, M. Zhang, Pareto front feature selection: using genetic programming to explore feature space, in: 11th Annual Conference on Genetic and Evolutionary Computation, GECCO’09, ACM, New York, NY, USA, 2009, pp. 1027–1034. [19] L. Cervante, B. Xue, M. Zhang, L. Shang, Binary particle swarm optimisation for feature selection: a filter based approach, in: 2012 IEEE Congress Evolutionary Computation, CEC, Brisbane, QLD, 2012, pp. 881–888, art. no. 6256452. [20] B. Xue, L. Cervante, L. Shang, W.N. Browne, M. Zhang, Binary PSO and rough set theory for feature selection: a multiobjective filter based approach, Int. J. Comput. Intell. Appl. 13 (2) (2014), art. no. 1450009. [21] K. Deb, A. Pratap, S. Agarwal, T. Meyarivan, A fast and elitist multiobjective genetic algorithm: NSGA-II, IEEE Trans. Evol. Comput. 6 (2) (2002) 182–197. [22] E. Castillo, A. Cobo, R. Gomez-Nesterkin, A.S. Hadi, A general framework for functional networks, Networks 35 (1) (2000) 70–82. [23] S.K. Das, S. Suman, Prediction of lateral load capacity of pile in clay using multivariate adaptive regression spline and functional network, Arab. J. Sci. Eng. 40 (6) (2015) 1565–1578. [24] J. Friedman, Multivariate adaptive regression splines, Ann. Stat. 19 (1991) 1–141. [25] S.K. Das, N. Sivakugan, Discussion of: intelligent computing for modeling axial capacity of pile foundations, Can. Geotech. J. 47 (2010) 928–930. [26] S.K. Das, P.K. Basudhar, Undrained lateral load capacity of piles in clay using artificial neural network, Comput. Geotech. 33 (2006) 454–459. [27] G.N. Smith, Probability and Statistics in Civil Engineering: An Introduction, Collins, London, 1986.
This page intentionally left blank
CHAPTER
TRANSIENT STABILITY CONSTRAINED OPTIMAL POWER FLOW USING CHAOTIC WHALE OPTIMIZATION ALGORITHM
17
Dharmbir Prasad∗ , Aparajita Mukherjee† , V. Mukherjee† ∗ Asansol
Engineering College, Asansol, India † Indian Institute of Technology (Indian School of Mines), Dhanbad, India
17.1 INTRODUCTION Optimization oriented lifestyle is universally accepted for all the human beings nowadays. The key perception of optimization is proposed for anticipating the least amount of losses with utmost return. Based on different nature inspired phenomena, a number of metaheuristic algorithms have been invented [1,2] and applied in different nonlinear complex optimization problems such that these algorithms may improve the performance of computational efficiency and resolve the large scale problems. The optimal power flow (OPF) may be considered as one of those challenges. The OPF problem may be stated as optimization of the considered objective function by proper adjustment of control variables while satisfying certain equality and inequality constraints [3]. Due to enhancement of the demand for electricity day by day, OPF problem is becoming more critical and vital in power system optimization world to adjoin the consumers’ energy demands by ensuring minimum amount of cost incurred for energy production. Many classical numerical programming techniques such as linear programming, nonlinear programming, quadratic programming, Newton method, interior point method, etc. have been applied in the literature for solving the OPF problem while only considering the algebraic functions. Generally, these types of problems have some restrictions like inability to consider dynamic characteristics, failure to obtain global optimal solutions in a short time, etc. To overcome such drawbacks and handling these difficulties, many heuristic optimization techniques such as simulated annealing (SA), evolutionary programming (EP) [4], particle swarm optimization (PSO), biogeography based optimization [5], etc. have been proposed for the solution of OPF problems. In the recent days, OPF problem has been extended to various confinements like voltage stability constraint, security constraint, harmonic constraint, etc. Transient stability constrained OPF (TSCOPF) problem is, mainly, a nonlinear optimization problem with both algebraic and differential equations considered. Other metaheuristic optimization algorithms like PSO [6], differential evolution (DE) [7], Handbook of Neural Computation. DOI: 10.1016/B978-0-12-811318-9.00017-X Copyright © 2017 Elsevier Inc. All rights reserved.
311
312
CHAPTER 17 TRANSIENT STABILITY CONSTRAINED OPTIMAL POWER FLOW
EP [8], improved genetic algorithm (GA), etc. have been also contrasted for solving the TSCOPF problem. Recently, a large number of novel swarm intelligence based algorithms have been proposed (such as krill herd algorithm (KHA), i.e. one based on the herding behavior of krill swarms in the nature [9], interior search algorithm [10], etc.) and secured a tough position in almost every area of science and engineering applications like combined heat and power dispatch problem [11], short-term load forecasting problem [12], reactive power and voltage control oriented problems [13], and so on. In 2016, Mirjalili and Lewis introduced a novel swarm intelligence based stochastic optimization approach and named it whale optimization algorithm (WOA) [14]. It is based on the simulation of the hunting behavior of humpback whales for searching of best agent in the nature. The objective function, used in WOA analogy, is assumed to be a coordinated sequence of the bubble-net hunting strategy of the humpback whales. The hunting behavior of humpback whales may be comprehended by three steps [14], viz. (a) encircling the target, (b) bubble-net attacking method, and (c) search for the target. Since its inception, to the best knowledge the authors of the present work, WOA or any of its variants was never applied in the recent arena of research concern such as OPF and TSCOPF problem of power system. With the recent developments in the theories and applications of nonlinear dynamics, chaotic activity is drawing a lot of attention in diversified fields of endeavor. Chaos concept is integrated with various metaheuristic optimization methods in the literature like GA [15], PSO [16], accelerated PSO [17], cuckoo search (CS) [18], harmony search algorithm [19], ant colony optimization [20], firefly algorithm [21], CS with elitism strategy [22], artificial bee colony optimization (ABC) [23], SA [24], bat algorithm [25], KHA [26,27], imperialist competitive algorithm [28], and so on. These hybridization works have shown good performance and increased accuracy in different fields of engineering applications. In this article, the chaotic concept is combined with the basic WOA to add some positive aspect, i.e. enhanced computational speed, improved convergence profile while applying in TSCOPF problem of power systems. Chaotic WOA (CWOA) is adopted in the present work to exercise on Western Systems Coordinating Council (WSCC) 3-generator, 9-bus and IEEE 30-bus test power systems for the purpose of TSCOPF study. The simulation results, yielded by the proposed CWOA technique, are compared to those offered by some other computational intelligence based techniques surfaced in the recent state-of-the-art literatures including the basic WOA one. The rest of this chapter is organized as follows. In Section 17.2, mathematical problem formulation of the TSCOPF work is presented. Section 17.3 describes the implementation part of TSCOPF problem using the proposed CWOA along with a brief overview of WOA and the chosen chaotic map. Simulation results are reported and discussed in Section 17.4. Finally, conclusion and scope of future work are presented in Section 17.5.
17.2 PROBLEM FORMULATION OF TSCOPF 17.2.1 OPF PROBLEM FORMULATION OPF problem formulation is concerned with the optimal setting of control variables for the steady-state performance of the power system with respect to a predefined objective function, subject to various
17.2 PROBLEM FORMULATION OF TSCOPF
313
equality and inequality constraints. Mathematically, OPF problem may be represented as (17.1) [6] f (u, v) eq(u, v) = 0 subject to: ieq(u, v) ≤ 0 Minimize
(17.1)
where f (u, v) is the objective function to be optimized, eq(u, v) is the vector set of equality constraints, ieq(u, v) is the vector set of inequality constraints, u is the set of dependent variables, and v is the set of control variables. The vector of dependent variables (refer u in (17.1)) may be represented by (17.2) uT = [PG1 , VL1 , · · · , VLNL , QG1 , · · · , QGNG , Sl1 , · · · , SlNT L ]
(17.2)
Similarly, the vector of control variables (refer v in (17.1)) may be written by (17.3) v T = [PG2 , · · · , PGNG , VG1 , · · · , VGNG , T1 , · · · , TN T , QC1 , · · · , QCNC ]
(17.3)
In (17.2)–(17.3), PG1 is the slack bus power, VL is the load bus voltage, QG is the reactive power output of the generators, Sl is the transmission line flows, N L is the number of load buses, N G is the number of generator buses, N T L is the number of transmission lines, N T is the number of tap changing transformers, N C is the number of shunt VAR compensators, VG is the terminal voltage at the generator bus, PG is the active power output of the generator, T is the tap setting of the tap changing transformer, and QC is the output of shunt VAR compensator [27].
17.2.2 OBJECTIVE FUNCTION The objective function, formulated here, is to minimize the total fuel cost of active power generation which may be defined by a quadratic function. Mathematically, it may be expressed by (17.4)
CT =
NG
ap PG2 p + bp PGp + cp
(17.4)
p=1
where ap , bp and cp are the fuel cost coefficients of the pth generator, PGp is the active power output of the pth generator, and NG is the total number of generators. Moreover, additional terms like valve point loading effect may be included to achieve more flexible, accurate and stable operation [5].
314
CHAPTER 17 TRANSIENT STABILITY CONSTRAINED OPTIMAL POWER FLOW
17.2.3 CONSTRAINTS OF THE PROBLEM 17.2.3.1 Equality Constraints In (17.1), eq is the set of equality constraints of the power flow equations given by (17.5) [27] ⎧ NB ⎪ ⎪ ⎪P − P = ⎪ |Vp ||Vq |(Gpq cos δpq + Bpq sin δpq ) G Lp ⎪ ⎪ ⎨ p q=1 ⎪ NB ⎪ ⎪ ⎪ ⎪ Q − Q = |Vp ||Vq |(Gpq sin δpq − Bpq cos δpq ) ⎪ Lp ⎩ Gp
(17.5)
q=1
where PGp , QGp are the injected active and reactive powers at the pth bus, respectively; PLp , QLp are the active and reactive power demands at the pth bus, respectively; Gpq , Bpq are the transfer conductance and susceptance between the pth and the qth buses, respectively, and NB is the number of buses.
17.2.3.2 Inequality Constraints In (17.1), ieq is the set of system inequality constraints presented below. (i) Generator capacity constraints: For the entire generator output voltages (including slack bus), active, reactive power outputs (including slack bus) and transformer tap settings must be restricted by their respective lower and upper limits as stated in (17.6) [6]. ⎧ ⎪ ≤ PGp ≤ PGmax , p = 1, 2, · · · , NPV ⎪PGmin ⎪ p p ⎪ ⎪ ⎨Qmin ≤ Q ≤ Qmax , p = 1, 2, · · · , N Gp PV Gp Gp (17.6) min max ⎪ ⎪ ⎪VGp ≤ VGp ≤ VGp , p = 1, 2, · · · , NPV ⎪ ⎪ ⎩T min ≤ T ≤ T max , p = 1, 2, · · · , NT p p p
(ii) Security constraints: These constraints include the load bus voltages and transmission line loadings. Each of these constraints must be restricted by their respective lower and upper operating limits, as expressed in (17.7) and (17.8), in that order [6]. ≤ VLp ≤ VLmax , VLmin p p Slp ≤ Slmax , p
p = 1, 2, · · · , NP Q
p = 1, 2, · · · , NL
(17.7) (17.8)
17.2.3.3 Transient Stability Constraints The transient stability constraints of TSCOPF problem constitute a set of differential algebraic equations [29]. The generator rotor angle deviation with respect to the center of inertia (COI) is expressed in the form of inequality constraints, as stated in (17.9) [6]. δ ≤ |δk − δCOI | ≤ δ k ⊆ SG
(17.9)
17.3 TSCOPF PROBLEM USING PROPOSED CWOA
The position of COI may be expressed as in (17.10) [6]. NPV M k δk δCOI = k=1 NPV k=1 Mk
315
(17.10)
where Mk is the inertia constant of the kth generator and δk is the rotor angle of the kth generator.
17.3 TSCOPF PROBLEM USING PROPOSED CWOA 17.3.1 OVERVIEW OF WOA WOA [14] is a new metaheuristic swarm intelligence based optimization algorithm which is inspired by the hunting behavior of humpback whales in response to the searching for food in the nature. By combining exploration or random search and exploitation or local search, the two main characteristics of WOA play a very significant role to achieve the highest performance in solving optimization problem. This metaheuristic approach requires only two main internal parameters (viz. D and C) that reflect the simplicity of this method and makes WOA more reliable, robust, and flexible. The WOA consists of three steps and pursues the search directions to develop the objective function value. These steps are [14]: (a) Encircling the target, (b) Bubble-net attacking method, and (c) Searching for the target.
17.3.1.1 Encircling the Target Humpback whales are highly competent to differentiate the location of the target and encircle it successfully. As the optimal design of the search space is previously unknown, the current best candidate solution may be considered as superior. After selecting the best search agent, the other search agents try to advance their positions towards the best one. This occurrence may be illustrated by (17.11) and (17.12), respectively [14]. > > U ∗ (t) − U (t)> (17.11) B = >C. B U (t + 1) = U ∗ (t) − D.
(17.12)
are the coefficient vectors, t denotes the current iteration, U is the position vector, where C and D U ∗ is termed as the position vector of the best solution achieved as yet, ‘| |’ and ‘.’ may be utilized for absolute value calculation and element-by-element multiplication, respectively. may be analyzed by using (17.13) and (17.14), respectively [14]. The coefficient vectors, C and D, = 2.d. r − d D
(17.13)
C = 2.r
(17.14)
where the value of d is set as 2 (at initial stage) and this value is linearly decreased to 0 for exploitation and exploration and r is a random vector (i.e. 0 ≤ r ≤ 1).
316
CHAPTER 17 TRANSIENT STABILITY CONSTRAINED OPTIMAL POWER FLOW
Here, Eq. (17.12) may be implemented to update the position of the search agents in the area of the current best solution. Thereafter, the simulation step (a) (viz. encircling the target) becomes successful.
17.3.1.2 Bubble-net Attacking Method (Exploitation Phase) The bubble-net attacking method of the humpback whales may be described by a proper mathematical modeling that may be intended by two mechanisms. These are: (a) Shrinking encircling method this circumstance may be achieved (refer to Eq. (17.13)) [14]. As By decreasing the value of d, may also get reduced. Here, the value of d is decreased a result of it, the fluctuation range of D may be measured as a random value that lies within the from 2 to 0 throughout the iterations. D interval [−1, 1]. The new position of the search agent may be classified anywhere in the middle of the original position and the current best position. (b) Spiral updating position For computing the spiral updating position, a spiral equation may be established between the position of the whale and the target to take off the movement of the humpback whales that may be written in (17.15) [14] U (t + 1) = B .eal . cos(2πl) + U ∗ (t)
(17.15)
where B = |U ∗ (t) − U (t)| specifies the distance between the whale to the target, a is chosen as a constant that depicts the shape of the logarithmic spiral, l is a random number (i.e. −1 ≤ r ≤ 1). The humpback whales swing around the target by using a shrinking circle or a spiral shaped path simultaneously. Now, the possibility for choosing any one method for updating the position of the whales between the two above-mentioned techniques (viz. shrinking encircling method and spiral modeling method) is 50%. The mathematical expression may be given as (17.16) [14] B if α < 0.5 U ∗ (t) − D. U (t + 1) = (17.16) al ∗ B .e . cos(2πl) + U (t) if α < 0.5 where α is a random number that lies within the interval [0, 1].
17.3.1.3 Searching for Target (Exploration Phase) vector. In line with the In this step, the target may be determined by using the variation of the D position of each other, the humpback whales carry on the search process in a random manner. Hence, vector may be attuned randomly (greater than 1 or less than −1) to force the search the value of the D agents to move distance from a reference whale [14]. After that, the position of the search agent gets modified owing to the randomly chosen search agents. This process is similar to the exploitation phase. > 1 emphasize the exploration process and permit the proposed The above-mentioned method and |D| CWOA to accomplish a global search. The mathematical model may be articulated by (17.17) and (17.18), respectively [14] Urand − U | B = |C.
(17.17)
B U (t + 1) = Urand − D.
(17.18)
where Urand is a random position vector that may be selected from the current population.
17.4 SIMULATION RESULTS AND DISCUSSION
317
17.3.2 IMPLEMENTATION OF TSCOPF PROBLEM USING PROPOSED CWOA Chaos is a phenomenon that is considered as a dynamic behavior in nonlinear systems having the property of non-repetition and ergodicity [25]. Due to the increasing interest on chaotic study, it has been proposed in different fields of applications like pattern recognition, synchronization, chaos control, etc. [30]. In the present work, the concept of chaos is combined with the basic WOA (termed as CWOA) and is applied for the TSCOPF problem of power systems. For evaluating the efficiency of the proposed CWOA and improving the performance of the considered test systems, chaotic variable based logistic map is selected in the present work and is applied for tuning the inertia weights of basic WOA technique. Logistic map is one of the simplest maps, proposed by May (in 1976) and is applied in nonlinear dynamics problems of biological population [31]. Mathematically, this chaotic map may be defined by (17.19) vb+1 = cvb (1 − vb )
(17.19)
where vb is the bth chaotic number (here, b denotes the iteration number).
17.3.2.1 Implementation of CWOA for TSCOPF Problem The main steps of the proposed CWOA approach, as applied to the TSCOPF problem of power system, are described in Fig. 17.1.
17.4 SIMULATION RESULTS AND DISCUSSION In this work, applicability of the proposed CWOA approach for TSCOPF problem is tested on two different test power systems, viz. WSCC 3-generator, 9-bus test system and IEEE 30-bus test system. For this TSCOPF study, a classical generator model and a constant impedance model are taken into consideration for the synchronous generator and the loads, respectively [7]. The simulation program is coded in MATLAB 2008a computing environment on a 2.63 GHz Pentium IV personal computer with 3 GB RAM. Simulation results of the present article are reported and discussed in this part.
17.4.1 INPUT PARAMETERS The value of number of fitness function evaluation (NFFEs) is set to 100 (for both WOA and proposed CWOA) for all the simulated test cases. Bus 1 is considered here as slack bus for both the power networks. The value of population size (Np ) is chosen as 50 for both WOA and proposed CWOA methods. Independent test trials for 100 runs have been implemented for this simulation study. For transient stability simulation, the value of integration time step is set as 0.01 s, whereas the whole simulation time is considered as 3.0 s and 1.0 s for WSCC 3-generator, 9-bus and IEEE 30-bus test power systems, respectively. The value of d is considered as 2 for exploration stage (i.e. at initial stage), and after that, this value is reduced linearly to 0 for exploitation [14].
17.4.2 TEST SYSTEM 1: WSCC 3-GENERATOR, 9-BUS TEST POWER SYSTEM The WSCC 3-generator, 9-bus test system is considered as test system 1 (refer to Fig. 17.2A for its single-line diagram). The system data are taken from [33]. The fuel cost coefficients and the rating of
318
CHAPTER 17 TRANSIENT STABILITY CONSTRAINED OPTIMAL POWER FLOW
Algorithm 1: Implementation of the proposed CWOA for TSCOPF problem Step 1 Read the parameters of defined power systems and also for CWOA. Specify the minimum and maximum value of each variable. Step 2 Fitness evaluation: Generate randomly the position set of humpback whales and evaluate the fitness function value (defined in (17.4)) for each of them based on the results of Newton–Raphson power flow analysis [32]. Step 3 while the termination criterion is not satisfied or m < NFFEs for Update the parameters such as d, D, C, l, α for each search agent. if (α < 0.5) < 1) if (|D| Update the position of current candidate solution using (17.11). ≥ 1) else if (|D| Select a random search agent (Urand ) and update the position of current candidate solution using (17.18). end if else if (α ≥ 0.5) Update the position of current candidate solution using (17.15). end if end for Update the internal parameters (D and C) using chaotic maps. Sort the population from best to worst and find the current best. m = m + 1; end while Step 4 Check for the constraints (i.e. equality constraints mentioned in (17.5) and inequality constraints including transient stability constraints presented in (17.6)–(17.9) of the problem). Step 5 Go to Step 2 until stopping criterion is met. FIGURE 17.1 Implementation of the proposed CWOA for TSCOPF problem.
the generators are taken from [34]. The upper and lower limits of all bus voltage magnitudes are taken as 0.95 p.u. and 1.05 p.u., respectively. Three cases, as considered for this test system (including the base load one) are as follows: (a) Case study 1.1 Base load condition: without considering transient stability [35]. (b) Case study 1.2 Contingency 1: A 3-phase to ground fault occurs at bus 7 and in-between lines 7 and 5. The fault clearing time is 0.35 s [35]. (c) Case study 1.3 Contingency 2: A 3-phase to ground fault occurs at bus 9 and in-between lines 6 and 9. The fault clearing time is 0.3 s [35].
17.4 SIMULATION RESULTS AND DISCUSSION
FIGURE 17.2 Single-line diagram of (A) WSCC 3-generator, 9-bus test system, and (B) IEEE 30-bus test system.
319
320
CHAPTER 17 TRANSIENT STABILITY CONSTRAINED OPTIMAL POWER FLOW
Table 17.1 Best Control Variable Settings for Fuel Cost Minimization Objective of WSCC 9-Bus Test Power System (Case Study 1.1) Control Variables TS [34]
DE [7]
Base Load Result Offered by TDS BPD ABC [36] [37] [35]
WOA [Studied]
CWOA [Proposed]
PG1 , MW PG2 , MW PG3 , MW VG1 , pu VG2 , pu VG3 , pu
106.19 112.96 99.20 NRa NRa NRa
105.94 113.04 99.29 1.0500 1.0500 1.0400
105.94 113.04 99.24 NRa NRa NRa
105.94 113.04 99.23 1.0500 1.0500 1.0400
107.15 114.18 96.74 1.0330 1.0240 1.0280
106.25 113.36 98.51 1.0500 1.0500 1.0400
107.05 113.96 97.18 1.0500 1.0500 1.0400
Minimum cost, $/h Maximum cost, $/h Average cost, $/h CPU time, s
1132.59 NRa NRa NRa
1132.30 1132.71 1132.32 NRa
1132.18 NRa NRa NRa
1132.17 NRa NRa NRa
1131.87 1132.22 1132.04 NRa
1131.15 1132.15 1131.45 19.31
1130.53 1131.67 1130.98 16.73
a
NR means not reported in the referred literature and the result of interest is bold faced
(a) Case study 1.1 Base load condition is referred to here as the one where transient stability constraint is not addressed. The only objective in this case is to minimize the fuel cost of the entire power network. The proposed CWOA based results for fuel cost minimization objective of this test system are presented in Table 17.1 and these results are compared to those offered by some other algorithms like trajectory sensitivities (TS) [34], DE [7], time domain simulation (TDS) [36], base case (BPD) [37], and ABC [35] including the studied WOA. It may be observed from this table that the minimum fuel cost offered by the proposed CWOA is 1130.53 $/h. The comparative convergence profile of fuel cost ($/h) for this power system, as yielded by both WOA and the proposed CWOA, is presented in Fig. 17.3A. From this figure it may be observed that the value of fuel cost converges smoothly at lesser value of NFFEs for the proposed CWOA than basic WOA counterpart. (b) Case study 1.2 Table 17.2 represents the optimal values of the control variables, as offered by the proposed CWOA method, for the solution of the TSCOPF problem after the occurrence of a contingency like a 3-phase to ground fault at bus 7 and in-between lines 7 and 5 of this test system. The fault clearing time is assumed as 0.35 s. The obtained TSCOPF results of the proposed CWOA are compared to those reported in the literature like TS [34], TSCOPF detailed model (TSCOPF_DM) [37], DE [7], TSCOPF classical model (TSCOPF_CM) [37], TDS [36], improved group search optimization (IGSO) [38], TSCOPF detailed model-well tuned (TSCOPF_DMWT) [37], ABC [35], and WOA in Table 17.2. This table shows that fuel cost reduction of 0.1879% is achieved from the previous best result of 1133.18 $/h while using ABC method in [35]. It may be noted that the cost related to TSCOPF case is higher than that of the base case OPF to achieve the stable performance of the power system network. Fig. 17.3B shows WOA and the proposed CWOA based comparative convergence characteristics of fuel cost minimization objective for this test system. The relative rotor angle deviation performance curve with respect to COI is also shown in
17.4 SIMULATION RESULTS AND DISCUSSION
321
FIGURE 17.3 Comparative convergence profiles of fuel cost for TSCOPF based fuel cost minimization objective of WSCC 9-bus test power system pertaining to (A) Case study 1.1, (B) Case study 1.2, and (C) Case study 1.3.
Table 17.2 Best Control Variable Settings for Fuel Cost Minimization Objective of WSCC 9-Bus Test Power System (Case Study 1.2) Control Variables TS [34]
TSCOPF_ DE DM [7] [37]
TSCOPF_CM [37]
TDS [36]
IGSO [38]
TSCOPF_ DMWT [37]
ABC [35]
WOA [Studied]
CWOA [Proposed]
PG1 , MW PG2 , MW PG3 , MW VG1 , pu VG2 , pu VG3 , pu
170.20 48.94 98.74 NRa NRa NRa
138.47 94.20 85.01 1.0500 1.0500 1.0400
130.94 94.46 93.09 0.9590 1.0139 1.0467
119.75 106.35 91.81 1.0500 1.0500 1.0400
117.85 103.50 96.66 NRa NRa NRa
118.04 103.51 96.43 1.0450 1.0480 1.0410
116.25 106.26 95.48 1.0500 1.0500 1.0400
117.69 105.89 94.23 1.0250 1.0700 1.0700
112.20 106.66 99.32 1.0495 1.0482 1.0490
113.19 105.35 99.33 1.0310 1.0100 1.0410
Minimum cost, $/h Maximum cost, $/h Average cost, $/h CPU time, s
1191.56 NRa NRa NRa
1143.42 NRa NRa NRa
1140.06 1141.57 1140.65 NRa
1134.20 NRa NRa NRa
1134.01 NRa NRa NRa
1133.96 1134.52 1134.12 46.1
1133.34 NRa NRa NRa
1133.18 1138.8 1135.9 NRa
1132.1399 1134.8678 1132.6778 49.65
1131.05 1133.6547 1131.78 64.23
a
NR means not reported in the referred literature and the result of interest is bold faced
17.4 SIMULATION RESULTS AND DISCUSSION
323
FIGURE 17.4 Relative rotor angle trajectories for TSCOPF solution of WSCC 9-bus test power system pertaining to (A) Case study 1.2, and (B) Case study 1.3.
Fig. 17.4A. The nature of the characteristics (refer to Fig. 17.4A) yielded by the proposed CWOA for both the generators (other than slack bus) are found to be stable and promising ones after the disturbance. (c) Case study 1.3 The best control variable settings for fuel cost minimization objective of this test system, as yielded by the proposed CWOA after the occurrence of the fault at bus 9 and in-between lines 6 and 9, are reported in Table 17.3. The fault clearing time is considered as 0.3 s. Simulation results obtained from CWOA are compared to those offered by the other optimization techniques such as TS [34], DE [7], TDS [36], ABC [35], and the basic WOA one. From this table, 0.3313% reduction in fuel cost may be recorded by using the proposed CWOA algorithm (i.e. 1134.01 $/h) as compared to ABC based result (i.e. 1137.78 $/h) reported in [35]. Comparative convergence profile of fuel cost minimization objective, corresponding to WOA and the proposed CWOA for this test system, is shown in Fig. 17.3C. It may be observed from this figure that fuel cost minimization value converges smoothly at lesser value of NFFEs for the proposed CWOA than the WOA counterpart.
324
CHAPTER 17 TRANSIENT STABILITY CONSTRAINED OPTIMAL POWER FLOW
Table 17.3 Best Control Variable Settings for Fuel Cost Minimization Objective of WSCC 9-Bus Test Power System (Case Study 1.3) Control Variables TS [34]
DE [7]
TDS [36]
ABC [35]
WOA [Studied]
CWOA [Proposed]
PG1 , MW PG2 , MW PG3 , MW VG1 , pu VG2 , pu VG3 , pu
164.38 112.44 41.00 NRa NRa NRa
130.01 127.17 60.72 1.0495 1.0481 1.0327
120.01 121.13 76.84 NRa NRa NRa
121.23 120.63 75.94 1.0190 1.0410 1.0440
122.34 113.67 81.79 1.0499 1.0476 1.0444
124.89 118.58 74.33 1.0494 1.0499 1.0495
Minimum cost, $/h Maximum cost, $/h Average cost, $/h CPU time, s
1179.95 NRa NRa NRa
1147.77 1151.37 1148.58 NRa
1137.82 NRa NRa NRa
1137.78 1146.70 1142.20 NRa
1134.1198 1148.8000 1141.9800 43.65
1134.01 1156.88 1143.67 59.35
a
NR means not reported in the referred literature and the result of interest is bold faced
Fig. 17.4B portrays the relative rotor angle deviation characteristic that shows stable nature of the curve after the fault is cleared.
17.4.3 TEST SYSTEM 2: IEEE 30-BUS TEST POWER SYSTEM The IEEE 30-bus test system, consisting of six generating units interconnected with forty-one transmission lines and four transformers, is chosen as test system 2. The single-line diagram of this power system is shown in Fig. 17.2B. The system data such as bus data, line data, and initial values of control variables are taken from [39]. The fuel cost coefficients data and the rating of generators are the same as in [35]. The total load demands of this test system are PLoad = 189.2 MW and QLoad = 107.2 MW at 100 MVA base. Three chosen case studies, including the base load condition, are expressed as follows: (a) Case study 2.1 Base load condition: without considering transient stability [6]. (b) Case study 2.2 Contingency 1: A 3-phase to ground fault occurs at bus 2 and in-between lines 2 and 5. The fault clearing time is 0.18 s [40]. (c) Case study 2.3 Contingency 2: A 3-phase to ground fault occurs at bus 2 and in-between lines 2 and 5. The fault clearing time is 0.35 s [40].
(a) Case study 2.1 The objective considered in this case is to minimize the value of fuel cost without considering the transient stability constraint. Table 17.4 depicts the best solutions of optimal settings of control variables for the studied WOA and the proposed CWOA methods to solve the TSCOPF problem of this test system and these results are also compared to those yielded by other optimization techniques such as OPF-GA [6], GA [6], and PSO [6]. From this table, the value of fuel cost corresponding to CWOA may be noted as 32.53 $/h less as compared to PSO based earlier best result reported in [6]. The comparative convergence mobility, as offered by both WOA and the proposed CWOA, is portrayed in Fig. 17.5A which presents that CWOA based objective function
17.4 SIMULATION RESULTS AND DISCUSSION
FIGURE 17.5 Comparative convergence profiles of fuel cost for TSCOPF based fuel cost minimization objective of IEEE 30-bus test power system pertaining to (A) Case study 2.1, (B) Case study 2.2, and (C) Case study 2.3.
325
326
CHAPTER 17 TRANSIENT STABILITY CONSTRAINED OPTIMAL POWER FLOW
Table 17.4 Best Control Variable Settings for Fuel Cost Minimization Objective of IEEE 30-Bus Test Power System (Case Study 2.1) Control Variables
OPF-GA [6]
Base Load Result Offered by GA PSO WOA CWOA [6] [6] [Studied] [Proposed]
PG1 , MW PG2 , MW PG13 , MW PG22 , MW PG23 , MW PG27 , MW VG1 , pu VG2 , pu VG13 , pu VG22 , pu VG23 , pu VG27 , pu T6−9 , pu T6−10 , pu T4−12 , pu T28−27 , pu
41.54 55.40 16.20 22.74 16.27 39.91 NRa NRa NRa NRa NRa NRa 1.00 1.00 1.00 1.00
42.30 55.61 23.04 37.62 16.51 16.96 NRa NRa NRa NRa NRa NRa 1.02 0.96 0.99 0.98
42.22 55.98 22.83 37.75 15.91 17.35 NRa NRa NRa NRa NRa NRa 1.01 0.97 0.99 0.97
42.65 55.24 21.77 37.59 16.21 18.58 0.9800 0.9700 1.0300 1.0200 0.9800 1.0200 0.97 0.96 0.99 1.02
41.77 56.10 23.14 37.76 16.93 16.34 1.0200 1.0400 0.9700 0.9800 1.0100 1.0000 0.98 1.00 1.00 1.01
Minimum cost, $/h Maximum cost, $/h Average cost, $/h CPU time, s
576.89 NRa NRa 29.4
576.64 576.91 576.75 27.8
576.63 576.87 576.69 22.5
544.67 546.65 544.97 25.34
544.10 544.98 544.24 21.60
a
NR means not reported in the referred literature and the result of interest is bold faced
value for this objective of the studied test system converges smoothly and reaches the near-global optimal value with the lesser number of NFFEs. (b) Case study 2.2 The optimal settings of control variables offered by both WOA and the proposed CWOA for fuel cost minimization objective are presented in Table 17.5 while considering a 3-phase to ground fault at bus 2 and between lines 2 and 5. The fault clearing time is taken as 0.18 s for the sake of comparison with [40]. This table also includes the results offered by other algorithms like GA [6], PSO [6], ABC [40], and chaotic ABC (CABC) [40]. It may be noted from this table that the obtained fuel cost value while using the proposed CWOA is 564.1243 $/h and the same for WOA is 564.86 $/h. This proves that fuel cost reduction of 2.338% has taken place while adopting the proposed CWOA as compared to the previous best algorithm like CABC [40]. The WOA and the proposed CWOA based comparative convergence profile of fuel cost ($/h), shown in Fig. 17.5B, presents that CWOA converges faster to near-global solution within lesser NFFEs. The relative rotor angle deviation curve is also shown in Fig. 17.6A and clearly proves that the stability of
17.4 SIMULATION RESULTS AND DISCUSSION
327
Table 17.5 Best Control Variable Settings for Fuel Cost Minimization Objective of IEEE 30-Bus Test Power System (Case Study 2.2). Control Variables GA [6]
PSO [6]
ABC [40]
CABC [40]
WOA [Studied]
CWOA [Proposed]
PG1 , MW PG2 , MW PG13 , MW PG22 , MW PG23 , MW PG27 , MW VG1 , pu VG2 , pu VG13 , pu VG22 , pu VG23 , pu VG27 , pu T6−9 , pu T6−10 , pu T4−12 , pu T28−27 , pu
41.88 56.38 22.94 37.63 16.70 16.53 NRa NRa NRa NRa NRa NRa 1.01 0.95 1.00 0.97
43.63 58.05 23.29 32.49 17.04 17.54 NRa NRa NRa NRa NRa NRa 1.01 0.96 1.01 0.97
40.5512 51.9248 18.9168 23.8110 16.8010 40.000 0.9858 0.9780 1.0601 1.0191 1.0400 1.0639 NRa NRa NRa NRa
41.4823 55.3017 17.0909 20.8952 17.0019 40.4145 0.9723 0.9738 1.0785 1.0241 1.0340 1.0610 NRa NRa NRa NRa
41.6772 57.1053 18.3953 21.3213 17.5919 35.9138 1.0213 1.0335 1.0205 1.0262 1.0430 1.0400 0.95 1.01 1.00 0.97
42.3443 56.7890 17.9230 18.9767 20.8343 35.1727 1.0241 1.0234 1.0069 1.0133 1.0102 1.0382 0.97 0.99 0.96 1.01
Minimum cost, $/h Maximum cost, $/h Average cost, $/h CPU time, s
585.62 585.71 585.66 752.3
585.17 585.69 585.34 576.4
577.78 583.90 580.84 115
577.63 580.83 579.23 182
564.86 587.088 568.98 153.58
564.1243 582.78 565.90 197.89
a
NR means not reported in the referred literature and the result of interest is bold faced
the considered system is achieved after the occurrence of the fault while adopting the proposed CWOA method. (c) Case study 2.3 Table 17.6 includes the best settings of control variables for fuel cost minimization of this test system, when a 3-phase to ground fault occurs at bus 2 and between lines 2 and 5. Fault clearing time is considered here as 0.35 s [40]. The other results of TSCOPF offered by EP [6], EP incorporating neural network (EPNN) [6], ABC [40], CABC [40], and WOA are also included in the same table for the sake of comparison. It may be noted from this table that 2.324% reduction in fuel cost has occurred as compared to the previous best result offered by CABC [40]. Comparative convergence profile of fuel cost value, based on WOA and the proposed CWOA, is portrayed in Fig. 17.5C. It may be observed from this figure that CWOA based approach is found to be promising one for this case of the test system. Fig. 17.6B exhibits the relative rotor angle deviation curve. This figure helps to infer that the stability after the occurrence of the fault of the adopted system is achieved while adopting the proposed CWOA.
328
CHAPTER 17 TRANSIENT STABILITY CONSTRAINED OPTIMAL POWER FLOW
FIGURE 17.6 Relative rotor angle trajectories for TSCOPF solution of IEEE 30-bus test power system pertaining to (A) Case study 2.2, and (B) Case study 2.3.
17.4.4 STATISTICAL ANALYSIS OF THE RESULTS The t -test method is a statistical evaluation of the substantial deviation between two algorithms. Mathematically, it may be represented as (17.20) [41]. t=
α¯ 2 − α¯ 1 σ22 σ12 ξ +1 + ξ +1
(17.20)
where α¯ 1 and α¯ 2 are the mean values of the first and second algorithms, respectively; σ1 and σ2 are the standard deviations of the first and second algorithms, respectively, and ξ is the value of the degree of freedom.
17.5 CONCLUSION AND SCOPE OF FUTURE WORK
329
Table 17.6 Best Control Variable Settings for Fuel Cost Minimization Objective of IEEE 30-Bus Test Power System (Case Study 2.3) Control Variables EP [6]
EPNN [6]
ABC [40]
CABC [40]
WOA [Studied]
CWOA [Proposed]
PG1 , MW PG2 , MW PG13 , MW PG22 , MW PG23 , MW PG27 , MW VG1 , pu VG2 , pu VG13 , pu VG22 , pu VG23 , pu VG27 , pu T6−9 , pu T6−10 , pu T4−12 , pu T28−27 , pu
NRa NRa NRa NRa NRa NRa NRa NRa NRa NRa NRa NRa NRa NRa NRa NRa
48.95 38.41 23.34 24.65 17.61 38.99 0.9900 0.9900 1.0700 0.9900 1.0200 1.0500 1.01 0.98 1.03 1.04
40.6796 55.0000 15.4663 22.1919 19.9917 38.8263 0.9700 0.9715 1.0752 1.0187 1.0264 1.0643 NRa NRa NRa NRa
42.7411 54.7034 14.8287 24.3374 17.9055 37.6754 0.9733 0.9653 1.0620 1.0106 1.0242 1.0685 NRa NRa NRa NRa
41.6443 55.4773 15.9703 21.3843 18.5905 39.0933 1.0273 1.0466 1.0078 1.0100 1.0366 1.0007 0.98 1.01 1.04 1.01
40.9711 56.9119 14.8773 22.7991 17.5997 39.0091 1.0446 1.0237 1.0178 1.0336 1.0200 1.0034 1.01 0.97 1.04 1.03
Minimum cost, $/h Maximum cost, $/h Average cost, $/h CPU time, s
585.15 586.86 585.83 451.13
585.12 586.73 585.84 220.17
577.71 583.26 580.21 125
577.47 580.74 579.10 195
564.72 570.43 566.54 145.34
564.05 567.75 564.65 215.24
a
NR means not reported in the referred literature and the result of interest is bold faced
Positive value of t will signify that the first algorithm is better than the second one and vice versa. When the t value is greater than 1.645 with ξ = 49, a significant difference has been established with a 95% confidence level [41]. A statistical analysis is carried out and presented in Table 17.7 for 100 independent trial runs. From Table 17.7, it may be seen that the t -values greater than 15.383 (ξ = 49) are obtained for all the techniques including CWOA which proves that a substantial difference between the proposed CWOA and the other methods with a 99% confidence level is achieved.
17.5 CONCLUSION AND SCOPE OF FUTURE WORK The present work represents a novel class of evolutionary optimization technique such as CWOA to investigate its potential benefit while applying the same for TSCOPF problem of power system considering multiple contingencies. Two different test power systems like WSCC 3-generator, 9-bus and IEEE 30-bus test power systems are taken into consideration for conducting this type of study. Simulation results, as offered by the proposed CWOA, are compared to other popular techniques reported in the recent state-of-the-art literatures. Basic WOA based results are also studied and included in the present work. From the simulation results, it may be concluded that with the inclusion of the chaotic variables
Table 17.7 Comparison Among Different Optimization Algorithms for TSCOPF of WSCC 9-Bus Test System and IEEE 30-Bus Test System (Rank 1: Best, Rank 4: Worst) Parameters Minimum cost, $/h Maximum cost, $/h Average cost, $/h Standard deviation, $/h t -test Rank Parameters Minimum cost, $/h Maximum cost, $/h Average cost, $/h Standard deviation, $/h t -test Rank Parameters Minimum cost, $/h Maximum cost, $/h Average cost, $/h Standard deviation, $/h t -test Rank Parameters Minimum cost, $/h Maximum cost, $/h Average cost, $/h Standard deviation, $/h t -test Rank a
WSCC 9-Bus: Case Study 1.1 WSCC 9-Bus: Case Study 1.3 TDS BPD ABC WOA CWOA TS DE TDS ABC WOA CWOA [36] [37] [35] [Studied] [Proposed] [34] [7] [36] [35] [Studied] [Proposed] 1132.18 1132.17 1131.87 1131.15 1130.53 1179.95 1147.77 1137.82 1137.78 1134.1198 1134.01 NRa NRa 1132.22 1132.15 1131.67 NRa 1151.37 NRa 1146.70 1148.8000 1156.88 NRa NRa 1132.04 1131.45 1130.98 NRa 1148.58 NRa 1142.20 1141.9800 1143.67 NRa NRa NRa 0.0073 0.0052 NRa 0.7 NRa NRa 0.49 0.35 NRa NRa NRa 370.8 NAa NRa 44.36 NRa NRa 15.383 NAa NAa NAa NAa 2 1 NAa 3 NAa NAa 2 1 WSCC 9-bus: Case study 1.2 TS TSCOPF_DM DE TSCOPF_CM TDS IGSO TSCOPF_DMWT ABC WOA CWOA [34] [37] [7] [37] [36] [38] [37] [35] [Studied] [Proposed] 1191.56 1143.42 1140.06 1134.20 1134.01 1133.96 1133.34 1133.18 1132.1399 1131.05 NRa NRa 1141.57 NRa NRa 1134.52 NRa 1138.8 1134.8678 1133.6547 NRa NRa 1140.65 NRa NRa 1134.12 NRa 1135.9 1132.6778 1131.78 NRa NRa 0.456 NRa NRa 0.14 NRa NRa 0.09 0.085 NRa NRa 135.2156 NRa NRa 101.025 NRa NRa 51.28 NAa NAa NAa 4 NAa NAa 3 NAa NAa 2 1 IEEE 30-bus: Case study 2.1 IEEE 30-bus: Case study 2.2 OPF-GA GA PSO WOA CWOA GA PSO ABC CABC WOA CWOA [6] [6] [6] [Studied] [Proposed] [6] [6] [40] [40] [Studied] [Proposed] 576.89 576.64 576.63 544.67 544.10 585.62 585.17 577.78 577.63 564.86 564.1243 NRa 576.91 576.87 546.65 544.98 585.71 585.69 583.90 580.83 587.088 582.78 NRa 576.75 576.69 544.97 544.24 585.66 585.34 580.84 579.23 568.98 565.90 NRa NRa NRa 0.075 0.063 NRa NRa NRa NRa 0.095 0.087 NRa NRa NRa 58.8021 NAa NRa NRa NRa NRa 169.067 NAa NAa NAa NAa 2 1 NAa NAa NAa NAa 2 1 IEEE 30-bus: Case study 2.3 EP EPNN ABC CABC WOA CWOA [6] [6] [40] [40] [Studied] [Proposed] 585.15 585.12 577.71 577.47 564.72 564.05 586.86 586.73 583.26 580.74 570.43 567.75 585.83 585.84 580.21 579.10 566.54 564.65 NRa NRa NRa NRa 0.51 0.47 NRa NRa NRa NRa 19.27 NAa NAa NAa NAa NAa 2 1 TS [34] 1132.59 NRa NRa NRa NRa NAa
DE [7] 1132.30 1132.71 1132.32 0.01 840.658 3
NR means not reported in the referred literature, NA means not applicable for this algorithm, and the results of interest are bold faced
REFERENCES
331
in the basic WOA method, considerable improvement has been noted in the transient performance of the power system. CWOA may be implemented for some other practical power engineering problems like hydrothermal scheduling problems, geometric optimization problems, resource-constrained project scheduling problems, image processing problems, path planning problems, etc. to explore its efficiency and to express its efficacy for yielding better convergence mobility. Some other optimization strategies, such as opposition-based learning, quantum theory, taboo search, may be also embedded by the future researchers to prove and explain the convergence of their proposed method. Some recently reported metaheuristic algorithms like earthworm optimization algorithm, monarch butterfly optimization, elephant herding optimization, firefly algorithm, cuckoo search, bat algorithm, etc. may also be applied to solve the TSCOPF problem of power system by the future researchers.
REFERENCES [1] X.S. Yang, Z. Cui, R. Xiao, A.H. Gandomi, M. Karamanoglu, Swarm Intelligence and Bio-Inspired Computation: Theory and Applications, Elsevier Science, The Netherlands, 2013 (ISBN: 9780124051638). [2] A.H. Gandomi, X.S. Yang, S. Talatahari, A.H. Alavi, Metaheuristic algorithms in modeling and optimization, in: Metaheuristic Applications in Structures and Infrastructures, 2013, pp. 1–24. [3] J. Carpentier, Optimal power flows, Int. J. Electr. Power Energy Syst. 1 (1) (1979) 3–15. [4] J. Yuryevich, K.P. Wong, Evolutionary programming based optimal power flow algorithm, IEEE Trans. Power Syst. 14 (4) (1999) 1245–1250. [5] A. Bhattacharya, P.K. Chattopadhyay, Application of biogeography-based optimization to solve different optimal power flow problems, IET Gener. Transm. Distrib. 5 (1) (2011) 70–80. [6] N. Mo, Z.Y. Zou, K.W. Chan, T.Y.G. Pong, Transient stability constrained optimal power flow using particle swarm optimisation, IET Gener. Transm. Distrib. 1 (3) (2007) 476–483. [7] H.R. Cai, C.Y. Chung, K.P. Wong, Application of differential evolution algorithm for transient stability constrained optimal power flow, IEEE Trans. Power Syst. 23 (2) (2008) 719–728. [8] K. Tangpatiphan, A. Yokoyama, Evolutionary programming incorporating neural network for transient stability constrained optimal power flow, in: Proc. Joint International Conference on Power System Technology and IEEE Power India Conference, POWERCON, 2008, pp. 1–8. [9] A.H. Gandomi, A.H. Alavi, Krill Herd: a new bio-inspired optimization algorithm, Commun. Nonlinear Sci. Numer. Simul. 17 (12) (2012) 4831–4845. [10] A.H. Gandomi, Interior search algorithm (ISA): a novel approach for global optimization, ISA Trans. 53 (4) (2014) 1168–1183. [11] S.S.S. Hosseini, A. Jafarnejad, A.H. Behrooz, A.H. Gandomi, Combined heat and power economic dispatch by mesh adaptive direct search algorithm, Expert Syst. Appl. 38 (6) (2011) 6556–6564. [12] S.S.S. Hosseini, A.H. Gandomi, Short-term load forecasting of power systems by gene expression programming, Neural Comput. Appl. 21 (2) (2012) 377–389. [13] S.S.S. Hosseini, A.H. Gandomi, A. Nemati, S.H.S. Hosseini, Reactive power and voltage control based on mesh adaptive direct search algorithm, in: N.D. Lagaros, M. Papadrakakis (Eds.), Engineering and Applied Sciences Optimization, Computational Methods in Applied Sciences, vol. 38, Springer, 2015, pp. 217–231. [14] S. Mirjalili, A. Lewis, The whale optimization algorithm, Adv. Eng. Softw. 95 (2016) 51–67. [15] L.-J. Yang, T.-L. Chen, Application of chaos in genetic algorithms, Commun. Theor. Phys. 38 (2) (2002) 168–172. [16] B. Alatas, E. Akin, A. Bedri Ozer, Chaos embedded particle swarm optimization algorithms, Chaos Solitons Fractals 40 (4) (2009) 1715–1734. [17] A.H. Gandomi, G.J. Yun, X.S. Yang, S. Talatahari, Chaos-enhanced accelerated particle swarm optimization, Commun. Nonlinear Sci. Numer. Simul. 18 (2) (2013) 327–340. [18] G.G. Wang, S. Deb, A.H. Gandomi, Z. Zhang, A.H. Alavi, Chaotic cuckoo search, Soft Comput. 20 (9) (2016) 3349–3362. [19] B. Alatas, Chaotic harmony search algorithms, Appl. Math. Comput. 216 (9) (2010) 2687–2699.
332
CHAPTER 17 TRANSIENT STABILITY CONSTRAINED OPTIMAL POWER FLOW
[20] W. Gong, S. Wang, Chaos ant colony optimization and application, in: Proc. Fourth International Conference on Internet Computing for Science and Engineering, ICICSE, 2009, pp. 301–303. [21] A.H. Gandomi, X-S. Yang, S. Talatahari, A.H. Alavi, Firefly algorithm with chaos, Commun. Nonlinear Sci. Numer. Simul. 18 (1) (2013) 89–98. [22] G.G. Wang, S. Deb, A.H. Gandomi, Z. Zhang, A.H. Alavi, A novel cuckoo search with chaos theory and elitism scheme, in: Proc. International Conference on Soft Computing & Machine Intelligence, 2014, pp. 64–69. [23] B. Wu, S. Fan, Improved artificial bee colony algorithm with chaos, Commun. Comput. Inform. Sci. Springer 158 (2011) 51–56. [24] J. Mingjun, T. Huanwen, Application of chaos in simulated annealing, Chaos Solitons Fractals 21 (4) (2004) 933–941. [25] A.H. Gandomi, X-S. Yang, Chaotic bat algorithm, J. Comput. Sci. 5 (2) (2014) 224–232. [26] G.-G. Wang, L. Guo, A.H. Gandomi, G.-S. Hao, H. Wang, Chaotic krill herd algorithm, Inf. Sci. 274 (2014) 17–34. [27] A. Mukherjee, V. Mukherjee, Solution of optimal power flow using chaotic krill herd algorithm, Chaos Solitons Fractals 78 (2014) 10–21. [28] S. Talatahari, B.F. Azar, R. Sheikholeslami, A.H. Gandomi, Imperialist competitive algorithm combined with chaos for global optimization, Commun. Nonlinear Sci. Numer. Simul. 17 (3) (2012) 1312–1319. [29] P. Kundur, Power System Stability and Control, McGraw Hill Inc., 1994. [30] Y.-Y. He, J.-Z. Zhou, X.-Q. Xiang, H. Chen, H. Qin, Comparison of different chaotic maps in particle swarm optimization algorithm for long-term cascaded hydroelectric system scheduling, Chaos Solitons Fractals 42 (5) (2009) 3169–3176. [31] R.M. May, Simple mathematical models with very complicated dynamics, Nature 261 (1976) 459–467. [32] X.F. Wang, Y. Song, M. Irving, Modern Power System Analysis, Springer, New York, 2008. [33] P.W. Sauer, M.A. Pai, Power System Dynamics and Stability, Prentice-Hall, Englewood Cliffs, NJ, 1998. [34] T. Nguyen, M.A. Pai, Dynamic security-constrained rescheduling of power systems using trajectory sensitivities, IEEE Trans. Power Syst. 18 (2) (2003) 848–854. [35] K. Ayan, U. Kilic, Solution of transient stability-constrained optimal power flow using artificial bee colony algorithm, Turk. J. Electr. Eng. Comput. Sci. 21 (2013) 360–372. [36] R. Zarate-Minano, T.V. Cutsem, F. Milano, A.J. Conejo, Securing transient stability using time-domain simulations within an optimal power flow, IEEE Trans. Power Syst. 25 (1) (2010) 243–253. [37] H. Ahmadi, H. Ghasemi, A.M. Haddadi, H. Lesani, Two approaches to transient stability-constrained optimal power flow, Int. J. Electr. Power Energy Syst. 47 (2013) 181–192. [38] S.W. Xia, B. Zhou, K.W. Chan, Z.Z. Guo, An improved GSO method for discontinuous non-convex transient stability constrained optimal power flow with complex system model, Int. J. Electr. Power Energy Syst. 64 (2015) 483–492. [39] R.D. Zimmerman, C.E. Murillo-Sanchez, D. Gan, MATPOWER: a Matlab power system simulation package, available at: http://www.pserc.cornell.edu/matpower/, accessed on 13 September 2014. [40] K. Ayan, U. Kilic, B. Barakli, Chaotic artificial bee colony algorithm based solution of security and transient stability constrained optimal power flow, Int. J. Electr. Power Energy Syst. 64 (2015) 136–147. [41] A. Chatterjee, S.P. Ghoshal, V. Mukherjee, Transient performance improvement of grid connected hydro system using distributed generation and capacitive energy storage unit, Int. J. Electr. Power Energy Syst. 43 (1) (2012) 210–221.
CHAPTER
SLOPE STABILITY EVALUATION USING RADIAL BASIS FUNCTION NEURAL NETWORK, LEAST SQUARES SUPPORT VECTOR MACHINES, AND EXTREME LEARNING MACHINE ∗ Faculty
18
Nhat-Duc Hoang∗ , Dieu Tien Bui†
of Civil Engineering, Institute of Research and Development, Duy Tan University, Danang, Vietnam † Geographic Information System Group, University College of Southeast Norway (USN), Bø i Telemark, Norway
18.1 RESEARCH BACKGROUND Slope collapses are complex geotechnical phenomena that represent a serious natural hazard in many regions around the globe. These hazards are responsible for hundreds of millions of US dollars of damages to public/private properties and human casualties every year [1]. The population expansion and economic development in many countries around the world lead to the construction of road networks and residential areas in the hilly or mountainous regions [2,3]. As a consequence, slope stability assessment becomes an urgent task and tools for analyzing slopes are necessary to prevent and mitigate the damages caused by slope failure [4–7]. The slope analysis is indeed very helpful since it can be used by various parties (e.g. Government agencies, land-use planners, etc.) for identifying collapse-prone areas. Based on such analyses, financial resources can be appropriately allocated to construct the retaining structures or can establish evacuation plans effectively [2,8]. Currently, slope failure prediction models based on machine learning have shown to be effective tools for assistance of the decision-making processes in hazard prevention planning, especially in the tasks of designing and constructing highways, open pits, and earth dams [9]. The reason for this effectivity of machine learning is that these techniques possess intrinsic capability to mining valuable information hidden in records of real slope cases in the past. Due to the complex and multi-factorial interactions between factors that affect slope stability, the task of slope assessment remains a significant challenge for civil engineers. This research carries out a comparative study of machine learning solutions for slope stability assessment relied on three advanced artificial intelligent methods: Radial Basis Function Neural Network (RBFNN), Least Squares Support Handbook of Neural Computation. DOI: 10.1016/B978-0-12-811318-9.00018-1 Copyright © 2017 Elsevier Inc. All rights reserved.
333
334
CHAPTER 18 SLOPE STABILITY EVALUATION USING RADIAL BASIS
Vector Machines (LSSVM), and the Extreme Machine Learning (ELM). Furthermore, two data sets with actual slope collapse events have been collected for this study. The reasons for selecting the aforementioned three approaches are as follows: RBFNN [10,11], LSSVM [12–15], and ELM [16–19] have been illustrated to be capable pattern classifiers; however, performances of these three approaches in slope assessment have rarely been discussed in the literature. In addition, based on a recent comparative work [20], LSSVM has shown superior prediction accuracies in slope classification; therefore, comparison among the three models can provide helpful information for readers including both academic researchers and practicing engineers. The remaining part of this chapter is organized as follows. The second section reviews pertinent works in the literature. The research framework is described in the third section, followed by the experimental results. Conclusions of this chapter are stated in the final section.
18.2 LITERATURE REVIEW Previous researches point out that methods based on advanced machine learning frameworks actually help to boost the slope prediction credibility [9,20,21]. Machine learning based models for slope evaluation are generally established by combining supervised learning techniques and historical cases of slope performance. Using such models, the slope stability prediction can be equivalently formulated as a classification task in which target outputs are either “collapse” or “non-collapse.” Lu and Rosenbaum [22], Zhou and Chen [23], Jiang [24], Das et al. [25], and Wang et al. [1] utilized the Artificial Neural Network (ANN) to forecast the slope stability. Zhao et al. [9] and Hoang and Tien-Bui [5] employed the Relevance Vector Machine (RVM) to explore the nonlinear relationship between slope stability and its influence factors. Prediction models of slope assessment employing the Support Vector Machine (SVM) were developed by Samui [26], Li and Wang [27], Li and Dong [28], Tien Bui et al. [29], and Cheng and Hoang [30]. The Evolutionary Polynomial Regression [31] and the Least Squares Support Vector Machine (LSSVM) [13] have been employed to model the mapping function between the input pattern and the factor of safety of slopes against failure. Probabilistic slope assessment model that utilized the Gaussian Process was established by Ching et al. [32] for analyzing slopes along mountain roads. Hoang and Pham [20] constructed a slope classification model based on a hybridization of LSSVM and Firefly Algorithm. Recently, Cheng and Hoang [2] have put forward a probabilistic slope assessment model based on Bayesian Framework. Yan and Li [33] established a method for predicting the stability of open pit slope based on the Bayes Discriminant Analysis. A swarm-optimized fuzzy instance based classifier has been proposed for predicting slope collapses occurred along road section in mountainous regions [34]. Those previous researches point out that machine learning can provide a competent tool to establish a structured representation of the slope system, which allows accurate predictions of slope stability.
18.3 RESEARCH METHOD
335
18.3 RESEARCH METHOD 18.3.1 MACHINE LEARNING APPROACHES 18.3.1.1 Radial Basis Function Neural Network Basically, a Radial Basis Function Neural Network (RBFNN) [10,35] model is a feedforward neural network that consists of one input layer, one hidden layer, and one output layer. Within this structure, a certain number of neurons are assigned to each layer. The neurons in the input layer illustrate the number of input features (N ) (i.e. the dimensions of the input data). The number of neurons in the hidden layer represents the number of centroids (M) and their location (i.e. coordination) in the learning space. RBFNN’s learning phase divides the data set into a certain number of groups; accordingly, a centroid is a representative of a group of data. The main purpose of the hidden layer of RBFNN is to map the input data from their original space onto the network space through a radial basis function φ(.): zj (x) = φ x − cj (18.1) where cj denotes the coordination of the j th centroid, x represents the input data, x − cj denotes the norm between the data and the centroid. Since the task at hand is binary pattern recognition, the output of neuron in the output layer is converted into binary values through the sigmoid function as follows: 0 if S(u) < t (18.2) f (u) = 1 if S(u) ≥ t where S(.) represents the sigmoid function, t denotes a threshold value of 0.2 used to convert the real value input into binary outputs. The formula of the sigmoid function is written as follows: S(u) =
1 1 + e−u
(18.3)
It is noted that in RBFNN, the weights between the hidden and output layers wj , the centroid location cj , and the number of centroids M are determined so that prediction error of the model is minimized. Thus, a least squares objective function for this learning problem can be defined as follows: E=
D
2 T (i) − y(i)
(18.4)
i=1
where T (i) denotes the design output, D is number of training data, and y(i) denotes the network output. The network output is computed through a sum product of the network’s weight and the input vector and it can be expressed in the following form: y=
M j =1
wj zj (x)
(18.5)
336
CHAPTER 18 SLOPE STABILITY EVALUATION USING RADIAL BASIS
18.3.1.2 Least Squares Support Vector Machines Least Squares Support Vector Machines (LS-SVM) [36] is a supervised machine learning technique for solving classification problems which relies on the principle of statistical learning theory. Given a n training data set {xk , yk }N k=1 with input data xk ∈ R where N is the number of training data points, n is the number of dimensions of the input data, the corresponding class of labels is denoted as yk ∈ {−1, +1}, the LS-SVM for classification is formulated as follows: 1 1 2 Jp (w, e) = w T w + γ ek 2 2 k=1 Subject to yk w T φ(xk ) + b = 1 − ek , k = 1, . . . , N N
Min.
(18.6) (18.7)
where w ∈ R n is the normal vector to the classification hyperplane and b ∈ R is the bias, ek ∈ R are error variables, and γ > 0 denotes a regularization constant. The Lagrangian is given by: L(w, b, e, a) = Jp (w, e) −
N
/ 0 αk yk w T φ(xk ) + b − 1 + ek
(18.8)
k=1
where αk are Lagrange multipliers, φ(xk ) represents a kernel function. Applying the KKT conditions of optimality, the above optimization problem is equivalent to this linear system after the elimination of e and w: ' & 0 yT 0 b = (18.9) −1 α 1 v y ω+γ I in which y = y1 , . . . , yN , 1v = [1; . . . ; 1], and α = [α1 ; . . . ; αN ]. And ω = yi yj K(xk , x1 ) with K represents a kernel function. The classification model based on LS-SVC is written in the following form: y(x) = sign
N
αk yi K(xk , x1 ) + b
(18.10)
k=1
where αk and b denote the solution to the linear system (Eq. (18.9)). The kernel function that is commonly used is Radial Basis Function (RBF) kernel. The RBF kernel is described as follows: xk − x1 2 K(xb , x1 ) = exp − (18.11) 2σ 2 where σ is the kernel function parameter.
18.3.1.3 Extreme Learning Machine Extreme Learning Machine (ELM) [37] is a novel method for pattern classification as well as function approximation. This method is essentially a single feedforward neural network; its structure consists of a single layer of hidden nodes, where the weights between inputs and hidden nodes are randomly
18.3 RESEARCH METHOD
337
assigned and remain constant during training and predicting phases. On the contrary, the weights that connect hidden nodes to outputs can be trained very fast. Experimental studies in the literature [16,37, 38] showed that ELMs can produce acceptable predictive performance and their computational cost is much lower than networks trained by the back-propagation algorithm. The task at hand is to construct a classification model from a data set X = {xt ∈ R p }, t = 1, . . . , n, with n samples and p input features. Given a network with p input units, q hidden neurons, and c outputs, the ELM model’s output is written in the following formula [39]: oi (t) = mTi h(t)
(18.12)
where mi ∈ R q , i ∈ {1, . . . , c} denotes the weight vector that connects the hidden neurons to the ith output neuron. h(t) ∈ R q represents the vector of outputs of hidden neurons for a certain input pattern x(t) ∈ R p . Then h(t) can be written in the following form: = < (18.13) h(t) = f w1T x(t) + b1 , f w2T x(t) + b2 , . . . , f wqT x(t) + bq where bk (k = 1, 2, . . . , q) denotes the bias of the kth hidden neuron, wk ∈ R p represents the weight vector of the kth hidden neuron, and f (.) denotes a sigmoidal activation function. It is worth to notice that the weight vectors wk as well as the bias bk are generated from a Gaussian distribution in a random manner. Providing wk and bk , the next step is to establish a matrix of hidden layer output H . It is noted that H is q × n matrix; its t th column is the vector of a hidden layer output h(t). Accordingly, the weight matrix M = [m1 , m2 , . . . , mc ]can be calculated via the Moore–Penrose pseudo inverse method as follows: −1 M = H × HT H × DT (18.14) where D = [d(1), d(2), . . . , d(n)] denotes a c × n matrix whose t th column is the actual target vector d(t) ∈ R c . With the network’s parameter being fully specified, the class label for a new input pattern is determined as follows: Y = arg max{oi }
(18.15)
i=1,...,c
where Y denotes the predicted class label.
18.3.2 HISTORICAL DATA SETS OF SLOPE ASSESSMENT Two data sets that record slope performance collected from the literature are employed in this study to construct and verify the machine learning approaches. The first data set (Data Set 1) consists of 168 historical cases [20]; six influencing factors, including unit weight (kN/m3 ), soil cohesion (kPa), internal friction angle (°), slope angle (°), slope height (m), and pore pressure ratio (Ru ), are employed to characterize an earth slope. The second data set (Data Set 2) includes 109 historical cases [32] collected in the Taiwan Provincial Highway No. 18; ten input factors (slope direction, slope angle, slope height, road curvature, strata type, thickness of canopy cover, catchment area, height of toe to cutting, change of slope grade, and peak ground acceleration) are employed to predict slope performance.
338
CHAPTER 18 SLOPE STABILITY EVALUATION USING RADIAL BASIS
Table 18.1 Slope Influencing Factors and Their Statistical Descriptions of Data Set 1 Factors
Notation
Definition
Max
Average
Std.
Min
X1 X2 X3 X4 X5 X6
γ C ϕ β H Ru
Unit weight (kN/m3 ) Soil cohesion (kPa) Internal friction angle (°) Slope angle (°) Slope height (m) Pore pressure ratio
31.30 300.00 45.00 59.00 511.00 45.00
21.76 34.12 28.72 36.10 104.19 0.48
4.13 45.82 10.58 10.22 132.68 3.45
12.00 0.00 0.00 16.00 3.60 0.00
Table 18.2 Slope Influencing Factors and Their Statistical Descriptions of Data Set 2 Factors
Definition
Max
Average
Std.
Min
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
Slope direction (°) Slope angle (°) Slope height (m) Road curvature (1/m) Strata type Thickness of canopy cover (m) Catchment area (m2 ) Height of toe cutting (m) Change of slope grade (°) Peak ground acceleration (gal)
345.00 90.00 100.00 0.03 5.00 4.50 255743.00 50.00 35.00 391.90
171.33 62.20 22.39 0.00 4.40 2.07 19706.81 7.00 9.08 251.00
93.95 11.74 14.86 0.02 1.06 1.05 39956.69 6.52 10.36 115.71
0.00 30.00 5.00 −0.05 1.00 0.50 406.00 2.00 0.00 0.00
Table 18.1 provides the information of the influencing factors and their statistical descriptions. Table 18.2 provides the information of the influencing factors and their statistical descriptions of Data Set 2. The data sets are provided in Appendices 1 and 2, within which the output of −1 indicates a non-collapsed slope and the output of +1 denotes a collapsed slope. For more detailed explanation regarding the influencing factors of slope, the readers are guided to previous works of Hoang and Pham [20] for Data Set 1 and Ching et al. [32] for Data Set 2. Furthermore, scatter plots of all input variables of the two data sets with class label distinction are plotted in Fig. 18.1 and Fig. 18.2. A preliminary observation from Fig. 18.1 and Fig. 18.2 is that there is a high degree of overlapping regions within each input feature of the two data sets.
18.4 EXPERIMENTAL SETTING As mentioned earlier, this study employs two data sets of slope performance to construct and verify the machine learning models. The number of records in the Data Set 1 and the Data Set 2 are 168 and 109, respectively. The numbers of collapsed and stable slopes are 84 and 84 for Data Set 1 and 55 and 54 for Data Set 2. Herein, 90% of the data set is used to train the machine learning models; 10% of the data set is reserved for testing phase. Additionally, to alleviate the bias in data selection, a ten-fold cross-validation process is employed. Three artificial intelligence methods are utilized to construct slope assessment models in this study: RBFNN, LSSVM, and ELM.
18.4 EXPERIMENTAL SETTING
FIGURE 18.1 Data distribution of Data Set 1.
FIGURE 18.2 Data distribution of Data Set 2.
339
340
CHAPTER 18 SLOPE STABILITY EVALUATION USING RADIAL BASIS
In case of RBFNN, to determine the parameters M and cj , the method of Orthogonal Least Squares (OLS) is employed [40,41]. In OLS, each data point is initially set as a possible location of the centroid (cj ); the assessment includes computing the network error for each data point in the training set as the new centroid of the corresponding cluster. The data point that can reduce the RBFNN error the most is the new centroid [10]. This assessment procedure is repeated until the network error reaches an acceptable value. In our study, the threshold value of the RBFNN training phase is set to be 90%. That is, the training process of the RBFNN will terminate when the classification accuracy rate reaches 0.9. Our observation is that setting a lower value of threshold results in under-trained models; meanwhile, setting a higher value causes the problem of overfitting. In this study, the RBFNN model is coded in Matlab by the authors. To construct an LSSVM model, it is necessary to specify the regularization constant (γ ) and the kernel function parameter (σ ). Previous works [20,42] point out that these two parameters affect the learning performance of LSSVM considerably. Therefore, in this study, a grid search procedure is used to appropriately set the values of the regularization constant (γ ) and the kernel function parameter (σ ). These two parameters are allowed to be varied within the following set of values: [0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100, 500, 1000]. It is noted that the training set is further divided into two sets: Set 1 (90%) used for model construction and Set 2 (10%) used for computing the fitness of each pair of hyper-parameters. It is noted that the LSSVM model is implemented via the LS-SVMlab Toolbox [43] and the grid search procedure is coded by the authors. In case of ELM, the sigmoid function is often employed as activation function. The only hyperparameter to be set for this model is the number of neurons in the hidden layer. Herein, the number of neurons is allowed to vary from the number of input features to 100. For the purpose of model selection, the training data is also split into two sets (Set 1 and Set 2) with a similar manner to the process used in parameter setting of LSSVM. The number of neurons that maximize ELM performance is selected for the testing phase. The ELM model is implemented in Matlab environment with the program codes provided by Huang [44]. Furthermore, besides the classification accuracy rate (CAR), the following four metrics can be used to measure the classification performance [15]: true positive rate TPR (the percentage of positive instances correctly classified), true negative rate TNR (the percentage of negative instances correctly classified), false positive rate FPR (the percentage of negative instances misclassified), and false negative rate FNR (the percentage of positive instances misclassified). The formulations for computing the above four metrics are stated as follows: TP TP + FN TN TNR = TN + FP FP FPR = FP + TN FN FNR = TP + FN
TPR =
(18.16) (18.17) (18.18) (18.19)
where TP, TN, FP, and FN are the numbers of true positive, true negative, false positive, and false negative, respectively.
18.5 EXPERIMENTAL RESULTS
341
Table 18.3 Experimental Results Data Set 1 FP FN
CAR
TP
85.19 5.91 81.00 10.96
62.00 8.82 7.00 1.76
9.40 3.34 1.30 1.06
94.52 3.50 87.06 5.27
70.50 3.87 8.30 1.42
88.22 3.01 84.00 7.50
66.00 2.91 7.60 1.35
Data Set 2 FP FN
TN
CAR
TP
TN
12.70 8.47 2.30 2.11
65.30 3.06 8.00 1.15
99.69 0.50 85.08 7.94
48.30 0.67 5.40 0.70
0.00 0.00 0.90 0.88
0.30 0.48 1.00 0.67
47.70 0.48 5.40 0.84
4.00 2.11 1.40 0.84
4.20 4.02 1.00 1.15
70.70 1.95 7.90 0.74
95.23 3.98 89.95 8.13
45.90 2.77 6.00 0.47
1.90 2.47 0.90 1.10
2.70 2.75 0.40 0.52
45.80 2.57 5.40 0.97
8.90 2.56 1.30 0.95
8.70 3.02 1.70 1.57
65.80 2.74 8.00 1.15
93.37 4.00 90.49 5.08
43.10 3.07 5.50 0.85
0.90 0.99 0.30 0.48
5.50 3.34 0.90 0.57
46.80 0.79 6.00 0.47
RBFNN Train Mean Std. Test Mean Std.
LSSVM Train Mean Std. Test Mean Std.
ELM Train Mean Std. Test Mean Std.
18.5 EXPERIMENTAL RESULTS Experimental results obtained from the ten-fold cross-validation procedure of the three machine learning models with two data sets are reported in Table 18.3. Notably, to evaluate the predictive capability of each model, it is reasonable to focus on the testing outcomes which are highlighted as bold figures in Table 18.3. It is noted that the figures presented in this table are the average results obtained from the cross-validation process. Considering Data Set 1, LSSVM is the best method (CAR = 87.06%), followed by ELM (CAR = 84%) and RBFNN (CAR = 81%). In case of Data Set 2, ELM (CAR = 90.49%) and LSSVM (CAR = 89.95%) are both superior to RBFNN (CAR = 85.08%); ELM is slightly better than LSSVM. In addition, the average true positive, true negative, false positive, false negative rates of all prediction models (RBFNN, LSSVM, and ELM) for the two data sets are shown in Figs. 18.3 and 18.4, respectively. When predicting Data Set 1, LSSVM achieves the highest TPR and TNR (both rates are 0.89); ELM attains the second best TPR and TNR (both rates are 0.82); RBFF has the worst TPR = 0.75 and TNR = 0.78. In case of Data Set 2, TPR (0.94) and TNR (0.93) obtained from LSSVM are also the most desirable; with TPR = 0.86 and TNR = 0.87, ELM is ranked after LSSVM, followed by RBFNN (TPR and TNR are both 0.84). Moreover, it is noted that in slope assessment, false negative cases (collapsed slopes are wrongly classified as safe slopes) are considered to be more dangerous. With such point of view, LSSVM deems to be more desirable since its FNR is lowest in both data sets (0.11 for Data Set 1 and 0.06 for Data Set 2). Meanwhile, FNRs of ELM is slightly higher than that of LSSVM: 0.18 for Data Set 1 and 0.14 for Data Set 2.
342
CHAPTER 18 SLOPE STABILITY EVALUATION USING RADIAL BASIS
FIGURE 18.3 True positive, true negative, false positive, false negative rates for Data Set 1.
FIGURE 18.4 True positive, true negative, false positive, false negative rates for Data Set 2.
18.6 CONCLUSION This chapter has investigated the capabilities of RBFNN, LSSVM, and ELM in slope stability assessment with two historical data sets. To accurately evaluate each model’s performance, a ten-fold cross-validation process has been carried out. Results obtained from experiments demonstrate that LSSVM and ELM are superior methods for tackling the problem at hand. The performance of RBFNN is significantly worse than that of the other two models. Considering the CAR, LSSVM is the best
REFERENCES
343
method for Data Set 1; and ELM is the most desirable method for Data Set 2. Nevertheless, when taking FNR into account, LSSVM deems to be the most suitable for slope evaluation since this method has resulted in the lowest FNR in both data sets. Overall, LSSVM and ELM are highly recommended as an intelligent tool to assist decision-making process in slope assessment. Further directions of the current study may include: (1) Investigate other advanced machine learning approaches in slope evaluation, (2) Combine feature selection technique with LSSVM and ELM, and (3) Enhancing prediction accuracy by ensemble and boosting methods.
REFERENCES [1] H.B. Wang, W.Y. Xu, R.C. Xu, Slope stability evaluation using Back Propagation Neural Networks, Eng. Geol. 80 (2005) 302–315. [2] M.-Y. Cheng, N.-D. Hoang, Slope collapse prediction using Bayesian framework with K-Nearest Neighbor density estimation: case study in Taiwan, J. Comput. Civ. Eng. 30 (2016) 04014116. [3] H.-M. Lin, S.-K. Chang, J.-H. Wu, C.H. Juang, Neural network-based model for assessing failure potential of highway slopes in the Alishan, Taiwan Area: pre- and post-earthquake investigation, Eng. Geol. 104 (2009) 280–289. [4] A. Manouchehrian, J. Gholamnejad, M. Sharifzadeh, Development of a model for analysis of slope stability for circular mode failure using genetic algorithm, Environ. Earth Sci. 71 (2014) 1267–1277. [5] N.-D. Hoang, D. Tien-Bui, A novel relevance vector machine classifier with cuckoo search optimization for spatial prediction of landslides, J. Comput. Civ. Eng. 30 (2016) 04016001. [6] C.-I. Wu, H.-Y. Kung, C.-H. Chen, L.-C. Kuo, An intelligent slope disaster prediction and monitoring system based on WSN and ANP, Expert Syst. Appl. 41 (2014) 4554–4562. [7] E. Salmi, S. Hosseinzadeh, Slope stability assessment using both empirical and numerical methods: a case study, Bull. Eng. Geol. Environ. 74 (2015) 13–25. [8] P. Luciano, L. Serge, Assessment of slope stability, in: Geotechnical Engineering State of the Art and Practice, 2012, pp. 122–156. [9] H. Zhao, S. Yin, Z. Ru, Relevance vector machine applied to slope stability analysis, Int. J. Numer. Anal. Meth. Geomech. 36 (2012) 643–652. [10] K.-W. Liao, J.-C. Fan, C.-L. Huang, An artificial neural network for groutability prediction of permeation grouting with microfine cement grouts, Comput. Geotech. 38 (2011) 978–986. [11] D. Tien Bui, D.T. Quach, V.H. Pham, I. Revhaug, V.L. Ngo, T.H. Tran, et al., Spatial prediction of landslide hazard along the National Road 32 of Vietnam: a comparison between Support Vector Machines, Radial Basis Function neural networks, and their ensemble, in: Proceedings of the Thematic Session, 49th CCOP Annual Session, 22–23 October 2013, Sendai, Japan, 2013. [12] P. Samui, J. Karthikeyan, Determination of liquefaction susceptibility of soil: a least square support vector machine approach, Int. J. Numer. Anal. Meth. Geomech. 37 (2013) 1154–1161. [13] P. Samui, D.P. Kothari, Utilization of a least square support vector machine (LSSVM) for slope stability analysis, Sci. Iran. 18 (2011) 53–58. [14] M.-Y. Cheng, N.-D. Hoang, Groutability prediction of microfine cement based soil improvement using evolutionary LSSVM inference model, J. Civ. Eng. Manag. 20 (2014) 1–10. [15] N.-D. Hoang, D. Tien Bui, Predicting earthquake-induced soil liquefaction based on a hybridization of kernel Fisher discriminant analysis and a least squares support vector machine: a multi-dataset study, Bull. Eng. Geol. Environ. (2016) 1–14. [16] G. Huang, G.-B. Huang, S. Song, K. You, Trends in extreme learning machines: a review, Neural Netw. 61 (2015) 32–48. [17] D. Avci, A. Do˘gantekin, An expert diagnosis system for Parkinson disease based on genetic wavelet kernel extreme learning machine, Parkinson’s Dis. 2016 (2016) 5264743. [18] P. Samui, J. Jagan, R. Hariharan, An alternative method for determination of liquefaction susceptibility of soil, Geotech. Geol. Eng. 34 (2016) 735–738. [19] O. Anicic, S. Jovi´c, H. Skrijelj, B. Nedi´c, Prediction of laser cutting heat affected zone by extreme learning machine, Opt. Laser Eng. 88 (2017) 1–4.
344
CHAPTER 18 SLOPE STABILITY EVALUATION USING RADIAL BASIS
[20] N.-D. Hoang, A.-D. Pham, Hybrid artificial intelligence approach based on metaheuristic and machine learning for slope stability assessment: a multinational data analysis, Expert Syst. Appl. 46 (2016) 60–68. [21] F. Kang, J. Li, Artificial bee colony algorithm optimized support vector regression for system reliability analysis of slopes, J. Comput. Civ. Eng. 30 (2015) 04015040. [22] P. Lu, M.S. Rosenbaum, Artificial neural networks and grey systems for the prediction of slope stability, Nat. Hazards 30 (2003) 383–398. [23] K.-p. Zhou, Z.-Q. Chen, Stability prediction of tailing dam slope based on neural network pattern recognition, in: Proc. of the Second International Conference on Environmental and Computer Science, ICECS ’09, 28–30 Dec. 2009, Dubai, the United Arab Emirates, 2009, pp. 380–383. [24] J.-P. Jiang, BP neural networks for Prediction of the factor of safety of slope stability, in: Proc. of the International Conference on Computing, Control and Industrial Engineering, CCIE, 20–21 Aug. 2011, Wuhan, China, 2011. [25] S.K. Das, R.i. Biswal, N. Sivakugan, B. Das, Classification of slopes and prediction of factor of safety using differential evolution neural networks, Environ. Earth Sci. 64 (2011) 201–210. [26] P. Samui, Slope stability analysis: a support vector machine approach, Environ. Geol. 56 (2008) 255–267. [27] J. Li, F. Wang, Study on the forecasting models of slope stability under data mining, in: Proc. of the Earth and Space 2012: Engineering, Science, Construction, and Operations in Challenging Environments, Honolulu, Hawaii, United States, ASCE, 2010, pp. 765–776. [28] J. Li, M. Dong, Method to predict slope safety factor using SVM, in: Proc. of the Earth and Space 2012: Engineering, Science, Construction, and Operations in Challenging Environments, Pasadena, California, United States, ASCE, 2012, pp. 888–899. [29] D. Tien Bui, B. Pradhan, O. Lofman, I. Revhaug, Landslide susceptibility assessment in Vietnam using support vector machines, decision tree, and naive Bayes models, Math. Probl. Eng. 2012 (2012) 26. [30] M.-Y. Cheng, N.-D. Hoang, Typhoon-induced slope collapse assessment using a novel bee colony optimized support vector classifier, Nat. Hazards 78 (2015) 1961–1978. [31] A. Ahangar-Asr, A. Faramarzi, A.A. Javadi, A new approach for prediction of the stability of soil and rock slopes, Eng. Comput. 27 (2010) 878–893. [32] J. Ching, H.-J. Liao, J.-Y. Lee, Predicting rainfall-induced landslide potential along a mountain road in Taiwan, Geotechnique 61 (2011) 153–166. [33] X. Yan, X. Li, Bayes discriminant analysis method for predicting the stability of open pit slope, in: Proc. of the International Conference on Electric Technology and Civil Engineering, ICETCE, 22–24 April 2011, Lushan, China, 2011, pp. 147–150. [34] M.-Y. Cheng, N.-D. Hoang, A Swarm-Optimized Fuzzy Instance-based Learning approach for predicting slope collapses in mountain roads, Knowl.-Based Syst. 76 (2015) 256–263. [35] S. Chen, C.F.N. Cowan, P.M. Grant, Orthogonal least squares learning algorithm for radial basis function networks, IEEE Trans. Neural Netw. 2 (1991) 302–309. [36] J. Suykens, J.V. Gestel, J.D. Brabanter, B.D. Moor, J. Vandewalle, Least Square Support Vector Machines, World Scientific Publishing Co. Pte. Ltd., Singapore, 2002. [37] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: theory and applications, Neurocomputing 70 (2006) 489–501. [38] G. Li, P. Niu, Y. Ma, H. Wang, W. Zhang, Tuning extreme learning machine by an improved artificial bee colony to model and optimize the boiler efficiency, Knowl.-Based Syst. 67 (2014) 278–289. [39] A.S.C. Alencar, A.R. Rocha Neto, J.P.P. Gomes, A new pruning method for extreme learning machines via genetic algorithms, Appl. Soft Comput. 44 (2016) 101–107. [40] F.M. Ham, I. Kostanic, Principles of Neurocomputing for Science and Engineering, McGraw-Hill, New York, United States, 2001. [41] V. Kecman, Learning and Soft Computing: Support Vector Machines, Neural Networks, and Fuzzy Logic Models, The MIT Press, Cambridge, 2001. [42] G.S. Dos Santos, L.G.J. Luvizotto, V.C. Mariani, L. dos Santos Coelho, Least squares support vector machines with tuning based on chaotic differential evolution approach applied to the identification of a thermal process, Expert Syst. Appl. 39 (2012) 4805–4812. [43] K. De Brabanter, P. Karsmakers, F. Ojeda, C. Alzate, J. De Brabanter, K. Pelckmans, et al., LS-SVMlab Toolbox User’s Guide Version 1.8, Internal Report 10-146, ESAT-SISTA, K.U. Leuven, Leuven, Belgium, 2010. [44] G.-B. Huang, Basic ELM algorithms, http://www.ntu.edu.sg/home/egbhuang/elm_codes.html, 2016.
CHAPTER
ALTERNATING DECISION TREES
∗ Advanced
19
Melanie Po-Leen Ooi∗,† , Hong Kuan Sok∗ , Ye Chow Kuang∗ , Serge Demidenko‡
Engineering Platform and Electrical and Computer Systems Engineering, School of Engineering, Monash University, Bandar Sunway, Malaysia † School of Engineering and Physical Sciences, Heriot-Watt University, Putrajaya, Malaysia ‡ School of Engineering and Advanced Technology, Massey University, Auckland, New Zealand
NOMENCLATURE x y n p X y β ϑ θ λ1 λ2 ε f (x) g(x) r(x) w W T z c C P ¬ ∧
Input sample x which consists of p features. It is a column vector such that x = [x1 , . . . , xp ]T . Class label of input sample x such that it assumes one of the labels {1, . . . , K}. Total number of labeled training samples. Total number of features in input sample x or length of column vector x. The design matrix of all n input samples such that X = (x(1) , . . . , x(n) )T . Class label vector of length n such that y = [y (1) , . . . , y (n) ]T . Discriminative vector such that β = [β 1 , . . . , β p ]T . A threshold splitting value. An optimal score vector such that θ = [θ 1 , . . . , θ K ]T . Regularization parameter for Lasso penalty. Regularization parameter for Ridge penalty. Error of the weak classifier. Edge of the weak classifier or ε ≤ 0.5 − . Weak classifier (or base learner) that returns a prediction given the input sample x. Regression function that returns a regression value given the input sample x. Decision rule of ADTree. Weight distribution of the training data set such that w = [w1 , . . . , wn ]T . A diagonal matrix such that each diagonal entry i encodes the weight of ith training sample, i.e. wi . Total number of boosting procedures. Working response of LogitBoost such that z = [z1 , . . . , zn ]T . Base condition. Base condition set. Precondition set. Logical NOT operation. Logical AND operation.
19.1 INTRODUCTION Decision trees are one of the most popular classifiers used in real-world applications due to their superior knowledge representation, robustness, ability to handle missing values, nonparametric induction, and ease of interpretation. Learning algorithms are used to automatically induce a decision tree for specific domain problems. Under the typical supervised learning framework, a number of correctly Handbook of Neural Computation. DOI: 10.1016/B978-0-12-811318-9.00019-3 Copyright © 2017 Elsevier Inc. All rights reserved.
345
346
CHAPTER 19 ALTERNATING DECISION TREES
FIGURE 19.1 Graphical representation of: (A) classical decision tree consisting of decision nodes as internal nodes and leaf nodes as terminal nodes; (B) decision tree ensemble consisting of a large network of decision trees.
labeled samples are used as a training data set for learning purposes. A common approach to grow a decision tree is to recursively divide-and-conquer data samples by splitting them according to their outputs. The decision tree model can be interpreted as a set of decision rules leading to a potential knowledge discovery. There are numerous decision tree learning algorithms available in the literature, such as ID3 [38], C4.5 [39], CART [4], their newer variants [1,41], etc. Graphically (Fig. 19.1(a)), a decision tree is a directed acyclic model consisting of nodes and edges. It begins with a root node with no incoming edges. Internal nodes (or decision nodes) are the nodes with outgoing edges. Terminal nodes (leaf nodes) are those without outgoing edges. A decision node contains a decision function f (x) to sort an input sample x down into one of the outgoing edges. The leaf nodes contain the final output. To make an output prediction for an unlabeled sample, it starts with the root node and the sample is sorted down the tree until it reaches one of the leaf nodes with a decision. In order to further improve the classification performances of decision trees, meta-learning algorithms can be applied to create an ensemble of decision trees. Bagging [2] and boosting [20] are the main ensemble techniques to combine with the decision tree algorithms to produce a forest of decision trees. Bagging or bootstrap aggregating algorithm [2] generates a prediction model by aggregating several prediction models based on different bootstrap examples. A bootstrap example is obtained by sampling the original training set with replacements. The idea is to reduce the variance by averaging the prediction models created. Random Forest [3] is the learning algorithm that combines the bagging and random selection of features to produce the ensemble prediction model. Boosting algorithms reweight the training set distribution iteratively after a weak prediction model is obtained. It places more emphasis on misclassified training samples with less emphasis on the correctly classified ones. A new weak prediction model is obtained based on the reweighted training samples from a previous iteration. Boosted Decision Tree [26] and Boosted Ant Colony Decision Forest [30] are two examples of the boosted decision trees. An ensemble of decision trees (Fig. 19.1B) often results in a large, complex, and difficult to interpret classifier. This reduces the interpretability of the decision tree, whereby its original structure (Fig. 19.1A) represents the knowledge in a flowchart-like manner that follows human logic and comprehension [19].
19.2 BOOSTING: AN OVERVIEW
347
FIGURE 19.2 Graphical representation of: (A) alternating decision tree that can be used to represent the standard one shown in Fig. 19.1A to make the same prediction; (B) boosting in the ADTree by adding additional decision stumps to any existing prediction nodes (represented by the ellipse) to obtain majority-voted decisions.
The large size and incomprehensibility of the boosted decision trees led to the invention of the Alternating Decision Tree (ADTree) [19]. It uniquely performs boosting within a single decision tree to facilitate the comprehensibility (see Fig. 19.2A). Decision stumps are one-level decision trees with one decision node. The ADTree supports multiple decision stumps (consisting of a decision node and two prediction nodes) under the same prediction node to perform majority-voted decisions (Fig. 19.2B) whereby all dotted edges of the arrived prediction nodes must be taken. Voted Decision Trees are equivalent to an ensemble of decision trees which improves the overall prediction accuracy. In this manner, the ADTree is a generalization of the decision trees, voted decision trees, and voted decision stumps.
19.2 BOOSTING: AN OVERVIEW Two different boosting algorithms are used to induce the ADTree variants described in Section 19.3: AdaBoost (Algorithm 19.1) and LogitBoost (Algorithm 19.2). A training data set D contains n labeled training samples whereby each sample is a column vector x consisting of p features. The AdaBoost maintains a weight distribution w of a training data set, whereby w = [w1 , ..., wn ]T . This distribution is initialized to a uniform with a weight value wi of 1/n. A weak learner is called to induce a weak classifier ft (x) given the weighted training data set with its coefficient αt at each t th boosting procedure. The current weight distribution wi,t is updated to a new weight distribution wi,t+1 for next boosting procedure. An indicator function I (.) returns 1 if the Boolean expression inside the function is evaluated true. A linear combination of weak classifiers F (x) = Tt=1 αt ft (x) is obtained at the end of T boosting procedure. The LogitBoost (Algorithm 19.2) initializes the uniform weight distribution in a similar fashion as the AdaBoost. Furthermore, it also keeps a track of the probability estimation of the positive class p(x(i) ) and the regression value G(x(i) ) for each sample at every boosting iteration. The working re-
348
CHAPTER 19 ALTERNATING DECISION TREES
Algorithm 19.1 AdaBoost algorithm Input: training dataset D 1. Initialize: wi = 1/n∀i 2. For t = 1, . . . , T a. Obtain ft (x) from weak learner n (i) (i) b. Determine αt = 12 log( 1−ε i=1 wi,t [I (ft (x ) = y )] ε ) with ε = c. Update the weight distribution wi,t+1 = wi,t exp(−y (i) αt ft (x(i) )) Output: F (x) = Tt=1 αt ft (x) Algorithm 19.2 LogitBoost algorithm Input: training dataset ∗ 1. Initialize: y (i) = (y (i) + 1)/2, wi = 1/n, G(x(i) ) = 0, and p(x(i) ) = 0.5 2. For t = 1, . . . , T 2.2 Compute working response and weights zi =
y (i)∗ − p(x(i) ) , p(x(i) )(1 − p(x(i) ))
wi = p x(i) 1 − p x(i)
2.2. Fit gt (x) by a weighted least-squares regression of zi to x(i) using weights wi (i) )) 2.3. Update G(x(i) ) ← G(x(i) ) + 12 gt (x(i) ) and p(x(i) ) = exp(G(xexp(G(x (i) ))+exp(−G(x(i) )) Output: G(x) = Tt=1 gt (x)
sponse (or the pseudo-label z) and weight distribution w are updated at the start of every boosting procedure. A regression function g(x) is fitted to the weighted least square regression problem. Regression values of all training samples are updated to calculate the new probability estimation of each sample at the end of every boosting procedure. The output is a regression function G(x), whereby classification is achieved by taking the sign of the summed regression value.
Bridging the Gap Between the Boosting and Decision Trees Different variants of the ADTree models can be grown using different boosting algorithms, each with their own strengths and drawbacks. For example, the original univariate ADTree was grown using AdaBoost [19], while the AdaBoost.MH has been used to induce a univariate multi-label ADTree model [13]. Section 19.3 provides four variants of the ADTree model that cover univariate and multivariate problems with some discussions on their characteristics presented in Section 19.4. The ADTree model returns comprehensibility to the boosting algorithm by arranging the weak classifiers in a tree structure. Mathematically, the ADTree is a set of decision rules as shown in (1) whereby each decision rule assumes the form (2) (see Fig. 19.3). Fig. 19.4 shows an example of the ADTree model with a set of decision rules given in Table 19.1. The advantages of the ADTree are immediately apparent here: each decision rule is known explicitly and can be analyzed in an isolation. Weak classifiers near the root prediction node act as preconditions to deeper weak classifiers. Each weak classifier is encoded as a decision node. Two corresponding outcomes (α + and α − ) are encoded
19.2 BOOSTING: AN OVERVIEW
349
FIGURE 19.3 ADTree decision rule representation.
Table 19.1 Equivalent Decision Rules Set of ADTree Shown in Fig. 19.4. Advantages of the Decision Rule Interpretation: Each Decision Rule Can Be Analyzed in Isolation With Confidence-Rated Prediction via a Score Margin r0 (x) r1 (x) r2 (x) r3 (x)
If true then {if true then +0.2 else 0} else 0 If true then {if (x1 > 0.5) then +0.2 else −0.3} else 0 If ¬(x1 > 0.5) then {if (x2 > 0.3) then +0.6 else −0.4} else 0 If true then {if (x2 > 0.1) then +0.1 else −0.2} else 0
FIGURE 19.4 ADTree with three decision stumps (consisting of a decision node and two prediction nodes). All dotted edges of arrived prediction nodes must be transverse. The decision rules are provided in Table 19.1.
as prediction nodes. The above forms a decision stump. Each boosting iteration adds one decision stump to the existing ADTree model under one of the existing prediction nodes. The root decision rule r0 (x) in Table 19.1 describes the root prediction node for the ADTree model in Fig. 19.4. The first decision rule r1 (x) (Table 19.1) describes the decision stump which consists of the decision function f1 (x) and two corresponding prediction nodes underneath the root prediction node in Fig. 19.4. The second decision rule r2 (x) (Table 19.1) describes the decision stump with the decision function f2 (x) and two corresponding prediction nodes with the −0.3 prediction value. The third decision rule r3 (x) (Table 19.1) describes the decision stump with the decision function f3 (x).
350
CHAPTER 19 ALTERNATING DECISION TREES
FIGURE 19.5 Weak learners for: (A) univariate ADTree with an univariate base condition formed using a single feature xj ; (B) Fisher’s ADTree with a multivariate base condition formed using all presented features; (C) Sparse ADTree with many zero elements in its multivariate base condition; (D) Regularized LADTree where the multivariate base condition is a regression function g(x) instead of Boolean function f (x).
Given an input sample of x = [0.6 0]T , the decision rules in Table 19.1 return the prediction scores of +0.2, +0.2, 0, and −0.2 respectively by evaluating all the decision rules. By taking the sign of the summation (sign(+0.2 + 0.2 + 0 − 0.2 = +0.2) = +1), this input sample is classified as a positive class and the magnitude of +0.2 indicates the confidence of the prediction made. ADTree model :=
T
rt (x),
(19.1)
t=0
< = where rt (x): if (precondition) then if (condition) then α + else α − else 0
(19.2)
19.3 ALTERNATING DECISION TREES The original univariate ADTree [19] is described in Section 19.3.1. It uses an exhaustive approach to generate a set of univariate base conditions C at the beginning of every boosting iteration, each based on a different feature given the weight distribution (Fig. 19.5(A)). The selected j th feature or xj is thresholded at ϑ to form the univariate split. The Fisher’s ADTree [47] described in Section 19.3.2 extends the work on multivariate decision trees by employing Fisher’s discriminant analysis [17] on the existing univariate ADTree. Fisher’s ADTree generates the opposite extreme of the univariate ADTree in that it uses all the presented features in a single decision node rather than an individual j th feature xj . The application of Fisher’s discriminant analysis gives a vector β to form an artificial feature xT β which results in a multivariate base condition. The Sparse ADTree [46] in Section 19.3.3 applies the Sparse Linear Discriminant Analysis (SLDA) [9] that allows the β vector to be sparse (e.g. a β vector may have many zero elements with redundant features removed). All three variants of ADTree in Fig. 19.5(A–C) are grown using the AdaBoost algorithm whereby the weight distribution must be explicitly included into a weak learner. The Regu-
19.3 ALTERNATING DECISION TREES
351
larized LADTree [47] uses LogitBoost [21], in which the weight is intrinsically a part of the iteratively reweighted least-squares regression problem [21]. This means that it can seamlessly incorporate the feature selection into the tree by applying any of the well-developed regularization techniques in literature [27]. Correspondingly, the multivariate base condition is formed as a regression function g(x) instead of Boolean function f (x) (Fig. 19.5D) as discussed in detail in Section 19.3.4.
19.3.1 UNIVARIATE ADTREE Algorithm 19.3 shows the implementation of the Real AdaBoost (Algorithm 19.1) to learn an ADTree model. Two important variables are maintained throughout the Algorithm 19.3 induction: the precondition set P and base conditions set C. The quality of any decision tree highly depends on the partitioning of the input space. The precondition set P keeps track all the preconditions that indirectly refer to a partition of the input space populated by the training samples. The set C refers to potential base models termed as base conditions to split a particular input space region into two new partitions. Algorithm 19.3 ADTree learning with Real AdaBoost Input: training dataset D 1. Initialization 1.1. Set wi,t=0 = 1/n∀i and Pt=1 = {true} + (true) 1.2. The first decision rule r0 (x) : {if (true) then [if (true) then α0 = 12 ln( W W− (true) ) else 0]else 0} 1.3. Update wi,t=1 = wi,t=0 exp(−r0 (x(i) )y (i) ) 2. Repeat for boosting cycle t = 1 : T 2.1. For each precondition c1 ∈ Pt and each condition # # c2 ∈ C, evaluate Z(c1 , c2 ) = 2( W+ (c1 ∧ c2 )W− (c1 ∧ c2 ) + W+ (c1 ∧ ¬c2 )W− (c1 ∧ ¬c2 )) + W (¬c1 ) 2.2. Calculate αt+ and αt− for the selected c1∗ and c2∗ that minimizes Z with δ = 1 W (c∗ ∧c∗ )+δ
W (c∗ ∧¬c∗ )+δ
αt+ = 12 ln( W+− (c1∗ ∧c2∗ )+δ ), αt− = 12 ln( W+− (c1∗ ∧¬c2∗ )+δ ) 1 2 1 2 2.3. Update Pt+1 : Pt ∪ {c1∗ ∧ c2∗ , c1∗ ∧ ¬c2∗ } 2.4. Update wi,t+1 = wi,t exp(−rt (x(i) )y (i) ) Output: F (x) = Tt=0 rt (x)
All training samples are equally weighted in step 1.1 in Algorithm 19.3 by assigning wi,t=0 = 1/n. The symbols i and t are indexes to the training sample and boosting step. The precondition set P is also initialized to include the precondition true which defines the entire input space. With the root decision rule, an additional prediction value α0 is derived from the ratio between positive and negative training samples, acting as a prior classifier. The total weight of the positive samples that satisfy a condition c is denoted as W+ (c), and W− (c) for the negative samples. The training data set is reweighted according to this root decision rule (step 1.3) before a new boosting procedure begins. Step 2 learns the rest of the decision rules sequentially based on the iteratively reweighted training data set. Each boosting cycle adds a new decision stump to expand the ADTree model whereby a split evaluation (step 2.1) is performed to decide the best combination of the precondition and condition. The precondition refers to the choice of a prediction node that is selected for the inclusion into the ADTree. The condition refers to the decision node of the decision stump.
352
CHAPTER 19 ALTERNATING DECISION TREES
The splitting criterion used in the univariate ADTree calculates the weighted error of a base condition (split) under a specific precondition (first term) in the consideration of whether it is worthy to expand the selected precondition (second term). In the step 2.2 (node partition), the combination of the precondition and condition that minimizes the splitting criterion is selected. The prediction values are then calculated to complete the new decision rule. Two prediction values refer to the prediction nodes of the decision stump. The steps 2.3 and 2.4 update the precondition set to include the two newly formed preconditions. Weight distribution over the training data set is then updated based on the newly added decision rule. This helps to guide the next weak learner when generating a new set of the base conditions. The process repeats until it reaches the maximum boosting step T .
19.3.2 FISHER’S ADTREE Rather than splitting based on an individual feature, Fisher’s ADTree splits on an artificial feature which is a linear combination of original features through Fisher’s discriminant analysis [17]. Fisher’s discriminant analysis is used to project the training samples to a subspace spanned by β. Gained information is then used to find the best split point on the resultant one-dimensional artificial feature. The objective is to maximize the between-class covariance matrix β T b β with respect to the within-class covariance matrix β T w β of the projected samples. This forms the Fisher’s ratio J (β) (19.1) which can be maximized by solving the generalized eigenvalue problem. The number of dimensions of the subspace is determined by the total number of classes K. For binary class problems, it results in a single discriminative vector β. βT b β , (19.3) J (β) = T β wβ where b and w of (19.3) are the between-class and within-class covariances of the original data set, respectively. They are estimated from the training data set using (19.4) and (19.5). The mean vector of the entire training data set is denoted as μ while the mean vector of training samples from class k is denoted as μk .
=
b
w
K (μk − μ)(μk − μ)T
(19.4)
k=1
=
K
T x(i) − μk x(i) − μk
(19.5)
k=1 x(i) ∈yk
This technique has been applied on decision trees to form Fisher’s decision tree [31], a so-called multivariate extension to the classical C4.5 algorithm [39] (Fig. 19.6). The Fisher’s ADTree differs from the Fisher’s decision tree by allowing several decision stumps to be boosted under the same prediction node. Boosting on a discriminant analysis is a separate but relevant field that extends the Fisher’s discriminant analysis through other models apart from the decision trees (Fig. 19.7). Examples include the boosting discriminant analysis [32,33] and boosted discriminant projections for k-nearest neighbor classification [35]. The Fisher’s ADTree differs from these works by representing the final ensemble model by a decision tree structure. Fig. 19.7 shows the genealogy of the Fisher’s ADTree in relation to the related algorithms.
19.3 ALTERNATING DECISION TREES
353
FIGURE 19.6 Genealogy of the Fisher’s ADTree (highlighted in gray).
FIGURE 19.7 Genealogy of sparse ADTree (highlighted box). Sparse ADTree can be viewed as a sparse version of Fisher’s ADTree. It is also an alternative approach to induce the univariate version of the ADTree.
In the original Fisher’s discriminant estimation, the weight distribution is not incorporated as a part of the optimization formulation, thus an identical β is produced at the end of each boosting cycle, defeating the boosting purpose. Weighted versions of b and w are used as shown in (19.6) and (19.7) to find the discriminative vector β where the weighted kth class mean vector and overall mean vector are shown in (19.8) and (19.9) respectively. The weighted Fisher’s discriminant is presented in Algorithm 19.4 that forms the weak learner for Fisher’s ADTree.
b
=
K k=1
1
i∈yk
wi
(μk − μ)(μk − μ)T
(19.6)
354
CHAPTER 19 ALTERNATING DECISION TREES
w
=
K
(19.7)
k=1 x(i) ∈yk
μk =
1
i∈yk
μ=
T wi x(i) − μk x(i) − μk
wi
wi x(i)
(19.8)
i∈yk
K 1 μk K
(19.9)
k=1
Algorithm 19.4 Weighted Fisher’s discriminant Input: a training dataset [X, y] and weight distribution w Statistical procedure to extract information based on [X, y] includes: 1. Calculate the weighted mean of positive and negative classes respectively: μ1 and μ2 using (19.8); 2. Calculate the weighted between-class covariance matrix b using (19.6); 3. Calculate the weighted within-class covariance matrix w using (19.7); 4. Maximize the Fisher’s ratio below by solving the generalized eigenvalue problem b β = λ w β: βT β
T
b
β
wβ
,
where λ and β are also termed as eigenvalue and eigenvector respectively of the eigendecomposition problem. Output: β
19.3.3 SPARSE ADTREE Models with a sparse representation are easier to interpret, less complex and have lower memory requirements. In the regression context where class labels are continuous, the sparse representation of the weight vector β is desirable for both the interpretability and automatic feature selection to explain the relationship between the features and the continuous output [26]. Discussion on the sparse ADTree in this section follows the approach in [46] to incorporate sparsity within the multivariate ADTree. Fig. 19.7 shows the genealogy of the sparse ADTree in relation to other algorithms. Sparse ADTree is grown using the Sparse Linear Discriminant Analysis (SLDA) [9]. It uses the optimal scoring formulation [25] shown in (19.10) that alters the classification problem of Fisher’s discriminant into a regression problem. For supervised binary classification problems, the provided class label y is categorical. Each element yi is the class label of ith sample of either +1 or −1. In order to perform the regression, a class indicator matrix Y is used instead of y, and an optimal score vector θ is applied to form a real-valued output Yθ . Rows of the matrix Y refer to the samples, and the columns represent the class labels. Given the training data set X and class label Y, the optimal scoring is a regression type approach to minimize the residual between Yθ and Xβ. It finds a linear regression model Xβ to best approximate
19.3 ALTERNATING DECISION TREES
355
the output response Yθ . By minimizing the least square between Yθ and Xβ, it is equivalent to the Fisher’s ratio. The constraint θ T Dπ θ = 1 in (19.10) prevents a null solution. min n−1 Yθ − Xβ22 θ,β
subject to θ T Dπ θ = 1,
where Dπ = n−1 YT Y
(19.10)
SLDA implements the Elastic Net penalty to the optimal scoring formulation to create a sparse representation [9]. Elastic Net formulation (19.12) is a Lasso-type penalization technique that has a sparsity-inducing property which is well examined and documented [55]. The Lasso-constrained optimal scoring is shown in (19.11). min n−1 Yθ − Xβ22 + λ1 β1 θ,β
subject to θ T Dπ θ = 1,
(19.11)
where λ1 is the regularization parameter for the Lasso penalty. min n−1 Yθ − Xβ22 + λ2 β22 + λ1 β1 θ,β
subject to θ T Dπ θ = 1
(19.12)
The paper on SLDA [9] proposes a simple iterative algorithm to solve (19.12) while optimizing two parameters: an optimal score vector θ and discriminative vector β. First, θ is held fixed in optimizing β and then β is held fixed while solving for θ . This process is repeated until the convergence takes place. Algorithm 19.5 shows its detailed implementation. Algorithm 19.5 SLDA Input: a training dataset, i.e. X and Y 1. Initialize a trivial optimal score vector θ 0 for which consists of all 1 s. For initialization, θ is set as θ = (I − θ 0 θ T0 Dπ )θ ∗ where θ ∗ is a random vector. The optimal score vector θ is then normalized such that θ T Dπ θ = 1. 2. Repeat until convergence: 2.1. Solving Elastic Net regression (19.12) for fixed θ using LARSEN [55] algorithm 2.2. Obtain the Ordinary Least Squares (OLS) solution i.e. θ = (Dπ )−1 YT Xβ for fixed β. The optimal score vector θ is then ortho-normalized to make it orthogonal to θ 0 . Output: β and θ The Elastic Net regression problem (19.12) is reformulated into the Lasso regression (19.13) by augmenting the artificial training data set (X∗ , Y∗ ) shown in (19.14). 2 min n−1 Y∗ θ − X∗ β 2 + λ1 β1 (19.13) β
X∗(n+p)×p
− 12
= (1 + λ2 )
X Y ∗ √ , Y(n+p) = 0 λ2 I
(19.14)
LARS is termed as LARSEN in this case to obtain the regularization path of (19.13). The only difference between LARS and LARSEN (besides the augmented training data set (Y∗ θ, X∗ )) is that LARSEN uses Cholesky factorization [55] for inversion of the Gram matrix GA for a faster computation. The SLDA algorithm does not accommodate additional weight inputs when finding the optimal β.
356
CHAPTER 19 ALTERNATING DECISION TREES
Therefore, a simple modification is made to the LARSEN algorithm [55] to facilitate boosting by focusing the β learning on samples with higher importance, i.e., those with higher weight values. This is shown in the step 2.2 in Algorithm 19.6. Algorithm 19.6 starts with the null β (k=0) = 0 solution. Active set A is used to index the active features (non-zero β entries). The inactive set Ac is a complement of set A. An estimated model μ(k) = X∗ β (k) that best approximates the output Y∗ θ is gradually built – one feature at a time. The difference between these two terms is defined as a residual. The kth residual vector for the entire training data set denoted as ε(k) = Y∗ θ − μ(k) with length n is updated in step 2.1. To select a feature that enters the estimated model, the correlation between the features and residual vector is calculated as corr(μ(k) ) = X∗T ε (k) . The weight vector w(t) of the length n at t th boosting procedure step is used directly to reweight the residual vector. The resultant dot product w(t) .ε (k) is then used to calculate the correlation vector corr(μ(k) ) = X∗T (w(t) .ε (k) ). In this manner, the feature selection is now based on both the weight distribution and residual vectors. Next, the vector β (k) is updated to accommodate the new feature. This process is shown in the step 2.2. Given the updated active features set, the next estimated model can be updated by calculating β (k+1) such that μ(k+1) = μ(k) + γ uA . The equiangular vector uA is a linear combination of active features, which is essentially the magnitude to be added to β. The new estimated model moves along in the direction of uA for a specified length γ . Correlation between the active features and updated residual vector decreases accordingly when γ increases. The process stops when there is an inactive feature that is equivalently important in terms of the correlation. The step length γ is calculated based on the above concept. This process is shown as the steps 2.3 and 2.4. Step 2.5 calculates the shortest length γ ∗ for any active features. Since the Lasso regression requires the correlation sign to be the same as that of the β j magnitude, thus once the sign restriction is violated γ ∗ < γ (step 2.6), the update of μ(k+1) is stopped at the length γ ∗ (rather than at γ ) and j ∗ feature is removed from the active set A. Otherwise, it proceeds as normal with the length γ . Step 2.7 increments the index k by 1. Application of SLDA with Algorithm 19.6 results in a final optimized regularization path that is represented by a series of β solutions for a different number of features. In order to select β that forms an optimal multivariate base condition for the ADTree, model selection criterion such as cross-validation generalization (GCV) [7], Akaike Information Criterion (AIC) [28] or Bayesian Information Criterion (BIC) [43] can be used to generate different variants of Sparse ADTree with different decision node complexities and tree sizes.
19.3.4 REGULARIZED LOGISTIC ADTREE There are some drawbacks to the Sparse ADTree. Specifically, SLDA reformulates the classification problem as a regression through the Optimal Scoring [25] in order to induce a multivariate decision node. Such a conversion results in a series of iterative optimization steps of both β and θ in each boosting procedure which can be computationally demanding. The regularized logistic ADTree (rLADTree) presents a more elegant solution to induce a sparse version of the multivariate ADTree. Here, only one regularization path for β is required per the boosting procedure. Additionally, rLADTree serves as a framework to allow a further characterization on β via various intriguing penalization functions. Fig. 19.8 shows the genealogy of rLADTree in relation to other algorithms.
19.3 ALTERNATING DECISION TREES
357
Algorithm 19.6 LARSEN Input: (Y∗ θ , X∗ ) 1. Initialize: k = 0, a null model μ(k=0) = X∗ β (k=0) with β (k=0) = 0, an active set A = {}(empty set), and inactive set Ac = {1, ..., p}. 2. Repeat until the inactive set Ac is empty: 2.1. Update the residual vector ε (k) = Y∗ θ − μ(k) 2.2. Find the maximal correlation rj = max |corr(μ(k) )| and move the corresponding feature from Ac to A 2.3. To compute the equiangular vector uA = X∗A ωA calculate the following: (k) X∗A = (. . . sj X∗j . . .)j ∈A , where sj = sign(corr(μj )) ∗ GA = X∗T A XA (Gram matrix) −1 T a = (1A GA 1A )−1 , where 1A is a vector consisting of 1’s with a length |A| ωA = aG−1 A 1A 2.4. Compute an analytical step size in the direction of uA : r−rj r+rj ∗ γ = min+ j ∈Ac { a−aj , a+aj } where r, a, rj and aj are the correlation values between XA and ∗ ∗ ∗ ∗ ∗ Y θ , XA and uA , Xj and Y θ , as well as Xj and uA 2.5. For d being a vector of length p equaling sj ωA for j ∈ A and 0 elsewhere, compute γ ∗ = minγj >0 {γj = −β (k) j /dj } (k+1) 2.6. Update μ = μ(k) + γ ∗ uA and remove j ∗ from A if γ ∗ < γ , otherwise, update μ(k+1) = (k) μ + γ uA 2.7. Increment k = k + 1 Output: A series of β (k) (regularization path) A general implementation of LADTree (without regularization) is presented in Algorithm 19.7. Step 1 describes the root decision rule formulation. The rest of the decision rules are sequentially learned via LogitBoost until the predetermined number of boosting steps. Iteratively Reweighted Least Squares (IRLS) formulation of LogitBoost [21] updates the working response and weight of every training sample (step 2.1). The weighted least squares problem is solved in step 2.2 and a regression model(s) is/are added to the base conditions set C. Step 2.3 expands the LADTree by evaluating all possible combinations of the precondition c1 ∈ P and base condition c2 ∈ C.W+ (c) and W− (c) denote the total weight of positive and negative samples respectively that satisfy the predicate c. The combination that generates the lowest splitting criterion Zt (c1 , c2 ) is added as a new decision rule and the resulting real-valued predictions are computed in step 2.4. Step 2.5 represents the probability of y = 1 for ith training sample as p(x(i) ) which is updated at every new decision rule. The precondition set P is expanded to include two new input subspaces in the step 2.6. The multivariate regularized LADTree can be induced by first restricting the regression model gm (x) to a standard linear regression type −g m (x) : xT β which results in (19.15). W is a diagonal matrix of n × n. Each diagonal entry indicates a weight value for one training sample. The weight is incorporated directly such that the output and design matrix are W1/2 z and W1/2 X respectively. 2 min n−1 W1/2 z − W1/2 Xβ 2 β
(19.15)
358
CHAPTER 19 ALTERNATING DECISION TREES
FIGURE 19.8 Genealogy of regularized LADTree. It is a flexible regularization framework which allows an incorporation of any penalization functions without modifications to the solver thus offering higher efficiency and more elegant solutions compared to the other multivariate ADTrees.
Unfortunately, the use of just (19.15) is still insufficient since a constraint or penalization function must be placed on β in order to provide the capability to shape characteristics of the ADTree decision node (e.g., feature selection). Therefore, a penalization function J (β) is used on (19.15) to obtain a constrained regression solution shown in (19.16). From Bayesian perspective, this is effectively equivalent to placing a priori condition on the β solution in maximizing the posterior likelihood. 2 min n−1 W1/2 z − W1/2 Xβ 2 + J (β) β
(19.16)
The regularized LADTree assimilates the weight distribution as a part of the linear regression problem in minimizing the residual between W1/2 z and W1/2 Xβ. By expressing the problem in the form of (19.16), it becomes possible to take an advantage of the regularized linear regression directly without any modification to accommodate the boosting weight distribution. The classical approaches include Ridge (β22 ), Lasso (β1 ), and Elastic Net (a convex combination of Ridge and Lasso, λ2 β22 + λ1 β1 ). The modularity and flexibility of the proposed regularized LADTree are its greatest advantages over other ADTree designs. Users of rLADTree can apply any of the classical penalization techniques [27] and select any number of features that they wish to incorporate in order to customize the tree for their specific application. Some of the other popular regularization techniques in the literature include the smoothly clipped absolute deviation (SCAD) [16] and Dantzig selector [6]. On Lasso itself, some of the extension works are the Group Lasso [53], Sparse Group Lasso [23], Graphical Lasso [22], and Nonnegative Lasso [52]. There are also Elastic Net (EN) extensions such as Cluster Elastic Net (CEN) [50] and Robust Elastic Net (REN) [49]. These can be used to generate different penalty functions to achieve certain rLADTree properties.
19.4 DISCUSSION AND COMPARISON
359
Algorithm 19.7 General LADTree algorithm Input: a training dataset Procedure: 1. Initialize: 1.1. Set probability estimates p(x(i) ) = 0.5∀i 1.2. Set precondition set P1 : {true} (true) 1.3. Set root decision rule: r0 (x) : {if (true)then {if (true) then α0 = 12 ln( W W (true) ) else 0} else 0} 2. Repeat for t = 1, 2, . . . , T 2.1. Compute the working response and weight ∀i zi =
y (i)∗ − p(x(i) ) , p(x(i) )(1 − p(x(i) ))
wi = p x(i) 1 − p x(i)
2.2. Solve the weighted least-squares regression of zi to x(i) using the weight wi 2.3. For each precondition c1 ∈ P and each condition c2 ∈ C # # Zt (c1 , c2 ) = 2( W+ (c1 ∧ c2 )W− (c1 ∧ c2 ) + W+ (c1 ∧ ¬c2 )W− (c1 ∧ ¬c2 )) + W (¬c1 ) 2.4. Select the combination of c1∗ and c2∗ which minimizes Zt (c1 , c2 ), and update a new decision rule rt (x) such that its precondition is c1∗ , condition is c2∗ and two prediction values are: W+ (c1∗ ∧ c2∗ ) + ε W+ (c1∗ ∧ ¬c2∗ ) + ε 1 1 − αt+ = ln = , α ln t 2 W− (c1∗ ∧ c2∗ ) + ε 2 W− (c1∗ ∧ ¬c2∗ ) + ε (i)
(x )) 2.5. Update F (x(i) ) ← F (x(i) ) + 12 rt (x(i) ) and p(x(i) ) = exp(F (xexp(F (i) ))+exp(−F (x(i) )) ∀i 2.6. Update the precondition set Pt+1 = Pt ∪ {c1∗ ∧ c2∗ , c1∗ ∧ ¬c2∗ } Output: sign[F (x)] = sign[ Tt=0 rt (x)]
19.4 DISCUSSION AND COMPARISON Three discussions are presented here. First, the ADTree variants described in Section 19.3 are compared against other types of decision trees in terms of the prediction accuracy, induction time, decision tree size, and decision node complexity. Comprehensibility can be viewed as a trade-off between the decision tree size and decision node complexity. Second, since the SADT and rLADTree can generate univariate versions, these are compared against the UADT. Finally, different SADT variants which can be grown using different node complexity tuning and model selection criteria are discussed. The data sets used in this study are publicly available from the University of California, Irvine (UCI) (Frank and Asuncion [18]) and University of Eastern Finland (UEF) (“University of Eastern Finland, Spectral Color Research Group,” [48]). They have been shortlisted (Table 19.2) and preprocessed to center each feature to zero and a standard deviation of one. All experiments were conducted on a PC with Intel® Core™ 3.2 GHz i5 CPU and 4 GB RAM. A standard 10-times 10-fold stratified cross-validation was used. For each data set, the best performing algorithm was given the rank value of 1, the second – the rank value of 2, and so on. The
360
CHAPTER 19 ALTERNATING DECISION TREES
Table 19.2 Data Sets No
Dataset
Number of Samples
Number of Features
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Breast cancer Blood transfusion Liver disorder Vertebral Pimaindian Heart MAGIC gamma Parkinson Haberman ILPD Ionosphere Spambase Wilt QSAR Climate Banknote Woodchip (UEF) Forest (UEF) Paper (UEF)
569 748 345 310 768 267 19020 195 306 579 351 4601 4839 1055 540 1372 10000 707 180
30 4 6 6 8 44 10 22 3 10 33 57 5 41 18 4 26 93 31
ranks were averaged if the performances were tied. An average rank of each learning algorithm was calculated over multiple data sets as shown in (19.17), where ril represents the rank of lth algorithm for ith data set. The total number of data sets is M while the total number of algorithms is L. Rl =
1 ril M
(19.17)
i
The widely used non-parametric Friedman’s statistical test (19.18) with its post-hoc Nemenyi’s test [14] was implemented to assess the performances of multivariate ADTrees against other algorithms. The null hypothesis of the Friedman’s test is that there is no statistical significant difference among the tested learning algorithms. When the null hypothesis was rejected, a post-hoc Nemenyi’s test [14] was conducted to determine pairs of algorithms that were statistically different based on the critical difference (CD) (19.19). The statistical test is based on the average ranks of classification models to detect if there are statistically significant differences among the classifiers where the critical value for Nemenyi’s test was denoted as qα . 12M L(L + 1)2 2 = Rl − L(L + 1) 4 l L(L + 1) CD = qα 6M
χF2
(19.18)
(19.19)
19.4 DISCUSSION AND COMPARISON
361
FIGURE 19.9 The average ranks of the algorithms (rank values are shown in round brackets next to the algorithms) and the corresponding statistical comparisons for: (A) prediction accuracy, (B) decision tree size, and (C) split complexity (total number of nonzero feature coefficients). Groups of algorithms that are not statistically significantly different are connected by a bold line. CD refers to a critical difference in terms of the rank value.
19.4.1 COMPARISON WITH OTHER DECISION TREE ALGORITHMS The experimental results (Fig. 19.9) show no critical difference in the prediction accuracy for all algorithms across all data sets, proving that no single algorithm can outperform all others in all possible data sets as suggested by the No-Free-Lunch Theorem [51]. Nonetheless, the most optimal decision tree can be grown by matching the data set characteristics to that of the decision tree algorithm. The research [47] shows that when the domain problem has a small number of highly discriminative features, the classical decision trees such as C4.5 and CART algorithms can generate optimal models. Yet when these algorithms are applied on a domain problem with complex and correlated features, they result in large, incomprehensible, and inaccurate trees. When the domain problem contains data with unknown characteristics, the ADTree algorithms show their true strength. They are equipped with boosting techniques to match the complexity of the
362
CHAPTER 19 ALTERNATING DECISION TREES
problem. While ensemble of decision trees (i.e., random forests) can achieve the same complexity matching using boosting techniques, they often come with an expensive trade-off on the induction time and comprehensibility. ADTrees, on the other hand, result in a single tree with comprehension advantage that is of great importance to practitioners who require transparency in their classification problem. For example, the univariate ADTree gives the fastest induction time for domain problems with few discriminative features [47]. The Fisher’s ADTree was proven to be the boosted alternative to multivariate decision trees such as Fisher’s decision tree. The proposed sparse ADTree incorporates a sparseness criterion into the multivariate ADTree to allow for a better comprehension through the feature selection and was shown to be a nonparametric extension to SLDA. The regularized LADTree is a boosted multivariate tree having a superior classification performance at a little to no cost to the tree size and node complexity [47]. For example, on applications that contain features with complex interactions, the regularized LADTree is able to build a more accurate and much smaller tree with its multivariate node compared to C4.5 and CART. At the same time its node complexity remains small due to the use of the regularization techniques. Furthermore, its modularity allows a wide range of established linear regularization techniques to be implemented seamlessly therefore effectively linking the decision tree to powerful regularization research fields.
19.4.2 COMPARISON OF DIFFERENT UNIVARIATE ADTREE VARIANTS SADT and rLADT can generate univariate versions by choosing to use only one active feature when computing the regularization path. In general, therefore, univariate ADTrees are a subclass of the SADT and rLADT algorithms. The experimentally studied performance differences between these univariate ADTree variants are presented in Table 19.3. The test shows that UADT is statistically more accurate (at 0.01 significant level) than SADTUNI and rLADTUNI . However, it is inferior with a statistically significant difference in the decision tree size and split complexity to the other two. This leads to the conclusion that rLADTUNI and SADTUNI are generally more comprehensible compared to UADT. Furthermore, UADT uses an exhaustive approach to generate a set of univariate base conditions. Thus, it can be observed from the results that SADTUNI and rLADTUNI are consistently faster to induce compared to ADT in all data sets. Thus, for domain problems with single discriminative features, UADT should be used when the prediction accuracy is prized over the induction time and comprehensibility. However, SADT and rLADT should be used when the reverse is desirable.
19.4.3 COMPARISON BETWEEN SADT VARIANTS SADT allows a manual fine-tuning of the decision node complexity, resulting in slightly different ADTree models. In this discussion, the regularization path is stopped at the user-defined parameter ρ, and the GCV model selection is used to select one of the solutions ranging from 1 to ρ features. The selected ρ values for this study are 1, 3, 5, and p (the maximum number of features in each data set) and their respective variants are SADTUNI , SADT3 , SADT5 , and SADT. Table 19.4 shows the average ranks of these variants for each performance metric. Algorithms are listed top-down based on their average rankings on the left with the best performing algorithm being placed at the top of the list.
Table 19.3 Comparisons Between Univariate ADTree, Univariate Version of SADT (SADTUNI ), and Univariate Version of rLADT (rLADTUNI ). The Average Ranks Are Shown in the Last Row of the Table ID 1 2 3 4 5 6 7 8 9 10 11 12
Prediction Accuracy UADT SADT UNI 94.68 93.68 ±0.80 ±0.67 77.38 76.09 ±0.41 ±0.38 62.34 62.35 ±1.37 ±1.39 82.71 79.03 ±1.24 ±0.59 72.54 73.32 ±0.91 ±0.67 78.57 78.87 ±1.66 ±1.05 78.59 75.12 ±0.10 ±0.01 88.86 80.09 ±1.57 ±0.84 71.91 72.24 ±1.59 ±1.00 71.23 71.51 ±0.52 ±0.00 84.18 78.58 ±1.39 ±1.00 93.63 91.18 ±0.16 ±0.64
rLADT UNI 93.84 ±0.58 76.21 ±0.00 61.05 ±1.77 80.52 ±1.42 72.40 ±0.94 79.43 ±0.00 74.75 ±0.00 75.42 ±0.00 72.84 ±0.91 71.51 ±0.00 81.62 ±0.70 90.96 ±0.19
Induction Time UADT SADT UNI 1.43 0.12 ±0.39 ±0.05 0.07 0.01 ±0.03 ±0.01 0.11 0.05 ±0.05 ±0.01 0.19 0.01 ±0.09 ±0.01 0.12 0.02 ±0.07 ±0.01 0.32 0.06 ±0.14 ±0.02 16.25 0.21 ±0.79 ±0.03 1.25 0.02 ±0.16 ±0.01 0.05 0.02 ±0.03 ±0.02 0.03 0.01 ±0.03 ±0.00 1.96 0.08 ±0.82 ±0.03 41.23 6.64 ±2.72 ±0.62
rLADT UNI 0.18 ±0.03 0.00 ±0.00 0.08 ±0.02 0.03 ±0.01 0.07 ±0.03 0.01 ±0.00 0.14 ±0.02 0.01 ±0.00 0.03 ±0.01 0.00 ±0.00 0.14 ±0.04 2.12 ±0.20
Decision Tree Size UADT SADT UNI 71.65 43.75 ±12.25 ±9.41 31.93 5.92 ±7.48 ±3.21 33.28 35.53 ±6.64 ±6.30 51.91 7.66 ±14.21 ±2.63 25.75 11.92 ±7.54 ±2.67 23.08 20.29 ±5.33 ±4.43 137.65 13.00 ±3.80 ±0.00 90.25 9.79 ±8.48 ±4.03 29.98 16.84 ±10.19 ±9.33 8.05 4.12 ±4.32 ±0.38 64.06 29.5 ±15.41 ±6.58 139.87 119.62 ±3.49 ±7.58
rLADT UNI 93.22 ±8.47 4.00 ±0.00 75.31 ±15.21 30.82 ±9.87 46.51 ±12.77 4.00 ±0.00 16.00 ±0.00 5.38 ±4.36 30.91 ±8.22 4.00 ±0.00 82.60 ±11.92 125.77 ±6.81
Split Complexity ADT SADT UNI 23.55 14.25 ±4.08 ±3.14 10.31 1.64 ±2.49 ±1.07 10.76 11.51 ±2.21 ±2.10 16.97 2.22 ±4.74 ±0.88 8.25 3.64 ±2.51 ±0.89 7.36 6.43 ±1.78 ±1.48 45.55 4.00 ±1.27 ±0.00 29.75 2.93 ±2.83 ±1.34 9.66 5.28 ±3.40 ±3.11 2.35 1.04 ±1.44 ±0.13 21.02 9.50 ±5.14 ±2.19 46.29 39.54 ±1.16 ±2.53
rLADT UNI 30.74 ±2.82 1.00 ±0.00 24.77 ±5.07 9.94 ±3.29 15.17 ±4.26 1.00 ±0.00 5.00 ±0.00 1.46 ±1.45 9.97 ±2.74 1.00 ±0.00 27.20 ±3.97 41.59 ±2.27
continued on next page
Table 19.3 (Continued) ID 13 14 15 16 17 18 19 Rank
Prediction Accuracy UADT SADT UNI 93.37 94.61 ±0.11 ±0.00 84.04 82.15 ±0.92 ±0.72 91.07 91.49 ±0.44 ±0.00 88.80 87.87 ±0.23 ±0.10 67.82 63.81 ±0.16 ±0.02 83.71 77.47 ±0.38 ±0.20 96.51 80.08 ±1.67 ±0.00 1.37 2.16
rLADT UNI 94.61 ±0.00 80.17 ±0.48 91.49 ±0.00 84.55 ±0.00 63.92 ±0.00 78.79 ±0.43 80.08 ±0.00 2.47
Induction Time UADT SADT UNI 1.17 0.01 ±0.52 ±0.00 6.68 0.43 ±1.42 ±0.11 0.77 0.02 ±0.45 ±0.03 0.37 0.02 ±0.09 ±0.00 1.25 0.05 ±0.02 ±0.00 6.23 0.04 ±1.10 ±0.03 1.34 0.01 ±0.26 ±0.00 3.00 1.37
rLADT UNI 0.01 ±0.00 0.27 ±0.08 0.01 ±0.00 0.01 ±0.01 0.02 ±0.00 0.18 ±0.02 0.00 ±0.00 1.63
Decision Tree Size UADT SADT UNI 27.58 4.00 ±8.89 ±0.00 104.56 64.57 ±11.39 ±10.88 37.00 6.55 ±17.02 ±8.06 64.18 10.39 ±8.55 ±0.40 30.55 4.00 ±0.32 ±0.00 86.62 6.70 ±8.22 ±4.48 76.12 4.00 ±7.46 ±0.00 2.79 1.63
rLADT UNI 4.00 ±0.00 70.54 ±10.22 4.00 ±0.00 4.00 ±0.00 4.00 ±0.00 57.40 ±8.06 4.00 ±0.00 1.58
Split Complexity ADT SADT UNI 8.86 1.00 ±2.96 ±0.00 34.52 21.19 ±3.80 ±3.63 12.00 1.85 ±5.67 ±2.69 21.06 3.13 ±2.85 ±0.13 9.85 1.00 ±0.11 ±0.00 28.54 1.90 ±2.74 ±1.49 25.04 1.00 ±2.49 ±0.00 2.79 1.63
rLADT UNI 1.00 ±0.00 23.18 ±3.41 1.00 ±0.00 1.00 ±0.00 1.00 ±0.00 18.80 ±2.69 1.00 ±0.00 1.58
19.5 APPLICATIONS OF ADTREE
365
Table 19.4 Average Ranks of SADT With Different ρ Values for Each Performance Metric Prediction Accuracy
Induction Time
Decision Tree Size
Split Complexity
2.08 2.26 2.37 3.29
1.16 2.82 2.84 3.18
1.74 1.82 3.08 3.37
1.05 2.66 3.13 3.16
SADT SADT3 SADT5 SADTUNI
SADTUNI SADT SADT3 SADT5
SADT SADTUNI SADT5 SADT3
SADTUNI SADT3 SADT5 SADT
In terms of prediction accuracy, SADT ranked at the top followed by SADT3 , SADT5 , and SADTUNI . The performed statistical test showed that SADT was statistically better (0.05 significant level) than SADTUNI . They are the only pair of algorithms with statistically significant difference. This is an unsurprising result since SADT has a full regularization path with all possible models to choose from, while SADTUNI is limited to just one solution. The trade-off comes in the split complexity whereby SADT has the worst average rank since the inclusion of more features comes at the expense of the node complexity. Only SADTUNI is statistically less complex compared to all other variants (0.01 significant level) due to its univariate decision nodes. In terms of the decision tree size, SADT has the best average rank followed by SADTUNI , SADT5 , and SADT3 . In addition, SADT is statistically smaller (0.01 significant level) than both SADT3 and SADT5 . Similarly, SADTUNI is statistically smaller than SADT3 . The induction time is directly affected by the decision tree size and ρ which determines the length of the regularization path computation. It is no surprise that the SADTUNI has the fastest induction time with its overall least complex nodes and small decision tree size. The surprising result comes in SADT, which is second fastest. Although it has the highest node complexity, however it also has smallest tree size. This suggests that its tree size has a greater impact on the induction time compared to the node complexity. In general, SADT3 and SADT5 are a trade-off between SADTUNI and SADT in terms of prediction accuracy and split complexity. Besides ρ-controlled SADT variants, other SADT variants can be grown by using different model selection techniques. Table 19.5 shows the performance of SADT variants grown using GCV and AIC. There is no statistically significant difference between these SADT variants for all the performance metrics indicating negligible impact in the choice of a model selection technique.
19.5 APPLICATIONS OF ADTREE ADTrees have been successfully used in various fields such as engineering, finance, medicine, and astronomy. For example, Creamer and Freund [10] present a representative ADT (a combined M cross-validated ADTrees) to learn a Balanced Scorecard (BSC) in improving a corporate performance. Important relations between corporate governance variables and a firm performance were extracted in the ADTree representation. A multi-stock automated trading system is proposed by Creamer and Freund [11]. Historical stock price data are used to train several ADTree models. An online learning layer is used to make a prediction on a new stock price. It is also combined with the risk management module to make an investment decision.
366
CHAPTER 19 ALTERNATING DECISION TREES
Table 19.5 Experimental Results on the Sparse ADTree With AIC Model Selection in Comparisons to the Sparse ADTree With GCV Model Selection. The Average Ranks Are Shown in the Last Row of the Table ID
1
Prediction Accuracy SADT SADT AIC GCV
96.84 ±2.08 76.21 2 ±0.41 62.84 3 ±6.75 83.39 4 ±6.87 76.01 5 ±4.69 6 76.90 ±8.42 78.90 7 ±0.80 81.67 8 ±7.87 9 71.28 ±4.88 71.35 10 ±1.01 84.85 11 ±5.50 12 90.90 ±1.10 94.59 13 ±0.19 85.11 14 ±2.95 15 91.49 ±0.79 98.21 16 ±1.13 99.58 17 ±0.18 96.45 18 ±2.54 100.00 19 ±0.00 Rank 1.47
96.89 ±0.37 76.21 ±0.00 62.81 ±0.81 83.35 ±0.57 76.01 ±0.41 76.36 ±1.77 78.90 ±0.03 81.67 ±1.01 71.28 ±0.81 71.39 ±0.26 84.76 ±0.92 90.91 ±0.08 94.59 ±0.06 85.17 ±0.36 91.49 ±0.00 98.21 ±0.05 99.58 ±0.01 96.42 ±0.36 100.00 ±0.00 1.53
Induction Time SADT SADT AIC GCV
Decision Tree Size SADT SADT AIC GCV
Split Complexity SADT SADT AIC GCV
0.32 ±0.56 0.01 ±0.00 0.01 ±0.01 0.02 ±0.02 0.04 ±0.04 2.86 ±2.48 1.26 ±2.68 0.17 ±0.22 0.03 ±0.04 0.07 ±0.13 0.34 ±0.57 7.08 ±6.92 0.03 ±0.03 2.30 ±2.41 0.05 ±0.10 0.01 ±0.01 0.57 ±0.05 8.86 ±11.78 0.20 ±0.61 1.39
9.79 ±14.39 4.00 ±0.00 5.62 ±6.54 7.96 ±8.77 13.36 ±12.79 53.23 ±43.51 15.94 ±25.00 12.19 ±14.80 18.85 ±24.88 15.49 ±23.55 12.82 ±17.79 12.01 ±11.00 5.02 ±4.23 28.81 ±30.42 6.88 ±14.13 4.33 ±2.76 4.00 ±0.00 6.49 ±8.29 4.00 ±0.00 1.50
73.86 ±123.83 4.00 ±0.00 9.21 ±13.09 13.64 ±17.50 30.15 ±32.15 341.99 ±287.10 46.54 ±80.58 74.80 ±100.65 14.07 ±20.61 42.95 ±72.15 118.02 ±182.20 204.66 ±207.32 6.56 ±7.09 364.87 ±407.13 25.10 ±81.08 3.77 ±3.69 26.00 ±0.00 162.52 ±245.13 29.13 ±1.30 1.61
0.44 ±0.22 0.01 ±0.00 0.01 ±0.01 0.02 ±0.01 0.04 ±0.02 2.88 ±0.68 1.26 ±0.77 0.25 ±0.15 0.03 ±0.01 0.08 ±0.05 0.44 ±0.25 6.82 ±2.75 0.03 ±0.01 2.85 ±1.13 0.05 ±0.04 0.01 ±0.01 0.39 ±0.05 11.17 ±5.35 0.11 ±0.02 1.61
11.59 ±4.84 4.00 ±0.00 5.62 ±2.64 7.63 ±2.33 13.24 ±4.32 50.56 ±12.02 15.94 ±8.25 16.30 ±7.90 18.22 ±5.43 15.88 ±8.10 14.02 ±4.67 11.80 ±4.52 5.02 ±1.27 33.31 ±11.11 5.68 ±3.55 4.33 ±0.85 4.00 ±0.00 10.30 ±4.03 4.00 ±0.00 1.50
90.86 ±166.57 4.00 ±0.00 9.20 ±13.04 12.98 ±12.43 29.83 ±32.14 263.09 ±251.29 46.54 ±80.58 100.57 ±135.98 13.44 ±18.77 40.83 ±68.14 128.69 ±173.76 200.70 ±193.33 6.56 ±7.09 422.98 ±452.77 17.59 ±53.92 3.77 ±3.69 26.00 ±0.00 274.06 ±515.69 28.83 ±1.58 1.39
19.5 APPLICATIONS OF ADTREE
367
In addition, Bagged ADTrees (BADTrees) are used in the medical field to detect important sets of a single nucleotide polymorphism (SNP) associated with diseases [24]. A number of ADTrees are built on different bootstrapped samples of training data using the bagging technique [2]. In the dengue fever diagnosis, ADTree is used to identify influential clinical symptoms and laboratory features [40]. In astronomy, ADTree is employed to identify whether multi-wavelength data from optical, nearinfrared, and X-ray bands belong to an active galactic nucleus (AGN) or non-AGN [54]. Two different applications for the ADTree are discussed below in this section. They are the semiconductor wafer defect cluster recognition [37] (Section 19.5.1) and computer vision based human detection [45] (Section 19.5.2).
19.5.1 DEFECT CLUSTER RECOGNITION SYSTEM FOR SEMICONDUCTOR MANUFACTURING In a collaborative industrial project with the Freescale Semiconductor, the challenging task was to correctly identify different types of defect clusters formed on fabricated semiconductor wafers [37]. A large number of Integrated Circuits (ICs) are produced simultaneously on a wafer. If failed the ICs often tend to form unique and systematic patterns (defect clusters) on the fabricated wafer. The recognition of the cluster types is particularly important for the manufacturing yield improvement. It allows to apply a real-time statistical process control, and thus to identify and eliminate root causes of the production defects. Besides the accurate recognition, a comprehension of the implemented classifier is of the utmost importance. This is because the company must perform an additional thorough study on the cause and effect of applying the defect cluster recognition system to explain what leads to the IC chip rejections. Such a report is necessary for customer audits, particularly since some of the clients of the company are from the safety-critical industries. A black-box classifier does not provide an explicit knowledge on how the defect clusters are recognized, making it difficult to study the cause of each classification and the effect of a wrongly classified case. Furthermore, it does not give any clue on the classification confidence margin, and therefore does not satisfy the company’s six-sigma process improvement system. A unique characteristic of this challenge is that the manufacturing test data could be physically and meaningfully interpreted as a two-dimensional “image.” The implementation of ADTree in two dimensions was achieved by setting the ρ = 2 for the ρ-controlled SADT and rLADT. The industry partner provided necessary resources for the testing and analysis. Those included several millions of ICs from mainstream high-volume production lines. They were chosen from six different semiconductor device families of different implementation technologies and circuit complexities to ensure that the developed system was applicable across a range of manufactured products. Columns 1–5 in Table 19.6 show the selected devices and their characteristics. For the sake of confidentiality, the true names and functional descriptions of the ICs have intentionally been changed here to A–E. The rightmost column shows the accuracy of the system using the two-dimensional ADTree algorithm. It is clear that the overall systems had successfully achieved an accuracy up to 95% depending on the product type [37]. The set of interpretable rules are not provided here due to the confidentiality agreement. However, they were passed to the company for thorough analysis and verification by semiconductor test specialists in order to qualify the system for the implementation on the production floor. The success of the study shows the value of ADTree as a set of interpretable decision rules for specific problems as compared to other “black-box” type classifiers.
368
CHAPTER 19 ALTERNATING DECISION TREES
Table 19.6 Devices Used for Experimental Trials Devices Technology – Half-Pitch Size (nm)
ICs per Wafer Metallisation Layers
Total Number Accuracy of Defect of ICs Recognition
A B C D E
984 794 402 384 235
2,460,000 1,985,000 1,105,500 960,000 587,500
250 250 130 90 250
3 3 6 7 3
90.3 94.7 81.5 95.9 84.1
19.5.2 DIMENSION REDUCTION TOOL IN HUMAN DETECTION SYSTEM ADTree can be applied as a feature subset selection tool [15]. ADTree was implemented to reduce the hardware specifications in terms of a memory size for the Histogram of Gradient-Support Vector Machines (HOG-SVM) [12] in the human detection system for locating the presence of a human in images or videos (streams of images). More often than not, the difference in resource requirements can mean a difference between a full-fledged general purpose CPU and a low-end (and low-cost) field programmable gate array (FPGA). In this study, the extracted HOG feature had 540 dimensions. They formed a better representation of an image than the pixel values themselves. A shape is widely used as a reliable cue for detecting a human because it is invariant to illumination [29]. The potency of HOG lies in the way it uses gradient magnitudes and orientations to construct feature vectors that capture detailed and important shape information. Together with linear SVM, they form a reliable human detection system as described in the literature. However, SVM requires storage of a subset of training samples known as support vectors for the classification purpose. Since each original support vector was of 540 dimensions, the aim was to reduce the dimension of the support vectors through ADTree for the obvious reasons: (1) to reduce the number of parameters for lower memory requirements, and (2) to reduce the number of floating point operations to fit into a FPGA-based hardware. ADTree was implemented by selecting 1 out of 540 at a time to form a decision rule. This was achieved by setting the ρ = 1 for the ρ-controlled SADT and rLADT in which eight salient features were discovered. The univariate nature of each decision node revealed the most discriminative feature in forming the decision rule. This “knowledge discovery” allowed users to employ smaller dimension support vectors for SVM which led to a faster and cheaper hardware implementation of a human detection algorithm. The experiment showed that the “trimmed” HOG-SVM system with eight features was capable of achieving an average accuracy of 98.99 ± 0.44%, with an insignificant 0.8% drop from the “complete” HOG-SVM system with 540 features [45]. Most importantly, the average number of parameters (9 to 541) and floating point operations (16 to 1080) were much smaller compared to the “complete” HOG-SVM system. The above example demonstrates an important option to many applications whereby a lower implementation cost and higher classification speed are more important than a marginal improvement in the classification accuracy. It also highlights the strength of ADTree as a set of interpretable decision rules that leads to the computational and memory requirement savings.
19.6 CONCLUSIONS AND FUTURE PERSPECTIVES
369
19.6 CONCLUSIONS AND FUTURE PERSPECTIVES As a generalization of the decision trees, voted decision trees and voted decision stumps, ADTree deserves a significant attention. Its underlying boosting implementation is an important advantage where different kinds of ADTree can be grown easily to suit a wide range of applications. Some of the most powerful variants have been presented above, allowing the user to build non-parametric decision trees that are equipped with boosting and regularization techniques to better match the complexity of given data sets. Using these algorithms, users can build ADTree models for the following types of domain problems: 1. Data sets with few highly discriminative features; 2. Data sets with correlated features; 3. Data sets that require multiple models. There are a wide range of boosting algorithms that differ in terms of their theoretical foundation, inductive bias, loss function, optimization approach, etc. to handle many types of the real-world problems such as data mining, computational biology, image processing, and finance. It is possible to design various ADTrees with unique attributes using different boosting implementations in addition to the regularization incorporation reported in this work. This is an advantage over the classical decision trees which often require a new learning mechanism to achieve required properties. For future research, a more powerful class of learning algorithms bridging together boosting, decision tree, and regularization techniques can be created. Some of these ideas are outlined below.
19.6.1 COST-SENSITIVE ADTREE In the work by Cieslak and Chawla [8], the split criterion was modified in order to deal with imbalanced data sets. Fraud detection is a common imbalanced problem. Research [42] develops a cost-sensitive decision tree to handle this issue. By changing the underlying boosting algorithm to the cost-sensitive boosting [36], a new variant of ADTree can be designed to handle imbalanced data sets.
19.6.2 CREDAL ADTREE Credal-C4.5 [34] is an extension of C4.5 that is designed to address noisy data by using a new splitting criterion called Imprecise Information Gain Ratio. This splitting criterion can be incorporated within the LogitBoost algorithm with errors-in-variables [44] to build a new Credal ADTree.
19.6.3 REGRESSION ADTREE All the variants of ADTree presented in this study are for the classification purposes. By implementing the least square boosting algorithm or L2 Boost [5], a novel regression ADTree can be designed.
19.6.4 GROUP LASSO ADTREE Mixed input feature measurements are common in the real-life practice. By using Group Lasso [53], it is possible to extend the current research to deal with both categorical and real-valued features simultaneously, and in situations where there are known groups among the features, an application of Group Lasso achieves the sparsity among non-overlapping groups of features.
370
CHAPTER 19 ALTERNATING DECISION TREES
REFERENCES [1] R.C. Barros, M.P. Basgalupp, A.C.P.L.F. de Carvalho, A.A. Freitas, A survey of evolutionary algorithms for decision tree induction, IEEE Trans. Syst. Man Cybern., Part C, Appl. Rev. 42 (2011) 291–312. [2] L. Breiman, Bagging predictors, Mach. Learn. 24 (1996) 123–140. [3] L. Breiman, Random forests, Mach. Learn. 45 (2001) 5–32. [4] L. Breiman, J.H. Friedman, R.A. Olshen, Classification and Regression Trees, Wadsworth International Group, Belmont, Canada, 1984. [5] P. Bühlmann, B. Yu, Boosting with the L2 loss: regression and classification, J. Am. Stat. Assoc. 98 (2003) 324–339. [6] E. Candes, T. Tao, The Dantzig selector: statistical estimation when p is much larger than n, Ann. Stat. 35 (2007) 2313–2351. [7] Y. Chen, P. Du, Y. Wang, Variable Selection in Linear Models, Wiley Interdiscip. Rev.: Comput. Stat., vol. 6, 2014, pp. 1–9. [8] D. Cieslak, N. Chawla, Learning decision trees for unbalanced data, Eur. Conf. Mach. Learn. (2008) 241–256. [9] L. Clemmensen, T. Hastie, D. Witten, B. Ersbøll, Sparse discriminant analysis, Technometrics 53 (2011) 406–413. [10] G. Creamer, Y. Freund, Learning a board balanced scorecard to improve corporate performance, Decis. Support Syst. 49 (2010) 365–385. [11] G. Creamer, Y. Freund, Automated trading with boosting and expert weighting, Quant. Finance 10 (2010) 401–420. [12] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: 2005 IEEE Conference on Computer Vision and Pattern Recognition, 2005, pp. 886–893. [13] F. De Comité, R. Gilleron, M. Tommasi, Learning multi-label alternating decision trees from texts and data, in: 3rd International Conference on Machine Learning and Data Mining in Pattern Recognition, 2003, pp. 35–49. [14] J. Demšar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (2006) 1–30. [15] M. Drauschke, W. Förstner, Comparison of adaboost and adtboost for feature subset selection, in: 8th International Workshop on Pattern Recognition in Information Systems, 2008, pp. 113–122. [16] J. Fan, R. Li, Variable selection via nonconcave penalized likelihood and its oracle properties, J. Am. Stat. Assoc. 96 (2001) 1348–1360. [17] R. Fisher, The use of multiple measurements in taxonomic problems, Annu. Eugen. 7 (1936) 179–188. [18] A. Frank, A. Asuncion, UCI Machine Learning Repository [WWW Document]. URL, http://archive.ics.uci.edu/ml. [19] Y. Freund, L. Mason, The alternating decision tree learning algorithm, in: 16th International Conference on Machine Learning, 1999, pp. 124–133. [20] Y. Freund, R. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, J. Comput. Syst. Sci. 55 (1997) 119–139. [21] J. Friedman, T. Hastie, T. Robert, Additive logistic regression: a statistical view of boosting, Ann. Stat. 28 (2000) 337–407. [22] J. Friedman, T. Hastie, R. Tibshirani, Sparse inverse covariance estimation with the graphical Lasso, Biostatistics 9 (2007) 432–441. [23] J. Friedman, T. Hastie, R. Tibshirani, A Note on the Group Lasso and a Sparse Group Lasso, Cornell University Library, 2010 [WWW Document] https://arxiv.org/pdf/1001.0736v1.pdf. [24] R. Guy, P. Santago, C. Langefeld, Bootstrap aggregating of alternating decision trees to detect sets of SNPs that associate with disease, Genet. Epidemiol. 36 (2012) 99–106. [25] T. Hastie, A. Buja, R. Tibshirani, Penalized discriminant analysis, Ann. Stat. 23 (1995) 73–102. [26] T. Hastie, R. Tibshirani, The Elements of Statistical Learning: Data Mining, Inference and Prediction, Springer-Verlag, New York, 2009. [27] T. Hesterberg, N.H. Choi, L. Meier, C. Fraley, Least angle and 1 penalized regression: a review, Stat. Surv. 2 (2008) 61–93. [28] A. Hirotugu, Information theory and an extension of the maximum likelihood principle, in: 2nd International Symposium on Information Theory, Tsahkadsor, Armenia, USSR, 1971. [29] W. Hu, T. Tan, L. Wang, S. Maybank, A survey on visual surveillance of object motion and behaviors, IEEE Trans. Syst. Man Cybern., Part C, Appl. Rev. 34 (2004) 334–352. [30] J. Kozak, U. Boryczka, Multiple boosting in the ant colony decision forest meta-classifier, Knowl.-Based Syst. 75 (2015) 141–151. [31] A. López-Chau, J. Cervantes, L. López-García, F.G. Lamont, Fisher’s decision tree, Expert Syst. Appl. 40 (2013) 6283–6291. [32] J. Lu, Boosting linear discriminant analysis for face recognition, IEEE Int. Conf. Image Proc. 1 (2003) 657–660. [33] J. Lu, K.N. Plataniotis, A.N. Venetsanopoulos, S.Z. Li, Ensemble-based discriminant learning with boosting for face recognition, IEEE Trans. Neural Netw. 17 (2006) 166–178.
REFERENCES
371
[34] C.J. Mantas, J. Abellán, Credal-C4.5: decision tree based on imprecise probabilities to classify noisy data, Expert Syst. Appl. 41 (2014) 4625–4637. [35] D. Masip, J. Vitrià, Boosted discriminant projections for nearest neighbor classification, Pattern Recognit. 39 (2006) 164–170. [36] H. Masnadi-Shirazi, N. Vasconcelos, Cost-sensitive boosting, IEEE Trans. Pattern Anal. Mach. Intell. 33 (2011) 294–309. [37] M.P.-L. Ooi, H.K. Sok, Y.C. Kuang, S. Demidenko, C. Chan, Defect cluster recognition system for fabricated semiconductor wafers, Eng. Appl. Artif. Intell. 26 (2013) 1029–1043. [38] J.R. Quinlan, Induction of decision trees, Mach. Learn. 1 (1986) 81–106. [39] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publisher, San Francisco, 1993. [40] V.S.H. Rao, M.N. Kumar, A new intelligence-based approach for computer-aided diagnosis of dengue fever, IEEE Trans. Inf. Technol. Biomed. 16 (2012) 112–118. [41] L. Rokach, O. Maimon, Top-down induction of decision trees classifiers—a survey, IEEE Trans. Syst. Man Cybern., Part C, Appl. Rev. 35 (2005) 476–487. [42] Y. Sahin, S. Bulkan, E. Duman, A cost-sensitive decision tree approach for fraud detection, Expert Syst. Appl. 40 (2013) 5916–5923. [43] G. Schwarz, Estimating the dimension of a model, Ann. Stat. 6 (2) (1978) 461–464. [44] J. Sexton, P. Laake, LogitBoost with errors-in-variables, Comput. Stat. Data Anal. 52 (2008) 2549–2559. [45] H.K. Sok, M.S. Chowdhury, M.P.-L. Ooi, Y.C. Kuang, S. Demidenko, Using the ADTree for feature reduction through knowledge discovery, in: IEEE International Instrumentation and Measurement Technology Conference, 2013, pp. 1040–1044. [46] H.K. Sok, M.P.-L. Ooi, Y.C. Kuang, Sparse alternating decision tree, Pattern Recognit. Lett. (2015) 60–61, pp. 57–64. [47] H.K. Sok, M.P.-L. Ooi, Y.C. Kuang, S. Demidenko, Multivariate alternating decision trees, Pattern Recognit. 50 (2016) 195–209. [48] University of Eastern Finland, Spectral Color Research Group [WWW Document], https://www.uef.fi/spectral/ spectral-database. [49] L. Wang, H. Cheng, Z. Liu, C. Zhu, A robust elastic net approach for feature learning, J. Vis. Commun. Image Represent. 25 (2014) 313–321. [50] D.M. Witten, A. Shojaie, F. Zhang, The cluster elastic net for high-dimensional regression with unknown variable grouping, Technometrics 56 (2014) 112–122. [51] D.H. Wolpert, W.G. Macready, No free lunch theorems for optimization, IEEE Trans. Evol. Comput. 1 (1997) 67–82. [52] L. Wu, Y. Yang, H. Liu, Nonnegative-lasso and application in index tracking, Comput. Stat. Data Anal. 70 (2014) 116–126. [53] M. Yuan, Y. Lin, Model selection and estimation in regression with grouped variables, J. R. Stat. Soc. B 68 (2006) 49–67. [54] Y. Zhang, H. Zheng, Y. Zhao, Preselecting AGN candidates from multi-wavelength data by ADTree, in: Proceedings of the International Astronomical Union, 2005, pp. 481–484. [55] H. Zou, T. Hastie, Regularization and variable selection via the elastic net, J. R. Stat. Soc. B 67 (2005) 301–320.
This page intentionally left blank
CHAPTER
SCENE UNDERSTANDING USING DEEP LEARNING
20
Farzad Husain∗,‡ , Babette Dellen† , Carme Torras∗ ∗ Institut
de Robòtica i Informàtica Industrial, CSIC-UPC, Barcelona, Spain † RheinAhrCampus der Hochschule Koblenz, Remagen, Germany ‡ Catchoom, Barcelona, Spain
20.1 INTRODUCTION Automation based on artificial intelligence becomes necessary when agents such as robots are deployed to perform complex tasks. Detailed representation of a scene makes robots better aware of their surroundings, thereby making it possible to accomplish different tasks in a successful and safe manner. Tasks which involve planning of actions and manipulation of objects require identification and localization of different surfaces in dynamic environments [1–3]. The usage of structured light based depth sensing devices has gained much attention in the past decade. This is because they are low-cost and capture data in the form of dense depth maps, in addition to color images. Convolutional Neural Networks (CNNs) provide a robust way to extract useful information from the data acquired using these devices [4–7]. In this chapter we will discuss the basic idea behind standard feedforward CNNs (Section 20.2) and their application in semantic segmentation (Section 20.3) and action recognition (Section 20.4). Further in depth analysis and state-of-the-art solutions for these applications can be found in our recent publications [6] and [7].
20.2 CONVOLUTIONAL NEURAL NETWORKS Convolutional Neural Networks are directed acyclic graphs. Such networks are capable of learning highly non-linear functions. A neuron is the most basic unit inside a CNN. Each layer inside a CNN is composed of several neurons. These neurons are hooked together so that the output of neurons at layer l becomes the input of neurons at layer l + 1, i.e., a (l+1) = f (W (l) a (l) + b(l) ),
(20.1)
where W (l) is the weight matrix of layer l, b(l) is the bias term, and f is the activation function. The activation for layer l is denoted by a (l) . Training a CNN requires learning W and b for each layer such that a cost function is minimized. Formally, given a training set {(x (1) , y (1) ), . . . , (x (m) , y (m) )} of m training examples, the weights W and bias b need to be determined that will minimize the cost, i.e., the difference between the desired output y and the actual output fW,b (x). The cost function for one Handbook of Neural Computation. DOI: 10.1016/B978-0-12-811318-9.00020-X Copyright © 2017 Elsevier Inc. All rights reserved.
373
374
CHAPTER 20 SCENE UNDERSTANDING USING DEEP LEARNING
training example is defined as: 1 J (W, b; x, y) = ||hW,b (x) − y||2 , 2
(20.2)
where h(x) gives the activations of the last layer. The minimization is done iteratively using a gradient descent approach which involves the computation of partial derivatives of the cost function with respect to the weights and updating the weights accordingly. One iteration of gradient descent updates the parameters W, b as: ∂ J (W, b), ∂W (l) ∂ b(l) = b(l) − α (l) J (W, b). ∂b
W (l) = W (l) − α
(20.3) (20.4)
The backpropagation algorithm is used to compute the partial derivatives of the cost function. Fully connected layers have all the hidden units connected to all the input units. This increases the number of connections tremendously when dealing with high-dimensional data such as images. If we consider the image size as its dimension then connecting each input pixel to each neuron becomes computationally expensive. An image as small as 100 × 100 pixels would need 104 × N connections at the input layer, where N is the number of neurons at the first layer. Convolutional layers allow to build sparse connections by sharing parameters across neurons. Compared to fully connected layers, convolutional layers have fewer parameters, so they are easier to train. This comes at the cost of a slight decrease in performance [8]. Commonly used CNNs for image recognition consist of several layers of convolution followed by a few fully connected layers [8,9]. Such networks are often termed deep networks.
20.3 SEMANTIC LABELING Dense semantic labeling of a scene requires assigning a label to each pixel in an image. The label must represent some semantic class. Such labeling is also referred to as object class segmentation because it divides the image into smaller segments, where each segment represents a particular class. Semantic labeling is challenging because naturally occurring indoor and outdoor scenes are highly unconstrained, leaving little room for discovering patterns and structures. The semantic classes can be abstract such as “furniture” or more descriptive such as “table,” “chair,” etc. The more descriptive labeling we aim to achieve, the harder it becomes. Convolutional Neural Networks provide a robust way to learn semantic classes. A CNN architecture used for semantic labeling typically consists of convolution and pooling layers only [6,5]. The number of channels in the last layer is equal to the number of object classes that we want to learn. Fig. 20.1 shows a basic example of a deep network architecture used for semantic segmentation. A CNN is usually trained to minimize a multiclass cross entropy loss function [4]. Formally given an image X of a scene, the objective is to obtain a label yˆp ∈ C for each pixel location xp ∈ X that
20.3 SEMANTIC LABELING
375
FIGURE 20.1 A model architecture for pixelwise semantic labeling. The network consists of four convolutional layers and two max pooling layers. The output Layer 4 has the number of channels equal to the number of class labels that needs to be learned. The filter sizes in each layer have been set to (11 × 11). Finally, the output feature maps obtained are upsampled to be of the same size as the input image.
Table 20.1 Individual Classes of NYU v2 (Four Classes) and Overall Average Method
Couprie et al. [18] Khan et al. [19] Stückler et al. [20] Müller and Behnke [21] Wolf et al. [22] Eigen and Fergus [4] (AlexNet) Husain et al. [6]
Floor
Struct
Accuracy (%) Furniture Prop
87.3 87.1 90.7 94.9 96.8 93.9 95.0
86.1 88.2 81.4 78.9 77.0 87.9 81.9
45.3 54.7 68.1 71.1 70.8 79.7 72.8
35.5 32.6 19.8 42.7 45.7 55.1 67.2
Class Average
Pixel Average
64.5 69.2 70.9 72.3 72.6 79.1 79.2
63.5 65.6 67.0 71.9 74.1 80.6 78.0
corresponds to the object class at the pixel location. The loss function L can now be written as: L=−
ci,b ln(cˆi,b ),
i∈X b∈C
where cˆi,· is the predicted class distribution at location i, and ci,· is the respective ground truth class distribution.
20.3.1 RELATED RESEARCH Several improvements in the past have been proposed to learn rich features from color images. One approach is to use image region proposals for training CNNs [10]. Another approach is to explore contextual information between different image segments [11]. Classification of superpixels at multiple scales has also been investigated in the past [12]. Another possibility is to train a network end-to-end by attaching a sequence of deconvolution and unpooling layers [13]. Recently, a joint training of a decoupled deep network for segmentation and image classification was shown to facilitate semantic segmentation results [14].
376
CHAPTER 20 SCENE UNDERSTANDING USING DEEP LEARNING
FIGURE 20.2 Some examples of semantic labeling, (A) color image, (B) ground truth labeling, (C) distance-from-wall, (D) predicted labels without distance-from-wall, and (E) predicted labels with distance-from-wall. White color in Figs. (B), (D), and (E) represents the unknown label. Figure reproduced from Husain et al. [6].
Different ideas for semantic labeling have been proposed which also utilize the depth information in RGB-D images. A depth normalization scheme where the furthest point is assigned a relative depth of one is proposed in [15]. Using height above the ground plane as an additional feature was investigated in [16,17]. A bounding hull heuristic to exploit indoor properties was proposed in [15]. In our recent study [6], we proposed a novel feature distance-from-wall. This feature was used to highlight objects that are usually found in close proximity to the walls detected in indoor scenes.
20.4 ACTION RECOGNITION
377
FIGURE 20.3 Illustration of a CNN network used for recognizing actions. Features from each frame are extracted using a CNN and averaged. K is the number of action categories. The final feature vector gives a probability for each action.
Table 20.2 Average Accuracy on the UCF-101 Data Set (3-Fold) Algorithm
Accuracy
CNN with transfer learning [29] LRCN (RGB) [36] Spatial stream ConvNet [28] LSTM composite model [37] C3D (1 net) [30] Temporal stream ConvNet [28] C3D (3 nets) [30] Combined ordered and improved trajectories [38] Stacking classifiers and CRF smoothing [39] Improved dense trajectories [40] Improved dense trajectories with human detection[41] 2D followed by 3D convolutions [7] Spatial and temporal stream fusion [28]
65.4% 71.1% 72.6% 75.8% 82.3% 83.7% 85.2% 85.4% 85.7% 85.9% 86.0% 86.7% 88.0%
Commonly used data sets for benchmarking different image segmentation approaches include the PASCAL Visual Object Classes data set [23], and for the RGB-D data include the NYU-v2 data set [24] and the SUN RGB-D data set [25]. Table 20.1 shows some state-of-the-art results for the NYU-v2 data set for four semantic classes as defined by Silberman and Fergus [24]. These classes are defined according to the physical role they play in the scene, i.e., “floor,” “structures” such as walls, ceilings, and columns; “furniture” such as tables, dressers, and counters; and “props” which are easily movable objects. Fig. 20.2 shows some examples of semantic labeling results achieved by Husain et al. [6].
20.4 ACTION RECOGNITION Recognizing human actions from videos is of central importance in understanding dynamic scenes. Recognition is typically performed by processing a video containing a particular action and predicting
378
CHAPTER 20 SCENE UNDERSTANDING USING DEEP LEARNING
FIGURE 20.4 Some results for top-5 predicted action labels for the UCF-101 data set [34]. First row (green color) shows the ground-truth followed by predictions in decreasing level of confidence. Blue and red (dark gray and light gray in print version) show correct and incorrect predictions, respectively. The figure is taken from Husain et al. [7].
a label as the output. Action recognition is a challenging task because similar actions can be performed at different speeds, recorded from different viewpoints, lighting conditions and background. Convolutional Neural Networks provide a way to recognize actions from videos. The most basic approach using CNNs involve treating each frame of the video as an image and predicting the action for each frame followed by averaging over all the predictions. Fig. 20.3 shows a basic action recognition pipeline using a CNN. It has been shown in the past that a CNN model trained on one data set can be transferred to other visual recognition tasks [26,27]. We also see this transfer learning technique being applied success-
20.4 ACTION RECOGNITION
379
FIGURE 20.5 Confusion matrix for the action sequences in the HMDB data set [35] using the approach as described in one of our previous studies (Husain et al. [7]).
fully for recognizing actions. This is achieved by using a pretrained image recognition model for the individual frames of videos [7,28,29].
20.4.1 RELATED RESEARCH Attempts have been made to make action recognition invariant to different kinds of situations. This includes the usage of optical flow as additional information [28] or using 3D (spatiotemporal) convo-
380
CHAPTER 20 SCENE UNDERSTANDING USING DEEP LEARNING
lutional kernels [7,30]. Recurrent Neural Networks have also been explored to learn from long-term dependencies in different types of actions [31]. Learning actions representation in an unsupervised way has also been proposed [32]. This involved using Long Short Term Memory (LSTM) networks for encoding videos and afterward reconstructing them. Recently, a concept of dynamic image was proposed [33]. The dynamic image encodes the temporal evolution of a video and is used for the task of action recognition. In our recent study, we demonstrated how human action recognition can be achieved using the transfer learning technique coupled with a deep network comprising 3D convolutions [7]. Commonly used data sets for benchmarking different approaches include the UCF-101 data set [34], the HMDB data set [35] and the Sports 1M data set [29]. Table 20.2 shows some state-ofthe-art results for the UCF-101 data set for 101 action classes. Fig. 20.4, reproduced from one of our previous studies [7], shows the top-5 predictions for selected sequences from the UCF-101 data set. It can be observed that the actions performed in visually similar environments are often predicted with a high probability. Consider, for example, Fig. 20.4(c6) vs. Fig. 20.4(b3). Fig. 20.5 shows the confusion matrix for the action sequences from the HMDB data set using our approach as described in [7]. It can be observed that similar actions such as “sword exercise” and “draw sword” have some degree of confusion.
20.5 CONCLUSIONS We introduced the basic idea behind Convolutional Neural Networks for the task of semantic labeling. We discussed different ways to further enhance the segmentation results by extracting different features from the scene such as the distance-from-wall. The semantic labeling can serve as a useful prior for object discovery methods as shown in one of our previous studies in [42]. We also explained the basic approach for recognizing actions in videos using Convolutional Neural Networks and different ways to bring robustness.
REFERENCES [1] A. Dragan, N. Ratliff, S. Srinivasa, Manipulation planning with goal sets using constrained trajectory optimization, in: Int. Conf. on Robotics and Automation (ICRA), 2011, pp. 4582–4588. [2] D. Martínez, G. Alenyà, C. Torras, Planning robot manipulation to clean planar surfaces, Eng. Appl. Artif. Intell. 39 (2015) 23–32. [3] F. Husain, A. Colome, B. Dellen, G. Alenya, C. Torras, Realtime tracking and grasping of a moving object from range video, in: Int. Conf. on Robotics and Automation (ICRA), 2014, pp. 2617–2622. [4] D. Eigen, R. Fergus, Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture, in: Int. Conf. on Computer Vision (ICCV), 2015, pp. 2650–2658. [5] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Conf. on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3431–3440. [6] F. Husain, H. Schulz, B. Dellen, C. Torras, S. Behnke, Combining semantic and geometric features for object class segmentation of indoor scenes, IEEE Robot. Autom. Lett. 2 (1) (2016) 49–55. [7] F. Husain, B. Dellen, C. Torras, Action recognition based on efficient deep feature learning in the spatio-temporal domain, IEEE Robot. Autom. Lett. 1 (2) (2016) 984–991. [8] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, in: Advances in Neural Information Processing Systems (NIPS), 2012, pp. 1097–1105.
REFERENCES
381
[9] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv:1409.1556. [10] J. Dai, K. He, J. Sun, BoxSup: exploiting bounding boxes to supervise convolutional networks for semantic segmentation, in: The IEEE International Conference on Computer Vision (ICCV), 2015. [11] G. Lin, C. Shen, A. van dan Hengel, I. Reid, Efficient piecewise training of deep structured models for semantic segmentation, in: Conf. on Computer Vision and Pattern Recognition (CVPR), 2016. [12] C. Farabet, C. Couprie, L. Najman, Y. Lecun, Scene parsing with multiscale feature learning, purity trees, and optimal covers, in: Int. Conf. on Machine Learning (ICML), ACM, New York, NY, 2012, pp. 575–582. [13] H. Noh, S. Hong, B. Han, Learning deconvolution network for semantic segmentation, in: Int. Conf. on Computer Vision (ICCV), 2015. [14] S. Hong, H. Noh, B. Han, Decoupled deep neural network for semi-supervised semantic segmentation, in: Advances in Neural Information Processing Systems (NIPS), 2015, pp. 1495–1503. [15] N. Silberman, R. Fergus, Indoor scene segmentation using a structured light sensor, in: ICCV Workshop on 3D Representation and Recognition, 2011. [16] H. Schulz, N. Höft, S. Behnke, Depth and height aware semantic RGB-D perception with convolutional neural networks, in: Eur. Conf. on Neural Networks (ESANN), 2015. [17] S. Gupta, R. Girshick, P. Arbelaez, J. Malik, Learning rich features from RGB-D images for object detection and segmentation, in: Eur. Conf. on Computer Vision (ECCV), vol. 8695, 2014, pp. 345–360. [18] C. Couprie, C. Farabet, L. Najman, Y. LeCun, Indoor semantic segmentation using depth information, in: Int. Conf. on Learning Representations (ICLR), 2013, pp. 1–8. [19] S. Khan, M. Bennamoun, F. Sohel, R. Togneri, Geometry driven semantic labeling of indoor scenes, in: Eur. Conf. on Computer Vision (ECCV), 2014, pp. 679–694. [20] J. Stückler, B. Waldvogel, H. Schulz, S. Behnke, Dense real-time mapping of object-class semantics from RGB-D video, J. Real-Time Image Process. (2015) 599–609. [21] A. Müller, S. Behnke, Learning depth-sensitive conditional random fields for semantic segmentation of RGB-D images, in: Int. Conf. on Robotics and Automation (ICRA), 2014, pp. 6232–6237. [22] D. Wolf, J. Prankl, M. Vincze, Fast semantic segmentation of 3D point clouds using a dense CRF with learned parameters, in: Int. Conf. on Robotics and Automation (ICRA), 2015, pp. 4867–4873. [23] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, A. Zisserman, The pascal visual object classes (VOC) challenge, Int. J. Comput. Vis. 88 (2) (2010) 303–338. [24] N. Silberman, D. Hoiem, P. Kohli, R. Fergus, Indoor segmentation and support inference from RGBD images, in: Eur. Conf. on Computer Vision (ECCV), 2012, pp. 746–760. [25] S. Song, S. Lichtenberg, J. Xiao, SUN RGB-D: a RGB-D scene understanding benchmark suite, in: Conf. on Computer Vision and Pattern Recognition (CVPR), 2015. [26] M. Oquab, L. Bottou, I. Laptev, J. Sivic, Learning and transferring mid-level image representations using convolutional neural networks, in: CVPR, Columbus, OH, 2014. [27] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, T. Darrell, DeCAF: a deep convolutional activation feature for generic visual recognition, in: ICML, 2014. [28] K. Simonyan, A. Zisserman, Two-stream convolutional networks for action recognition in videos, in: NIPS, Curran Associates, Inc., 2014, pp. 568–576. [29] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, L. Fei-Fei, Large-scale video classification with convolutional neural networks, in: CVPR, 2014, pp. 1725–1732. [30] D. Tran, L.D. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning spatiotemporal features with 3D convolutional networks, in: ICCV, 2015, pp. 4489–4497. [31] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Darrell, Long-term recurrent convolutional networks for visual recognition and description, in: Conf. on Computer Vision and Pattern Recognition (CVPR), 2015. [32] N. Srivastava, E. Mansimov, R. Salakhutdinov, Unsupervised learning of video representations using LSTMs, in: Int. Conf. on Machine Learning (ICML), 2015. [33] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, S. Gould, Dynamic image networks for action recognition, in: Conf. on Computer Vision and Pattern Recognition (CVPR), 2016. [34] K. Soomro, A.R. Zamir, M. Shah, UCF101: a dataset of 101 human actions classes from videos in the wild, arXiv: 1212.0402.
382
CHAPTER 20 SCENE UNDERSTANDING USING DEEP LEARNING
[35] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, T. Serre, HMDB: a large video database for human motion recognition, in: ICCV, 2011, pp. 2556–2563. [36] J. Donahue, L. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, T. Darrell, K. Saenko, Long-term recurrent convolutional networks for visual recognition and description, in: CVPR, 2015, pp. 2625–2634. [37] N. Srivastava, E. Mansimov, R. Salakhutdinov, Unsupervised learning of video representations using LSTMs, arXiv: 1502.04681. [38] O.R. Murthy, R. Goecke, Combined ordered and improved trajectories for large scale human action recognition, in: ICCV Workshop on Action Recognition with a Large Number of Classes, 2013. [39] S. Karaman, L. Seidenari, A. Bagdanov, A. Bimbo, L1-regularized logistic regression stacking and transductive CRF smoothing for action recognition in video, in: ICCV Workshop on Action Recognition with a Large Number of Classes, 2013. [40] H. Wang, C. Schmid, Action recognition with improved trajectories, in: ICCV, 2013, pp. 3551–3558. [41] H. Wang, D. Oneata, J. Verbeek, C. Schmid, A robust and efficient video representation for action recognition, Int. J. Comput. Vis. (IJCV) 3 (2015) 219–338. [42] G. Martín García, F. Husain, H. Schulz, S. Frintrop, C. Torras, S. Behnke, Semantic segmentation priors for object discovery, in: Int. Conf. on Pattern Recognition (ICPR), 2016, pp. 549–554.
CHAPTER
DEEP LEARNING FOR CORAL CLASSIFICATION
21
Ammar Mahmood∗ , Mohammed Bennamoun∗ , Senjian An∗ , Ferdous Sohel† , Farid Boussaid∗ , Renae Hovey∗ , Gary Kendrick∗ , Robert B. Fisher‡ ∗ The
University of Western Australia, Crawley, WA, Australia † Murdoch University, Murdoch, WA, Australia ‡ University of Edinburgh, Edinburgh, United Kingdom
21.1 INTRODUCTION Coral reefs are a vital part of marine ecosystems. They provide a nutrient-rich habitat and a safe shelter for many marine organisms. They are a rich source of nitrogen and other essential nutrients for benthic species. They also play an essential part in the recycling of nutrients and in protecting coastlines from the devastating effects of waves and sea storms. Coral reefs help in sustaining a growing fishing industry since many fish and other species are found closer to the reefs. Shallow sea coral reefs such as the Great Barrier Reef of Australia also benefit the tourism industry. Marine scientists have reported a worldwide decreasing trend in the coral population. According to a 2011 research, 19% of the coral reefs were lost and 75% are now being threatened [1]. With the increase in global warming, urbanization, human population, large use of sea for shipping, exploration for minerals, recreational uses such as boating and industrial trade and activities, there has been a huge impact on coral reefs, both positive and negative [2]. Increased water temperatures are responsible for bleaching and death of corals [3,4]. This has resulted in a rapid decline in our planet’s marine biodiversity [5]. In order to minimize the negative impact of these activities on the sea, marine ecosystems need to be monitored regularly. That is where underwater optical imaging comes to the rescue. Long-term monitoring of large areas, remote sensing and tracking of marine species and their associated habitats are now standard requirements in most management strategies. As a result, the automatic annotation of collected marine data is now at the forefront of management applications and thus a research priority [6]. With the development of underwater optical imaging techniques, standard protocols can be developed for analyzing and curtailing the negative impacts on seawater environmental sustainability. Additionally, an exponential increase in the use of digital cameras and video implies the need for storage and automated analysis of such data. Marine scientists have a massive amount of imagery of coral reefs that is yet to be annotated. Monitoring systems like the Integrated Marine Observing System (IMOS) collect millions of images of coral reefs around Australia every year. However, only a small percentage of these images, typically less than 5%, get analyzed by marine experts. Moreover, manual annotation is a tediously repetitive task and demands large human resources and time. Automated technologies to monitor marine ecosystems are crucial for a continuous monitoring without feedback from human experts. With these stated facts in mind, automatic annotation of underwater images can achieve visually comprehensible results. The Handbook of Neural Computation. DOI: 10.1016/B978-0-12-811318-9.00021-1 Copyright © 2017 Elsevier Inc. All rights reserved.
383
384
CHAPTER 21 DEEP LEARNING FOR CORAL CLASSIFICATION
proposed results could help in curbing a global threat. The extent of the usefulness of this research can be seen from the attention this research field is getting. In this chapter, we aim to address the challenges associated with the automatic analysis of marine data and explore the applications of deep learning for automatic annotation of coral reef images. The rest of this chapter is organized as follows. Section 21.2 presents existing methods for the manual annotation of corals used by marine scientists and discusses the challenges involved with coral reef classification. It also highlights previous coral classification research. Section 21.4 presents a brief introduction of deep learning and the state-of-the-art of deep networks. Section 21.5 summarizes current coral classification studies with deep learning. Finally, Section 21.6 outlines the future prospects and applications of deep learning for deep sea image analysis. Section 21.7 concludes this chapter.
21.2 ANNOTATION OF CORAL REEFS: METHODS AND CHALLENGES Coral reefs are one of the most diverse and economically important ecosystems of our planet. There are three main types of coral reefs: fringing, barrier, and atoll. Fringing reefs (e.g., reefs off the coast of Eilat, Israel) are found closer to the shores and they often form borders around islands and coastlines. Barrier reefs (e.g., The Great Barrier Reef, Australia) are separated from coastlines by water. This results in a lagoon of water between the shore and the reef itself. Atoll reefs (e.g., Lighthouse Atoll Reefs, Belize) are found deep below the sea level and are usually circular or oval in shape. Three coral reef examples are shown in Fig. 21.1A–C. The main goal of a long-term monitoring of coral reefs is to investigate how the reefs are changing over time due to the phenomenon of coral bleaching (shown in Fig. 21.1D). This investigation is done on local and global scales. Coral reef data generally consist of the following: • Site survey (e.g., information about location, depth, water temperatures, and turbidity) • Coral species survey (e.g., hard corals, soft corals, bleached corals, and dead corals) • Substrate survey (non-coral species: e.g., macroalgae, sponges, sand, rock, and rubble) Corals have a large number of subspecies. The Great Barrier Reef in Australia alone has more than 600 subspecies. They are a diverse specie and are found in a variety of size, shapes and colors. The two main categories for corals are: hard corals and soft corals. Hard corals have a limestone skeleton, whereas soft corals are flexible and are often mistaken as plants due to the lack of a skeleton. Hard corals are the best indicator of health of any coral reef. Their percentage cover is the most commonly used parameter to quantify the coral reef population. Adverse climate effects such as pollution and increased temperature of sea floor result in the bleaching of healthy corals and eventually death.
21.2.1 METHODS FOR CONVENTIONAL ANNOTATION Underwater imaging techniques such as autonomous underwater vehicles (AUVs) have tremendously increased the amount of marine data that are available for analysis. However, the process of manually annotating these data is cumbersome and inefficient [7]. In practice, marine scientists usually adopt random point annotations whereby a predefined number of random points are selected on each image, as low as 20 or as high as 200. Afterwards, a marine expert assigns a label to all of the individual
21.2 ANNOTATION OF CORAL REEFS: METHODS AND CHALLENGES
385
FIGURE 21.1 (A) Fringing reef off the coast of Eilat, Israel. (B) The Great Barrier Reef, Australia. (C) Lighthouse Atoll Reef, Belize. (D) Coral Bleaching: Healthy corals on left and bleached corals on right.
points as shown in Fig. 21.2. A single image can take up to 30 minutes to fully annotate it. Repeating the same procedure for millions of images is obviously a tedious and a challenging task because the class boundaries are ambiguous and difficult to define in terms of color, shape, or texture [7]. This annotation scheme is often facilitated by software such as Coral Point Count [9]. It is free software developed by the National Coral Reef Institute (NCRI) for experts and researchers working in the management and the monitoring of coral reefs. CPCe overlays a given image with a predefined set of random pixels. A marine expert then assigns a class label to these random pixels. Furthermore, water turbidity and underwater illumination render the images difficult to analyze [8]. Also, coral reef images consist of an assemblage of corals and non-corals of irregular shapes and sizes. To manually label the full segmentation ground truths for every image is exhausting and time consuming. Bounding box annotations are prone to leave out key details. Assigning one label per image hinders subspecies classification. As a result, the well-known labeling techniques such as bounding boxes, boundary segmentation and whole image labeling are impractical. Random point sampling and labeling are the least cumbersome and most efficient of these techniques.
21.2.2 CHALLENGES Sea floor exploration and imaging have provided us with a great opportunity to look into the vast and complex marine ecosystem. Data acquisition from the sea bed is vital for the scientific understanding of these intricate ecosystems, but is often hampered by logistical constraints associated with working underwater. Advanced underwater cameras and an increasing interest in exploring underwater environments have initiated the need for improvements in the field of imaging techniques. Seabed observations,
386
CHAPTER 21 DEEP LEARNING FOR CORAL CLASSIFICATION
FIGURE 21.2 Sample coral reef image from Benthoz15 data set illustrating random point sampling annotation method.
archaeology, marine geology, marine biology and biodiversity are mainly conducted by optical imaging [1,10,11]. Digital images of the sea floor are now commonly collected with the help of Remotely Operated Vehicles (ROVs) and Autonomous Underwater Vehicles (AUVs) [12]. To make the marine data useful for analysis, there has to be an accurate automatic annotation system instead of the reliance on manual labeling. Any system that is used in theory has to face specific challenges in a real underwater environment. In the same manner, when a sufficiently large number of training images are available, the methods derived for real-world object classification can be used to analyze structured textures and objects, but they fall short in a real underwater world. In order to achieve higher classification accuracy, many issues have to be addressed such as blurring, scattering, sun flicker, and color attenuation. Therefore, automatic annotation for underwater scene classification is a difficult and challenging topic. Underwater digital imaging and automatic species classification is an extremely challenging task. Training data sets are created on the basis of underwater classes that are different in terms of shape, color, texture, size, rotation, illumination, view angle, camera distance, and light conditions. These challenges mainly include: • • • • • • • •
significant intra-class and inter-site diversity of the acquired images complex and ambiguous spatial borders between classes manual annotation varies from expert to expert variations in the spatial and spectral resolution limits, view points, and image quality of the cameras partial or complete occlusion of objects of interests gradual changes in the structures of the marine seabed over longer periods of time lighting artefacts due to refraction from waves and variable depth dependent optical properties variable water turbidity, color distortions, and inadequate illumination conditions.
21.3 AUTOMATIC CORAL CLASSIFICATION
387
FIGURE 21.3 Sample marine images from the Western Australian seabed under different illumination conditions and color distortion.
Four marine images are shown in Fig. 21.3 to illustrate some of these challenges. These pictures were captured at the same sites but under different illumination conditions. They also portray a significant color distortion. In the next section, we will explore the previous work done for coral classification.
21.3 AUTOMATIC CORAL CLASSIFICATION 21.3.1 CORAL CLASSIFICATION WITH HAND-CRAFTED FEATURES Color and texture are the key discriminating factors for classifying corals. Hence, researchers have extensively studied the extraction of color and texture based hand-crafted features for image representation. Features that encode shape information are less suitable because the corals have arbitrary shapes and the class boundaries are unclear. Usually a combination of color and texture based features is preferred. There are no definite combinations of features which are expected to work for any general coral data set. The features are often selected based on the discriminating characters of corals and non-corals that are present in a given data set. In this section, we will highlight some of the prominent studies for coral image classification with hand-crafted features. • Marcos [13] used Normalized Chromaticity Coordinate (NCC) for color and Local Binary Pattern (LBP) for texture. A 3-layer feed-forward back propagation neural network was used to classify
388
•
•
•
•
CHAPTER 21 DEEP LEARNING FOR CORAL CLASSIFICATION
five classes: living corals, dead corals, corals with algae, abiotics, and algae. It was proposed that NCC color features are invariant to illumination conditions and LBP is robust to brightness changes. However, the NCC and LBP features are not discriminative enough for complex underwater images. This method was then tested on the three coral classes with only 300 images. A higher performance was reported when a combination of LBP and hue information was used compared to the combination of LBP with NCC. Stokes and Deane [14] used normalized color histograms as color descriptors and a discrete cosine transform (DCT) based feature vector for texture. Their training set consisted of 3000 images and 18 distinct classes. A novel classification approach titled “probability density weighted mean distance (PDWMD)” was proposed for classification. This method is easy to implement and fast. However, the weights of the color and texture features are set manually during the feature extraction. Also, DCT descriptors are not very robust and accurate texture descriptors. Pizarro [15] used a feature vector based on NCC histogram, bag of words (BoW) for scale-invariant feature Transform (SIFT) and Hue-histograms. A subset of training samples is used to construct a BoW and the test image is then described in terms of this vocabulary of words. They performed classification by voting for the best matches. In their method, each image is classified as one class out of the total eight classes. A total of 453 images were used for training and vocabulary learning. This annotation method does not perform well for pixel annotations and is prone to leaving out key details. The sub-image level classification is not addressed in this work. Defining texture with BoW on SIFT features is not an efficient texture feature in complex underwater conditions. Beijbom [8] introduced the Moorea Labeled Coral (MLC) data set (with four non-coral and five coral classes) and used a Maximum Response (MR) filter bank followed by texton maps for feature extraction at multiple scales. A subset of training images was used along with kmeans clustering for generating a texture dictionary. They also showed that preprocessing the images in the L ∗ a ∗ b color space improves a superior performance compared to RGB. They used an SVM classifier with a Radial Basis Function (RBF) kernel for classification. Coral images from three different years were automatically annotated to yield coral maps across the reef sites. In [16], a combination of hand-crafted features and multiple classifiers was analyzed to achieve best classification of accuracy for multiple benthic data sets. The descriptors that they used include Completed Local Binary Patterns (CLBP), grey level co-occurrence matrix (GLCM), Gabor feature, and opponent angle and hue channel color histograms. All the feature vectors used in this work were scale invariant and robust to color distortion and low illumination. Support vector machines (SVM), k-nearest neighbors (KNN), neural networks, and probability density weighted mean distance (PDWMD) were the selected classifiers. Different combinations of features and classifiers were also employed to get the best performance for the six test data sets. However, issues such as how to choose an optimal scale for patch extraction and identification of overlapping classes were not addressed in this work.
Table 21.1 summarizes the feature vectors and the number of classes of the methods explained above.
21.4 DEEP NEURAL NETWORKS
389
Table 21.1 Summary of Hand-Crafted Feature Based Methods for Coral Classification Features
Number of Classes
Ref.
NCC Histogram for color and LBP for texture RGB Histogram for color and DCT + LBP for texture NCC Histogram for color and bag of words SIFT L ∗ a ∗ b colorspace and MR filter bank + texton maps CLBP + GLCM + Gabor filter
5 18 8 9 Multiple data sets
[27] [28] [29] [22] [30]
21.3.2 CORAL CLASSIFICATION WITH LEARNED FEATURES Deep neural networks are a powerful category of machine learning algorithms implemented by stacking layers of neural networks along the depth and width of smaller architectures. Deep networks have recently demonstrated discriminative and representation learning capabilities over a wide range of applications in the contemporary years. Researchers in ML are expanding the horizons of deep learning by seeking their prospective applications in other diverse domains. One such forthcoming domain is marine scene classification. Deep networks require a large amount of annotated data for training. With efficient training algorithms, deep neural networks are capable of separating millions of labeled images. Moreover, the trained network can also be used for learning efficient image representations for other similar benthic data sets. Before discussing applications of deep learning in coral classification, we give a brief review on deep learning and its state-of-the-art architectures in the next section.
21.4 DEEP NEURAL NETWORKS An excellent performance of any image or video processing task (e.g., classification, object detection, scene understating) relies on the extraction of discriminative features or image representations from the input data. Domain specific hand-crafted image representations have been extensively used in computer vision for decades. Features learned using machine learning algorithms, known as representation learning, have shown better performance in recent years, compared to the traditional hand-crafted representations. Deep learning algorithms employ conventional neural networks with increased complexities and depths. Neural networks with many hidden layers [17] are capable of extracting high levels of abstractions from raw data. Many state-of-the-art systems in computer vision owe their success to their ability to extract high-level abstractions. Neural networks were popular in the 1990s but support vector machines ascended to the central stage in the 2000s and out-performed NNs. Deep neural networks became very popular in computer vision after the seminal work in [18].
21.4.1 CONVOLUTIONAL NEURAL NETWORKS Convolutional neural networks (CNNs) [18] are another important class of neural networks used to learn image representations that can be applied to numerous computer vision problems. Deep CNNs, in particular, consist of multiple layers of linear and non-linear operations that are learned simultaneously, in an end-to-end manner. To solve a particular task, the parameters of these layers are learned over
390
CHAPTER 21 DEEP LEARNING FOR CORAL CLASSIFICATION
FIGURE 21.4 VGGnet Architecture: 16 weighted layers. C1 to C5 are 5 convolutional layers with sublayers. FC are 3 fully connected layers.
several iterations. CNN based methods have become popular in the recent years for feature extraction from images and video data. A CNN consists of convolutional layers and pooling layers occurring in an alternating fashion. Sparse connectivity, parameter sharing, subsampling and local receptive fields are the key factors that render CNNs invariant to shifting, scaling, and distortions of input data. Sparse connectivity is achieved by making the kernel size smaller than the input image which results in a reduced number of connections between the input and the output layer. Inter-channel and intra-channel redundancy can be exploited to maximize sparsity. Moreover, the computation of output requires fewer operations and less memory to store the weights. In a non-convolutional neural network, a weight element is only multiplied once by the input and never used again. However, in a convolutional layer, every element of the kernel matrix is convolved with the input image more than once. The convolutional layers consist of stacks of filters of predefined sizes that are convolved with the input of the layer. The parameter sharing used by convolutional layers is more efficient (requires fewer computation and memory storage) than a dense matrix multiplication. Parameter sharing can also make the convolutional layers equivariant to linear translations (i.e., any shift in the input will result in a similar shift in the output). However, convolutional layers are not equivariant to distortions in scale or rotation. The depth of the CNN can be increased by setting the output of the pooling layer to be the input of the next convolutional layer. CNNs with smaller filter size (3 × 3) and deeper architectures have shown increased performances. One such example is the VGGnet [19] (shown in Fig. 21.4). They have reported a significant improvement on the prior-art configurations by pushing the depth to 16–19 hidden layers. They secured the first and the second places in the ImageNet Challenge 2014 in the localization and classification tracks, respectively.
21.4.2 REPRESENTATION LEARNING Learning discriminative image representations from data have evolved as a promising research area. A powerful image representation captures the prior distributions of data by learning the image features. These features are usually hierarchical in nature (low and high level features) and hence the image representations learn to define the more abstract concepts in terms of the less abstract ones. A good learned representation should be simple (usually linearly dependent), sparse, and possess spatial and temporal coherence. The depth of a network is also an important aspect in the representation learning. Representations learned from the higher layers of deep networks encode high level features of data. Image representations extracted from CNNs, trained on large data sets such as ImageNet [18] and fine-tuned on domain specific data sets, have shown state-of-the-art performance in numerous image
21.4 DEEP NEURAL NETWORKS
391
classification problems [20]. These learned features can be used as universal image representations and have produced outstanding performances in computer vision tasks, e.g., image classification, object detection, fine grained recognition, attribute detection, and instance retrieval. The activations of the first fully connected layer of CNNs are the preferred choice of most researchers. However, the activations of intermediate convolutional layers have also shown comparable performances. In [21], subarrays of convolutional layer activations are extracted and used as region descriptors in a ‘local feature’ setting. The extracted local features from two consecutive convolutional layers are then pooled together and included in the resulting feature vector. This approach, termed “cross-convolutional layer pooling,” achieved significant performance improvements in scene classification tasks [21]. Why do these CNN features perform so well across diverse domains? Despite their outstanding performance, the intrinsic behavior of these deep networks is somewhat of a mystery. A visualization technique was proposed in [22] which investigated the relationship between the output of various layers of the CNN architecture (proposed in [18]) and the input image. The outputs of different convolutional layers were analyzed and the following conclusions were drawn: Layer 2 responds to corners and edges, Layer 3 captures complex invariances such as texture and mesh patterns, Layer 4 is more class-specific, and Layer 5 captures entire objects irrespective of pose variations. Visualization methods that can help us understand computer vision image representations in general and learned deep representation in particular are gaining popularity in computer vision society. Before the development of these methods, CNN based image representations were considered as black boxes for deep feature extraction. A new visualization method was introduced recently in [23]. This method is based on natural looking pre-images (an image obtained by the inverse transform of the learned representation) which have prominent image representations. Such images are termed “natural pre-images.” Three image visualizations were used to investigate the effectiveness of standard hand-crafted representations and CNN representations: inversion, activation maximization, and caricaturization. It was demonstrated that representations like HOG can be inverted more precisely compared to CNN features. However, different layers of CNN retain relevant information of the input image along with different pose and illumination variations. Deep layers of a CNN preserved object specific information and global variances. Moreover, fully connected layers captured large variations in the object layouts. Intermediate convolutional layers seemed to preserve the local variances and structures such as lines, edges, curves, and parts. These conclusions were a big step towards understanding generic deep features. To further enhance the invariance of deep features without decreasing their discriminative power, multi-scale order-less pooling (MOP-CNN) was introduced in [24]. CNN features are pooled locally at multiple scales using Vector of Locally Aggregated Descriptors (VLAD) pooling. The final feature vector is then obtained by concatenating these local feature vectors and can be used as a generic descriptor for either supervised or unsupervised recognition tasks, image classification, or scene understanding. MOP-CNN features have consistently shown better performance than the global CNN activations. They have also eliminated the need of joint training of prediction layers for a particular domain. For object detection tasks, regions with CNN features (or R-CNN, in short) [25] is a powerful variant of CNNs and has recently been very popular. R-CNN combines two key concepts: (1) region proposals combined with deep CNNs in order to localize and segment objects, and (2) supervised pretraining followed by domain-specific fine-tuning for smaller training data sets. This method yielded a significant performance improvement in the case of object detection tasks.
392
CHAPTER 21 DEEP LEARNING FOR CORAL CLASSIFICATION
A specific pretrained conventional deep network requires a fixed input image size (e.g., 224 × 224 for VGGnet). This “artificial” requirement may reduce the recognition accuracy for images of arbitrary size. To overcome this limitation of CNNs, a novel pooling scheme called “spatial pyramid pooling (SPP)” was introduced in [26]. A fixed-length feature vector can be obtained using SPP-nets irrespective of input image size. SPP also made the network robust to pose variations and scale deformations. SPP-net achieved state-of-the-art classification results using full-image representations and without any fine-tuning. Feature maps are computed from the entire images only once and hence the repeated computation of the convolutional features can be avoided. Deep learning methods have achieved state-of-art performances on many computer vision tasks. But these tasks remain challenging and the methods have plenty of room for improvement. The history of deep learning applications to computer vision demonstrates that deeper networks improve performance [18].
21.4.3 GOING DEEPER WITH NEURAL NETS A detailed study to explain why deep learning outperforms other shallow networks was presented in [27] and [28]. In [27], the number of distinct linear regions was used as a key parameter to address the complexity of a function encoded by a deep network. It was established that for any given layer of a deep network, the ability to encode pieces of input information was exponential in nature. The functions computed by the deeper layers were more complex but they still possessed an intrinsic rigidity caused by replicating the hidden layers. This rigidity helps deep networks to generalize unseen input samples better than the shallow models. In [28], a novel method of understanding the expressiveness of a model was presented based on computational geometry for piecewise linear functions. Deep and narrow rectifier MLPs generated more regions of linearity as compared to shallow networks with the same number of computational units. Increasing the depth and width of a CNN requires a huge computational cost to train the deep neural networks (a larger number of weight parameters to adjust). The staggering success of the CNNs over the last five years can be explained by these factors: larger data sets, deeper models, faster hardware, and last but not the least, novel algorithms for optimization and efficient training of deeper networks. Conventional CNNs come with two basic functionalities: partitioning and abstraction. Partitioning can be improved by using very small filters at the start of the network and then increasing the filter size as we go deeper. In a standard CNN, a linear classifier and a non-linear activation function are employed to yield an abstraction from the input patch. These abstractions are not discriminative enough. In [29], a novel structure called “Network in network (NIN)” was proposed to enhance the strength of these abstractions. Micro neural networks are initialized along with multilayer perceptrons (MLPs). This micro network can be viewed as an additional 1 × 1 convolutional layer followed typically by the rectified linear activation layer. The 1 × 1 convolutions (small filters) have twofold advantages: reducing the dimension of the input vector, thereby increasing the width of the network, and reducing the computational cost. This combination approximates any given function more effectively than a linear classifier followed by a non-linear activation function. These micro networks are then convolved with the input image within a larger network, hence the name “network in network.” Stacking these structures repeatedly results in deep NINs. Fully connected layers are also replaced with global average pooling layers which are less prone to overfitting. Deep NINs demonstrated state-of-the-art performance on CIFAR-10, CIFAR-100, and SVHN data sets.
21.5 DEEP LEARNING FOR CORAL CLASSIFICATION
393
Inspired by NIN architecture, Google introduced a deep network codenamed ‘inception’ [30]. This network utilized the concept of depth in two ways: (1) it increased the number of layers, and (2) an “inception module” is introduced to add a new level of organization along the width of the network. The performance of any deep network can be enhanced by increasing the depth of the network (adding more layers) and increasing the width (adding more channels at each layer). However, the resulting deeper and wider network is more prone to overfitting and is also computationally expensive. A logical approach to solve these bottlenecks is to make the network connections sparse instead of fully connected. When the dense building blocks of the network are approximated by the optimal sparse structures [31], the resulting network outperformed the shallower networks with a similar computational budget. Small filters (1 × 1 convolutions) were used to reduce the dimension of the output which preceded the bigger filters. The inception modules were only added to the higher layers of GoogLeNet to keep the computational cost lower. This was a promising start towards creating deeper and sparser networks. Towards training deeper networks, there is another prominent class that is worth mentioning: ‘highway networks’ [32]. In this novel architecture, optimization is performed using a learned gating mechanism inspired from the concept of the Long Short Term Memory (LSTM) recurrent neural networks [33]. This gating mechanism results in arbitrary paths for information flow between multiple layers. These paths are termed ‘information highways.’ The switching information for the gates is learned using the training set and, since some of the neurons are activated at any given iteration, computational cost is minimized as well. Highway networks as deep as 900 layers can be optimized easily using this approach. So far, we have established that the depth of any given neural network is directly proportional to the computational difficulty involved in training that network. The accuracy of a deep network gets saturated if we keep on stacking layers after layers beyond a certain depth. However, if the training computations are optimized effectively, an increased depth can result in higher performance. One such approach was articulated in [34] and was named residual networks (ResNets). A residual network includes a number of residual blocks each being a small CNN itself. These residual blocks are not only just stacked together; each block also has a shortcut connection to the outputs of the next blocks. An example of a residual block is shown in Fig. 21.5. These shortcut connections decrease the network’s complexity. A 34-layer ResNet contains 3.6 billion multiply-add operations whereas a 19-layer VGGnet has 19.6 billion multiply-add operations. Consequently, ResNets are easier to train and the training accuracy does not get saturated. Improved results on CIFAR-10 were reported in a subsequent study [35] using a 1001-layer deep ResNet.
21.5 DEEP LEARNING FOR CORAL CLASSIFICATION Coral reefs exhibit significant within-class variations, complex between-class boundaries, and inconsistent image clarity. The accuracy of any classification algorithm depends on the discriminating power of the extracted features from the images. In the light of the challenges outlined in Section 21.2, handcrafted feature has a number of limitations. Hand-crafted features usually encode one or two aspects of data such as color, shape, or texture. Creating a novel hand-crafted feature representation which addresses all of the challenges involved in marine images is an up-hill task. It is far more feasible to rely on off-the-shelf CNN features extracted from a deep network pretrained on a large image data
394
CHAPTER 21 DEEP LEARNING FOR CORAL CLASSIFICATION
FIGURE 21.5 A residual block of ResNets shown in a red rectangle. The output of one residual block acts as the input of the next block and is also added to the output of the next residual block.
set. CNN features have shown their discriminating power when transferred to a different domain [20]. Combining the CNN features with the domain-specific hand-crafted features to improve the classification performance presents an interesting research problem (as shown below).
21.5.1 HYBRID AND QUANTIZED FEATURES The idea of combining CNN and hand-crafted features was used in action classification tasks from videos [36]. Most of the entries in THUMOS challenge [36] have combined CNN based features with hand-crafted features for action classification in videos. Hand-crafted features are usually encoded using Fisher vectors [37] and Vector of Locally Aggregated Descriptors (VLAD) [38] before combining them with the CNN features. Wang et al. [39] have cascaded morphology and texture based handcrafted features with CNN features for mitosis detection. They have trained three classifiers: one for CNN features, one for hand-crafted features, and a third classifier for the test samples that are misclassified by the first two classifiers. This approach is computationally expensive and impractical for applications with large data sets. Jin et al. [40] have showed that CNN and hand-crafted features complement each other and have shown promising results for RGB-D object detection. They combined Locality-constrained Linear Coding (LLC) based spatial pyramid matching features with the CNN features. CNN features cannot be used directly in coral image classification since benthic data sets come with pixel (instead of bounding box) annotations. Deep features have not yet been explored until recently for the coral reef classification problem. An application of generic deep features extracted from VGGnet combined with hand-crafted features for coral reef classification to take advantage of the complementary strengths of these representation types was proposed in [41]. The data set was not big enough for training a CNN from random initializations. Therefore, pretrained CNN based features were extracted from patches centered at labeled pixels at multiple scales and a local variant of SPP was implemented to render the image representations scale-invariant. Texture and color based hand-crafted features extracted from the same patches were used to complement the CNN features. A memory efficient 2-bit feature representation scheme was investigated to reduce the memory requirements by a factor of 16.
21.5 DEEP LEARNING FOR CORAL CLASSIFICATION
395
FIGURE 21.6 Block diagram of coral classification method of [41]: (A) the pipeline for CNN feature extraction, (B) the pipeline for the hybrid features, (C) the pipeline for the quantized features.
The proposed method achieved a classification accuracy that is higher than the state-of-the-art methods on the MLC benchmark data set for corals. The hybrid (hand-crafted and learned) features performed the best. It is also implied that the CNN features and the hybrid features addressed the problem of class imbalance more efficiently. In the case of corals, the most abundant class overshadows the less frequent classes when the patches are extracted at one scale. Since the patches were extracted at different scales and then max-pooled, the less abundant classes are made more prominent in the resulting feature vectors. This was demonstrated by the experimental results. This helps the classifier to cope with the inherent class imbalance problem effectively. Fig. 21.6 outlines the block diagrams of the different classification pipelines. Fig. 21.7 shows the confusion matrices (CM) from experiments in [41]. The rows correspond to the ground truth and the columns correspond to the predicted class assignments. An ideal confusion matrix has 1s in its diagonals and 0s elsewhere. A better classification performance was observed when these confusion matrices are compared with the ones in [8]. The presence of high non-zero elements in the first column implies an imbalance towards class 1 which is the most abundant class in our data set. In practice, high values in the diagonal of the CM represent a good quality classifier. When the first column is compared with the corresponding first columns of [8], we note that the latter method copes better with the class imbalance. It was concluded that the local-SPP scheme takes care of the class imbalance problem to an extent. Memory overhead is an important aspect when dealing with large data sets. Feature representations for larger data sets require a lot of storage space. Efficient encoding schemes are necessary to compress these representations without losing the essential information. Therefore, we propose to quantize the feature vector to a low bit representation to encode the CNN based features. The resulting feature
396
CHAPTER 21 DEEP LEARNING FOR CORAL CLASSIFICATION
FIGURE 21.7 Confusion matrices for coral classification on MLC data set: (A–C) Baseline performance [8] for experiments 1, 2 and 3 respectively. (D–F) Our combined features (CF) for three experiments: (1) Trained and tested on 2008 images. (2) Trained on 2008 images and tested on 2009. (3) Trained on 2008 and 2009 images while tested on 2010 images.
vector takes up to 16 times less storage space. Compact feature representations with lower memory storage are preferred in the case of CNN based features. This should lead to faster training and testing times. In CNNs, the magnitude activation of neurons is not so much important as the spatial location of that particular neuron in the network. To prove this, the individual elements of our combined feature vector were quantized into three values, i.e., 0, 1, and −1. The positive elements were replaced by 1 and the negative elements were replaced by −1. Consequently, only two bits are required to store each individual element compared to the commonly used 32-bit single precision floating point format. This quantization effectively reduced the required memory to store the feature vectors by a factor of 16 (32 bits replaced by 2 bits after quantization). This efficient utilization of memory was achieved at the cost of a slight decrease in the classification accuracy. The resulting accuracy for the quantized features (QF) is still comparable with the baseline performance on the MLC data set.
21.5.2 CORAL POPULATION ANALYSIS The proposed classification algorithm of [41] was also evaluated on Benthoz15 data set [42]. This data set consists of an expert-annotated set of geo-referenced benthic images and associated sensor data,
21.5 DEEP LEARNING FOR CORAL CLASSIFICATION
397
FIGURE 21.8 Block diagram of proposed framework of [41].
captured by an autonomous underwater vehicle (AUV) across multiple sites from all over Australia. The whole data set contains 407,968 expert-labeled points, on 9874 distinct images collected at different depths from nine sites around Australia over the past few years. There are almost 40 distinct class labels in this data set, which make it quite challenging to annotate automatically. A subset of this data set containing images from Western Australia (WA) was used to train the classifier in [43]. Fig. 21.8 outlines the general approach of their proposed framework. The multi-scale features were extracted using a deep network. The coral population of the Abrolhos Islands (located off the west coast of Western Australia) was also analyzed by automatically annotating the unlabeled mosaics using our best classifier. Coral cover maps were then generated and validated by a marine expert as ground-truth labels were not available. This method detected a decreasing trend in the coral population in this region. It was an important step towards investigating the long-term effects of environmental change on the effective sustenance of marine ecosystems automatically.
21.5.3 COST-SENSITIVE LEARNING FOR CORALS Like most real-world computer vision data sets, marine data sets also exhibit class imbalance. Noncoral classes exist in abundance and hence the class balance is skewed towards coral classes. This imbalance in class distribution hinders the classifier to learn distinct class boundaries and a performance drop occurs. A cost-sensitive deep network was proposed in [44] to address this issue. This network’s architecture was based on VGG-net (16-layer version). Instead of altering the original class distributions (e.g., oversampling and undersampling), a cost-learning layer was introduced before the soft-max layer of the classifier. An optimization algorithm was proposed to optimize the network parameters and the cost-sensitive layer parameters. This approach was tested on many data sets (including a coral data set, MLC) which exhibit class imbalance. Their approach performed better than the baseline performance of MLC reported in [8]. However, this performance is lower than the performance reported in [41].
21.5.4 CNN WITH FLUORESCENT IMAGES Most common deep networks work with color images and hence the input layer has three distinct channels (R, G and B). However in theory, a CNN can have an arbitrary number of input channels to encode additional information. One such approach was proposed in [45]. RGB images were combined with reflectance images and fluorescent images. A pixel-wise average function was used to obtain the final image. The fluorescent images had rich contrast information for the corals and the reflectance images provided context of the non-fluorescent substrates. After registration, the input image had five channels and a CNN was trained with these additional channels. This CNNs architecture was similar to the CIFAR10 architecture defined in Caffe. Patches of 128 × 128 were extracted and resized to 32 × 32
398
CHAPTER 21 DEEP LEARNING FOR CORAL CLASSIFICATION
before passing them through three consecutive rounds of convolutional layers. The 5-channel CNN performed better than the corresponding traditional CNN for coral classification. The performance of this 5-channel CNN was also compared with the baseline performance of [8]. A 22% reduction of classification error-rate was demonstrated when both reflectance and fluorescent images were used compared to the case when only reflectance images were used.
21.6 FUTURE PROSPECTS Deep learning solutions to ecological studies can provide a truly objective measure to detect, discriminate and identify species, and their behavior and morphology. This will reduce common sources of variations and bias in human observer studies caused by subjective interpretation or the lack of skill or experience. Such automated processing tools will ensure transparency of the study results and standardization of methods for analysts. It will also facilitate comparisons of studies across individuals, populations and species in a systematic and objective manner. It will also enable the processing of data sets at considerably higher speeds compared to human experts. This is particularly relevant for tediously repetitive tasks. Freeing human resources for more complex tasks is becoming increasingly important in budget-limited and data-intensive studies. Other transformational ecological outcomes include: (i) rapid quantitative surveying of the massive amount (95%) of the acquired underwater imagery that is yet to be processed. This will enable the construction of large-scale spatially extensive image baselines for marine habitats. Such data could then be used to make a quantitative assessment of the impact of climate change; (ii) monitoring of the growth, mortality, and recruitment rates and competitive abilities of marine species (e.g. coral reef, lobsters, kelps) associated with warming and acidification; and (iii) improved knowledge of marine ecosystems for which very little is known. Some other future prospects of this research are: • To develop deep learning methods, not only limited to corals, to classify a huge amount of marine data automatically. • To compare different deep learning methods to form a solid basis for efficient assessment of marine ecosystems. • To develop an automatic annotation system that works with diverse data sets while saving human resources that are necessary for manual labeling. • To investigate the resilience of marine ecosystems to environmental impacts (global warming, marine pollution, resource extraction, coastal development) through economically sustainable monitoring programs. • To analyze the relationships between marine species and to quantify the trends in the population dynamics.
21.7 CONCLUSION In this chapter, we presented a concise survey on the evolution of deep learning and state-of-the-art deep neural network architectures. We introduced sea floor exploration and the challenges involved in
REFERENCES
399
collecting and analyzing marine data. Next, we presented a brief literature survey on marine image classification techniques. We further explored the potential applications of deep learning for benthic image classification by discussing the most recent studies which have been conducted by our group and other researchers. We also discussed a few future research directions in the fields of deep learning and underwater scene understanding. We expect that this chapter will encourage researchers from computer vision and marine societies to collaborate on similar long-term joint ventures.
ACKNOWLEDGMENTS This research was partially supported by Australian Research Council Grants (DP150104251 and DE120102960) and the Integrated Marine Observing System (IMOS) through the Department of Innovation, Industry, Science and Research (DIISR), National Collaborative Research Infrastructure Scheme. The authors also thank NVIDIA for providing a Titan-X GPU for the experiments involved in this research.
REFERENCES [1] J.D. Hedley, C.M. Roelfsema, I. Chollett, A.R. Harborne, S.F. Heron, S. Weeks, W.J. Skirving, A.E. Strong, C.M. Eakin, T.R. Christensen, V. Ticzon, Remote sensing of coral reefs for monitoring and management: a review, Remote Sens. 8 (2) (2016) 118. [2] S.C. Doney, M. Ruckelshaus, J.E. Duffy, J.P. Barry, F. Chan, C.A. English, H.M. Galindo, J.M. Grebmeier, A.B. Hollowed, N. Knowlton, J. Polovina, Climate change impacts on marine ecosystems, Marine Sci. (2012) 4. [3] O. Hoegh-Guldberg, P.J. Mumby, A.J. Hooten, R.S. Steneck, P. Greenfield, E. Gomez, C.D. Harvell, P.F. Sale, A.J. Edwards, K. Caldeira, N. Knowlton, Coral reefs under rapid climate change and ocean acidification, Science 318 (5857) (2007) 1737–1742. [4] T.P. Hughes, A.H. Baird, D.R. Bellwood, M. Card, S.R. Connolly, C. Folke, R. Grosberg, O. Hoegh-Guldberg, J.B. Jackson, J. Kleypas, J.M. Lough, Climate change, human impacts, and the resilience of coral reefs, Science 301 (5635) (2003) 929–933. [5] B. Worm, E.B. Barbier, N. Beaumont, J.E. Duffy, C. Folke, B.S. Halpern, J.B. Jackson, H.K. Lotze, F. Micheli, S.R. Palumbi, E. Sala, Impacts of biodiversity loss on ocean ecosystem services, Science 314 (5800) (2006) 787–790. [6] F. Shafait, A. Mian, M. Shortis, B. Ghanem, P.F. Culverhouse, D. Edgington, D. Cline, M. Ravanbakhsh, J. Seager, E.S. Harvey, Fish identification from videos captured in uncontrolled underwater environments, ICES J. Marine Sci.: J. Conseil (2016) 106. [7] M. Bewley, B. Douillard, N. Nourani-Vatani, A. Friedman, O. Pizarro, S. Williams, Automated species detection: an experimental approach to kelp detection from sea-floor AUV images, in: Proc Australas Conf Rob Autom 2012, 2012. [8] O. Beijbom, P.J. Edmunds, D.I. Kline, B.G. Mitchell, D. Kriegman, Automated annotation of coral reef survey images, in: IEEE Conference on 2012 Jun 16, Computer Vision and Pattern Recognition (CVPR), 2012, pp. 1170–1177. [9] K.E. Kohler, S.M. Gill, Coral Point, Count with excel extensions (CPCe): a visual basic program for the determination of coral and substrate coverage using random point count methodology, Comput. Geosci. 32 (9) (2006) 1259–1269. [10] M. Solan, J.D. Germano, D.C. Rhoads, C. Smith, E. Michaud, D. Parry, F. Wenzhöfer, B. Kennedy, C. Henriques, E. Battle, D. Carey, Towards a greater understanding of pattern, scale and process in marine benthic systems: a picture is worth a thousand worms, J. Exp. Mar. Biol. Ecol. 285 (2003) 313–338. [11] M.F. Dolan, V.L. Lucieer, A review of marine geomorphometry, the quantitative study of the seafloor, Hydrol. Earth Syst. Sci. 20 (8) (2016) 3207. [12] M.R. Patterson, N.J. Relles, Autonomous underwater vehicles resurvey bonaire: a new tool for coral reef management, in: Proceedings of the 11th International Coral Reef Symposium, 2008 Jul, 2008, pp. 539–543. [13] M.S. Marcos, M. Soriano, C. Saloma, Classification of coral reef images from underwater video using neural networks, Opt. Express 13 (22) (2005) 8766–8771. [14] M.D. Stokes, G.B. Deane, Automated processing of coral reef benthic images, Limnol. Oceanogr., Methods 7 (157) (2009) 157–168.
400
CHAPTER 21 DEEP LEARNING FOR CORAL CLASSIFICATION
[15] O. Pizarro, P. Rigby, M. Johnson-Roberson, S.B. Williams, J. Colquhoun, Towards image-based marine habitat classification, in: In OCEANS 2008, 2008 Sep 15, IEEE, 2008, pp. 1–7. [16] A.S. Shihavuddin, N. Gracias, R. Garcia, A.C. Gleason, B. Gintert, Image-based coral reef classification and thematic mapping, Remote Sens. 5 (4) (2013) 1809–1841. [17] G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science 313 (5786) (2006) 504–507. [18] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, in: Advances in Neural Information Processing Systems, 2012, pp. 1097–1105. [19] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint, arXiv:1409.1556, 2014 Sep 4. [20] A. Sharif Razavian, H. Azizpour, J. Sullivan, S. Carlsson, CNN features off-the-shelf: an astounding baseline for recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2014, pp. 806–813. [21] L. Liu, C. Shen, A. van den Hengel, The treasure beneath convolutional layers: cross-convolutional-layer pooling for image classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4749–4757. [22] M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in: European Conference on Computer Vision 2014 Sep 6, Springer International Publishing, 2014, pp. 818–833. [23] A. Mahendran, A. Vedaldi, Visualizing deep convolutional neural networks using natural pre-images, Int. J. Comput. Vis. (2016) 1–23. [24] Y. Gong, L. Wang, R. Guo, S. Lazebnik, Multi-scale orderless pooling of deep convolutional activation features, in: European Conference on Computer Vision 2014 Sep 6, Springer International Publishing, 2014, pp. 392–407. [25] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580–587. [26] K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition, in: European Conference on Computer Vision 2014 Sep 6, Springer International Publishing, 2014, pp. 346–361. [27] G.F. Montufar, R. Pascanu, K. Cho, Y. Bengio, On the number of linear regions of deep neural networks, in: Advances in Neural Information Processing Systems, 2014, pp. 2924–2932. [28] R. Pascanu, G. Montufar, Y. Bengio, On the number of inference regions of deep feed forward networks with piece-wise linear activations, in: International Conference on Learning Representations, 2014 Apr, 2014. [29] M. Lin, Q. Chen, S. Yan, Network in network, arXiv preprint, arXiv:1312.4400, 2013 Dec 1. [30] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9. [31] S. Arora, A. Bhaskara, R. Ge, T. Ma, Provable bounds for learning some deep representations, in: International Conference on Machine Learning (ICML), 2014 Jun 21, 2014, pp. 584–592. [32] R.K. Srivastava, K. Greff, J. Schmidhuber, Highway networks, in: International Conference on Machine Learning (ICML), 2015. [33] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (8) (1997) 1735–1780. [34] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. [35] K. He, X. Zhang, S. Ren, J. Sun, Identity mappings in deep residual networks, arXiv preprint, arXiv:1603.05027, 2016 Mar 16. [36] Z. Xu, L. Zhu, Y. Yang, A.G. Hauptmann, Uts-cmu at thumos 2015, in: THUMOS Challenge, 2015. [37] J. Sánchez, F. Perronnin, T. Mensink, J. Verbeek, Image classification with the Fisher vector: theory and practice, Int. J. Comput. Vis. 105 (3) (2013) 222–245. [38] H. Jégou, M. Douze, C. Schmid, P. Pérez, Aggregating local descriptors into a compact image representation, in: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2010 Jun 13, pp. 3304–3311. [39] H. Wang, A. Cruz-Roa, A. Basavanhally, H. Gilmore, N. Shih, M. Feldman, J. Tomaszewski, F. Gonzalez, A. Madabhushi, Mitosis detection in breast cancer pathology images by combining handcrafted and convolutional neural network features, J. Med. Imaging. 1 (3) (2014) 034003. [40] L. Jin, S. Gao, Z. Li, J. Tang, Hand-crafted features or machine learnt features? Together they improve rgb-d object recognition, in: 2014 IEEE International Symposium on Multimedia (ISM), IEEE, 2014 Dec 10, pp. 311–319. [41] A. Mahmood, M. Bennamoun, S. An, F. Sohel, F. Boussaid, R. Hovey, G. Kendrick, R.B. Fisher, Coral classification with hybrid feature representations, in: 2016 IEEE International Conference on Image Processing (ICIP), 2016 Sep 25, IEEE, 2016, pp. 519–523.
REFERENCES
401
[42] M. Bewley, A. Friedman, R. Ferrari, N. Hill, R. Hovey, N. Barrett, O. Pizarro, W. Figueira, L. Meyer, R. Babcock, L. Bellchambers, Australian sea–floor survey data, with images and expert annotations, Sci. Data (2015) 2. [43] A. Mahmood, M. Bennamoun, S. An, F. Sohel, F. Boussaid, R. Hovey, G. Kendrick, R.B. Fisher, Automatic annotation of coral reefs using deep learning, in: OCEANS 2016, 2016 Sep 20, IEEE, 2016. [44] S.H. Khan, M. Bennamoun, F. Sohel, R. Togneri, Cost sensitive learning of deep feature representations from imbalanced data, arXiv preprint, arXiv:1508.03422, 2015 Aug 14. [45] O. Beijbom, T. Treibitz, D.I. Kline, G. Eyal, A. Khen, B. Neal, Y. Loya, B.G. Mitchell, D. Kriegman, Improving automated annotation of benthic survey images using wide-band fluorescence, Sci. Rep. (2016) 6.
This page intentionally left blank
CHAPTER
A DEEP LEARNING FRAMEWORK FOR CLASSIFYING SOUNDS OF MYSTICETE WHALES
22 Stavros Ntalampiras Politecnico di Milano, Milan, Italy
22.1 INTRODUCTION Bioacoustic signal processing has attracted a lot of attention during the last decade as it is able to offer robust solutions to problems with diverse needs [7,17,19]. The ultimate goal of frameworks processing bioacoustic signals is to provide a complete and accurate picture of the biodiversity of the habitat of interest toward its conservation [15]. Without such automatic frameworks the monitoring process is accomplished by human experts by thorough observation of the recorded data. Even though the quality of the work done by a human expert is superior to the services offered by a machine, there are many drawbacks with respect to monitoring carried out by humans: (a) they require more time as an algorithm is able to run faster than real-time, (b) the needed expeditions are costly or even impossible due to dangerous, inaccessible areas, (c) they are able to analyze a limited number of habitats, (d) they may also interfere with the behavior of the species of interest and alter its behavior. An application domain falling under the umbrella of bioacoustic signal processing deals with the automatic categorization of marine mammal sounds. It comes out from the related literature that the specific domain is still not well explored with respect to others, such as processing of bird callings [18,20]. Mainly this is due to the fact that underwater sound recording requires more sophisticated equipment and resources in general. However the recent technological advancements in automatic recording units have facilitated the capturing of underwater sounds, thus nowadays one may easily have access to vast amount of the associated audio signals. These databases can be used for the development of automated methods which achieve biodiversity monitoring toward a better analysis of underwater life. This study addresses the problem of classifying sounds coming from mysticetes based on the hypothesis the mammalian cortex uses a form of hierarchical decomposition for processing sound stimuli [24,25]. This results in a consistent distribution of the energy produced by each audio signal with respect to specific parts of the spectrum. Our aim is to design acoustic features able to capture this distribution and subsequently model them (and/or their evolution in time) for its automatic classification. The mysticete species included in the present study are: (a) Blue whales, (b) Bowhead whales, (c) Fin whales, (d) Humpback whales, and (e) Southern Right whales. Handbook of Neural Computation. DOI: 10.1016/B978-0-12-811318-9.00022-3 Copyright © 2017 Elsevier Inc. All rights reserved.
403
404
CHAPTER 22 DEEP LEARNING FRAMEWORK FOR CLASSIFYING SOUNDS
The problem is quite challenging since the related signals may exhibit similar temporal and spectral characteristics. Thus one must search for features able to capture even slight differences among the signals belonging to the above-mentioned species. Another issue is that the data of a specific class may exhibit distribution with varying characteristics mainly due to the noise coexisting with the signals of interest. Here we use information coming both from cepstral and wavelet domains. Subsequently they are modeled by a method exploiting a discriminative classifier based on deep learning. The pattern modeling technique is adaptive while taking into account the following issues: (a) limited data associated with one or more classes, and (b) data set exhibiting imbalances with respect to one or more classes, i.e. data quantities among the classes are unequal. We followed a thorough experimental procedure using a publicly available data set and reached quite encouraging classification rates. The rest of this article is organized as follows. Section 22.2 provides an overview of the related literature. Section 22.4 analyzes the modules which comprise the proposed classification framework with special attention to the universal background and reservoir modeling. The next section examines the capabilities of the proposed approach in a thorough and concise way. Finally Section 22.6 offers our conclusions as well as ideas for future work.
22.2 RELATED LITERATURE Processing of sounds coming from mysticete and/or odontocete species has attracted the interest of the audio signal processing community quite recently. Halkias et al. [5] designed a method able to classify mysticete sounds of five species (Blue whale, Bowhead whale, Fin whale, Humpback whale, and Southern Right whale) under the presence of noise (ambient noise, mechanical noise, other species). Their method is based on Restricted Boltzmann Machine and Sparse Auto-Encoder fed on spectrogram ROIs while providing a recognition accuracy of 69% and 80% with and without the presence of noise respectively. Three time-frequency methods for recognizing fin and blue whale calls are presented in [12]. The methods include spectrogram matching, dynamic time wrapping, and vector quantization while the latter two operate on the frequency contour. The data set was recorded by the authors while they emphasize the strong and weak points of each method. Study [23] processes sounds produced by ten killer whales and eight pilot whales close to the coasts of Norway, Iceland, and the Bahamas (Whale FM project). They were automatically analyzed and the killer whales were classified as Icelandic or Norwegian while the pilot ones were separated into Norwegian long-finned and Bahamas’ short-finned pilot whales. The audio features are extracted out of the spectrogram while the classification is based on a distance metric weighted by Fisher discriminant scores. It is interesting to note that the proposed method performed better than the analysis of the citizens. In [1] the authors used the short-time Fourier and wavelet packet transforms along with a multi-layer perceptron (MLP) to analyze blue whale calls. The proposed system is able to classify the vocalizations into A, B, and D blue whale classes. Paper [13] employs two types of neural networks (based on competitive learning and Kohonen feature mapping respectively) in order to analyze the repertoire of false killer whale vocalizations. The authors used duty cycle measurements and peak frequency as signal characteristics while three major categories were discovered: ascending whistles, low-frequency pulse trains, and high-frequency pulse
22.3 ACOUSTIC FEATURES
405
trains. It should be noted that the vocalizations were captured by two false killer whales, one male and one female, located at Sea Life Park, Oahu, Hawaii. Brown and Miller [3] applied four Dynamic Time Wrapping algorithms on a set of calls by Northern Resident whales which may be categorized into seven different classes. Their features included the low-frequency contour, the high-frequency contour, their derivatives, and weighted sums of the distances corresponding to LFC with HFC, LFC with its derivative, and HFC with its derivative. Subsequently Brown and Smaragdis [4] used hidden Markov models (HMMs) and Gaussian mixture models (GMMs) to classify seven types of calls coming from Northern Resident killer whales. Their feature set was a time-frequency decomposition of the recorded signals. An interesting approach is presented in [11], based on spectrogram correlation. The corpus consisted of bowhead whale’s (Balaena mysticetus) end notes from songs recorded in Alaska in 1986 and 1988 while the method outperformed three other methods (matched filters, neural networks, and hidden Markov models). Roch et al. [10] explain a method for classification of free-ranging delphinid vocalizations. The feature extraction concerned cepstral vectors associated with multisecond segments. The authors trained one Gaussian mixture model for each of the following three species: short-beaked and long-beaked common (Delphinus delphis and Delphinus capensis), Pacific white-sided (Lagenorhynchus obliquidens), and bottlenose (Tursiops truncates). Last but not least Wilcock [28] followed a fundamentally different approach and performed tracking of fin whales in the northeastern Pacific Ocean using measurements coming from a seafloor seismic network. To the best of our knowledge there are no approaches in the literature exploiting a deep learning classifier in combination with a multidomain set of features for the classification of five Mysticete species.
22.3 ACOUSTIC FEATURES This section explains the parameterization of the audio signals coming from the five whale species. Both frequency and wavelet domains were employed toward obtaining a spherical picture of the involved sound events. For convenience, the features extracted out of a Blue whale sound event are depicted in Fig. 22.1.
22.3.1 FREQUENCY DOMAIN FEATURES The first feature set exploits the spectrogram of the sound event since it may reveal important information for its characterization. The Fast Fourier Transform is used while the signal is windowized in order to minimize the effect of spectral leakages, i.e. diminish the finite length sequence at the ends aiming at a periodic structure without discontinuities. There exists a gamut of window functions with very different spectral properties, e.g. main lobe widths and side lobe amplitudes [6]. Here we have employed the following windowing techniques to reduce edge effects in the FFT: (a) Blackman, (b) Hamming, (c) Hanning, and (d) rectangular. Since classification of whale sounds is a relatively new task for the audio signal processing community and a standard windowing technique has not been established, we performed a series of examinations to determine the optimal one. Early experimentations showed that Blackman window-
406
CHAPTER 22 DEEP LEARNING FRAMEWORK FOR CLASSIFYING SOUNDS
FIGURE 22.1 A representation of the feature sets used in this work. Both frequency and wavelet domains are considered.
ing offers the best spectral representation with respect to classification accuracy, thus it is favored in this work. As one may see in Fig. 22.1 the high energy parts of the spectrum are more emphasized while using Blackman window type than the Hamming one (Hamming was chosen since it is com-
22.3 ACOUSTIC FEATURES
407
FIGURE 22.2 The block diagram of the process extracting the wavelet packet integration feature set.
monly used in audio processing applications). Thus the final feature vector is the energies of the short time Fourier transform after it is Blackman windowed.
22.3.2 WAVELET DOMAIN FEATURES This group is extracted after a critical band-based multiresolution analysis of the signal takes place. Wavelets have become a common tool in many signal processing applications (bioacoustic signal enhancement [21], audio fingerprinting [2], speech/music discrimination [14] etc.). The uniqueness of the wavelet transform comes from its ability for processing time series, which include non-stationary power at many different frequencies. While the Fourier transform is based on smooth and predictable sinusoid functions, wavelets tend to be irregular and asymmetric. It is a dynamic windowing technique processing low and high frequency information content with different levels of analysis. Wavelet packet (WP) analysis breaks up the signal and transforms it into shifted and scaled variants of the original (or mother) function. In this article we employed the Daubechies 1 (or Haar) function. The proposed methodology applies the discrete wavelet transform three subsequent times which is equivalent to a three-stage filtering while we retain both low and high frequency content. The feature extraction process is depicted in Fig. 22.2. With this set we wish to obtain a vector with a complete analysis of the audio signal across different spectral areas while they are approximated by WP. This set takes into account that not all parts of the spectrum contain valuable information while some parts are highly contaminated with noise. After manual inspection of the recordings, we employed a filterbank with the frequency ranges denoted in Table 22.1 using Gabor bandpass filters based on a Gaussian kernel. Subsequently we extract three-level wavelet packets out of each spectral band while applying downsampling as Nyquist theorem requests, in order not to end up having the double amount of data. During the next stage we compute the autocorrelation envelope area with respect to each segmented wavelet coefficient and we normalize it by half the segment size. Finally we form a vector comprised of N normalized integration parameters, where N is the total number of the frequency bands multiplied by the number of the wavelet coefficients
408
CHAPTER 22 DEEP LEARNING FRAMEWORK FOR CLASSIFYING SOUNDS
Table 22.1 The Frequency Limits of the Wavelet Packet Integration Analysis Band Number
Lower (Hz)
Center (Hz)
Upper (Hz)
1 2 3 4 5
1 10 20 30 40
5 15 25 35 45
10 20 30 40 50
(5 × 8 = 40). This is the WP-integration feature vector and the block diagram for its computation is demonstrated in Fig. 22.2. They capture the variations exhibited by each wavelet coefficient within a group of predefined frequency bands. The normalized autocorrelation envelope area was chosen as the whale signals show differences in the content of the frequency bands we utilized.
22.4 THE CLASSIFICATION FRAMEWORK We decided to apply a classification methodology approaching the problem from the following perspective: the Reservoir Network (RN) tries to determine the hyperplanes which separate the feature space while projecting them to a multidimensional space. They basically comprise recurrent neural networks, i.e., a deep learning architecture whose their main purpose is to capture the characteristics of high-level abstractions existing in the acquired data while designing multiple processing layers of complicated formations, i.e. non-linear functions. The advantage of RN is that the calculations involved in its readout layer are linear, thus of limited computational complexity and relatively small duration of the training process. Reservoir computing argues that since back propagation is computationally complex but typically does not influence the internal layers severely, it may be totally excluded from the training process. On the contrary, the readout layer is a generalized linear classification/regression problem associated with low complexity. In addition any potential network instability is avoided by enforcing a simple constraint on the random parameters of the internal layers. In the following we provide a brief description of the RN.
22.4.1 RESERVOIR NETWORK The trend in acoustic modeling suggests the usage of Reservoir Computing (RC) techniques [26]. An RN comprises an a priori fixed Recurrent Neural Network (RNN), the output layer of which is linear. An RN, whose topology is depicted in Fig. 22.3, includes neurons with non-linear activation functions which are connected to the inputs (input connections) and to each other (recurrent connections). These two types of connections have randomly generated weights, which are kept fixed during both the training and operational phase. Finally, a linear function is associated with each output node. Its parameters are the weights of the output connections and are trained to achieve a specific result, e.g. that a particular output node produces high values for observations of a particular class. The output weights are learned by means of linear regression and are called readouts since they “read” the reservoir state. Details about the RN training and the echo state property can be found in [9].
22.4 THE CLASSIFICATION FRAMEWORK
409
FIGURE 22.3 A standard reservoir network consisting of three layers: (A) the input, (B) the reservoir, and (C) the readout. The second layer includes neurons with non-linear activation functions. The weights of the input and the recurrent connections are randomly fixed. The weights to the output nodes are the only ones being trained.
As a general formulation of the RNs, we assume that the network has K inputs, G neurons (usually called reservoir size), M outputs while the matrices Win (K × G), Wres (G × G), and Wout (G × L) include the connection weights. The RN system equations are the following: x(k) y(k)
= fres (Win u(k − 1) + Wres x(k − 1)) = fout (Wout )x(k),
(22.1) (22.2)
where u(k), x(k), and y(k) denote the values of the inputs, reservoir outputs, and the readout nodes at time k respectively; fres and fout are the activation functions of the reservoir and the output nodes, respectively. In this work we consider fres (x) = tanh(x) and fout (x) = x and we fix L = 5 equal to the number of the sound classes. Linear regression is used to determine the weights Wout , 1 Wout = arg min XW − D2 + W 2 (22.3) W Ntr Wout
=
(X T X + I )−1 (X T D),
(22.4)
where XW and D are the computed vectors, I a unity matrix, Ntr the number of the training samples, while is a regularization term. The recurrent weights are randomly generated by a zero-mean Gaussian distribution with variance v, which essentially controls the spectral radius (SR) of the reservoir. The largest absolute eigenvalue of Wres is proportional to v and is particularly important for the dynamical behavior of the reservoir [8,27]. Win is randomly drawn from a uniform distribution [−InputScalingFactor, +InputScalingFactor], which emphasizes/deemphasizes the inputs in the activation of the reservoir neurons. The significance of the specific parameter is decreased as the reservoir size increases.
410
CHAPTER 22 DEEP LEARNING FRAMEWORK FOR CLASSIFYING SOUNDS
In this work the RN is used for assigning classes to a certain sequence of features coming from audio signals. To this end it is trained so as to achieve an output state where a particular output node is high for observations of a specific class (e.g. Bowhead, Southern Right, etc.) and low for observations of any other class. Thus, the regression layer minimizes the mean squared error between yt (readout vector) and dt (desired output), where all the elements belonging to dt are −1, except the one corresponding to the desired state which is equal to +1. Following the work presented in Richard and Lippmann = 0.5 + 0.5y [22], the readout layer yt,q t,q well approximates the posterior probability vector P (q|ut ), where q corresponds to any given data class. At this point a small adjustment was introduced ensuring that the probabilities are positive. The readouts are calculated using the following formula: yt,q + 1 , δ), 0 < δ 36°. Eight classes (8) for aspect: 0°–45°, 45°–90°, 90°–135°, 135°–180°, 180°–225°, 225°–270°, 270°–315°, and 315°–360° (Fig. 24.5C). The plan and profile curvature was classified into three (3) classes (< −0.25, −0.24–+0.25, > +0.26), respectively (Fig. 24.5D–E), and fi-
442
CHAPTER 24 APPLYING MACHINE LEARNING ALGORITHMS
FIGURE 24.5 The landslide related variables of the study area: (A) Elevation, (B) Slope, (C) Aspect, (D) Plan curvature, (E) Profile curvature, (F) TWI, (G) Distance from river network, (H) Distance from faults.
24.3 STUDY AREA
443
FIGURE 24.5 (Continued ).
nally the TWI layer was also classified into three (3) classes (10.1) (Fig. 24.5F). Concerning the distance from river network, it was classified into four (4) zones of influence: (a) 401 m (Fig. 24.5G), and finally the fault distribution was classified into four (4) categories: 751 m (Fig. 24.5H). Table 24.1 shows the calculated number of grid cells within each class of each landslide related variable and also the spatial distribution of landslides among the classes. As proposed by the methodology, training (70% of the total number) and validating data sets were randomly produced from the total number of landslide and non-landslide areas the spatial distribution of which are shown in Fig. 24.6. The next phase of the followed methodology involves the multi-collinearity analysis. The analysis of VIFs yields values between 1.156 and 2.300, with three values slightly above the threshold of 2. Also, the minimum value of tolerance is 0.443 (lithology) that is greater than the threshold of 0.10. Based on these two metrics there is an indication that there is no serious multi-collinearity between the independent variables (Table 24.2). The next phase involved the implementation of the two algorithms. As already discussed, the optimal performance of the SVM is strongly influenced by the selection of the parameters C and γ . A large value of the factor Cost (C) may lead to overfitting, whereas a small value leads to underfitting [44]. Using the training data set the tuning process estimated the best C and γ values as 1.0 and 0.50 respectively. The tuning process used the grid search method that is widely accepted as the most reliable optimization method [54]. Fig. 24.7 illustrates the landslide susceptibility map produced by the SVM classifier. To implement successively the RF method, there is a need to estimate the minimum number of trees required to minimize the Out-Of-Bag error and also the need to estimate the number of variables randomly sampled as candidates at each split. As illustrated in Fig. 24.8, the Out-of-Bag error (black line) is less fluctuated when the number of trees exceeds 150, while the results of the tuning process indicated that four random variables was the optimal number of variables used in each split.
444
CHAPTER 24 APPLYING MACHINE LEARNING ALGORITHMS
Table 24.1 Spatial Relation Between Each Landslide Related Variable and Landslides Landslide Related Variables
Classes
Number of Grid Cells
Number of Landslides
Lithology
Loose fine grained sediments Loose coarse grained deposits Loose deposits of mixed phases Cohesive, coarse grained formations Cohesive formations of mixed phases Coarse grained sediments Fine grained sediments Flysch formations Cretaceous limestones Triasic–Jurassic limestones Schist, sandstones and cherts
50,281 573,094 431,606 318,123 40,294 743,141 799,042 453,295 1,434,101 3369 318,015
1 11 4 2 1 5 22 9 20 0 5
Elevation
1151 m
1,041,366 938,977 1,138,930 1,290,998 754,090
12 10 34 21 3
Slope
36°
1,484,026 611,145 1,394,526 1,444,144 230,520
15 8 15 33 9
Aspect
0°–45° 46°–90° 91°–135° 136°–180° 181°–225° 226°–275° 276°–315° 316°–360°
868,838 767,334 524,577 345,562 372,906 590,337 798,803 896,004
27 8 6 7 3 4 9 16
TWI
10.1
2,180,064 2,246,282 738,015
37 37 6
Plan curvature
< −0.25 −0.24–+0.25 > +0.26
386,078 4,312,900 465,383
4 67 9
Profile curvature
< −0.25 −0.24–+0.25 > +0.26
471,320 8 4,197,608 62 495,433 10 continued on next page
24.3 STUDY AREA
445
Table 24.1 (Continued) Landslide Related Variables
Classes
Number of Grid Cells
Number of Landslides
Distance from faults
751 m
1,236,630 911,522 637,235 2,378,974
19 16 14 31
Distance from river network
301 m
1,640,425 1,365,154 1,034,518 1,124,264
22 23 19 16
FIGURE 24.6 The spatial distribution of landslide and non-landslide areas of the training and validation data sets.
Insights about the influence of the landslide related variables in predicting the stability condition of the research area have been obtained by the implementation of the RF model. Specifically, Fig. 24.9 illustrates the nine variables ordered by the mean decrease accuracy and the mean decrease Gini value. The mean decrease in Gini coefficient is a measure of how each variable contributes to the homogeneity of the nodes and leaves in the resulting random forest model, while the mean decrease in accuracy a variable causes is determined during the Out-of-Bag error calculation phase. The more the accuracy of the RF decreases due to the exclusion of a variable, the more important that variable is assumed, thus
446
CHAPTER 24 APPLYING MACHINE LEARNING ALGORITHMS
Table 24.2 Collinearity Statistics Results Variables
Tolerance VIF values
Lithology Slope Aspect Elevation Plan curvature Profile curvature Topographic wetness index Distance from river network Distance from faults
0.443 0.460 0.786 0.435 0.847 0.782 0.539 0.724 0.865
2.258 2.173 1.272 2.300 1.181 1.279 1.856 1.381 1.156
FIGURE 24.7 The SVM landslide susceptibility map.
variables with a large mean decrease in accuracy are more important. According to those two metrics, the three most important variables are elevation, aspect, and lithology. Fig. 24.10 illustrates the landslide susceptibility map constructed according to the RF method. From the visual analysis of the landslide susceptibility map, it seems that it follows the pattern of altitude, lithology, and the distance to river network. High and very high susceptible zones are located along the road network mainly at the west and east mountainous areas, while the central area is characterized by very low to low susceptibility values. The results of the implementation of the two models were validated using the validation data set through the usage of confusion matrix, the ROC graphs which are summarized by the calculation of
24.3 STUDY AREA
FIGURE 24.8 The optimal number of tress.
FIGURE 24.9 Mean decrease accuracy and mean decrease Gini.
447
448
CHAPTER 24 APPLYING MACHINE LEARNING ALGORITHMS
FIGURE 24.10 The RF landslide susceptibility map.
Table 24.3 Models Validation – Statistical Evaluation Measures Performance Index
SVM
RF
True positive False positive True negative False negative Accuracy Precession Recall F-m Cohen’s kappa index
22 3 21 2 0.8958 0.9167 0.8800 0.8980 0.7916
23 1 24 0 0.9792 0.9583 1.0000 0.9787 0.9583
AUC values, and the calculation of the Cohen’s kappa index, an index that express the reliability of the models [20,105] (Table 24.3). Both classifiers have good prediction capabilities. In particular, the highest accuracy was achieved by the RF classifier (0.9792), while the SVM algorithm achieved a lower value (0.8958). The RF classifier showed the highest AUC value (0.9831), while the SVM algorithm achieved a slightly lower AUC value (0.9531). The Cohen’s kappa index classified the SVM algorithm as substantial performance (0.7916), while the RF algorithm was classified as almost perfect (0.9583).
24.3 STUDY AREA
449
FIGURE 24.11 The RF landslide susceptibility map.
The next step was to estimate how well the two models had classified the research area according to the landslide susceptibility classes and the cumulative percentage of the observed landslide occurrence. The validation process was performed by comparing the produced landslide susceptibility map with the actual landslide locations using the success rate and the prediction rate methods. Fig. 24.11 illustrates the success and prediction rate curve for the two models. The AUC values for both models showed similar results. The efficiency of the RF classifier was the highest among the models (AUC = 0.8328), followed by the SVM classifier (AUC = 0.7917). Also, the predictive power of the RF classifier was the highest among the two models (AUC = 0.7198), followed by the SVM classifier (AUC = 0.6581). When applying RF analysis, the percentage of landslides located within the zones of high and very high susceptibility is estimated to be 78.75%, while for RF the percentage is 67.50% (Fig. 24.12). Performing the Wilcoxon signed-rank test at a 95% significance level, the p-value was estimated to be 0.000 (less than 0.05), while the z value (−4.615) exceeded the critical values of z (−1.96 and +1.96), indicating that the performance of the susceptibility models was significantly different. In order to assess further the landslide susceptibility values that the two models produced, 1000 random points which covered the entire research area were generated and their susceptibility values were obtained. SVM model appeared to produce higher susceptibility values than the RF model. In particular, approximately 60% of the total 1000 points had higher susceptibility values. Regarding the linear regression analysis and the performed Analysis of Variance, it revealed that a moderate evidence of linear correlation between the two landslide susceptibility maps exists, having a p-value less than 0.0001 at a 95% confidence level and an R 2 value estimated to be 0.401. The R 2 value indicates that 40.10%
450
CHAPTER 24 APPLYING MACHINE LEARNING ALGORITHMS
FIGURE 24.12 The percentage of landslides in each landslide susceptibility class.
of the variability in the SVM model can be explained by variation in the RF model (Fig. 24.13A). Fig. 24.13B illustrates the graph constructed by the landslide susceptibility values estimated by the two models based only on the training data set. Two interesting issues could be highlighted. Firstly, there is a clear trend concerning the low values provided by the RF model in comparison with the low values produced by the SVM model. It seems that the SVM model produces much higher values, an indication of overestimation and reduced ability to evaluate the non-landslide location, with linearity among the two models to be rather low (R 2 = 0.3414). And secondly, that the two models predict in a much more similar manner the high values, that refer to landslide locations, with linearity among the two models to be higher (R 2 = 0.4972).
24.4 DISCUSSION The main objective of the study was to predict trends and patterns in response to the evolution of landslide processes and to produce a landslide susceptibility map by applying machine learning methods within a GIS environment. Specifically, the SVM and RF methods were utilized, due to their ability to model non-linear correlations concerning the landslide related variables. The results obtained based on the validation data set showed that the RF model had a better generalization power, since the AUC value, when comparing the produced landslide susceptibility map with the actual landslide locations, showed higher values than the SVM model. This outcome may have been influenced by the lack of an efficient number of training data and the parameter tuning process of SVM. According to Jain et al. [51], as a rule of thumb, a minimum of 10 · d · C training samples are required for a d-dimensional classification problem of C classes. The higher the d-dimensions, the higher the complexity of the model
24.4 DISCUSSION
451
FIGURE 24.13 (A) SVM vs. RF (1000 random points), (B) SVM vs. RF (landslide and non-landslide).
and the larger volumes of training data are needed [110]. The nature of the landslide phenomenon often makes it difficult to obtain the required number of training data and thus the complexity of the model, which depends on the number of variables used, must be tuned to an optimal level by selecting the most appropriate variables [15,50,22]. Each model in our case had 112 training points, which is significantly lower than the expected number of training according to the above assumption (at least 180 training points). In addition, in the RF there are no significant parameters to tune, in comparison with the fact that the SVM is strongly influenced by the selection of the parameters C and γ . According to the estimated mean decrease of accuracy and mean decrease of Gini value, the three most important variables are elevation, aspect, and lithology. The outcome is persistent with the prior knowledge concerning landslide susceptibility in Greece [60,97,111]. Specifically, concerning the elevation of a surface, it is considered to be formed by the combined action of tectonic activity, weathering and erosion processes and is also related to the action of the climatic conditions through a complex interactive influence [24]. The altitude could be considered as a variable that indirectly contributes to the slope failure manifestation. The descriptive analysis performed in our study showed that areas with elevations greater than 350 m experience considerably higher chance of landslide occurrence. Especially in areas between 351 and 650 m, which are covered mainly by Plio-Pleistocene sediments, the observed landslides reached the percentage of 42.5%. Moreover, previous studies in Greece have shown that slides are usually abundant on N, NNE, and SSW orientated slopes, a fact that was attributed mainly to climatic factors [57,59,94]. Certain slope orientations are associated with increased snow concentrations and consequently longer periods of freeze and thaw action processes. These slopes can favor higher erosion and weathering processes as the climatic conditions facilitate the cyclic alternation of dry and wet periods. The SE–SW (135°–225°) oriented slopes are mostly affected by rainfalls and the NNW–NNE (315°–045°) oriented slopes are
452
CHAPTER 24 APPLYING MACHINE LEARNING ALGORITHMS
mostly sunless, affected by the increased snow concentrations. Approximately, in the research area, 85% of the total landslides where found in the SE–SW and NNW–NNE oriented slopes. Finally, regarding lithology, the high percentage of landslide occurrence is observed in PlioPleistocene sediments, flysch formations, loose fine and coarse grained deposits and Cretaceous limestone, findings that are in agreement with previous studies [93,96]. Fine-grained Plio-Pleistocene sediments, which consist of alternations of clayey marls, marls, silty sands, and weak sandstones, appear to be much more susceptible in rotational slides, earth flows, and creep movements, while the evolution of this type of landslides is largely influenced by the heterogeneous structure and the degree of looseness. In areas covered by flysch formations, the manifestation of rotational and translational slides, creep movements but also rockfalls is attributed mainly to the anisotropic geotechnical behavior of the formation. Flysch formations are characterized by intensively folded sediments and in places appear with considerable thickness of weathering mantle. Concerning the Cretaceous limestones, they appear susceptible to rockfalls that are influenced by the degree of weathering and fragmentation, the orientation of the discontinuities surfaces, and the intense morphological relief.
24.5 CONCLUSIONS In the present study, a SVM and a RF model was applied for the construction of landslide susceptibility maps in NW Peloponnese, Greece. A total of nine conditional factors were analyzed, namely: lithology, elevation, slope, aspect, plan curvature, profile curvature, TWI, distance to rivers, and distance to faults. The inventory database contained 80 landslide locations that were divided into two subsets, one for training (70% of the total number of areas) and one for validating the model and 80 locations of non-landslide areas that were also partitioned into training and validating data sets. From the preprocessing phase, the multi-collinearity analysis revealed no multi-collinearity among the nine conditioning variables, while according to the outcomes of the RF model, elevation, aspect, and lithology were ranked as the three most influential variables that contribute to the landslide manifestation in the area of research. The non-parametric analysis also revealed that there was a statistically significant difference in the produced susceptibility maps. The most accurate model was the RF, which identified correctly 97.92% of the instances during the validation phase, followed by SVM (89.58%). The area under the prediction rate curve for the RF was calculated to be 0.9831, while the SVM model showed slightly lower predictive performance of 0.9531. Also, RF proved to have a greater ability of generalization, since the AUC validation values, when comparing the produced landslide susceptibility map with the actual landslide locations, were estimated to be higher than the SVM. Concerning the potential linear correlation between the two models, the analysis revealed a moderate evidence of linear correlation with 40.01% of the variability in the SVM model explained by variation in the RF model. Finally, a clear trend concerning the predictive values provided by the SVM model has been found; the SVM model tends to overestimate the non-landslide area providing higher values compared with the RF model, while the two models predict in a much more similar manner the landslide areas.
REFERENCES
453
REFERENCES [1] E. Aguado, J. Burt, Understanding Weather and Climate, sixth edition, Prentice Hall, Upper Saddle River, New Jersey, 2012, p. 576. [2] A. Akgun, C. Kincal, B. Pradhan, Application of remote sensing data and GIS for landslide risk assessment as an environmental threat to Izmir city (west Turkey), Environ. Monit. Assess. 184 (2012) 5453–5470. [3] P. Aleotti, R. Chowdhury, Landslide hazard assessment: summary review and new perspectives, Bull. Eng. Geol. Environ. 58 (1) (1999) 21–44. [4] J. Aubouin, Contribution a l’ etude geologique de la Grece septentrionale: les confins de l’ Epire et de la Thessalie, Ann. Geol. Pays Hellen. 10 (1959) 1–483. [5] L. Ayalew, H. Yamagishi, The application of GIS-based logistic regression for landslide susceptibility mapping in the Kakuda-Yahiko Mountains, Central Japan, Geomorphology 65 (2005) 15–31. [6] C. Ballabio, S. Sterlacchini, Support vector machines for landslide susceptibility mapping: the Staffora River basin case study, Italy, Math. Geosci. 44 (1) (2012) 47–70. [7] L. Breiman, Random forests, Mach. Learn. 45 (2001) 5–32. [8] L. Breiman, J.H. Freidman, R.A. Olshen, C.J. Stone, Classification and Regression Trees, Wadsworth, 1984. [9] A. Brenning, Spatial prediction models for landslide hazards: review, comparison and evaluation, Nat. Hazards Earth Syst. Sci. 5 (2005) 853–862. [10] J.H. Brunn, Contribution à l’étude géologique du Pinde septentrional et d’une partie de la Macédoine occidentale, Ann. Géol. Pays Hellén., 1re Série, vol. 7, 1956, 358 pp., 20 pl. [11] T. Can, H.A. Nefeslioglu, C. Gokceoglu, H. Sonmez, T.Y. Duman, Susceptibility assessments of shallow earthflows triggered by heavy rainfall at three catchments by logistic regression analysis, Geomorphology 72 (1–4) (2005) 250–271. [12] D. Caniani, S. Pascale, F. Sdao, A. Sole, Neural networks and landslide susceptibility: a case study of the urban area of Potenza, Nat. Hazards 45 (2008) 55–72. [13] A. Carrara, F. Guzzetti, M. Cardinali, P. Reichenbach, Use of GIS technology in the prediction and monitoring of landslide hazard, Nat. Hazards 20 (2–3) (1999) 117–135. [14] F. Catani, D. Lagomarsino, S. Segoni, V. Tofani, Landslide susceptibility estimation by random forests technique: sensitivity and scaling issues, Nat. Hazards Earth Syst. Sci. 13 (2013) 2815–2831. [15] J. Chacon, C. Irigaray, T. Fernandez, R. El Hamdouni, Engineering geology maps: landslides and geographical information systems, Bull. Eng. Geol. Environ. 65 (2006) 341–411. [16] W. Chen, H. Chai, Z. Zhao, Q. Wang, H. Hong, Landslide susceptibility mapping based on GIS and support vector machine models for the Qianyang County, China, Environ. Earth Sci. 75 (2016) 474, http://dx.doi.org/10.1007/s12665-015-5093-0. [17] V. Cherkassky, F. Mulier, Learning from Data: Concepts, Theory, and Methods, Wiley, New York, 2007. [18] J. Choi, H.J. Oh, J.S. Won, S. Lee, Validation of an artificial neural network model for landslide susceptibility mapping, Environ. Earth Sci. 60 (2010) 473–483. [19] C.J.F. Chung, A.G. Fabbri, Validation of spatial prediction models for landslide hazard mapping, Nat. Hazards 30 (3) (2003) 451–472. [20] Jacob Cohen, A coefficient of agreement for nominal scales, Educ. Psychol. Meas. 20 (1) (1960) 37–46, http://dx.doi.org/ 10.1177/001316446002000104. [21] M. Conforti, S. Pascale, G. Robustelli, F. Sdao, Evaluation of prediction capability of the artificial neural networks for mapping landslide susceptibility in the Turbolo River catchment (northern Calabria, Italy), Catena 113 (2014) 236–250. [22] D. Costanzo, E. Rotigliano, C. Irigaray, J.D. Jiménez-Perálvarez, J. Chacón, Factors selection in landslide susceptibility modelling on large scale following the GIS matrix method: application to the River Beiro Basin (Spain), Nat. Hazards Earth Syst. Sci. 12 (2012) 327–340. [23] D.M. Cruden, D.J. Varnes, Landslide types and processes, in: A.K. Turner, R.L. Shuster (Eds.), Landslides: Investigation and Mitigation, in: Transp Res Board, Spec Rep, vol. 247, 1996, pp. 36–75. [24] F.C. Dai, C.F. Lee, Y.Y. Ngai, Landslide risk assessment and management: an overview, Eng. Geol. 64 (1) (2002) 65–87. [25] R. Dikau, D. Brunsden, L. Sshrott, M. Ibsen, Landslide Recognition. Identification, Movement and Causes, Wiley & Sons, Chichester, 1996, p. 274. [26] C.F. Dormann, J. Elith, S. Bacher, C. Buchmann, G. Carl, G. Carré, J.R.G. Marquéz, B. Gruber, B. Lafourcade, P.J. Leitão, T. Münkemüller, C. McClean, P.E. Osborne, B. Reineking, B. Schröder, A.K. Skidmore, D. Zurell, S. Lautenbach, Collinearity: a review of methods to deal with it and a simulation study evaluating their performance, Ecography 36 (2013) 27–46.
454
CHAPTER 24 APPLYING MACHINE LEARNING ALGORITHMS
[27] M. Ercanoglu, C. Gokceoglu, Assessment of landslide susceptibility for a landslide prone area (north of Yenice, NW Turkey) by fuzzy approach, Environ. Geol. 41 (2002) 720–730. [28] M. Ercanoglu, C. Gokceoglu, Use of fuzzy relations to produce landslide susceptibility map of a landslide prone area West Black Sea region, Turkey, Eng. Geol. 75 (3–4) (2004) 229–250. [29] L. Ermini, F. Catani, N. Casagli, Artificial neural networks applied to landslide susceptibility assessment, Geomorphology 66 (2005) 327–343. [30] ESRI, ArcGIS Desktop: Release 10.1, Environmental Systems Research Institute, Redlands, CA, 2013. [31] T. Fawcett, An introduction to ROC analysis, Pattern Recogn. Lett. 27 (2006) 861–874. [32] B. Feizizadeh, T. Blaschke, GIS-multicriteria decision analysis for landslide susceptibility mapping: comparing three methods for the Urmia lake basin, Iran, Nat. Hazards 65 (3) (2013) 2105–2128. [33] B. Feizizadeh, T. Blaschke, M.S. Roodposhti, Integrating GIS based fuzzy set theory in multicriteria evaluation methods for landslide susceptibility mapping, Int. J. Geoinf. 9 (3) (2013) 49–57. [34] B. Feizizadeh, M.S. Roodposhti, P. Jankowski, T. Blaschke, A GIS-based extended fuzzy multi-criteria evaluation for landslide susceptibility mapping, Comput. Geosci. 73 (2014) 208–221. [35] A.M. Felicisimo, A. Cuartero, J. Remondo, E. Quiros, Mapping landslide susceptibility with logistic regression, multiple adaptive regression splines, classification and regression trees, and maximum entropy methods: a comparative study, Landslides 10 (2) (2013) 175–189. [36] R. Fell, J. Corominas, C. Bonnard, L. Cascini, E. Leroi, W. Savage, Guidelines for landslide susceptibility, hazard and risk zoning for land-use planning, Eng. Geol. 102 (2008) 99–111. [37] P. Flentje, D. Stirling, R.N. Chowdhury, Landslide susceptibility and hazard derived from a landslide inventory using data mining – an Australian case study, in: Proceedings of the First North American Landslide Conference, Landslides and Society: Integrated Science, Engineering, Management and Mitigation, 2007, pp. 1–10. [38] J.N. Goetz, A. Brenning, H. Petschko, P. Leopold, Evaluating machine learning and statistical prediction techniques for landslide susceptibility modeling, Comput. Geosci. 81 (2015) 1–11. [39] F. Guzzetti, A. Carrara, M. Cardinali, P. Reichenbach, Landslide hazard evaluation: a review of current techniques and their application in a multi-scale study, Central Italy, Geomorphology 31 (1999) 181–216. [40] F. Guzzetti, P. Reichenbach, M. Cardinali, M. Galli, F. Ardizzone, Probabilistic landslide hazard assessment at the basin scale, Geomorphology 72 (2005) 272–299. [41] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, 2nd ed., Springer, ISBN 0-387-95284-5, 2008. [42] H. Hong, S.A. Naghibi, H. Pourghasemi, B. Pradhan, GIS-based landslide spatial modeling in Ganzhou City, China, Arabian J. Geosci. 9 (2016) 112, http://dx.doi.org/10.1007/s12517-015-2094-y. [43] H. Hong, H.R. Pourghasemi, Z.S. Pourtaghi, Landslide susceptibility assessment in Lianhua County (China): a comparison between a random forest data mining technique and bivariate and multivariate statistical models, Geomorphology 259 (2016) 105–118. [44] H. Hong, B. Pradhan, C. Xu, D. Tien Bui, Spatial prediction of landslide hazard at the Yihuang area (China) using two-class kernel logistic regression, alternating decision tree and support vector machines, Catena 133 (2015) 266–281. [45] J.N. Hutchinson, Keynote paper: landslide hazard assessment, in: Proceedings 6th International Symposium on Landslides, Christchurch, Balkema, Rotterdam, 1995, pp. 1805–1841. [46] IGME, Geological map of Greece at a scale of 1:50,000, Patras sheet, Athens, 1980. [47] IGME, Geological map of Greece, at a scale of 1:50,000, Aigion sheet, Athens, 2005. [48] I. Ilia, P. Tsangaratos, Applying weight of evidence method and sensitivity analysis to produce a landslide susceptibility map, Landslides 13 (2) (2016) 379–397. [49] G. Koukis, Macroseismic observations and geotechnical foundation conditions in the region of Aitoloakarnania followed the earthquakes of March–April 1983, Bull. KEDE 3 (1983) 199–207 (in Greek). [50] C. Irigaray, T. Fernández, R. El Hamdouni, J. Chacón, Evaluation and validation of landslide-susceptibility maps obtained by a GIS matrix method: examples from the Betic Cordillera (southern Spain), Nat. Hazards 41 (2007) 61–79. [51] A.K. Jain, R.P.W. Duin, J. Mao, Statistical pattern recognition: a review, IEEE Trans. Pattern Anal. Mach. Intell. 22 (2000) 4–37. [52] G. James, D. Witten, T. Hastie, R. Tibshirani, An Introduction to Statistical Learning, Springer, 2013, pp. 316–321. [53] P.D. Jones, I. Harris, Climatic Research Unit (CRU) Time-Series Data Sets of Variations in Climate with Variations in Other Phenomena, University of East Anglia Climatic Research Unit, NCAS British Atmospheric Data Centre, 2008. [54] T. Kavzoglu, I. Colkesen, A kernel functions analysis for support vector machines for land cover classification, Int. J. Appl. Earth Obs. Geoinf. 11 (2009) 352–359.
REFERENCES
455
[55] P. Kayastha, M.R. Dhital, F. De Smedt, Landslide susceptibility mapping using the weight of evidence method in the Tinau watershed, Nepal, Nat. Hazards 63 (2) (2012) 479–498. [56] O. Korup, A. Stolle, Landslide prediction from machine learning, Geol. Today 30 (1) (2014) 26–33. [57] G. Koukis, C. Ziourkas, Slope instability phenomena in Greece: a statistical analysis, Bull. Int. Assoc. Eng. Geol. 43 (1) (1991) 47–60. [58] G. Koukis, D. Rozos, Geotechnical conditions and landslide phenomena in Greek territory, in relation with geological structure and geotectonic evolution, Mineral Wealth 16 (1982) 53–69 (in Greek, with summary in English). [59] G. Koukis, D. Rozos, I. Hatzinakos, Relationship between rainfall and landslides in the formations of Achaia County, Greece, in: Proc of International Symposium of IAEG in Engineering Geology and the Environment, Balkema, Rotterdam, vol. 1, 1997, pp. 793–798. [60] G. Koukis, N. Sabatakakis, N. Nikolaou, C. Loupasakis, Landslides hazard zonation in Greece, in: Proc of Open Symp. on Landslides Risk Analysis and Sustainable Disaster Management by International Consortium on Landslides, Washington USA, 2005, pp. 291–296, Chapter 37. [61] I.K. Koukouvelas, T. Doutsos, The effects of active faults on the generation of landslides in NW Peloponnese, Greece, in: Proc of International Symposium of IAEG in Engineering Geology and the Environment, Balkema, Rotterdam, vol. 1, 1997, pp. 799–804. [62] M. Kouli, C. Loupasakis, P. Soupios, D. Rozos, F. Vallianatos, Landslide susceptibility mapping by comparing the WLC and WofE multi-criteria methods in the West Crete Island, Greece, Environ. Earth Sci. 72 (12) (2014) 5197–5219. [63] S. Lee, Application of likelihood ratio and logistic regression models to landslide susceptibility mapping using GIS, Environ. Manag. 34 (2004) 223–232. [64] M. Marjanovi´c, M. Kovacevic, B. Bajat, V. Vozenilek, Landslide susceptibility assessment using SVM machine learning algorithm, Eng. Geol. 123 (2011) 225–234. [65] D. Marquardt, Generalized inverses, ridge regression, biased linear estimation, and non-linear estimation, Technometrics 12 (1970) 605–607. [66] C. Melchiorre, M. Matteucci, A. Azzoni, A. Zanchi, Artificial neural networks and cluster analysis in landslide susceptibility zonation, Geomorphology 94 (3–4) (2008) 379–400. [67] N. Micheletti, L. Foresti, S. Robert, M. Leuenberger, A. Pedrazzini, M. Jaboyedoff, M. Kanevski, Machine learning feature selection methods for landslide susceptibility mapping, Math. Geosci. 46 (2014) 33–57, http://dx.doi.org/ 10.1007/s11004-013-9511-0. [68] A.S. Miner, P. Vamplew, D.J. Windle, P. Flentje, P. Warner, A comparative study of various data mining techniques as applied to the modeling of landslide susceptibility on the Bellarine Peninsula, Victoria, Australia, in: A.L. Williams, G.M. Pinches, C.Y. Chin, T.J. McMorran (Eds.), Geologically Active, CRC Press, New York, NY, USA, 2010, p. 352. [69] J.M. Moguerza, A. Munoz, Support vector machines with applications, Stat. Sci. 21 (2006) 322–336, http://dx.doi.org/ 10.1214/088342306000000493. [70] K. Muthu, M. Petrou, C. Tarantino, P. Blonda, Landslide possibility mapping using fuzzy approaches, IEEE Trans. Geosci. Remote Sens. 46 (2008) 1253–1265. [71] NEAK, New Hellenic Anti-Seismic Code, Athens, 2004. [72] H.A. Nefeslioglu, C. Gokceoglu, H. Sonmez, An assessment on the use of logistic regression and artificial neural networks with different sampling strategies for the preparation of landslide susceptibility maps, Eng. Geol. 97 (2008) 171–191. [73] H.A. Nefeslioglu, E. Sezer, C. Gokceoglu, A.S. Bozkir, T.Y. Duman, Assessment of landslide susceptibility by decision trees in the metropolitan area of Istanbul, Turkey, Math. Probl. Eng. (2010) 901095, http://dx.doi.org/10.1155/ 2010/901095. [74] M. Negnevitsky, Artificial Intelligence: A Guide to Intelligent Systems, Addison–Wesley/Pearson Education, Harlow, England, 2002, p. 394. [75] H.J. Oh, B. Pradhan, Application of a neuro-fuzzy model to landslide susceptibility mapping in a tropical hilly area, Comput. Geosci. 37 (3) (2011) 1264–1276. [76] B.T. Pham, B. Pradhan, D. Tien Bui, I. Prakash, M.B. Dholakia, A comparative study of different machine learning methods for landslide susceptibility assessment: a case study of Uttarakhand area, Environ. Model. Softw. 84 (2016) 240–250. [77] C.P. Poudyal, C. Chang, H.J. Oh, S. Lee, Landslide susceptibility maps comparing frequency ratio and artificial neural networks: a case study from the Nepal Himalaya, Environ. Earth Sci. 61 (5) (2010) 1049–1064. [78] H.R. Pourghasemi, H.R. Moradi, S.M. Fatemi Aghda, Landslide susceptibility mapping by binary logistic regression, analytical hierarchy process, and statistical index models and assessment of their performances, Nat. Hazards 69 (1) (2013) 749–779.
456
CHAPTER 24 APPLYING MACHINE LEARNING ALGORITHMS
[79] H.R. Pourghasemi, A.G. Jirandeh, B. Pradhan, C. Xu, C. Gokceoglu, Landslide susceptibility mapping using support vector machine and GIS at the Golestan Province, Iran, J. Earth Syst. Sci. 122 (2) (2013) 349–369. [80] H.R. Pourghasemi, M. Mohammady, B. Pradhan, Landslide susceptibility mapping using index of entropy and conditional probability models in GIS: Safarood Basin, Iran, Catena 97 (2012) 71–84. [81] H.R. Pourghasemi, B. Pradhan, C. Gokceoglu, Application of fuzzy logic and analytical hierarchy process (AHP) to landslide susceptibility mapping at Haraz watershed, Iran, Nat. Hazards 63 (2) (2012) 965–996. [82] B. Pradhan, Use of GIS-based fuzzy logic relations and its cross application to produce landslide susceptibility maps in three test areas in Malaysia, Environ. Earth Sci. 63 (2) (2011) 329–349. [83] B. Pradhan, Manifestation of an advanced fuzzy logic model coupled with geoinformation techniques for landslide susceptibility analysis, Environ. Ecol. Stat. 18 (3) (2011) 471–493. [84] B. Pradhan, S. Lee, Delineation of landslide hazard areas on Penang Island, Malaysia, by using frequency ratio, logistic regression, and artificial neural network models, Environ. Earth Sci. 60 (2010) 1037–1054. [85] B. Pradhan, S. Lee, Landslide susceptibility assessment and factor effect analysis: back-propagation artificial neural networks and their comparison with frequency ratio and bivariate logistic regression modelling, Environ. Model. Softw. 25 (2010) 747–759. [86] B. Pradhan, S. Lee, Regional landslide susceptibility analysis using back – propagation neural network model at Cameron Highland, Malaysia, Landslides 7 (1) (2010) 13–30. [87] B. Pradhan, S. Lee, M.F. Buchroithner, A GIS-based back-propagation neural network model and its cross application and validation for landslide susceptibility analyses, Comput. Environ. Urban Syst. 34 (2010) 216–235. [88] B. Pradhan, E.A. Sezer, C. Gokceoglu, M.F. Buchroithner, Landslide susceptibility mapping by neuro-fuzzy approach in a landslide-prone area (Cameron Highlands, Malaysia), IEEE Trans. Geosci. Remote Sens. 48 (12) (2010) 4164–4177. [89] D.H. Radbruch-Hall, D.J. Varnes, W.Z. Savge, Gravitational speeding of steep-sided ridges (“sacking”) in Western United States, Bull. Int. Assoc. Eng. Geol. 14 (1976) 23–35. [90] N.R. Regmi, J.R. Giardino, E.V. McDonald, J.D. Vitek, A comparison of logistic regression-based models of susceptibility to landslides in western Colorado, USA, Landslides 11 (2014) 247–262. [91] N.R. Regmi, J.R. Giardino, J.D. Vitek, Modeling susceptibility to landslides using the weight of evidence approach: Western Colorado, USA, Geomorphology 115 (2010) 172–187. [92] N.R. Regmi, J.R. Giardino, J.D. Vitek, Assessing susceptibility to landslides: using models to understand observed changes in slopes, Geomorphology 122 (2010) 25–38. [93] D. Rozos, Engineering-Geological Conditions in the Achaia County. Geomechanical Characteristics of the PlioPleistocene Sediments, PhD Thesis, University of Patras, Patras, 1989, p. 453 (in Greek, with extensive summary in English). [94] D. Rozos, L. Pyrgiotis, S. Skias, P. Tsagaratos, An implementation of rock engineering system for ranking the instability potential of natural slopes in Greek territory: an application in Karditsa County, Landslides 5 (3) (2008) 261–270. [95] D. Rozos, P. Tsagaratos, K. Markantonis, S. Skias, An application of rock engineering system (RES) method for ranking the instability potential of natural slopes in Achaia County, Greece, in: Proc. of XIth International Congress of the Society for Mathematical Geology, University of Liege, Belgium, 2006, pp. S08–S10. [96] D. Rozos, G.D. Bathrellos, H.D. Skillodimou, Comparison of the implementation of rock engineering system and analytic hierarchy process methods, upon landslide susceptibility mapping, using GIS: a case study from the Eastern Achaia County of Peloponnesus, Greece, Environ. Earth Sci. 63 (2011) 49–63. [97] N. Sabatakakis, G. Koukis, E. Vassiliades, S. Lainas, Landslide susceptibility zonation in Greece, Nat. Hazards 65 (1) (2013) 523–543. [98] H. Saito, D. Nakayama, H. Matsuyama, Comparison of landslide susceptibility based on a decision-tree model and actual landslide occurrence: the Akaishi mountains, Japan, Geomorphology 109 (3–4) (2009) 108–121. [99] B. Scholkopf, A.J. Smola, Learning with Kernels, MIT Press, Cambridge, MA, 2002. [100] F. Sdao, D.S. Lioi, S. Pascale, D. Caniani, I.M. Mancini, Landslide susceptibility assessment by using a neuro-fuzzy model: a case study in the Rupestrian heritage rich area of Matera, Nat. Hazards Earth Syst. Sci. 13 (2013) 395–407. [101] A.E. Sezer, B. Pradhan, C. Gokceoglu, Manifestation of an adaptive neuro-fuzzy model on landslide susceptibility mapping: Klang valley, Malaysia, Expert Syst. Appl. 38 (7) (2011) 8208–8219. [102] A.N. Strahler, Quantitative analysis of watershed geomorphology, Trans. Am. Geophys. Union 38 (6) (1957) 913–920, http://dx.doi.org/10.1029/tr038i006p00913. [103] D. Tien Bui, B. Pradhan, O. Lofman, I. Revhaug, O.B. Dick, Spatial prediction of landslide hazards in Vietnam: a comparative assessment of the efficacy of evidential belief functions and fuzzy logic models, Catena 96 (2012) 28–40.
REFERENCES
457
[104] D. Tien Bui, B. Pradhan, O. Lofman, I. Revhaug, O.B. Dick, Landslide susceptibility assessment in the Hoa Binh province of Vietnam using Artificial Neural Network, Geomorphology 171–172 (2012) 12–19. [105] D. Tien Bui, B. Pradhan, O. Lofman, I. Revhaug, Landslide susceptibility assessment in Vietnam using support vector machines, decision tree, and Naïve Bayes models, Math. Probl. Eng. 2012 (2012) 974638, http://dx.doi.org/10.1155/ 2012/974638, 26 pages, 2012. [106] D. Tien Bui, T. Tuan, H. Klempe, B. Pradhan, I. Revhaug, Spatial prediction models for shallow landslide hazards: a comparative assessment of the efficacy of support vector machines, artificial neural networks, kernel logistic regression, and logistic model tree, Landslides (2015), http://dx.doi.org/10.1007/s10346-015-0557-6. [107] D. Tsagkas, Geomorphological Investigation and Mass Movements in Northern Peloponnese: Area of Xylokastro – Diakofto, PhD Thesis, 2011, p. 361. [108] P. Tsangaratos, A. Benardos, Estimating landslide susceptibility through an artificial neural network classifier, Nat. Hazards 74 (3) (2014) 1489–1516. [109] P. Tsangaratos, I. Ilia, Landslide susceptibility mapping using a modified decision tree classifier in the Xanthi Perfection, Greece, Landslides 13 (2) (2016) 379–397. [110] P. Tsangaratos, I. Ilia, Comparison of a logistic regression and Naïve Bayes classifier in landslide susceptibility assessments: the influence of models complexity and training data set size, Catena 145 (2016) 164–179. [111] P. Tsangaratos, I. Ilia, D. Rozos, Case Event System for landslide susceptibility analysis, in: Margottini, Canuti, Sassa (Eds.), Landslide Science and Practice, Springer, Berlin, Heidelberg, 2013, pp. 585–593. [112] A.K. Turner, R.L. Schuster, Landslides – Investigation and Mitigation: National Research Council, Transportation Research Board Special Report 247, National Academy Press, Washington, DC, 1996, 673 pp. [113] M.H. Vahidnia, A.A. Alesheikh, A. Alimohammadi, F. Hosseinali, A GIS-based neuro-fuzzy procedure for integrating knowledge and data in landslide susceptibility mapping, Comput. Geosci. 36 (29) (2010) 1101–1114. [114] V.N. Vapnik, Statistical Learning Theory, Wiley-Interscience, 1998. [115] D.J. Varnes, International Association of Engineering Geology Commission on Landslides and Other Mass Movements on Slopes: Landslide Hazard Zonation: A Review of Principles and Practice, UNESCO, Paris, 1984, 63 pp. [116] S. Weisberg, J. Fox, An R Companion to Applied Regression, Sage Publications, Incorporated, Los Angeles, London, New Delhi, Singapore, Washington, DC, 2010. [117] F. Wilcoxon, Individual comparisons by ranking methods, Biom. Bull. 1 (6) (1945) 80–83. [118] C. Xu, F. Dai, X. Xu, Y.H. Lee, GIS-based support vector machine modeling of earthquake-triggered landslide susceptibility in the Jianjiang River watershed, China, Geomorphology 145–146 (2012) 70–80. [119] C. Xu, x. Xu, F. Dai, Z. Wu, H. He, F. Shi, X. Wu, S. Xu, Application of an incomplete landslide inventory, logistic regression model and its validation for landslide susceptibility mapping related to the may 12, 2008 Wenchuan earthquake of China, Nat. Hazards 68 (2013) 883–900. [120] A. Yalcin, S. Reis, A.C. Aydinoglu, T. Yomralioglu, A GIS-based comparative study of frequency ratio, analytical hierarchy process, bivariate statistics and logistics regression methods for landslide susceptibility mapping in Trabzon, NE Turkey, Catena 85 (2011) 274–287. [121] X. Yao, L.G. Tham, F.C. Dai, Landslide susceptibility mapping based on Support Vector Machine: a case study on natural slopes of Hong Kong, China, Geomorphology 101 (2008) 572–582. [122] Y.K. Yeon, J.G. Han, K.H. Ryu, Landslide susceptibility mapping in Injae, Korea, using a decision tree, Eng. Geol. 16 (3–4) (2010) 274–283. [123] I. Yilmaz, Comparison of landslide susceptibility mapping methodologies for Koyulhisar, Turkey: conditional probability, logistic regression, artificial neural networks, and support vector machine, Environ. Earth Sci. 61 (2010) 821–836. [124] A.M. Youssef, H.R. Pourghasemi, Z. Pourtaghi, M.M. Al-Katheeri, Landslide susceptibility mapping using random forest, boosted regression tree, classification and regression tree, and general linear models and comparison of their performance at Wadi Tayyah Basin, Asir region, Saudi Arabia, Landslides (2015), http://dx.doi.org/10.1007/s10346-015-0614-1. [125] M. Zare, H. Pourghasemi, M. Vafakhah, B. Pradhan, Landslide susceptibility mapping at Vaz Watershed (Iran) using an artificial neural network model: a comparison between multilayer perceptron (MLP) and radial basic function (RBF) algorithms, Arab. J. Geosci. 6 (8) (2013) 2873–2888. [126] A-X. Zhu, R. Wang, J. Qiao, C.-Z.Qin, Y. Chen, J. Liu, F. Du, Y. Lin, T. Zhu, An expert knowledge-based approach to landslide susceptibility mapping using GIS and fuzzy logic, Geomorphology 214 (2014) 128–138.
This page intentionally left blank
CHAPTER
MDHS–LPNN: A HYBRID FOREX PREDICTOR MODEL USING A LEGENDRE POLYNOMIAL NEURAL NETWORK WITH A MODIFIED DIFFERENTIAL HARMONY SEARCH TECHNIQUE
25
Rajashree Dash, Pradipta K. Dash Siksha O Anusandhan University, Bhubaneswar, India
25.1 INTRODUCTION FOREX (Foreign Currency Exchange) is concerned with the exchange rates of foreign currencies compared to one another. Currency trading in the international monetary market is highly influenced by these exchange rates. With the consequences of economic globalization and interactions between different countries economical systems, generation and accumulation of exchange rate data have reached an unprecedented rate. The expeditiously growing volume of such data has far exceeded the ability of a human beings to analyze them manually. Again, these are highly influenced by several external factors, such as many highly interrelated economic, political, social and even psychological behavior of the investor. The continuous growth of such highly fluctuating and irregular data has put forth the critical need for developing more automated approaches for efficient analysis of such massive data to extract meaningful statistics from them. The main motivation of this study is to predict FOREX rate changes more accurately with increased profitability. The initial models suggested in the literature for FOREX prediction are based on statistical methods, which assume that the data are correlated and linear in nature. However, practically the foreign exchange rates do not satisfy such assumptions. As a result, the performance of statistical models is not adequate for mapping the inherent nonlinear and dynamic behavior of exchange rates time series data more accurately. Hence developing more realistic models to predict FOREX rate changes more effectively and accurately that leads to high profitability is a great interest of research in financial data mining. To enhance the predictive power, numerous computational intelligence based FOREX predictors are proposed in the literature. Artificial Neural Network (ANN) is one of the popular techniques. Due to the inherent capabilities to identify complex nonlinear relationship present in the time series Handbook of Neural Computation. DOI: 10.1016/B978-0-12-811318-9.00025-9 Copyright © 2017 Elsevier Inc. All rights reserved.
459
460
CHAPTER 25 MDHS–LPNN: A HYBRID FOREX PREDICTOR MODEL
data based on historical data and to approximate any nonlinear function to a high degree of accuracy, the application of Artificial Neural Network in modeling economic conditions is expanding rapidly. A survey of literature indicates that among different types of ANNs, Multi-Layer Perception Network (MLP) [11,9], Radial Basis Function Neural Network (RBF) [31], and Functional Link Artificial Neural Network (FLANN) [18,19,13] are the most popular ANN tools used for predictions of currency exchange rates. In this chapter, a predictor model is developed using a Legendre Polynomial Neural Network (LPNN) for predicting the daily FOREX rate values. Regardless the application of different types of neural networks for financial time series prediction, literature survey hardly reveals the application of Legendre polynomial neural network in the prediction of highly fluctuating and dynamic currency exchange rates. Legendre Polynomial Neural Network (LPNN) is identical to a single hidden layer neural network, which maps the nonlinearity of the input samples to output through a functional expansion block, instead of using hidden layers with a number of neurons as in Multi-Layer Perceptron (MLP) network. The functional expansion block produces a dimensionality enhancement of the input pattern by using a set of Legendre orthogonal functions. The main advantage of the LPNN is its simple structure and reduced computational complexity compared to the MLP by increasing the dimensionality of the input pattern with a set of linearly independent nonlinear functions. As parameter tuning is another important issue in the design of any neural network, a modified differential harmony search (MDHS) technique is proposed for estimating the unknown weights of the network by minimizing the error of prediction. Modified differential harmony search technique is a variant of original Harmony Search algorithm, in which the pitch adjustment of harmonies is accomplished by adopting the current to best mutation strategy of differential evolution algorithms. Further, instead of using fixed controlling parameters for all the harmony vectors in each iteration, they are adapted iteratively according to their previous successful experience. The modified approach leads to an improvement of the convergence speed of the network as well as the predictive ability of the network. Application of MDHS technique for solving parameter estimation of LPNN is also new in the scope of currency exchange rate prediction. The proposed learning scheme is also compared with other evolutionary learning schemes such as Particle Swarm Optimization (PSO), Differential Evolution (DE), and Harmony Search (HS) algorithm. To test the model performance, currency exchange rates of US Dollar (USD) against four other currencies such as Australian Dollar (AUD), British Pound (GBP), Indian Rupee (INR), and Japanese Yen (JPY) are taken as experimental data. The popular benchmark error matrices such as Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE), Mean Absolute Error (MAE), Mean Square Error (MSE), and the coefficient of determination (R 2 ) are used for model validation. Experimental results obtained by implementing the model on four currency exchange pairs demonstrate the effectiveness of the proposed model compared to existing models for FOREX prediction included in the study. The rest of the chapter is organized as follows. Section 25.2 reviews the literature in the area of exchange rate prediction using neural networks. In section 25.3 the basic LPNN architecture and the MDHS learning algorithm are discussed in detail. The simulation study for demonstrating the prediction performance of the proposed model along with a comparative result of various learning approaches is presented in section 25.4. Finally, conclusions are drawn in the last section.
25.2 LITERATURE SURVEY
461
25.2 LITERATURE SURVEY In recent years many publications using different intelligent techniques such as neural networks, fuzzy inference systems, neuro-fuzzy inference systems, and a number of hybrid models appeared in literature in the area of exchange rate prediction. Leung et al. [16] have applied a general regression network (GRNN) to predict the monthly exchange rates of different currencies. The performance comparison of GRNN with multi-layered feedforward network (MLFN), multivariate transfer function, and random walk models clearly shows its higher degree of forecasting accuracy than the other forecasting approaches tested in the study. Motivated by the promising results of GRNN for monthly exchange rate prediction, Chen and Leung [3] have proposed a hybrid approach using GRNN and multivariate econometric model for currency exchange rate prediction. In the first stage of hybrid approach, estimates of the exchange rates are generated using time series model and then in the next stage the errors of the estimates are corrected using GRNN. Performing simulation on three exchange rate data such as British pound/US dollar, Canadian dollar/US dollar, and Japanese yen/US dollar, the out-of-sample performance statistics and the regression tests indicate that the hybrid model outperforms the single-stage econometric and GRNN models. Kondratenko and Kuperin [14] have suggested a recurrent network by taking time series data and technical indicators such as moving average as input for forecasting the exchange rates between American Dollar and four other major currencies, Japanese Yen, Swiss Franc, British Pound, and EURO. The network is validated with respect to various statistical estimates of forecast quality. The network performance is also analyzed based on different linear and statistical data preprocessing, such as Kolmogorov–Smirnov test and Hurst exponents for each currency. Ince and Trafalis [11] presented a two-stage forecasting model combining parametric and nonparametric techniques, for improving the performance for exchange rate forecasting. They presented a comparative study of two nonparametric models, ANN and SVR, with two input selection techniques such as ARIMA and VAR, respectively. Experimental analysis clearly illustrates that SVR method outperforms the MLP networks for each input selection algorithm. Yu et al. [31] proposed a multistage nonlinear radial basis function (RBF) neural network ensemble forecasting model for foreign exchange rates prediction. In the process of ensemble modeling, primarily a great number of single RBF neural network models are produced. The appropriate ensemble members are chosen by a conditional generalized variance (CGV) minimization method. Finally another RBF network is used for neural network ensemble for prediction purpose. The experimental results show that the proposed RBF neural network ensemble forecasting model is consistently superior to the individual RBF model and four existing ensemble forecasting models for the testing cases of four main currencies in terms of the level-prediction measurement and direction–prediction measurement. Two efficient low complexity neural networks based forecasting models for exchange rate prediction are proposed by Majhi et al. [18]. The first model is a simple FLANN model with one layer and single neuron architecture and the second model is a cascaded FLANN (CFLANN) with two stages. The output of the first stage undergoes nonlinear expansion and then is fed to the second FLANN for predicting the exchange rate. Both the models produce better prediction result compared to the LMS model. However, CFLANN offers superior performance compared to the FLANN model. The work is extended by Majhi et al. [19] by developing two robust forecasting models such as Wilcoxon artificial neural network (WANN) and Wilcoxon functional link artificial neural network (WFLANN) for
462
CHAPTER 25 MDHS–LPNN: A HYBRID FOREX PREDICTOR MODEL
efficient prediction of different exchange rates for future months ahead. The weights of the models are estimated by minimizing a robust norm called Wilcoxon norm. Use of Wilcoxon norm in training has presented robust exchange rate predictions under different densities of outliers included in the training samples. A knowledge guided artificial neural network (KGANN) is developed by Jena et al. [13] for efficient prediction of exchange rate. The new structure is developed by using a least mean square (LMS) trained adaptive linear combiner together with an adaptive FLANN model that provides more accuracy and efficiency in prediction than either the LMS or the FLANN individually. Ni and Yin [22] have developed a hybrid model combining recurrent self-organizing maps (RSOM) and support vector regressions with few trading indicators such as moving average convergence/divergence and relative strength index, for modeling and prediction of foreign exchange rate time series. In this process of hybridization initially the nonstationary time series are partitioned into coherent groups using the RSOM. Then grouped samples are modeled using SVR and prediction is done by the best fitting local models. A genetic algorithm is applied to further merge the trading rules with the local regressive models to produce a final hybrid model enclosing a number of local models, each with a set of learnt weighting factors. Empirically the model has shown better result compared to the global modeling techniques such as generalized autoregressive conditional heteroscedasticity in terms of profitability. Another simple hybrid prediction model combining an adaptive autoregressive moving average (ARMA) architecture and differential evolution (DE) based training is proposed by Rout et al. [30] for one–fifteen months’ ahead predictions of three exchange rates. In this study a sliding window of past data is used to extract statistical features for each exchange rate, which are employed as input to the prediction model. Performance comparison of the model with other evolutionary training techniques such as Particle Swarm Optimization (PSO), Cat Swarm Optimization (CSO), and Bacterial Foraging Optimization (BFO) demonstrates the superior short- and long-range prediction potentiality of ARMA-DE exchange rate prediction model compared to others. A new model comprising Differential EMD algorithm and SVR is proposed by Premanode and Toumazou [24] for filtering, smoothing, and predicting nonlinear and nonstationary currency exchange rates. Differential EMD helps in smoothing and reducing the noise in input data, whereas the SVR model with the filtered data set improves prediction accuracy. The model has outperformed simulations by a state-of-the-art MS-GARCH and Markov switching regression (MSR) models. A Neuro-evolutionary algorithm based on Cartesian Genetic Programming evolved Artificial Neural Network (CGPANN) is proposed by Rehman et al. [27] for implementation of FOREX prediction. Flexibility in real-time feature selection, network architecture, and selection connectivity pattern as feedforward or recurrent type for prediction are the key features for the efficient performance of the system. The results also demonstrate that the network accuracy increases with increase in number of feedback paths, thus improving the capabilities of the network to predict the future data. Galeshchuk [9] has applied an MLP for forecasting the exchange rates of three currency pairs. Instead of comparing the MLP with other networks and training algorithm, the author has simply applied the neural network for prediction of different exchange rates with daily, monthly, and quarterly setups. Most previous researchers have employed different types of neural networks for financial time series prediction, but in literature rare application of Legendre polynomial neural network (LPNN) is available for prediction of highly fluctuating and dynamic currency exchange rates. LPNN has been
25.2 LITERATURE SURVEY
463
successfully implemented in different other application areas such as function approximation, pattern recognition, air pollution parameter prediction, stock price prediction, nonlinear channel equalization, nonlinear active noise control, and nonlinear dynamic system identification [21,10,17,29,5]. The main advantage of the LPNN is its simple structure and reduced computational complexity by increasing the dimensionality of the input pattern with a set of linearly independent orthogonal Legendre polynomials [28,23,4]. As any other neural network, parameter tuning is one of the important issues during designing an LPNN. The well-known Backpropagation algorithm used for training LPNN suffers from the drawback of easily falling into local minimum value and slow convergence rate. To cope with the common drawbacks of Backpropagation algorithm and to increase the accuracy, some scholars have proposed several optimization techniques such as Genetic Algorithm (GA), Particle Swarm Optimization (PSO), Differential Evolution (DE), Harmony Search (HS) algorithm in the training step of the neural network. All these meta-heuristic techniques are mainly based on four operations such as: initialization, fitness calculation, improvisation, and selection. The performances of these techniques are highly dependent upon the way of implementation of these basic operations and the controlling parameters affecting these factors, which ultimately affects the time and space complexity of the algorithm. In Kulluk et al. [15] performance of a multilayer perceptron network with variants of Harmony Search algorithm is analyzed for a classification problem. In Rout et al. [29] a Differential Evolution based learning approach is proposed for LPNN for predicting closing price of stock indices. A comparative study of two variants of Particle Swarm Optimization and Differential Evolution based learning of LPNN is presented in [5] for stock price prediction. In [7] a hybrid learning framework integrating global learning capability of a self-adaptive differential harmony search technique with the generalization ability of extreme learning machines (ELM) is proposed for single hidden layer feedforward neural networks (SLFN). The proposed learning algorithm applied on two SLFNs, i.e. RBF and a low complexity Functional link Artificial Neural Networks (CEFLANN) for prediction of closing price and volatility of five different stock indices, clearly shows its better performance compared to other learning schemes such as DE-OELM, DE, SADHS, SGHS, HS, ELM and BP, etc. In Naik et al. [20] an improved variant of harmony search (HS), called self-adaptive harmony search (SAHS) along with gradient descent learning, is suggested for training of a functional link artificial neural network (FLANN) for classification of biological data. Inspired by successful applications of harmony search algorithms, in this study a variant of Harmony Search algorithm, i.e. MDHS, is applied to address the intricacy in adjusting the unknown weights of the LPNN. The proposed modified approach can improve the performance of the harmony search algorithm by incorporating the mutation scheme of differential evolution in pitch adjustment operation and by eliminating the fixed controlling parameters. The mechanism of adapting the controlling parameters of harmony search from previous successful iteration is inspired by the strategy of adapting the crossover rate and mutation scale of ZADE approach proposed in [32,12]. The MDHS is successfully applied for parameter estimation of a self-evolving recurrent neurofuzzy inference system for predicting the stock price indices [6]. However, application of MDHS technique for solving parameter estimation of LPNN is new in the scope of currency exchange rate prediction. Employing these strategies has found considerable influence on the quality of solutions as compared with previously available alternatives.
464
CHAPTER 25 MDHS–LPNN: A HYBRID FOREX PREDICTOR MODEL
FIGURE 25.1 Architecture of Legendre Polynomial Neural Network (LPNN).
25.3 MDHS–LPNN: A HYBRID FOREX PREDICTOR MODEL FOREX (Foreign Currency Exchange) is concerned with the exchange rates of foreign currencies observed successively at particular interval of time. The time interval may be of any duration such as hourly, daily, weekly, monthly and so on. Being time series, the FOREX data with N points can be represented by X(t) = {x1 , x2 , . . . , xN } where xk for k = 1, 2, . . . , N denotes the exchange rate value of one currency with respect to other at the kth time instant. The aim of the predictor model is to predict the future exchange rate value xN+h using preceding observation sequence over a window size of W and prediction horizon of h, i.e., [xN−W +h . . . xN−1 xN ]T . During training of the model, past data points are used as outputs and the corresponding window-sized preceding sequences created for each of these data points are applied as inputs. The training continues over a large number of iterations until the chosen error metric value between the calculated output and the desired output attains a minimum value. Once the training is complete, the model can be used for the prediction of future values.
25.3.1 LEGENDRE POLYNOMIAL NEURAL NETWORK Legendre Polynomial Neural Network (LPNN) is identical to a functional link artificial neural network (FLANN) with a functional expansion block. Fig. 25.1 specifies the architecture of the LPNN. The functional expansion block produces a dimensionality enhancement of the input pattern by using a set
25.3 MDHS–LPNN: A HYBRID FOREX PREDICTOR MODEL
465
of orthogonal Legendre polynomials in contrast to trigonometric sine and cosine functions as used in existing FLANNs. The architecture of LPNN resembles a single hidden layer neural network, which maps the nonlinearity of the input samples to output through the functional expansion block, instead of using hidden layers with a number of neurons as in Multi-Layer Perceptron (MLP) network. The main advantage of the LPNN is its simple structure and reduced computational complexity compared to the MLP by increasing the dimensionality of the input pattern with a set of linearly independent nonlinear functions [4,23]. The Legendre polynomials are denoted by Lp (X), where p is the order and −1 < x < 1 is the argument of the polynomial. It constitutes a set of orthogonal polynomials as solutions to the differential equation: % dy d $ 1 − x2 + p (p + 1) y = 0 dx dx
(25.1)
The zeroth and the first order Legendre polynomials are respectively given by L0 (x) = 1 and L1 (x) = x. The higher order polynomials are % 1$ 2 3x − 1 2 % 1$ 3 5x − 3x L3 (x) = 2 % 1$ 35x 4 − 30x 2 + 3 L4 (x) = 8 L2 (x) =
(25.2)
The recursive formula to generate higher order Legendre polynomials is given by Lp+1 (x) =
= 1 < (2p + 1) × Lp (x) − pLp−1 (x) p+1
(25.3)
Hence with the order p any d-dimensional input pattern X = [x1 , x2 , . . . , xd ]T is expanded to a k-dimensional pattern Y by Legendre functional expansion as Y = [1, L1 (x1 ), L2 (x1 ), . . . , Lp (x1 ), L1 (x2 ), L2 (x2 ), . . . , Lp (x2 ), . . . , L1 (xd ), L2 (xd ), . . . , Lp (xd )]T where k = p ∗ d + 1. The recursive polynomial operations of Legendre functions produce a polynomial matrix as follows: ⎤ ⎡ ⎡ ⎤ 1 L0 (X) ⎥ ⎢ X ⎢ L1 (X) ⎥ ⎢ ⎥ 2 ⎢ ⎥ ⎥ ⎢ ⎢ L2 (X) ⎥ = ⎢ ⎥ ( 3X2 − 12 ) (25.4) ⎢ ⎥ ⎥ ⎢ ⎥ ⎣ L3 (X) ⎦ ⎢ 5X 3 3X − ) ( ⎦ ⎣ 2 2 4 L4 (X) 15X 2 3 − + ) ( 35X 8 4 8 The predicted sample can be represented as a weighted sum of these nonlinear polynomial arrays. The inherent nonlinearities in the polynomials attempt to accommodate the nonlinear causal relation of the future sample with the samples prior to it.
466
CHAPTER 25 MDHS–LPNN: A HYBRID FOREX PREDICTOR MODEL
At any instance t, using Legendre polynomials, the weighted sum of the components of the enhanced input pattern X is obtained using the following formula: ⎡
⎤T ⎡ w(1,1) (t) L1 (x1 (t)) ⎢ w(2,1) (t) ⎥ ⎢ L2 (x1 (t)) ⎢ ⎥ ⎢ ⎢ ⎥ ⎢ .. .. ⎢ ⎥ ⎢ . . ⎢ ⎥ ⎢ Legendre u(t) = w(0) (t) + ⎢ ⎥ ⎢ ⎢ ⎥ ⎢ .. .. ⎢ ⎥ ⎢ . . ⎢ ⎥ ⎢ ⎣ w(p−1,1) (t) ⎦ ⎣ Lp−1 (x1 (t)) Lp (x1 (t)) w(p,1) (t) ⎡ ⎤T ⎡ ⎤ w(1,d) (t) L1 (xd (t)) ⎢ w(2,d) (t) ⎥ ⎢ L2 (xd (t)) ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ .. .. ⎢ ⎥ ⎢ ⎥ . . ⎢ ⎥ ⎢ ⎥ +⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ .. .. ⎢ ⎢ ⎥ ⎥ . . ⎢ ⎥ ⎥ ⎢ ⎣ w(p−1,d) (t) ⎦ ⎣ Lp−1 (xd (t)) ⎦ Lp (xd (t)) w(p,d) (t)
Weighted sum = u(t)Legendre =
i=p,j =d
⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ + ··· ⎥ ⎥ ⎥ ⎦
w(i,j ) (t) × Li (xj (t))
(25.5)
(25.6)
i=1,j =1
Then the weighted sum is passed through an activation function S( ) to produce an output y as follows: (25.7) y(t) = S u(t)Legendre In this study the hyperbolic tangent (tanh( )) nonlinear function is used as S( ). The error obtained by comparing the output with desired output is used to update the weights of the network structure by a weight-updating algorithm during the learning process. With Backpropagation learning, in each iteration the gradient of the cost function with respect to the weight is determined and the weights are incremented by a fraction of the negative gradient. The updating rule for the weight wij becomes: wij (t + 1) = wij (t) + αei,t (1 − yi,t )2 Lj (X) Lj (X) = value of jth exp anded unit
(25.8)
25.3.2 MODIFIED DIFFERENTIAL HARMONY SEARCH Modified Differential Harmony Search (MDHS) technique is a variant of existent Harmony Search (HS) algorithm that results by adopting the mutation scheme of differential evolution in the pitch adjustment operation and by eliminating the use of fixed controlling parameters for all harmonies. The pitch adjustment operation of the original HS algorithm intrinsically performs a fixed-step-size mutation with a prespecified execution probability, which cannot adapt the searching landscapes of different
25.3 MDHS–LPNN: A HYBRID FOREX PREDICTOR MODEL
467
problems at different searching stages. On the other hand, the mutation scheme in DE provides a spontaneous self-adaptability to the searching landscape. Therefore, by incorporating mutation strategy of DE into the pitch adjustment operation of HS, the population-variance of HS can be enhanced and also intensify its explorative power [2,26,25,8]. A number of mutation strategies like rand1, rand2, best1, best2 and current-to-best are available for DE in literature. Existing techniques use the rand1 approach of mutation strategy. In this proposed modified approach the pitch adjustment operation of original HS algorithm is replaced by the current-to-best mutation strategy of DE. The modified approach leads to an improvement of the convergence speed of the network. Analogous to any meta-heuristic technique, performance MDHS algorithm relies on value of few controlling parameters such as the size of harmony memory (HMS), the maximum number of iterations (NI), the harmony memory consideration rate (HMCR), the pitch adjustment rate (PAR), and the scaling factor (F ). The optimal settings of these parameters are problem-dependent, so it is often necessary to tune these parameters in order to achieve the desired results. The parameter HMCR controls the balance between exploration and exploitation and takes value between 0 and 1. PAR determines whether further adjustment is required according to mutation operator or not and can be visualized as local search. Scaling factor F controls the result of mutation. Rather than using fixed controlling parameters for all the harmony vectors during each iteration in the modified approach, the parameters are adapted iteratively according to their previous successful experience. The mechanism of adapting the controlling parameters of harmony search from previous successful iteration is inspired from the strategy of adapting the crossover rate and mutation scale of ZADE approach proposed in [32,12]. Initially the HMCR, PAR, and F values associated with each individual are generated according to a normal/ Cauchy distribution with means. At the end of each generation, the values of the controlling parameters are updated according to the HMCR, PAR, F values resulting in the generation of the successful trial vector in that generation. As the search progresses, it should gradually approach the optimal values for the given problem. The flow diagram specifying the steps of MDHS is shown in Fig. 25.2.
25.3.3 DETAILED STEPS OF CURRENCY EXCHANGE RATE PREDICTION USING MDHS–LPNN Step 1. Collect historical currency exchange rate values within a period. Step 2. Normalize the currency exchange rates between 0 and 1 using the min max normalization. y=
x − xmin xmax − xmin
(25.9)
where y = normalized value, x = value to be normalized, xmin = minimum value of the series to be normalized, xmax = maximum value of the series to be normalized. Step 3. With a chosen window size and prediction horizon, prepare the input and output patterns. Step 4. Expand the input pattern using the order p and the Legendre orthogonal functions. Step 5. Initialize the parameters HMS, μHMCR , μPAR , μF , and NI.
468
CHAPTER 25 MDHS–LPNN: A HYBRID FOREX PREDICTOR MODEL
FIGURE 25.2 Detailed steps of Modified Differential Harmony Search technique (MDHS).
Step 6. Initialize the harmony memory HM randomly specifying the unknown parameters of the model according to the harmony memory size. Step 7. Set t = 0. Step 8. Find the fitness function value of each harmony vector in HM, i.e. the root mean square error of the actual and predicted output value obtained by utilizing the weights specified in each harmony vector to the Legendre functional expansion and exploiting the nonlinear tanh( ) function at the output unit.
25.3 MDHS–LPNN: A HYBRID FOREX PREDICTOR MODEL
469
Step 9. Generate the HMCR, PAR, and the mutation scaling factor F using Normal and Cauchy distributions having mean value μHMCR , μPAR , μF as follows: HMCRi = randni (μHMCR , 0.1) PARi = randni (μPAR , 0.1)
(25.10)
Fi = randci (μF , 0.1) Step 10. Improvise a new harmony (Xnew ) from HM as follows: Step 10.1: for (j = 1 to n) do Step 10.2: if (rand(0, 1) < HMCR) then Xnew (j ) = Xa (j ) where a ∈ (1, 2, . . . , HMS) Step 10.3: if (rand(0, 1) < PAR) then Xnew (j ) = Xnew (j ) + F × (Xbest (j ) − Xnew (j )) + F × (Xb (j ) − Xc (j ))
(25.11)
where b, c ∈ (1, 2, . . . , HMS) Step 10.4: else Xnew (j ) = LB(j ) + rand(0, 1) × (UB(j ) − LB(j )) Step 11. If f (Xnew ) is better than the f (worst) update the HM as Xworst = Xnew Step 12. Update the mean HMCR, PAR, and scaling factor as follows: μHMCR = (1 − c)μHMCR + c × meanA (SHMCR ) μPAR = (1 − c)μPAR + c × meanA (SPAR )
(25.12)
μF = (1 − c)μF + c × meanL (SF ) where c ∈ (0, 1) is a positive integer; SHMCR , SPAR , and SF denote all successful HMCR, PAR, and mutation scaling factor F , respectively; meanA and meanL represent the arithmetic and Lehmer mean, respectively. The Lehmer mean is given by |SF | meanL (SF ) = i=1 |SF | i=1
Fi2
(25.13)
Fi
Step 13. Set t = t + 1. Step 14. Repeat steps 8 to 13 until t = NI is reached. Step 15. Save the Best harmony vector Xbest in the HM to represent the parameters of the model and use the model for testing.
25.3.4 COMPUTATIONAL COMPLEXITY OF LPNN LPNN having a single layer neural network structure provides a great advantage in reduced computational complexity compared to Multilayer Perceptron Network (MLP). The reduction in computational complexity is achieved by the use of a simple polynomial expansion used in an LPNN in lieu of the multiple hidden layers and number of neurons in each layer employed in a Multilayer Perceptron Network. For a simple MLP having a single hidden layer with m number of neurons in hidden layer and n
470
CHAPTER 25 MDHS–LPNN: A HYBRID FOREX PREDICTOR MODEL
number of input layer nodes and 1 output neuron, then number of weights need to be estimated during learning stage is [m(n + 1) + m + 1]. Whereas for an LPNN with order m, n number of input layer nodes and 1 output neuron, the number of weights need to be estimated is [nm + 1], which is much smaller than an MLP with a single hidden layer.
25.4 EMPIRICAL STUDY In this study the performance of a Legendre Polynomial Neural Network is analyzed with a modified differential harmony search technique for one day ahead prediction of daily exchange rates of US Dollar (USD) against four other currencies: Australian Dollar (AUD), British Pound (GBP), Indian Rupee (INR), and Japanese Yen (JPY). Again, the performance of the hybrid model is compared with other evolutionary learning algorithms such as DE, PSO, and HS.
25.4.1 DATA SET DESCRIPTION For experimental observations, sample data sets comprising the daily exchange rate prices of US Dollar (USD) against four other currencies, Australian Dollar (AUD), British Pound (GBP), Indian Rupee (INR), and Japanese Yen (JPY), during the period of 1/1/2014 to 31/5/2015 are gathered and passed through a windowing process with a chosen window size and prediction horizon to set the input and output of the model. The visualizations of gathered data for each exchange rate are depicted in Fig. 25.3 for USD/AUD, in Fig. 25.4 for USD/GBP, in Fig. 25.5 for USD/INR, and in Fig. 25.6 for USD/JPY. To measure the generalization ability of the model, a held out crossvalidation is applied on the data set, by which data sets are divided into two sets such as: insample set that is used for training and validation and out-sample set that is used for testing. For all the data set, the daily exchange rate prices collected during the period of 1/1/2014 to 11/12/2014 are used as in-sample data and the daily exchange rate prices during the period of 12/12/2014 to 31/5/2015 are used as out-sample data. The same duration of time period is considered to analyze the performance of the proposed model for predicting the exchange rate of US Dollar (USD) against the four other popular currencies. The details of in-sample and out-sample data for the four currency exchange rates are given in Table 25.1. To improve the performance initially all the collected exchange rate prices are scaled between 0 and 1 using the min max normalization as given in Eq. (25.9).
25.4.2 INPUT SETUP By using direct method of prediction, for simulation the window size is set s to 5 specifying the weekly exchange rate price and prediction horizon is set to 1 specifying the one day ahead prediction. With the chosen window size and prediction horizon, the input pattern with corresponding output pattern is prepared from the normalized data set. The first input pattern is prepared by taking the normalized exchange rates of first five days and the simple moving average calculated for those five days. The normalized exchange rate of sixth day is used as the corresponding output pattern. Subsequently the sliding window is shifted by one position to extract the second input–output pattern and the same process is repeated until all input–output patterns are extracted. In this way a total of 437 input–output
25.4 EMPIRICAL STUDY
471
FIGURE 25.3 Exchange rate of USD/AUD with daily setup.
FIGURE 25.4 Exchange rate of USD/GBP with daily setup.
patterns are prepared for USD/AUD, USD/GBP, USD/JPY and total of 436 patterns are prepared for USD/INR. Out of these patterns two thirds are used as in-sample data and the remaining are used as out-sample data. Corresponding to the input–output patterns, the number of input layer nodes for the LPNN model is set to 6 to express the daily exchange rates of five days ago and the simple moving average of it and the number of output node is set to 1 for expressing the exchange rate price of sixth day.
472
CHAPTER 25 MDHS–LPNN: A HYBRID FOREX PREDICTOR MODEL
FIGURE 25.5 Exchange rate of USD/INR with daily setup.
FIGURE 25.6 Exchange rate of USD/JPY with daily setup.
25.4.3 PARAMETER SETUP The complexity of the LPNN network is greatly influenced by the expansion order considered in functional expansion block, which ultimately specifies the number of unknown weights needed to be tuned and may affect the overall performance of the network. Hence instead of keeping the expansion order fixed, initially the simulation is done with three different expansion order values. With different
25.4 EMPIRICAL STUDY
473
Table 25.1 Details of Data Taken for Daily Exchange Rate Prediction of USD to 4 Other Exchange Rates. In-Sample Data Details Date Total Range Patterns
Out-Sample Data Details Date Total Range Patterns
437
1/1/2014 to 291 11/12/2014
12/12/2014 146 to 31/5/2015
442
437
1/1/2014 to 291 11/12/2014
12/12/2014 146 to 31/5/2015
1/1/2014 to 31/5/2015
441
436
1/1/2014 to 291 11/12/2014
12/12/2014 145 to 31/5/2015
1/1/2014 to 31/5/2015
442
437
1/1/2014 to 291 11/12/2014
12/12/2014 146 to 31/5/2015
Data Set
Date Range
Total Available Samples
Total Input– Output Patterns Generated
USD/AUD (US Dollar to Australian Dollar rate) USD/GBP (US Dollar to British Pound rate) USD/INR (US Dollar to Indian Rupee rate) USD/JPY (US Dollar to Japanese Yen rate)
1/1/2014 to 31/5/2015
442
1/1/2014 to 31/5/2015
expansion order as the parameter space size is different, two different harmony memory sizes such as 30 and 50 are used for the MDHS technique. Then the harmony vectors are randomly initialized to values between −1 and 1, where each vector represents the weight vector corresponding to the parameter space size with a chosen expansion order for the network. For network convergence, the number of iterations is set to 200. The controlling parameters of any evolutionary algorithm are normally application-oriented. There are no fixed values for them. So initially through a number of simulations the controlling parameters of the evolutionary algorithms are derived. The RMSE error is taken as the fitness function for all the evolutionary learning algorithms included in the study.
25.4.4 PERFORMANCE EVALUATION CRITERIA The Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE), Mean Absolute Error (MAE), and Mean Square Error (MSE) are used to compare the performance of the model for predicting the exchange rate values of USD against AUD, GBP, INR, and JPY in one day advance with different learning algorithms. The error metrics are defined as follows: ! N !1 2 RMSE = " yk − yˆk N k=1
(25.14)
474
CHAPTER 25 MDHS–LPNN: A HYBRID FOREX PREDICTOR MODEL
MAPE =
> N > 1 >> yk − yˆk >> > y > × 100 N k
(25.15)
k=1
N > 1 >> yk − yˆk > MAE = N
(25.16)
k=1
MSE =
N 2 1 yk − yˆk N k=1
where yk = actual exchange rate price on kth day yˆk = predicted exchange rate price on kth day N = number of data samples. The coefficient of determination (R 2 ) between the outcome and predicted values is also used for model validation. The coefficient of determination (R 2 ) is calculated using the following formula: 2 N k=1 yk − yˆk $ $ %% R =1− N N 1 k=1 yk − N k=1 yk 2
(25.17)
Additionally, the regression error characteristic (REC) analysis, a powerful visualization tool proposed by Bi and Bennet [1] is also applied to validate and compare different prediction models. The REC curve represents the cumulative distribution of error produced by a cost estimation method, by simply plotting the error values on x axis and accuracy of the prediction method on the y axis.
25.4.5 RESULT ANALYSIS Performance of the hybrid MDHS–LPNN model depends on several factors such as the order of expansion of the network and different controlling parameter values of MDHS technique. To evaluate the prediction ability of the model, simulation is carried out with different expansion order value and controlling parameters of MDHS technique. For a good predictor model, the out-sample prediction result has higher significance than the in-sample output. Lower forecasting errors normally indicate higher prediction ability of a predictor model. Hence the three forecasting statistical errors, obtained on out-sample data using the hybrid predictor model with three different expansion orders is reported in Table 25.2 for USD/AUD data set, in Table 25.3 for USD/GBP data set, in Table 25.4 for USD/INR data set, in Table 25.5 for USD/JPY data set. With different expansion order as the parameter space size is different, so two different harmony memory sizes such as 30 and 50 are used for the MDHS technique. For USD/AUD and USD/JPY data set better forecasting results are observed with the expansion order 3 and harmony memory size 30, whereas for USD/GBP and USD/INR data set better forecasting results are observed with the expansion order 3 and harmony memory size 50. For all the data sets, it is clearly observed that the hybrid model is providing better statistical error for expansion order 3. With the further higher expansion order no sign of improvement in the prediction accuracy is observed. Again, with higher expansion order, size of parameter space will be higher and accordingly the training time will be larger. The convergence curve of MDHS–LPNN model for the four data sets is shown in Figs. 25.7 to 25.10. From the convergence analysis it is clear that the model is converging
25.4 EMPIRICAL STUDY
475
Table 25.2 Performance statistics of MDHS–LPNN for the out-sample forecasts of USD/AUD data set. Expansion Order
Harmony Memory Size
USD/AUD Data Set Parameter RMSE Space Size
MAPE
MAE
2
30 50 30 50 30 50
13 13 19 19 25 25
0.8807 1.1764 0.8273 0.9618 1.8929 1.2115
0.0384 0.0515 0.0355 0.0419 0.0825 0.0527
3 4
0.0335 0.0441 0.0310 0.0368 0.0761 0.0482
Table 25.3 Performance Statistics of MDHS–LPNN for the Out-Sample Forecasts of USD/GBP Data Set Expansion Order
Harmony Memory Size
USD/GBP Data Set Parameter RMSE Space Size
MAPE
MAE
2
30 50 30 50 30 50
13 13 19 19 25 25
0.5852 0.7584 0.6000 0.5796 0.7866 0.6708
0.0346 0.0449 0.0354 0.0342 0.0461 0.0395
3 4
0.0320 0.0402 0.0323 0.0314 0.0415 0.0361
Table 25.4 Performance Statistics of MDHS–LPNN for the Out-Sample Forecasts of USD/INR Data Set Expansion Order
Harmony Memory Size
USD/INR Data Set Parameter RMSE Space Size
MAPE
MAE
2
30 50 30 50 30 50
13 13 19 19 25 25
0.3786 0.3834 0.3172 0.2960 0.4521 0.3950
0.0367 0.0371 0.0306 0.0285 0.0438 0.0383
3 4
0.0339 0.0342 0.0272 0.0257 0.0440 0.0362
within 100 to 130 iterations for USD/AUD, USD/INR, and USD/JPY data set and within 50 to 80 iterations for USD/GBP data set. For all the data sets, the RMSE error is converging within the range 0.01 to 0.025. The performance of the hybrid model is also compared with other evolutionary learning algorithms such as DE, PSO, and HS. As the evolutionary algorithms are based on randomness, the performance metrics are observed from ten independent executions of the algorithm with same in-sample and outsample data. The statistical analysis in terms of best, average, and standard deviation of these 10 runs for all evolutionary algorithms is reported for all the data sets. Tables 25.6, 25.7, 25.8, and 25.9 list the various forecast statistics of LPNN model trained using different learning algorithms for one day
476
CHAPTER 25 MDHS–LPNN: A HYBRID FOREX PREDICTOR MODEL
Table 25.5 Performance Statistics of MDHS–LPNN for the Out-Sample Forecasts of USD/JPY Data Set Expansion Order
Harmony Memory Size
USD/JPY Data Set Parameter RMSE Space Size
MAPE
MAE
2
30 50 30 50 30 50
13 13 19 19 25 25
0.4828 0.5011 0.4699 0.5191 0.6263 0.7819
0.0224 0.0232 0.0217 0.0240 0.0289 0.0361
3 4
0.0214 0.0217 0.0198 0.0219 0.0257 0.0307
FIGURE 25.7 Convergence curve of MDHS–LPNN for USD/AUD data set.
ahead prediction of exchange rates of USD/AUD, USD/GBP, USD/INR, and USD/JPY, respectively. Results in Tables 25.6 to 25.9 clearly illustrate that the MDHS learning approach outperforms the other three evolutionary learning algorithms used for LPNN over the out-sample data. The minimum to average RMSE error obtained using MDHS–LPNN model over out-sample data is within the range 0.03 to 0.05 for USD/AUD, 0.03 to 0.05 for USD/GBP, 0.02 to 0.04 for USD/INR, and 0.01 to 0.03 for USD/JPY data set. The minimum to average MAPE error obtained using MDHS–LPNN model over out-sample data is within the range 0.8 to 1.3 for USD/AUD, 0.5 to 0.9 for USD/GBP, 0.2 to 0.5 for USD/INR, and 0.4 to 0.6 for USD/JPY data set. The minimum to average MAE error obtained using MDHS–LPNN model over out-sample data is within the range 0.03 to 0.06 for USD/AUD, 0.03 to 0.06 for USD/GBP, 0.02 to 0.05 for USD/INR, and 0.02 to 0.03 for USD/JPY data set. The comparison of actual and predicted exchange rate values for both in-sample and out-sample data of USD/AUD, USD/ GBP, USD/INR, and USD/JPY data set using the MDHS–LPNN model are shown in Figs. 25.11 to 25.14. Analysis of prediction results demonstrates that the LPNN network not only provides a higher
25.4 EMPIRICAL STUDY
477
FIGURE 25.8 Convergence curve of MDHS–LPNN for USD/GBP data set.
FIGURE 25.9 Convergence curve of MDHS–LPNN for USD/INR data set.
degree of forecasting accuracy with MDHS learning technique but also performs statistically better than other evaluated learning techniques included in the study. The performance of the proposed model is also compared with the RBF, MLP, and Linear Regression models. Tables 25.10, 25.11, 25.12, and 25.13 list the various forecast statistics of the predictor models for one day ahead prediction of exchange rates of USD/AUD, USD/GBP, USD/INR, and USD/JPY, respectively. Tables 25.10 to 25.13 include the out-sample forecast statistics such as RMSE,
478
CHAPTER 25 MDHS–LPNN: A HYBRID FOREX PREDICTOR MODEL
FIGURE 25.10 Convergence curve of MDHS–LPNN for USD/JPY data set.
Table 25.6 Performance Statistics of LPNN with Different Learning Algorithms for USD/AUD Data Set Learning Algorithm RMSE In-Sample Data MDHS min avg std HS min avg std DE min avg std PSO min avg std
0.0147 0.0164 0.0012 0.0151 0.0164 0.0013 0.0141 0.0146 0.0006 0.0138 0.0149 0.0007
USD/AUD Data Set RMSE Out-Sample Data 0.0310 0.0457 0.0180 0.0406 0.0613 0.0252 0.0402 0.0557 0.0140 0.0407 0.0588 0.0209
MAPE Out-Sample Data
MAE Out-Sample Data
0.8273 1.2280 0.4841 1.0516 1.6890 0.7633 1.0716 1.5089 0.4087 1.0734 1.5883 0.6342
0.0355 0.0530 0.0212 0.0458 0.0734 0.0333 0.0466 0.0659 0.0180 0.0465 0.0692 0.0278
MAPE, MAE, MSE, R 2 , and correlation coefficient values obtained by different predictor models. Analyzing the tables, it is clearly observed that the MDHS–LPNN model provides lower error statistics and higher R 2 and correlation value compared to all other models for all the data sets. Further, to compare all the models through a single graph, the REC curves for each data set are depicted in Figs. 25.15 to 25.18. From the REC curve analysis, it is clearly visible that the curve related to MDHS–LPNN model dominates the other curves, which indicates the better prediction performance of MDHS–LPNN model compared to MLP, RBF, and Linear regression model.
25.5 CONCLUSION
479
Table 25.7 Performance Statistics of LPNN with Different Learning Algorithms for USD/GBP Data Set Learning Algorithm RMSE In-Sample Data MDHS min avg std HS min avg std min DE avg std PSO min avg std
0.0142 0.0152 0.0008 0.0163 0.0177 0.0011 0.0145 0.0150 0.0008 0.0136 0.0147 0.0011
USD/GBP Data Set RMSE Out-Sample Data 0.0314 0.0471 0.0137 0.0407 0.0616 0.0169 0.0348 0.0497 0.0117 0.0353 0.0496 0.0123
MAPE Out-Sample Data
MAE Out-Sample Data
0.5796 0.8924 0.2852 0.8183 1.1986 0.3213 0.6495 0.9408 0.2193 0.6595 0.9298 0.2295
0.0342 0.0528 0.0169 0.0478 0.0705 0.0193 0.0383 0.0557 0.0131 0.0388 0.0550 0.0137
Table 25.8 Performance Statistics of LPNN with Different Learning Algorithms for USD/INR Data Set Learning Algorithm RMSE In-Sample Data
USD/INR Data Set RMSE Out-Sample Data
MAPE Out-Sample Data
MAE Out-Sample Data
MDHS min avg std HS min avg std min DE avg std PSO min avg std
0.0257 0.0379 0.0098 0.0297 0.0408 0.0114 0.0359 0.0418 0.0043 0.0336 0.0404 0.0054
0.2960 0.4373 0.1180 0.3385 0.4557 0.1165 0.4077 0.4754 0.0527 0.3774 0.4478 0.0522
0.0285 0.0423 0.0115 0.0327 0.0441 0.0114 0.0395 0.0461 0.0051 0.0365 0.0434 0.0051
0.0248 0.0256 0.0011 0.0254 0.0265 0.0012 0.0237 0.0248 0.0007 0.0235 0.0250 0.0021
25.5 CONCLUSION This chapter focuses on a hybrid FOREX predictor model developed using the Legendre Polynomial neural network with modified differential harmony search based learning strategy. The introduction of LPNN with MDHS learning imposes a new dimension to exchange rate prediction. The main advantage of the LPNN network is the ability to learn from experience using a polynomial functional expansion block with smaller expansion order rather than using hidden layers with several neurons and a smaller
480
CHAPTER 25 MDHS–LPNN: A HYBRID FOREX PREDICTOR MODEL
Table 25.9 Performance Statistics of LPNN with Different Learning Algorithms for USD/JPY Data Set Learning Algorithm RMSE In-Sample Data
USD/JPY Data Set RMSE Out-Sample Data
MAPE Out-Sample Data
MAE Out-Sample Data
MDHS min avg std HS min avg std min DE avg std PSO min avg std
0.0198 0.0275 0.0097 0.0279 0.0321 0.0071 0.0217 0.0291 0.0076 0.0233 0.0360 0.0124
0.4699 0.6486 0.2387 0.6501 0.7891 0.2287 0.4998 0.7040 0.2083 0.5441 0.8481 0.3005
0.0217 0.0301 0.0112 0.0302 0.0365 0.0104 0.0232 0.0327 0.0097 0.0252 0.0393 0.0138
0.0182 0.0195 0.0011 0.0169 0.0180 0.0011 0.0149 0.0167 0.0018 0.0151 0.0184 0.0027
FIGURE 25.11 Predicted exchange rate of USD/AUD data set using MDHS–LPNN model.
computational complexity compared to popular MLP network. Further, the modified pitch adjustment operation and controlling parameter adaptation approach included in harmony search technique as MDHS leads to an improvement of the convergence speed of the network as well as the predictive ability of the network. The proposed model is tested for one-step-ahead prediction of US Dollar (USD) against four other currencies: Australian Dollar (AUD), British Pound (GBP), Indian Rupee (INR), and Japanese Yen (JPY) over the period 1/1/2014 to 31/5/2015. From the model verification, it is demonstrated that the proposed network not only provides a higher degree of forecasting accuracy with
25.5 CONCLUSION
481
FIGURE 25.12 Predicted exchange rate of USD/GBP data set using MDHS–LPNN model.
FIGURE 25.13 Predicted exchange rate of USD/INR data set using MDHS–LPNN model.
MDHS learning technique but also performs statistically better than other evaluated learning techniques and other predictor models included in the study. Although the proposed model is able to achieve higher prediction accuracy for highly dynamic FOREX time series data, it still suffers from some limitations. In this study no systematic approaches are considered for selecting expansion order of LPNN and controlling parameters of MDHS technique. The expansion order of LPNN and controlling parameters of MDHS technique are decided through
482
CHAPTER 25 MDHS–LPNN: A HYBRID FOREX PREDICTOR MODEL
FIGURE 25.14 Predicted exchange rate of USD/JPY data set using MDHS–LPNN model.
Table 25.10 Performance Comparison of Different Predictor Models for the Out-Sample Forecasts of USD/AUD Data Set USD/AUD Data Set MAE MSE
Predictor Model
RMSE
MAPE
MDHS– LPNN RBF MLP Linear regression
0.0310
0.8273
0.0355
0.0383 0.0428 0.0450
0.9970 1.1478 1.2394
0.0434 0.0498 0.0534
R2
Correlation
0.0019
0.8174
0.9211
0.0029 0.0037 0.0041
0.7334 0.6519 0.6820
0.9174 0.8204 0.9231
Table 25.11 Performance Comparison of Different Predictor Models for the Out-Sample Forecasts of USD/GBP Data Set USD/GBP Data Set MAE MSE
Predictor Model
RMSE
MAPE
MDHS– LPNN RBF MLP Linear regression
0.0314
0.5796
0.0342
0.0486 0.0341 0.0517
0.9061 0.6388 1.0290
0.0538 0.0376 0.0604
R2
Correlation
0.0020
0.8369
0.9313
0.0047 0.0023 0.0054
0.6655 0.8074 0.6417
0.9457 0.9041 0.9171
25.5 CONCLUSION
483
Table 25.12 Performance Comparison of Different Predictor Models for the Out-Sample Forecasts of USD/INR Data Set USD/INR Data Set MAE MSE
Predictor Model
RMSE
MAPE
MDHS– LPNN RBF MLP Linear regression
0.0257
0.2960
0.0285
0.0323 0.0474 0.0503
0.3571 0.4982 0.6111
0.0344 0.0484 0.0588
R2
Correlation
0.0013
0.8961
0.9493
0.0021 0.0045 0.0051
0.8363 0.6845 0.6753
0.9170 0.9023 0.9171
Table 25.13 Performance Comparison of Different Predictor Models for the Out-Sample Forecasts of USD/JPY Data Set USD/JPY Data Set Predictor Model
RMSE
MAPE
MAE
MSE
R2
Correlation
MDHS– LPNN RBF MLP Linear regression
0.0198
0.4699
0.0217
0.0008
0.7164
0.8683
0.0216 0.0240 0.0284
0.5334 0.5625 0.7254
0.0246 0.0260 0.0335
0.0009 0.0012 0.0016
0.6713 0.5889 0.5520
0.8376 0.7998 0.8611
FIGURE 25.15 REC curve for USD/AUD.
484
CHAPTER 25 MDHS–LPNN: A HYBRID FOREX PREDICTOR MODEL
FIGURE 25.16 REC curve for USD/GBP.
FIGURE 25.17 REC curve for USD/INR.
simulation by hit and trial method. Again, the proposed MDHS learning algorithm includes only the parameter learning. So the prediction capability of the proposed model may be data-dependent. Future research will focus on both parameter and structure learning of the model. More systematic approaches
REFERENCES
485
FIGURE 25.18 REC curve for USD/JPY.
will be explored for optimal network size. The present approach will also be enhanced for multiple step ahead prediction of exchange rates.
REFERENCES [1] J. Bi, K.P. Bennett, Regression error characteristic curves, in: Twentieth International Conference on Machine Learning, ICML-2003, 2003. [2] P. Chakraborty, G.G. Roy, S. Das, D. Jain, A. Abraham, An improved harmony search algorithm with differential mutation operator, Fundam. Inform. 95 (2009) 1–26. [3] A.S. Chen, M.T. Leung, Regression neural network for error correction in foreign exchange forecasting and trading, Comput. Oper. Res. 31 (7) (2004) 1049–1068. [4] K.K. Das, J.K. Satapathy, Legendre neural network for nonlinear active noise cancellation with nonlinear secondary path, in: IEEE Conference on Multimedia, Signal Processing and Communication Technologies, IMPACT, 2011, 2011, pp. 40–43. [5] R. Dash, P.K. Dash, Prediction of financial time series data using hybrid evolutionary Legendre neural network: evolutionary LENN, Int. J. Appl. Evol. Comput. 7 (1) (2016) 16–32. [6] R. Dash, P. Dash, Efficient stock price prediction using a self-evolving recurrent neuro-fuzzy inference system optimized through a modified differential harmony search technique, Expert Syst. Appl. 52 (2016) 75–90. [7] R. Dash, P.K. Dash, R. Bisoi, A self-adaptive differential harmony search based optimized extreme learning machine for financial time series prediction, Swarm Evol. Comput. 19 (2014) 25–42. [8] R. Dash, P.K. Dash, R. Bisoi, A differential harmony search based hybrid interval type2 fuzzy EGARCH model for stock market volatility prediction, Internat. J. Approx. Reason. 59 (2015) 81–104. [9] S. Galeshchuk, Neural networks performance in exchange rate prediction, Neurocomputing 172 (2016) 446–452. [10] N.V. George, G. Panda, A reduced complexity adaptive Legendre neural network for nonlinear active noise control, in: IEEE Conference on Systems, Signals and Image Processing, IWSSIP-2012, 19th International Conference, 2012, pp. 560–563. [11] H. Ince, T.B. Trafalis, A hybrid model for exchange rate prediction, Decis. Support Syst. 42 (2) (2006) 1054–1062. [12] S.M. Islam, S. Das, S. Ghosh, S. Roy, P.N. Suganthan, An adaptive differential evolution algorithm with novel mutation and crossover strategies for global numerical optimization, IEEE Trans. Syst. Man Cybern., Part B, Cybern. 42 (2) (2012) 482–500.
486
CHAPTER 25 MDHS–LPNN: A HYBRID FOREX PREDICTOR MODEL
[13] P.R. Jena, R. Majhi, B. Majhi, Development and performance evaluation of a novel knowledge guided artificial neural network (KGANN) model for exchange rate prediction, J. King Saud Univ., Comput. Inf. Sci. 27 (4) (2015) 450–457. [14] V.V. Kondratenko, Y.A. Kuperin, Using recurrent neural networks to forecasting of forex, preprint, arXiv:condmat/0304469, 2003. [15] S. Kulluk, L. Ozbakir, A. Baykasoglu, Training neural networks with harmony search algorithms for classification problems, Eng. Appl. Artif. Intell. 25 (1) (2012) 11–19. [16] M.T. Leung, A.S. Chen, H. Daouk, Forecasting exchange rates using general regression neural networks, Comput. Oper. Res. 27 (11) (2000) 1093–1110. [17] F. Liu, J. Wang, Fluctuation prediction of stock market index by Legendre neural network with random time strength function, Neurocomputing 83 (2012) 12–21. [18] R. Majhi, G. Panda, G. Sahoo, Efficient prediction of exchange rates with low complexity artificial neural network models, Expert Syst. Appl. 36 (1) (2009) 181–189. [19] B. Majhi, M. Rout, R. Majhi, G. Panda, P.J. Fleming, New robust forecasting models for exchange rates prediction, Expert Syst. Appl. 39 (16) (2012) 12658–12670. [20] B. Naik, J. Nayak, H.S. Behera, A. Abraham, A self-adaptive harmony search based functional link higher order ANN for non-linear data classification, Neurocomputing 179 (2016) 69–87. [21] S.K. Nanda, D.P. Tripathy, S.S. Mahapatra, Application of Legendre neural network for air quality prediction, in: Proceedings of the 5th PSU-UNS International Conference on Engineering and Technology, ICET’11, 2011, pp. 267–272. [22] H. Ni, H. Yin, Exchange rate prediction using hybrid neural networks and trading indicators, Neurocomputing 72 (13) (2009) 2815–2823. [23] J.C. Patra, C. Bornand, Nonlinear dynamic system identification using Legendre neural network, in: IEEE International Joint Conference on Neural Networks, IJCNN-2010, 2010, pp. 1–7. [24] B. Premanode, C. Toumazou, Improving prediction of exchange rates using differential EMD, Expert Syst. Appl. 40 (1) (2013) 377–384. [25] N. Poursalehi, A. Zolfaghari, A. Minuchehr, Differential harmony search algorithm to optimize PWRs loading pattern, Nucl. Eng. Des. 257 (2013) 161–174. [26] A.K. Quin, F. Forbes, Harmony search with differential mutation based pitch adjustment, in: 13th Annual Conference on Genetic and Evolutionary Computation, 2011, pp. 545–552. [27] M. Rehman, G.M. Khan, S.A. Mahmud, Foreign currency exchange rates prediction using CGP and recurrent neural network, IERI Proc. 10 (2014) 239–244. [28] N. Rodríguez, Multiscale Legendre neural network for monthly anchovy catches forecasting, in: Third International Symposium on Intelligent Information Technology Application, 2009, IITA 2009, vol. 2, IEEE, 2009, pp. 598–601. [29] A.K. Rout, P.K. Dash, R. Dash, R. Bisoi, Forecasting financial time series using a low complexity recurrent neural network and evolutionary learning approach, J. King Saud Univ., Comput. Inf. Sci. (2015), in press, https://doi.org/10.1016/ j.jksuci.2015.06.002. [30] M. Rout, B. Majhi, R. Majhi, G. Panda, Forecasting of currency exchange rates using an adaptive ARMA model with differential evolution based training, J. King Saud Univ., Comput. Inf. Sci. 26 (1) (2014) 7–18. [31] L. Yu, K.K. Lai, S. Wang, Multistage RBF neural network ensemble learning for exchange rates forecasting, Neurocomputing 71 (16) (2008) 3295–3302. [32] J. Zhang, A.C. Sanderson, JADE: adaptive differential evolution with optional external archive, IEEE Trans. Evol. Comput. 13 (5) (2009) 945–958.
CHAPTER
A NEURAL MODEL OF ATTENTION AND FEEDBACK FOR COMPUTING PERCEIVED BRIGHTNESS IN VISION
26
Ashish Bakshi∗ , Kuntal Ghosh† ∗ Machine
Intelligence Unit, Indian Statistical Institute, Kolkata, India † Machine Intelligence Unit, Center for Soft Computing Research, Indian Statistical Institute, Kolkata, India
26.1 INTRODUCTION We the human beings normally believe that whatever we perceive through our senses is “reality” itself as it exists outside of our senses. This belief is so strong that we are willing to entrust our lives to it. Otherwise it would be impossible to cross a road, for example, if we could not trust our senses to report the presence of an approaching vehicle. However, with scientific progress, we now know that reality is far too vast and far too detailed to be perceived by any finite being. From the fine details of the microscopic world, that our eyes do not have the resolving power to watch, to the vast spectrum of electromagnetic waves, that our retina cannot sense, we are missing out a lot of signals from the external world that are impinging upon us. Through evolutionary trials and errors, over eons, we have come to be able to sense just enough so as to survive and propagate our species. For this purpose, evolution has equipped us with a brain whose purpose is to make inferences about the real world around us from whatever limited data that we may be receiving through our senses. However, the inferences made by our brain are just “estimates” about what may be “out there.” These “estimates” are what constitutes our perception. Under normal circumstances, our perceptual experiences are completely determined by the state of the external world that the brain intends to know about. So, if we were to somehow record a person’s perceptual experiences, that record would only tell us about the state of the external world and nothing about that person’s brain. Sometimes, however, under certain circumstances, it has been found that the inferences drawn by the brain can be incorrect or inaccurate. These inaccuracies are anomalies in our perceptual experiences. Recording these anomalies would then reveal information about the perceptual apparatus, instead of merely reflecting the state of reality. Sometimes these inferential errors are only temporary, but at other times they are persistent over time and consistent from person to person. These errors of estimation are termed illusions. When they pertain to the sense of vision they are called visual illusions. These anomalies, i.e. illusions, offer us a window into the inner workings of the brain without having to physically probe it using instruments. By performing simple psychophysical experiments, we can get insights about some of the underlying mechanisms of the brain. Psychophysical experimentation involves subjecting volunteers to a variety of external stimuli Handbook of Neural Computation. DOI: 10.1016/B978-0-12-811318-9.00026-0 Copyright © 2017 Elsevier Inc. All rights reserved.
487
488
CHAPTER 26 A NEURAL MODEL OF ATTENTION AND FEEDBACK
FIGURE 26.1 The Checker shadow illusion by Edward H. Adelson. In the figure on the left the squares marked A and B are of equal luminance, yet A clearly appears darker than B. The figure on the right proves that the two squares are equiluminant by connecting them with solid bars of uniform luminance.
and recording their responses describing their subjective experiences of those stimuli. Sometimes these perceptual anomalies also show a degree of variation from person to person. This variation, too, may sometimes reveal information about the state of the perceptual apparatus of a person (e.g., detecting conditions such a color-blindness, or even some diseases like diabetes). Insights into human visual system and its underlying mechanism can also help us develop new computational models which can be important tools when solving various problems of computer vision and image processing. These computational models can also assist us in mimicking the high robustness and sensitivity characteristics (such as ‘identification,’ ‘discrimination,’ etc.) of human vision. Another practical use of computational models of perception is in the efficient representation of visual data for efficient transmission and storage, i.e. image and video compression. In this chapter we describe the Attentive Vision Filter (AVF) model of brightness perception, the only brightness perception model so far which considers attention as an important factor in determining brightness perception. We show that it has the capacity to explain a variety of brightness illusions. While demonstrating this, we also show why a non-linear model is necessary to explain brightness perception, via scale considerations. Finally we conclude, by noting how both of these may be achieved in the brain by feedback signals from the cortex to the LGN, a scheme that may require, for its implementation, the incorporation of a computational feedback mechanism into the highly successful Deep Convolutional Neural Networks in Computer Vision.
26.2 BRIGHTNESS ILLUSIONS Brightness illusions are a special kind of visual illusions in which different surfaces, within the field of view of the eye, having equal luminance, are perceived to be of different brightness depending upon the rest of the field of view [32]. A very famous example of this type of illusion is the Checker shadow Illusion created by Edward H. Adelson, shown in Fig. 26.1.
26.3 GENERAL STRUCTURE OF THE EYE–BRAIN SYSTEM
489
The existence of such illusions implies that the perceived luminance of a region is not determined only by the actual luminance of that region; instead the perceived luminance can be modulated by the observer’s brain depending on the area surrounding the region concerned. Fig. 26.2 shows several types of brightness illusions. A careful look at Fig. 26.2 will reveal that the various types of brightness illusions can be broadly categorized into two contrary types of brightness illusions known as the brightness-contrast and brightness-assimilation illusions. In the brightness-contrast type, the apparent brightness of a region changes in a direction so as to enhance the contrast with respect to the surrounding regions. In other words the perceived luminance gets modulated opposite to the direction of the surrounding regions. Examples of this include the Simultaneous Brightness Contrast (SBC) illusion (Fig. 26.2A) [25] and the Grating illusion (Fig. 26.2B) [15]. In the brightness-assimilation type of illusions, the apparent brightness changes in the same direction as its surroundings, as if it were assimilating the intensity of its surroundings. Examples of this include the White effect [54] (Fig. 26.2C) and the Checkerboard illusion [12] (Fig. 26.2D).
26.3 GENERAL STRUCTURE OF THE EYE–BRAIN SYSTEM The eye–brain complex may be roughly compared to a camera–computer system where the light signals are first transduced by the camera (eye) into an electrical signal which is then transmitted to the computer (brain) to be processed and/or recorded, either in real time, or subsequently at a later time. The light rays, after being focused by the lens, fall upon the retina, which is embedded with photosensitive receptor cells that convert the light energy into electrochemical signals. These electrochemical signals after passing through a cascade of Ganglion cells are collected into a bundle of axonal fibers, known as the Optic fiber, which carries the signal into the brain. (In pure terms the eye could be considered an extension of the brain.) At the brain these signals arrive at a structure in the midbrain, known as the thalamus (which is often called the gateway to the cortex). Especially a substructure within the thalamus, known as the Lateral Geniculate Nucleus (LGN), receives most of the visual sensory signals. The LGN acts like a relay center to the brain. From the LGN the signals are forwarded into the cerebral cortex, where most of the higher level processing, such as object recognition, etc., is performed. The main region within the cerebral cortex that receives and processes the visual sensory data from the LGN is the Primary Visual Cortex (V1), located in the Occipital Lobe at the back of the brain. Importantly, for this chapter, it has been shown that this flow of visual signals from retina to the cortex is not strictly feedforward. It is well known that there exist significant corticothalamic feedback lines from the cortex back into the LGN [24,27]. These feedback lines can in turn modify the feedforward signals. The necessity of feedback can be clearly understood in the form of an analogy. Imagine one is listening to the radio and the volume is not sufficiently high. One then reaches out to the volume knob in order to turn up the volume. Thus one has fed a signal back into the radio, which modifies the sound coming from the radio that we wish to hear clearly, until we are satisfied with the volume level. This is a continuous process, i.e. in the next radio programme, the sound levels may change again and we may once again adjust the volume knob. The neural pathways that carry visual signals from the retina to the brain can be divided into at least three types, viz. Parvocellular, Magnocellular, and Koniocellular, originating from three different types (P , M, and K) of retinal ganglion cells [11,10,53,56], which give rise to P , M, and K channels segregated anatomically, physiologically, and behaviorally [49,52]. These channels send visual information from the retina to the cerebral cortex via LGN in thalamus.
490
CHAPTER 26 A NEURAL MODEL OF ATTENTION AND FEEDBACK
FIGURE 26.2 Examples of brightness induction illusions. (A) Simultaneous brightness contrast (SBC) illusion: The two gray squares have the same real intensity but the square on the dark background apparently looks brighter than the square on the light background. This is an example of brightness-contrast, i.e. the apparent brightness changes in a direction opposite to its background. (B) Grating induction: the gray strip has uniform intensity but appears to have undulating brightness. This is also an example of brightness-contrast, since here the gray-patch brightness changes in a direction opposite to most of its surroundings considering each of the black or white columns on which it lies. (C) White effect: The two gray bars have the same intensity but the one on the black stripe looks brighter than the one on the white stripe. The bar on the black stripe shares a greater boundary with its two neighboring white stripes than with the black stripe. Similarly the bar on the white stripe shares a greater boundary with its two neighboring black stripes than with the white stripe. So, unlike cases (A) and (B) above, here the brightness changes in the same direction as most of its surroundings. Therefore this is an example of brightness assimilation. (D) Checkerboard illusion: the two gray squares have the same intensity but the one surrounded by white squares looks brighter than the one surrounded by the black squares. This is another example of brightness-assimilation as the change in brightness is in the same direction as the surrounding brightness. (E) Shifted White effect: The two gray bars have the same intensity but the one surrounded on all sides by black regions looks brighter than the one surrounded by white. So, unlike cases (A) and (B) above, here the brightness changes in the same direction as most of its surroundings. Therefore this is an example of brightness assimilation. (F) Mach Band Illusion: This consists of steps of intensity-plateaus separated by intensity-gradients. Bright and dark bands can be observed along the lines where the gradients meet the plateaus. These bright and dark bands are illusory brightness peaks and troughs, respectively, in the brightness profile, as the input stimulus has a monotonic intensity profile devoid of any peaks or troughs.
26.4 LATERAL INHIBITION
491
FIGURE 26.2 (Continued ).
The visual cortex is also supposed to be divided into two pathways, one of which is specialized for motion processing and the other for color or form information processing. Several studies [38] put indirect evidences that M channel in subcortical pathway feeds input to motion pathway and P channel drives the color or form pathway of visual cortex. By selective blocking of neuronal response of either P or M channels in LGN of macaque monkeys (Macacafascicularis and M. nemstrina), Ferrera et al. [13] came up with results, showing that there is, however, an intermixing of P and M channel contribution in the visual area V4 and many units of V1 also, providing evidence that both M and P channels probably make substantial contribution to neuronal response in color or form pathway. It is not unlikely, therefore, that all the three (including K) channels may be involved in the process of brightness perception.
26.4 LATERAL INHIBITION Lateral inhibition is the phenomenon in which a neuron’s response to a stimulus is inhibited by the excitation of a neighboring neuron. Lateral inhibition has been experimentally observed in the retina and the LGN of organisms [47]. Lateral inhibition makes neurons more sensitive to spatially varying of stimulus than to spatially uniform stimulus. This is because a neuron getting stimulated by a spatially uniform stimulus is also inhibited by its surrounding neurons, thus suppressing its response. On the other hand, a neuron subjected to a spatially varying stimulus is less inhibited by its neighbors that are not excited, thus producing stronger response. Therefore in the case of visual neurons, lateral inhibition makes them more sensitive to edges on the scene. Although usually described for visual neurons, lateral inhibition is also found in other sensory systems, such as auditory and olfactory neurons. The total region, to which a particular neuron is sensitive to, is called the receptive field of the neuron.
492
CHAPTER 26 A NEURAL MODEL OF ATTENTION AND FEEDBACK
FIGURE 26.3 Typical shape of the DOG function. The x-axis is the spatial distance from a given point. The y-axis is the response produced at the origin by a spot of light falling at a distance x from the origin.
26.5 MODELING BRIGHTNESS ILLUSIONS Explanation of brightness perception, specifically perception of brightness illusions, has always been a challenge to model mathematically. Brightness illusions have generally been modeled using a spatial filtering function applied on the input stimulus. Such models have been justified by the experimental observation of lateral inhibition in organisms. Lateral inhibition can be qualitatively modeled using the convolution operation in signal processing. Using a convolution function, which takes positive values at the center and negative values at the surroundings, we can simulate the effect of lateral inhibition on the response signal. Thus a strong signal on the surrounding parts will inhibit the response signal at the center. Although lateral inhibition can partially explain some brightness-contrast illusions such as SBC, it cannot explain brightness-assimilation illusions such as the White effect. Several approaches have been made in order to explain brightness illusions, but among them the ‘Difference of Gaussian (DOG)’ model of visual receptive field was the first one to successfully explain a series of brightness illusions. The DOG model and its variants portray the ganglion receptive field as the difference of two concentric Gaussians with different spreads [47,57,3] as described by the equation: DOG(r; σ1 , σ2 ) = A1 e
−(
r2 ) 2σ12
− A2 e
−(
r2 ) 2σ22
(26.1)
where r is the radial distance from the center of the receptive field (Fig. 26.3). An on-centered ganglion cell receptive field produces positive response when stimulated in its central region and negative response when stimulated in the outskirt regions. In order that Eq. (26.1) possess this property, we must have σ2 > σ1 and A1 > A2 . Notice that the DOG model is a linear model. This is because the convolution operation is a linear operation. A linear operation is an operation whose output response to a sum of two inputs is the sum of the individual responses to each those inputs. An operation which does not follow this linearity property is called a non-linear operation.
26.6 MOTIVATION FOR THE ATTENTIVE VISION FILTER MODEL
493
One variant of this is the well-known model by [3] which demonstrated that linear combination of seven differentially weighted isotropic DOG filters can account for both Grating Induction (GI) (Fig. 26.2B) and Simultaneous Brightness Contrast (SBC) [25] illusions (Fig. 26.2A). They proposed their model as a description of cortical filtering. Several other phenomena, such as Hermann Grid illusion [26], different variants of the Grating Induction illusion [37], can also be accounted for by the multiscale DOG model of Blakeslee and McCourt [3]. However, the DOG model fails to account for brightness-assimilation illusions which include the White effect (Fig. 26.2C) and the Shifted White effect (Fig. 26.2E). The brightness-assimilation illusions are a distinct class of illusions from the brightness-contrast illusions, to which the previously mentioned Grating Induction (GI) and Simultaneous Brightness Contrast (SBC) illusions belong. Perception of these types of illusions is mainly influenced by the direction of brightness induction. When two gray patches of same gray-value are separately placed on a white bar and a black bar of a squarewave grating (see Fig. 26.2C), the gray patch on the black bar looks brighter than the gray patch on the white bar, and this effect is called White’s illusion [54]. Shifted White illusion is similar to the White’s illusion except that in case of the shifted white’s illusion the horizontal section of the square wave background containing the gray patches are shifted so that their intensities undergo a phase change of 180 degrees with respect to the upper and lower regions. To explain these kinds of orientation dependent illusions along with previously mentioned boundary contrast dependent illusions (like GI, SBC), Blakeslee and McCourt [4,6] proposed a new ‘Oriented Difference of Gaussians (ODOG)’ model. The ODOG model consists of a set of forty-two anisotropic DOG functions with seven different length scales and six different orientation directions. The outputs from each of the oriented DOG filters are then non-linearly combined, by adding their RMS normalized values, to produce the final output. It is because of this last step that the ODOG model is a non-linear model. This spatial filtering model was capable of explaining a large number of illusions. One major problem with ODOG was that the steps involved in the computation of ODOG model are very complex and also physiological evidences do not agree with its proposed level of complexity, especially the last non-linear RMS normalization step. To model the low-level visual mechanism in a more physiologically plausible manner with reduced complexity, Ghosh et al. [17–21] proposed an extended classical receptive field (ECRF) model. The ECRF model extends the concept of the classical receptive field by proposing that beyond the region of negative response of a particular ganglion cell, there lies a further region of positive response, but which is small in magnitude. Mathematically, this effect is produced by adding one more positive Gaussian term to the DOG expression of Eq. (26.1). Evidences from recent literature have shown that the extended surround is likely to play a crucial role in the attention-driven mechanisms of the visual system [33]. Despite all the successes or failures of the above mentioned models there has not been found a solid biological foundation for any of these models. A mathematical model based upon direct neurophysiological data is yet to be formulated.
26.6 MOTIVATION FOR THE ATTENTIVE VISION FILTER MODEL The role of attention has always been an ignored factor in modeling of brightness perception. Carrasco et al. [8] have demonstrated that attention can enhance both contrast sensitivity and spatial resolution. McAlonan et al. [36] have shown that attention can differentially modulate the various channels of
494
CHAPTER 26 A NEURAL MODEL OF ATTENTION AND FEEDBACK
visual information from retina to brain. From these it can be reasonably expected that attention must also play an important role in illusion perception. The Attentive Vision Filter (AVF) model attempts to incorporate the role of visual attention as a contributing factor in explaining brightness illusions. The modeling of brightness induction, as the effect of frequency selective spatial filtering on one hand [6] and experimental evidences of frequency-dependent contrast sensitivity of subcortical parallel channels in determining color or form of the visual world on the other [38], motivated Ghosh [21] to explore whether the three ECRF based spatial filters proposed by Ghosh et al. [19] can represent an appropriate model for the three parallel channels, viz. Parvocellular, Magnocellular, and Koniocellular in the central visual pathway with respect to explaining both the brightness-contrast and brightness-assimilation illusions at the three different parameter settings, simultaneously verifying the observations of Blakeslee & McCourt [4,6] and Blakeslee et al. [5] from this perspective. Ghosh [21], has already demonstrated that the ECRF model [21] can be linked to the three parallel channels, especially in view of the fact that the Magnocellular channel with higher conduction velocity and other favorable characteristics may play a leading role in driving the attention mechanism towards brightness perception. However, Ghosh [21] could not make any conclusion on how the attention mechanism may come into play in combining the channels, especially at least the two major complementary channels [38], i.e. Parvo and Magno, in the central visual pathway. Our proposed algorithm in this chapter is inspired by a two-pass model of attentive vision as described below. As mentioned earlier, the visual information entering the brain from the eyes can be separated broadly into the Magnocellular and Parvocellular pathways. It has been known for a long time that signals are conducted faster through the Magnocellular channels compared to the other channels [29,48]. Activity transferred through the Magnocellular neurons of the LGN reach area V1 some 20 ms earlier than the activity transferred through the Parvocellular neurons of the LGN [40], which shows that despite the two channels converging beyond layer 4C, M activity precedes P activity in the different layers of V1. Based on latencies of visual responses of neurons [35] in different cortical areas, Bullier [7] argues that such characteristics of the M channel like high contrast sensitivity, poor chromatic selectivity, larger receptive fields, and lower spatial resolution are well suited for a first-pass ‘vision at a glance.’ So, although the Parvocellular pathway carries much more detail, owing to the higher spatial resolution of the midget cells of this channel, compared to the Magnocellular pathway, yet the Magnocellular pathway can carry an overall holistic information much faster than the Parvocellular pathway [28,38,1,30]. According to the 2-pass model of attentive vision the visual process is divided into two stages. In the first stage, called ‘vision at a glance,’ the brain first interprets the contents of the Magnocellular pathway. If it can find sufficient detail in this stage itself, then it virtually ignores the contents of the Parvocellular pathway. In other words, if the brain can obtain sufficient information content about its environment from this channel alone then it doesn’t bother interpreting the other channel. If it cannot find sufficient detail then it enters the second stage, which is called ‘vision with scrutiny’; in this stage the brain examines the contents of the Parvocellular pathway to find further details in those regions where sufficient details were not found. In our model we implement the Magnocellular and Parvocellular pathways using linear filters, which we call M and P filters respectively. The M filter has a larger spatial sampling interval reflecting the fact the Magnocellular pathway has lower spatial resolution, whereas the P filter has much finer spatial resolution just as in the biological visual system. The above mentioned two-stage process can be depicted as shown in Fig. 26.4.
26.7 THE ECRF FILTER AND ATTENTIVE VISION FILTER (AVF)
495
FIGURE 26.4 Flow chart showing the 2-pass model of attentive vision which motivates the AVF model.
26.7 THE ECRF FILTER AND ATTENTIVE VISION FILTER (AVF) The ECRF filter [21] has been developed in response to the somewhat inconsistent performance of DOG, ODOG, and FLODOG filters [45,46]. None of them is able to explain all the brightness illusions involving either brightness-assimilation or brightness-contrast. Another major failure lies in the explanation of physiological plausibility of these filters [21]. Responses of ECRF are inspired from the Parvocellular (P ), Magnocellular (M), and Koniocellular (K) channels of the human visual system [28]. The M channel drives the motion pathway and the P channel drives the color pathway of the visual cortex system. However, studies have shown that selective blocking produces intermixing of the P and M channels responses [13]. This property of intermixing of the responses of the different channels has been exploited to derive a mathematical model of the filter. As explained earlier, the DOG filter could explain many brightness-contrast illusions including SBC and Grating Induction, but it performed miserably when dealing with brightness-assimilation illusions such as White’s illusion. The ECRF filter [21] is similar to the DOG filter but with an additional Gaussian term added to it and just like the DOG model the ECRF model is a linear convolutional model.
496
CHAPTER 26 A NEURAL MODEL OF ATTENTION AND FEEDBACK
ECRF(A, B, C, σ1 , σ2 , σ3 ) = Ae
−(
r2 ) 2σ12
− Be
−(
r2 ) 2σ22
+ Ce
−(
r2 ) 2σ32
(26.2)
According to the earlier work of Ghosh [21], in the ECRF filter model the parameters σ1 , σ2 and σ3 represent the classical center, the classical antagonistic surround, and the non-classical extended disinhibitory surround [21]. We also choose σ1 = 0.7, σ2 = 3σ1 , σ3 = 9.3σ1 for the Attentive Vision Filter (AVF), since according to Shou et al. [51] the diameter of the extended non-classical surround is at least ten times that of the classical center [21]. Here the underlying hypothesis is that Eq. (26.2) with different parameter values (A, B, C and ‘σ ’) mimics the role of Parvocellular and Magnocellular pathways, thus considering the two major complimentary channels in the central visual pathway [38]. It is also assumed, as in Ghosh [21], that the initial sensory perception in the visual system is performed by the M channel and the detailed analysis is later done by the P channel. It is also well known that the M channel has lower spatial resolution than the P channel. This means that the photosensitive cells which relay visual information to the M channel nerve fibers have greater spatial distance between them than the other photosensitive cells which relay visual information to the P channel nerve fibers. This fact must be important in how the information relayed by the M and P channels is further processed in the LGN and cortex. In order to mimic this biological fact into our computational model, we also sample the input stimulus at a lower resolution before passing it to the M channel filter (as modeled by the ECRF filter with the parameter values shown in Table 26.1) and we sample the input stimulus at a higher resolution before passing it to the P channel filter. The sampling intervals we use for the M and P channels respectively are mentioned in Table 26.1. The final brightness percept is formed in the visual cortex through a linear combination of the M and P channels. The degree to which the delayed P output is combined with the initial M output depends upon the contextual role of the attention mechanism in the visual pathway. The more important the role of the attention, the greater the component of P in the brightness percept. This we model by introducing a parameter α, to which we refer as the Factor of Attention (FOA). So in the AVF model, the outputs of the M channel and the P channel are finally combined through the following equation: AVF(α) = αP + (1 − α)M
(26.3)
where P and M correspond to their respective Gaussian expressions (26.2) and α is a weight value representing the Factor of Attention. We are going to show that α, which varies between 0 and 1, is either very high (close to 1) or medium (close to 0.5), according to whether the illusion is of brightnesscontrast or brightness-assimilation. This implies using Eq. (26.3) that in case of brightness-contrast, attentive vision through P plays a major role, unlike brightness-assimilation when P plays an almost competing role with M to produce the final brightness percept. Such a theory corroborates the prevalent view that when the stimulus mainly loses its low frequency content due to spatial filtering by high spatial frequency tuned channels, which is P as per our assumption [21], brightness-contrast appears; on the other hand, if the low spatial frequency tuned channels, i.e. M according to our assumption [21], mainly filter out the high frequency content of the stimulus, when value of α decreases to 0.5 or less, it results in brightness-assimilation [6]. The AVF model, although being a linear combination of two linear convolutional kernels, is itself a non-linear model because the coefficients α and (1 − α) assume variable values as stated above depending on the input stimulus. Therefore, if we were to add two input stimuli with two different values of Factor of Attention (α), then the resultant response would not be the sum of the responses to the stimuli individually. If on the other hand α were assigned a fixed constant value for every input stimulus then the AVF model would reduce to a linear model.
26.8 SAMPLE RESULTS FROM THE AVF FILTER
497
Table 26.1 AVF Filter Coefficients Used to Implement the P and M Channels Channel A
B
C
σ1
σ2 (= 3σ1 )
σ3 (= 9.3σ1 )
Sampling Interval
30 20
1 1
0.01 0.08
0.7 0.7
2.1 2.1
6.51 6.51
0.13 0.43
P M
The AVF filter coefficients used to implement the P channel and the M channel are shown in Table 26.1.
26.8 SAMPLE RESULTS FROM THE AVF FILTER In this section we illustrate the performance of the AVF filter for a few brightness illusions. We show that the AVF filter, even with its relatively simple design, can explain both brightness-contrast and brightness-assimilation illusions with a single model.
26.8.1 SIMULTANEOUS BRIGHTNESS CONTRAST (SBC) Fig. 26.5 shows an SBC stimulus along with the AVF response output. The two gray patches have equal intensity but appear unequally bright. The AVF response profile is plotted in Fig. 26.5B, from which it can be seen that the mean brightness of the gray patch on the left is lower than the mean brightness of the gray patch on the right, i.e. contrast gets enhanced.
26.8.2 WHITE’S ILLUSION Fig. 26.6 shows White’s stimulus along with its AVF response profile. As can be seen from Fig. 26.6B, the AVF response value for the left gray patch is higher than the AVF response value for the right gray patch, i.e. contrast is getting reduced, or in other words, brightness is getting assimilated from the surroundings. Thus this illusion is opposite of that in Fig. 26.5 (SBC), and the AVF filter can explain both of them.
26.8.3 SHIFTED WHITE’S ILLUSION Fig. 26.7 shows the Shifted-White stimulus along with its AVF response profile. Just as in the previous illusion, here brightness-assimilation occurs and it is correctly reflected in the response profile of Fig. 26.7B.
26.8.4 SINUSOIDAL GRATING Fig. 26.8 shows the sinusoidal grating stimulus along with the AVF response profile. The horizontal gray strip contains only equal intensity pixels. It can be seen by comparing Figs. 26.8A–B that the output profile is 180 degrees out of phase from the stimulus background. For example, at the abscissa value of 25 degrees, while the sinusoidal grating in Fig. 26.8A shows an Intensity peak, the AVF
498
CHAPTER 26 A NEURAL MODEL OF ATTENTION AND FEEDBACK
FIGURE 26.5 (A) Simultaneous Brightness Contrast (SBC) stimulus of patch size 1.53 degrees, (B) AVF predicted output intensity profile for SBC stimulus in (A) at alpha =0.90.
response at that 25-degree horizontal distance value shows a trough. This matches with our perceptual experience, in which the uniform gray strip appears to have undulating brightness which is opposite to that of the background grating intensity. Like in the case of SBC (Fig. 26.5), this is another example where the AVF filter can account for brightness-contrast illusions.
26.9 NECESSITY OF NON-LINEARITY IN EXPLAINING BRIGHTNESS ILLUSIONS We next describe an additional important property of brightness illusions. This is their scaling behavior, i.e. how the brightness illusion changes when the length scale of the stimulus is changed. From this scaling behavior we try to answer the question whether linear filtering models can explain brightness illusions. For this purpose we choose the Mach band illusion. From psychophysical studies on the Mach band illusion it can be shown that the length scale of the illusory effect scales in proportion to the length scale of the stimulus [2]. This reveals an important property of the human perceptual system – the illusory effect must be rendered in the same way at every length scale within the brain. This means that there must be neurons that are computing the illusory effect at various length scales from
26.9 NECESSITY OF NON-LINEARITY IN EXPLAINING BRIGHTNESS ILLUSIONS
499
FIGURE 26.6 (A) White’s illusion with frequency of 0.53 cycles/degree and patch height of 3.12 degrees. (B) Intensity profile given by AVF at alpha value of 0.45.
very small scale to very large scale (and also at intermediate scales). We conclude that linear filtering cannot produce this scale invariant behavior.
26.9.1 THE MACH BAND ILLUSION Mach bands are light and dark bands that are visible near points of high spatial rate of change of intensity gradients (i.e. high magnitude of the second derivative), as shown in Fig. 26.2F and Fig. 26.9. If any two neighbouring intensity plateaus are interpolated by region of uniform brightness gradient then a bright band can be seen at the boundary where the gradient meets the higher intensity plateau and a dark band can be seen where the gradient meets the lower intensity plateau. Mach bands are named after the Austrian physicist Ernst Mach who first observed them in 1865 [42]. Mach bands are not only present in laboratory or artificial situations, they may easily be observed at the edge of practically all shadows where light or dark lines will surround the penumbra. Fomm’s striae [16], seen while determining the wavelength of X-ray from diffraction experiments, turned out to be nothing but results of Mach band illusions and a serious mistake in experimental physics [55]. This brightness perception illusion was also found to be the culprit in the well-known discrepancy in determination of
500
CHAPTER 26 A NEURAL MODEL OF ATTENTION AND FEEDBACK
FIGURE 26.7 (A) The shifted white illusion with frequency of 0.53 cycles/degree having patch height 3.12 degrees. (B) Intensity profile given by AVF at alpha value of 0.45.
Earth’s radius from its shadow during lunar eclipse and the correct explanation was finally provided by physiological/perceptual optics rather than by physical or geometrical optics. However, even to this day Mach bands still remain an excellent subject of study in linking perception with the underlying neural mechanisms.In fact Mach himself provided an affirmative answer to the question of linearity by proposing a linear spatial filter based retinal model of visual perception towards explaining his own observation of the fictitious bright and dark bands. Mach bands have proved to be an excellent paradigm to probe several vision mechanisms like the role of edges in early vision [34], the nature of lateral inhibition [42], the importance of phase information [39], multi-channel information processing [14], linearity in visual system [31] and so on, leading to various theories for explaining the Mach bands. Of these the last three approaches mentioned are practically all multi-scale models of vision. Some of these provide in their own way explicit though qualitative explanations for the absence of Mach bands at luminance steps [41]. On the other hand the Grossberg–Todorovic model [23], despite attempts to provide quantitative explanation depending on filling-in mechanism, fails to account for the reason why Mach bands are strong at ramps but weak or inexistent at steps. It can be shown why linear models will always be inappropriate in solving the Mach bands, and furthermore, by demonstrating the scaling properties of the width of Mach bands, it can also be shown why the bands practically vanish at
26.9 NECESSITY OF NON-LINEARITY IN EXPLAINING BRIGHTNESS ILLUSIONS
501
FIGURE 26.8 (A) Sinusoidal grating stimulus with spatial frequency of 0.18 cycles/degree and gray strip width of 0.56 degrees. (B) AVF predicted output intensity profile for the gray strip as in Fig. 26.8A at alpha =0.90.
luminance steps [2]. In this section, we argue by length-scale considerations, why any linear model in vision will always remain inadequate in explaining the appearance of the Mach bands. In the process we shall also be able to provide clues towards the solution of a related long standing problem in visual perception. The most significant contribution of the study of Mach bands from its outset lies in its establishing an intimate link between perception, its underlying neural mechanisms, and computational theories. This started with Mach himself. To explain his own observations, E. Mach proposed a mathematical model of visual perception from retinal images [42]. He stated: “Let us call the intensity of illumination u on a uniform mat plane where u = f (x, y). Thus, the brightness sensation v of the corresponding retinal point is given by: ν = u − m(d 2 u/dx 2 + d 2 u/dy 2 ).” This equation also happens to be the first computational model for Mach bands, clearly representing a linear spatial filter. The assumption behind this model in the words of Mach himself was: “The illumination of a retinal point will, in proportion to the difference between this illumination and the average of the illumination on neighboring points, appear brighter or darker, respectively depending on whether the illumination of it is above or below that average. The weight of the retinal points in this average is to be thought of as rapidly decreasing
502
CHAPTER 26 A NEURAL MODEL OF ATTENTION AND FEEDBACK
FIGURE 26.9 (A) The Mach band stimulus (above) along with its corresponding intensity profile. The Mach bands are the dark and bright lines seen along the line where the gradient abruptly flattens into a plateau. (B) Multiple Mach band image with intensity profile superposed on top.
with distance from the particular point considered [42].” So what Mach actually speculated from his observations is as follows: • Perceived brightness = f (real intensity, neighboring intensities). • Neighboring intensities inhibit real intensity. • As neighboring intensities go up, perceived brightness goes down and vice versa. Ratliff and Hartline [43] later studied the neural responses of limulus ommatidia to a luminance ramp of the type shown in Fig. 26.9A, known to elicit Mach bands to observe that the responses actually displayed undershoots and overshoots at the inflection points. If we apply a discrete version of Mach’s equation stated above, as a spatial filter to any intensity ramp, then the horizontal line profile drawn through the corresponding output is found to be as shown in Fig. 26.10B. Thus the light and dark bands appear to be explained by Mach’ s linear filter based model.
26.9 NECESSITY OF NON-LINEARITY IN EXPLAINING BRIGHTNESS ILLUSIONS
503
FIGURE 26.10 The horizontal line profile shown in (B) is drawn through the output of applying Mach’s model to an intensity ramp as in (A), used by Ratliff and Hartline [43] to study the neural responses of limulus eye, which seems to explain the visible bright and dark Mach bands.
It is also highly interesting that the last line in the second quotation that we have extracted from Mach’s paper [42] clearly points towards the existence of a center-surround smoothing mechanism in retina and it was exactly a century henceforth that the probing microelectrodes of two physiologists, Rodieck and Stone [47], discovered the truth of this scientific prophecy. Their discovery of lateral inhibition in retinal ganglion cells led to a second linear model, viz. the Difference of Gaussian (DOG) model, that could also provide similar explanation to the occurrence of Mach bands. The well-known lateral inhibition based DOG model: DOG(r; σ1 , σ2 ) = A1 e
−(
r2 ) 2σ12
− A2 e
−(
r2 ) 2σ22
where A1 and A2 represent the weights, while σ1 and σ2 represent in one dimension the scales of the classical center and the antagonistic surround respectively, can easily be extended to two dimensions as in case of Mach’s model by using 2-D Gaussians. In Fig. 26.11 we demonstrate the result of applying this 2-D DOG model, as proposed by the neurophysiologists to a trapezoidal staircase-like stimulus like the one shown in Fig. 26.9B. The results confirm the efficacy of this linear model of lateral inhibition too. The third linear model, an algorithm based computational one, was established by David Marr [34] who actually combined the concepts of the previous two models (those of Mach and Rodieck–Stone) by proposing a new Laplacian of Gaussian (LOG) model. The LOG which was shown by Marr to be a sort of equivalent of the DOG is given by: 2 1 r2 − r 2 ∇ G(r) = − 4 1 − 2 e 2σ 2 πσ 2σ where r 2 = x 2 + y 2 . It produces similar results as output as in Fig. 26.11 for DOG. One however encounters the problem with all these linear models stated above when examining the Chevreul illusion (Fig. 26.12A). This stimulus that represents a step staircase luminance distribution is
504
CHAPTER 26 A NEURAL MODEL OF ATTENTION AND FEEDBACK
FIGURE 26.11 (A) A 2D trapezoidal waveform in staircase form as in Fig. 26.1. (B) The linear DOG model reproduces overshoots and undershoots (the thicker dotted line) indicating the bright and dark Mach bands visible at the top and the bottom of each ramp (the thinner solid line) as experimentally reported.
sometimes loosely referred to as the Mach band illusion, especially in many classical image processing books (e.g. [22]). However, this should not be so referred. The reason is as follows. If we look at the result of applying any of the previously mentioned linear models to this illusion as has been shown in Fig. 26.12, we find that the results predict the strong existence of Mach bands at the step transitions, possibly explaining why the image processing books which are more concerned with edge-detecting filters like LOG or DOG refer to this stimulus as Mach band illusion. However, psychophysically, it is now well established that such detection of overshoots and undershoots at step transitions of intensity totally contradicts the visual percept. For example, the experiments performed by [44] actually proved that steps inhibit Mach bands. Hence despite the success of the lateral inhibition based theories, a grave shadow of doubt is cast on these linear models towards explaining the underlying mechanism in case of Mach band formation.
26.9.2 NECESSITY OF DISCONTINUITY We now exhibit that a sharp variation of intensity gradients is fundamentally essential for the generation of Mach bands themselves. If the intensity plateaus are interpolated by a cubic equation, without any discontinuities in the slope, then no bands can be seen as unambiguously as before (Figs. 26.2F, 26.8A). This has been explicitly demonstrated in Fig. 26.13A and more robustly with the help of Fig. 26.13B. In this later one, the upper half of the stimulus shows clearly visible Mach bands, with the intensity plateaus being linearly interpolated resulting in discontinuity at transition, and the lower half is devoid of any such discontinuity because of cubic interpolation and consequently of the Mach bands as well. Based on this approach, one may conclude that clearly determinable horizontal gradients on either side of the discontinuities, that occur for non-smooth linear interpolations only, may be a crucial factor for the formation of the Mach bands. By studying the effect of varying this horizontal gradient region, we present below some interesting results on the scaling properties of the width of Mach bands.
26.9 NECESSITY OF NON-LINEARITY IN EXPLAINING BRIGHTNESS ILLUSIONS
505
FIGURE 26.12 (A) The Chevreul illusion representing step staircase luminance distribution as shown by (B). When this stimulus is simulated by either Mach’s model or the DOG model, then both predict the existence of Mach bands at the step edges as shown in (C) and (D). Such formation of Mach bands in step edges contradicts the experimental results in psychophysics (e.g. [44]) predicting the strong existence of Mach bands at the step transitions, possibly the reason why the image processing books which are more concerned with edge-detecting filters like LOG or DOG refer to this stimulus as Mach band illusion. However, psychophysically, it is now well established that such detection of overshoots and undershoots at step transitions of intensity totally contradicts the visual percept. For example, the experiments performed by Ratliff et al. [44] actually proved that steps inhibit Mach bands. Hence despite the success of the lateral inhibition based theories, a grave shadow of doubt is cast on these linear models towards explaining the underlying mechanism in case of Mach band formation.
26.9.3 SCALING PROPERTIES OF MACH BANDS The image sequence in Fig. 26.14, from top to bottom, are horizontally scaled up versions of a single image. Therefore the widths of the gradient regions of the sequence of Mach band images increase from
506
CHAPTER 26 A NEURAL MODEL OF ATTENTION AND FEEDBACK
FIGURE 26.13 (A) The higher and lower intensity plateaus have been interpolated by a cubic equation such that there are no discontinuities of slope. No intensity peaks or troughs can now be seen as unambiguously as in Fig. 26.8A. (B) A stimulus having two distinct halves, an upper and a lower one, where the upper half shows clearly visible Mach bands, the intensity plateaus being linearly interpolated as in Fig. 26.9A. The lower half on the other hand has been interpolated by a cubic equation as in (A) and fails to show any formation of Mach band. The red (short lines in print version) marks have been used to clearly demonstrate the observations.
top to bottom. The widths of the corresponding Mach bands also appear to correspondingly increase. It would be interesting to know if the Mach-band width changes in proportion with the width of the gradient region. If so, then that would imply that the perceptual processes which are generating this illusion work similarly at all length scales from very small to very large. It can be easily seen from the consecutive images, and can also be easily verified by the reader in his own monitor by just creating these simple graphics, that as we move from the top image to the bottom image, the widths of the Mach bands seem to be scaled up proportionally in size, i.e. at smaller scales the bands are clearly sharp and thin while at larger scales the bands are wide and less prominent. This establishes that under such conditions of varying gradient region, Mach bands of all length scales, small to big, can be produced. It is also to be noted in particular that at the smallest scale the Mach band is hardly visible since, with
26.9 NECESSITY OF NON-LINEARITY IN EXPLAINING BRIGHTNESS ILLUSIONS
507
FIGURE 26.14 The five figures from top to bottom have been horizontally scaled up or in other words the sizes of the horizontal gradient region are increasing from top to bottom. The red (short lines in print version) marks around the bright Mach bands in each appear to suggest that the widths of the perceived bands also increase from top to bottom.
decrease in the region of gradient upwards as in Fig. 26.14, the band itself (for example the clearly visible and marked bright one) has been compressed into a very thin line. This demonstration therefore provides an approach towards explaining the long-standing paradox in visual perception as to why the Mach bands are weak or inexistent at step changes of intensity [2]. To further elucidate this scaling property of Mach band widths, we consider another stimulus (Fig. 26.15), in which the region of gradient increases linearly in size as we move downwards within the same figure itself. Here one can clearly see that the band itself diverges outwards as we move from top to bottom. This divergence must be in proportion to the distance from the top since the band flares out like rays of light emanating from the top. If the band did not expand in proportion to the scaling factor then we would have observed a curvature in the width of the Mach band as it flares outwards from the top. Thus it appears that the width of the Mach bands in this stimulus is proportional to the vertical distance from the top. Simple psychophysical experiments can be performed to experimentally demonstrate the aforementioned scaling effect. A set of volunteers were shown a set of Mach band stimuli, and were asked to mark out the boundaries of the Mach band using a pair of cursors. When the average separation between the cursors is plotted against the width of the gradient region, we obtain a straight line as shown in Fig. 26.16.
508
CHAPTER 26 A NEURAL MODEL OF ATTENTION AND FEEDBACK
FIGURE 26.15 The region of gradient increases linearly in size from zero at the top to the widest at the bottom. The bright Mach band can be clearly seen as a slanting band radiating from the center at the top towards the south-east direction. The band is clearly thinner at the top and wider at the bottom, as has been marked by red (short lines in print version) in our experiments, as can also be tried out by the reader.
FIGURE 26.16 Graph showing the variation of Mach band width with respect to Gradient size width. The linear graph suggests a proportional relation.
These results straightaway expose the problems of all linear filters that may be used to explain brightness illusions. These problems may be stated in brief as follows:
26.10 CONCLUSION
509
FIGURE 26.17 By changing the region of horizontal intensity gradient, Mach bands of both very narrow (practically zero), as well as much broader types, can be formed, a phenomenon that cannot be explained by any linear model.
• Any physically realizable filter function must have a finite, non-zero width with typical length scales, e.g. the length scales of the DOG filter are characterized by σ1 and σ2 . • The response signals would therefore also have these same length scales. • Consequently, very short (almost tending to zero) or very long responses as we obtain here cannot be obtained by linear filtering. To more vividly demonstrate what we mean by this last statement we take recourse to a separate figure (Fig. 26.17) derived from Fig. 26.14. Based on these observations, we can say that Mach bands can only be explained by scale-adaptive filters consisting of a very wide range of length scales. The effective length scale of the filter response would therefore be a function of the length scale of the input stimulus itself. The underlying neural circuitry should be compatible with that. This phenomenon therefore is unlikely to be explained by a purely bottom-up linear spatial filtering approach. It is easy to see from this discussion on Mach bands that it is not only the Mach band illusion, but all the brightness perception illusions mentioned in this chapter are perceivable at various scales. Hence the concept of integrating scaling with the AVF model becomes imperative. It can therefore be an interesting task for future to implement the AVF model for various scales, coarse to fine, by means of introducing a computational feedback (top–down) mechanism into the traditional Deep Convolutional Neural Networks (ConvNet) that are successfully used in Computer Vision [9]. The structure of the thalamus, especially the LGN, bears testimony to this. On one hand, it shows a number of anatomically distinct channels within each of the Parvo and Magno pathways [28] which may take care of different scales. On the other hand, it is well known that the cortico-thalamic inputs, as a feedback to the thalamus from the cortex, far outnumber the thalamo-cortical output from the LGN to the brain [50], which makes the thalamus an ideal ground for selective information extraction.
26.10 CONCLUSION The observations of the White effect, checkerboard illusion and other brightness-assimilation phenomena led many researchers to reject the very idea that brightness induction occurs as a result of a
510
CHAPTER 26 A NEURAL MODEL OF ATTENTION AND FEEDBACK
mechanism of spatial filtering of the visual signals starting from low-level, and instead offer various other high-level qualitative explanations of brightness perception. However, on the basis of the comparative study between point by point brightness matching and their corresponding ODOG model output, Blakeslee & McCourt [4] persuasively argued against the necessity of invoking such higher-order processes to explain the brightness effect in these visual stimuli. Robinson et al. [45,46] later extended the ODOG model and proposed the Locally Normalized ODOG (LODOG) and the Frequency-specific LODOG (FLODOG) to further strengthen the basic spirit of this argument, though many problems still remained unaddressed. In this chapter we have described an Attentive Vision Filter (AVF) based model of brightness perception, a model of far lower complexity compared to the well-known ODOG model. This approach is based on the Extended Classical Receptive Field (ECRF) hypothesis previously proposed by Ghosh (2012), which consists of an added positive Gaussian term over the pre-existing Difference of Gaussians (DOG) model. The AVF model is further inspired by the fact that the human visual system combines the incoming signals arriving via the Parvocellular and Magnocellular pathways. In a similar fashion we also mix the outputs of two separate M and P spatial filters distinguished by different values of their defining parameters. Although the filters may be combined in numerous possible ways, we chose the simplest possible strategy, i.e., a linear combination of the two filters. The weight factor, alpha (α), used for the linear combination is termed the Factor of Attention (FOA) in this model. By analyzing various brightness-contrast as well as brightness-assimilation illusions we arrive at certain values of alpha, in order to successfully explain the brightness-contrast as well as the brightness-assimilation illusions. This adaptive variation of alpha is what introduces non-linearity into the AVF model. Therefore the AVF model is a non-linear model. In this chapter our purpose was only to show that AVF has the potential to explain all variety of brightness illusions. So we used a very simple strategy of using one global value of alpha for the whole stimulus. There can be far more complex strategies of choosing alpha. For example, one strategy could be to automatically compute the value of alpha, depending upon the input stimulus. We could also have used locally adaptive values of alpha depending upon the local neighborhood. The AVF model, unlike the ODOG model, can also successfully explain the checkerboard illusion. Furthermore, in order to determine the relative weights of the M channel to the P channel in the AVF model, electrophysiological or fMRI based experiments may be designed. Last but not least, the model provides us with insight about the possible mechanism of attention in play during the course of brightness perception through the two major complimentary channels in the central visual pathway. Through our considerations on the scaling properties of Mach band we have shown the necessity of non-linearity in a brightness model, preferably an adaptive non-linearity achieved through feedback, i.e. the filter coefficients themselves being a function of the input stimulus. This, as discussed before, is required to produce the scale invariant illusory effect. In order to produce a scale invariant illusory effect it is necessary to first analyze the stimulus at various length scales to detect its inherent length scales. This information then has to be fed back into the neural circuitry generating the illusory effect, so that illusions may be generated at the correct length scale. The above process of analyzing the stimulus at various length scales, from very small to very large, can only be done by neural circuits very high up in the visual pathway, i.e. in the visual cortex. This information must be then fed back into the LGN which combines the M and P channels. Both these effects can be achieved by feedback axons from the cortex into the LGN, which have been actually observed to exist within the brain. In future we could adopt more complex strategies for the choice of alpha, which could also be spatially varying;
REFERENCES
511
in order to find for what feedback signals can we produce scale invariant illusions. The integration of the AVF model with scaling may also be implemented in Computer Vision in the coming days by incorporating a feedback mechanism into the traditional Deep Covolutional Neural Networks.
REFERENCES [1] M. Bar, A cortical mechanism for triggering top–down facilitation in visual object recognition, J. Cogn. Neurosci. 15 (4) (2003) 600–609. [2] A. Bakshi, K. Ghosh, Some insights into why the perception of Mach bands is strong for luminance ramps and weak or vanishing for luminance steps, Perception 41 (11) (2012) 1403–1408. [3] B. Blakeslee, M.E. McCourt, Similar mechanisms underlie simultaneous brightness contrast and grating induction, Vis. Res. 37 (20) (1997) 2849–2869. [4] B. Blakeslee, M.E. McCourt, A multiscale spatial filtering account of the White effect, simultaneous brightness contrast and grating induction, Vis. Res. 39 (1999) 4361–4377. [5] B. Blakeslee, W. Pasieka, M.E. McCourt, Oriented multiscale spatial filtering and contrast normalization: a parsimonious model of brightness induction in a continuum of stimuli including White, Howe and simultaneous brightness contrast, Vis. Res. 45 (5) (2005) 607–615. [6] B. Blakeslee, M.E. McCourt, A unified theory of brightness contrast and assimilation incorporating oriented multiscale spatial filtering and contrast normalization, Vis. Res. 44 (2004) 2483–2503. [7] J. Bullier, Integrated model of visual processing, Brains Res. Rev. 36 (2) (2001) 96–107. [8] M. Carrasco, S. Ling, S. Read, Attention alters appearance, Nat. Neurosci. 7 (3) (2004) 308–313. [9] C. Cao, X. Liu, Y. Yang, Y. Yu, J. Wang, Z. Wang, Y. Huang, L. Wang, C. Huang, W. Xu, D. Ramanan, T.S. Huang, Look and think twice: capturing top–down visual attention with feedback convolutional neural networks, in: 2015 IEEE International Conference on Computer Vision, ICCV, Dec. 7–13, 2015, 2015, pp. 2956–2964. [10] L.J. Croner, E. Kaplan, Receptive fields of P and M ganglion cells across the primate retina, Vis. Res. 35 (1) (1995) 7–24. [11] F.M. De Monasterio, P. Gouras, Functional properties of ganglion cells of the rhesus monkey retina, J. Physiol. 251 (1) (1975) 167. [12] R.L. De Valois, K.K. De Valois, Spatial Vision, Oxford Psychology Series, vol. 14, Oxford University Press, New York, 1988. [13] V.P. Ferrera, T.A. Nealey, J.R.H. Maunsell, Mixed parvocellular and magnocellular geniculate signals in visual area V4, Nature 358 (1992) 756–758. [14] A. Fiorentini, G. Baumgartner, S. Magnussen, P. Schiller, G. Thomas, The perception of brightness and darkness: relation to neuronal receptive fields, in: L. Spillman, J. Werner (Eds.), Visual Perceptions: The Neurophysiological Foundations, Academic Press, 1990, pp. 129–161. [15] J.M. Foley, M.E. McCourt, Visual grating induction, J. Opt. Soc. Amer. A 2 (7) (1985) 1220–1230. [16] L. Fomm, The wavelength of Roentgen-rays, Ann. Phys. 59 (1896) 350–353. [17] K. Ghosh, S. Sarkar, K. Bhaumik, A possible mechanism of zero-crossing detection using the concept of the extended classical receptive field of retinal ganglion cells, Biol. Cybernet. 93 (1) (2005) 1–5. [18] K. Ghosh, S. Sarkar, K. Bhaumik, Low-level brightness-contrast illusions and non classical receptive field of mammalian retina, in: Proceedings of 2005 International Conference on Intelligent Sensing and Information Processing, IEEE, 2005, pp. 529–534. [19] K. Ghosh, S. Sarkar, K. Bhaumik, A possible explanation of the low-level brightness–contrast illusions in the light of an extended classical receptive field model of retinal ganglion cells, Biol. Cybernet. 94 (2) (2006) 89–96. [20] K. Ghosh, S.K. Pal, Some insights into brightness perception of images in the light of a new computational model of figure–ground segregation, IEEE Trans. Sys. Man Cybern. A 40 (2010) 758–766. [21] K. Ghosh, A possible role and basis of visual pathway selection in brightness induction, Seeing Perceiving 25 (2012) 179–212. [22] R.C. Gonzalez, R.E. Woods, Digital Image Processing, second edition, Pearson Education, 2003, Third Indian Reprint. [23] S. Grossberg, D. Todorovic, Neural dynamics of 1-D and 2-D brightness perception, Percept. Psychophys. 43 (1988) 241–277.
512
CHAPTER 26 A NEURAL MODEL OF ATTENTION AND FEEDBACK
[24] S. Grossberg, E. Mingolla, W.D. Ross, Visual brain and visual perception: how does the cortex do perceptual grouping?, Trends Neurosci. 20 (3) (1997) 106–111. [25] E.G. Heinemann, Simultaneous brightness induction as a function of inducing-and test-field luminances, J. Exp. Psychol. 50 (2) (1955) 89. [26] L. Hermann, Eine Ersheinung des simultanen Contrastes, Pflügers Archiv. Gesamte Physiol. 3 (1870) 13–15. [27] J.M. Hupe, A.C. James, B.R. Payne, S.G. Lomber, P. Girard, J. Bullier, Cortical feedback improves discrimination between figure and background by V1, V2 and V3 neurons, Nature 394 (6695) (1998) 784–787. [28] E.R. Kandel, J.H. Schwartz, T.M. Jessell (Eds.), Principles of Neural Science, McGraw–Hill, New York, 2000. [29] E. Kaplan, R.M. Shapley, X and Y cells in the lateral geniculate nucleus of macaque monkeys, J. Physiol. 330 (1) (1982) 125–143. [30] K. Kveraga, J. Boshyan, M. Bar, Magnocellular projections as the trigger of top–down facilitation in recognition, J. Neurosci. 27 (48) (2007) 13232–13240. [31] F. Kingdom, B. Moulden, A multi-channel approach to brightness coding, Vis. Res. 32 (8) (1992) 1565–1582. [32] F.A. Kingdom, Lightness, brightness and transparency: a quarter century of new ideas, captivating demonstrations and unrelenting controversy, Vis. Res. 51 (7) (2011) 652–673. [33] Q. Lv, B. Wang, L. Zhang, Saliency computation via whitened frequency band selection, Cognit. Neurodyn. 10 (3) (2016) 255–267. [34] D. Marr, Vision: A Computational Investigation Into the Human Representation and Processing of Visual Information, W. H. Freeman and Company, New York, 1982. [35] J.H. Maunsell, T.A. Nealey, D.D. DePriest, Magnocellular and parvocellular contributions to responses in the middle temporal visual area (MT) of the macaque monkey, J. Neurosci. 10 (10) (1990) 3323–3334. [36] K. McAlonan, J. Cavanaugh, R.H. Wurtz, Guarding the gateway to cortex with attention in visual thalamus, Nature 456 (7220) (2008) 391–394. [37] M.E. McCourt, A spatial frequency dependent grating-induction effect, Vis. Res. 22 (1982) 119–134. [38] W.H. Merigan, J.H. Maunsell, How parallel are the primate visual pathways?, Annu. Rev. Neurosci. 16 (1) (1993) 369–402. [39] M.C. Morrone, J. Ross, D.C. Burr, R. Owens, Mach bands are phase dependent, Nature 324 (6094) (1986) 250–253. [40] L.G. Nowak, M.H.J. Munk, P. Girard, J. Bullier, Visual latencies in areas V1 and V2 of the macaque monkey, Vis. Neurosci. 12 (02) (1995) 371–384. [41] L. Pessoa, Mach-band attenuation by adjacent stimuli: experiment and filling-in simulations, Perception 25 (4) (1996) 425–442. [42] F. Ratliff, Mach Bands: Quantitative Studies on Neural Networks, Holden-Day, San Francisco London Amsterdam, 1965. [43] F. Ratliff, H.K. Hartline, The responses of limulus optic nerve fibers to patterns of illumination on the receptor mosaic, J. Gen. Physiol. 42 (6) (1959) 1241–1255. [44] F. Ratliff, N. Milkman, N. Rennert, Attenuation of Mach bands by adjacent stimuli, Proc. Natl. Acad. Sci. 80 (14) (1983) 4554–4558. [45] A.E. Robinson, P.S. Hammon, V.R. de Sa, Explaining brightness illusions using spatial filtering and local response normalization, Vis. Res. 47 (12) (2007) 1631–1644. [46] A.E. Robinson, P.S. Hammon, V.R. de Sa, A filtering model of brightness perception using Frequency-specific Locallynormalized Oriented Difference-of-Gaussians (FLODOG), J. Vis. 7 (9) (2007) 237. [47] R.W. Rodieck, J. Stone, Analysis of receptive fields of cat retinal ganglion cells, J. Neurophysiol. 28 (5) (1965) 833–849. [48] P.H. Schiller, J.G. Malpeli, Functional specificity of lateral geniculate nucleus laminae of the rhesus monkey, J. Neurophysiol. 41 (3) (1978) 788–797. [49] R. Shapley, V.H. Perry, Cat and monkey retinal ganglion cells and their visual functional roles, Trends Neurosci. 9 (1986) 229–235. [50] S.M. Sherman, R.W. Guillery, Exploring the Thalamus and Its Role in Cortical Function, 2nd edition, The MIT Press, 2006. [51] T. Shou, W. Wang, H. Yu, Orientation biased extended surround of the receptive field of cat retinal ganglion cells, Neuroscience 98 (2000) 207–212. [52] L.C.L. Silveira, V.H. Perry, The topography of magnocellular projecting ganglion cells (M-ganglion cells) in the primate retina, Neurosciences 40 (1) (1991) 217–237. [53] S.G. Solomon, A.J. White, P.R. Martin, Extraclassical receptive field properties of parvocellular, magnocellular, and koniocellular cells in the primate lateral geniculate nucleus, J. Neurosci. 22 (1) (2002) 338–349. [54] M. White, A new effect of pattern on perceived lightness, Perception 8 (4) (1979) 413–416. [55] C.H. Wind, Zur demonstration einer von E. Mach entdeckten optischen Tauschung, Phys. Z. 1 (1899) 112–113.
REFERENCES
513
[56] X. Xu, J.M. Ichida, J.D. Allison, J.D. Boyd, A.B. Bonds, V.A. Casagrande, A comparison of koniocellular, magnocellular and parvocellular receptive field properties in the lateral geniculate nucleus of the owl monkey (Aotus trivirgatus), J. Physiol. 531 (1) (2001) 203–218. [57] R. Young, The Gaussian derivative model for spatial vision: I. retinal mechanisms, Spat. Vis. 2 (4) (1987) 273–293.
This page intentionally left blank
CHAPTER
SUPPORT VECTOR MACHINE: PRINCIPLES, PARAMETERS, AND APPLICATIONS
27
Raoof Gholami∗ , Nikoo Fakhari† ∗ Curtin
University, Miri, Malaysia
† Shahrood
University of Technology, Shahrood, Iran
27.1 INTRODUCTION Developing machines with the capability of learning and gaining experiences from what they are exposed to was a long-standing discussions for many years. These machines were finally designed by the rise of powerful computers with the ability of processing a huge number of data. Later, it was proven that these machines can overcome many of the shortcomings posed due to the weakness of classic mathematical and statistical approaches [14]. Artificial Neural Networks (ANN) was one of the initial learning machines developed in the 1940s based on the biological neuron system of human brains. It found its application later in the 1980s and has been used for many engineering related applications ever since, due mainly to its capability in extracting complex and non-linear relationships between features of different systems [9,23,1]. However, it was later indicated that the ANN can only give reliable results when a huge number of data are available for training purposes. It had a very poor generalization ability on many occasions and a local optimal solution was often offered rather than a global best answer [21]. Due to many of these shortcomings, a new machine learning technique, a so-called Support Vector Machine (SVM) was developed in the early 1990s as a non-linear solution for classification and regression tasks [35]. There have been at least three reasons behind the success of the SVM in providing reliable results: its ability to learn well with only a very small number of features, its robustness against the error of models, and its computational efficiency compared to other machine learning methods such as neural networks [30,36]. The SVM is generally divided into two categories of Supper Vector Classification (SVC) and Support Vector Regression (SVR), but the SVR is the one gained attention in many of petroleum and mining engineering projects. In this chapter, after a short introduction of what is known as generalization, principles and concepts included in the development of SVMs are presented in detail. This is followed by case studies related to petroleum and mining engineering tasks where these machines were successfully applied. Handbook of Neural Computation. DOI: 10.1016/B978-0-12-811318-9.00027-2 Copyright © 2017 Elsevier Inc. All rights reserved.
515
516
CHAPTER 27 SUPPORT VECTOR MACHINE: PRINCIPLES, PARAMETERS
FIGURE 27.1 A regression analysis where complication of the models fitted to the data is (dashed blue (dark grey in print versions) curve) and is not (dashed green (light Grey in print versions) line) controlled.
27.2 GENERALIZATION During training of machine learning techniques, it is often observed that by adding the number of data, the error function can be decreased significantly. This rather results in what is known as overfitting [21]. Overfitting is referred to the situation where ANNs or any other machines are trained by a huge number of data and a very complicated function is selected to reduce the empirical risk (the means of the losses between the estimated and desired output computed over all the training pairs {x, y}) (e.g., the green (light grey in print version) model in Fig. 27.1). As a result, a very promising result is often yielded at the training stage but a poor estimation is achieved at the testing step by the machine [30]. Overfitting is one of the common incidents observed and reported to the applications where ANNs were applied [21]. One of the simplest approaches for resolving the issue of overfitting is to reduce the complexity of the model used to explain the data [2]. Under these circumstances, the simplest possible function (e.g., the dashed blue (dark grey in print versions) line in Fig. 27.1) which can satisfactorily explain the data must be selected. Although this simple function may not reduce the empirical error as much as that of a complicated function at the training stage, it would have a very good efficiency in facing unseen data at the testing stage [34]. In these kinds of problems in which reduction in the complexity of the chosen model is required, regularization of the selected function would be very helpful. This is generally known as the theory of Vapnik–Chervonenkis which was the basic idea behind the development of SVMs [19].
27.3 SUPPORT VECTOR CLASSIFICATION
517
FIGURE 27.2 Margins of classes and the hyperplane used to classify data of two classes. Support vectors (bold size data) used to have the maximum margins from each class of data.
27.3 SUPPORT VECTOR CLASSIFICATION 27.3.1 LINEARLY SEPARABLE CASE (HARD MARGIN) Support Vector Machine used for Classification is called SVC and has been successfully used for many applications concerning separation of data into two or several classes (e.g., [5,1]). The aim of using SVC is to find a classification criterion (i.e., a decision function) which can properly separate unseen data with a good generalization ability at the testing stage. This criterion, for a two-class data classification, can be a linear straight line with a maximum distance (margin) from the data of each class. This linear classifier is also known as an optimal hyperplane in SVC related discussions [2]. This straight line (see Fig. 27.2) which is also known as a linear hyperplane for a set of training data, xi (i = 1, 2, 3, . . . , n), is defined as: w T x + b = 0,
(27.1)
where w is an n-dimensional vector and b is a bias term. This hyperplane must have two specific properties: (1) it must have the least possible error in separation of data, and (2) its distance from the closest data of each class must be maximal [34]. Under these circumstances, data of each class can only be in the left (y = 1) or in the right (y = −1) sides of the hyperplane. Two margins can, therefore, be defined (see Fig. 27.2) to control the separability of data as: w x +b T
≥1 for yi = 1 ≤ −1 for yi = −1
(27.2)
However, the generalization region for the hyperplane can be anywhere between 1 and −1 and there are many margins which can be considered as the boundary of each class. Hence, to find the best hyperplane, the distance (d) between the margins should be measured and maximized using the
518
CHAPTER 27 SUPPORT VECTOR MACHINE: PRINCIPLES, PARAMETERS
following equation: d(w, b; x) =
|(w T x + b − 1) − (w T x + b + 1)| 2 = w w
(27.3)
Thus, maximizing the margin would be equal to minimizing the dimensional vector w which can be also written as 12 w T w [19]. The general convex problem to determine the optimal hyperplane is then addressed as: 1 Minw,b = w T w 2 (27.4) s.t. yi (w T x + b) ≥ 1 which is subjected to the constraint of the margins of two classes. A Lagrange multiplier (α) can be used on these occasions to enforce the constraint as below: / < = 0 1 αi yi w T xi + b − 1 Lp (w, b, α) = w T w − 2 N
(27.5)
i=1
To find the stationary (saddle) point of the above equation, the following Karush–Kuhn–Tucker (KKT) conditions must be satisfied: ∂L =0 ∂w
⇒
∂L =0 ∂b
w0 =
N
αi xi yi
(27.6)
i=1
N
⇒
αi yi = 0
(27.7)
i=1
It should be noticed that αi will not be equal to zero if and only if their corresponding input data (xi ) is a support vector [2]. Support vectors are the data chosen as the boundary of each class through which the margin of the class can be found (i.e., bold data points in Fig. 27.2). Finally, substituting Eq. (27.6) and Eq. (27.7) into Eq. (27.5) gives the general equation of the SVC for a linearly separable case which would be subjected to two constraints as below [36]: Max Ld (α) =
N i=1
s.t.
αi −
⎧ ⎪ α ≥0 ⎪ ⎨ i N ⎪ αi yi = 0 ⎪ ⎩
N 1 yi yj αi αj xiT xj 2 i,j =1
(27.8)
i=1
The above equation and its constraints are used by the SVC to find the support vectors and their corresponding input data. The parameter w of the hyperplane (decision function) can then be obtained
27.3 SUPPORT VECTOR CLASSIFICATION
519
FIGURE 27.3 Linearly non-separable data with the slack variables defined to minimize the error of misclassification.
from Eq. (27.6) and the bias parameter b can be calculated from Eq. (27.9), written in the average form as: b0 =
N 1 yS − w T xS N
(27.9)
S=1
27.3.2 LINEARLY NON-SEPARABLE CASE (SOFT MARGIN) There are always cases where data are not linearly separable due to similarity of few features in the database. However, a linear SVM might still be able to provide a good solution for the problem if and only if a penalty function could be defined such that the distance (ξi ) between the bad classified data of each class from the margin of that class could be measured and minimized (see Fig. 27.3). The penalty function in these cases can then be defined as [2]: F (ξ ) =
N
(27.10)
ξi
i=1
As a result, the convex optimization function presented earlier as Eq. (27.4) for a linearly nonseparable case will have a new term based on the penalty function addressed as: 1 Minw,b = w T w + C ξi 2 N
i=1
s.t.
(27.11)
yi (w x + b) ≥ 1 − ξi T
The parameter C in the above equation is known as the “trade-off” parameter added to maximize the margin and minimize the classification error [19].
520
CHAPTER 27 SUPPORT VECTOR MACHINE: PRINCIPLES, PARAMETERS
To resolve the above optimization problem which is subjected to constraint of the margins, Lagrange multipliers (α (αi ≥ 0), β (βi ≥ 0)) are used. The unconstrained form of Eq. (27.11) can then be obtained as [30]: / < = 0 1 Lp (w, b, ξ, α, β) = w T w + C ξi − αi yi w T xi + b − 1 + ξi − βi ξi 2 N
N
N
i=1
i=1
i=1
(27.12)
The optimal solution for Eq. (27.12) can be obtained by satisfying the following KKT condition: ∂L =0 ∂w
⇒
w=
N
αi yi xi
(27.13)
i=1
∂L =0 ∂b
⇒
∂L =0 ∂ξ
⇒
N
αi yi = 0
(27.14)
αi + βi = C
(27.15)
i=1
Substituting Eqs. (27.13)–(27.15) into Eq. (27.12), the following dual problem for a soft margin SVC is formulated [30]: Max Ld (α) =
N i=1
s.t.
αi −
⎧ ⎪ 0 ≤ αi ≤ C ⎪ ⎨ N ⎪ αi yi = 0 ⎪ ⎩
N 1 yi yj αi αj xiT xj 2 i,j =1
(27.16)
i=1
The difference between Eq. (27.16) and the one presented earlier as Eq. (27.8) is the constraint posed on the Lagrange multiplier α, forcing it to be either equal to or less than the trade-off parameter C.
27.3.3 NON-LINEAR CASE (KERNEL MACHINE) 27.3.3.1 Feature Space The aim of using an optimal hyperplane is to enhance the generalization ability of the machine. However, if the data are not linearly separable, the machine will not have a good generalization ability, even though the optimal hyperplane might be found. To resolve this issue, the input data are mapped onto a higher dimensional dot product space, also known as feature or Hilbert space proposed by [31]. Having this theory in mind, the data are still nonlinear in the input space while a liner SVC can be created in the feature space to separate them. Fig. 27.4 shows how the feature space can be used to separate data in a higher dimension. According to Mercer, the input data x is represented by ϕ(x) in the feature space while the functional form of this mapping is unknown. Fortunately, it is not essential to show the input data in the
27.3 SUPPORT VECTOR CLASSIFICATION
521
FIGURE 27.4 Mapping input data onto the feature space for a better generalization.
Table 27.1 Different Kernel Functions Often Used for Nonlinear Data Classifications Using a SVC Kernel Function
Type of Classifier
K(xi , xj ) = (xiT xj )ρ K(xi , xj ) = (xiT xj + 1)ρ K(xi , xj ) = tanh(γ xiT xj + μ) K(xi , xj ) = exp(−[xi − xj 2 ]/2σ 2 )
Linear Complete polynomial of degree ρ Multilayer perceptron Gaussian RBF
K(xi , xj ) =
sin((n+1/2)(xi −xj )) 2 sin((xi −xj )/2)
K(xi , xj ) = tanh(α(xi · xi ) + ϑ)
Dirichlet Sigmoid
feature space and calculation of their inner product is only required [34]. This inner product can by calculated by selecting a sophisticated kernel function as written below: ϕ(xi , xj ) = K(xi , xj )
(27.17)
This makes it possible to apply the SVC for solving nonlinear engineering problems. It should, however, be noticed that Eq. (27.17) would be valid if and only if for a nonlinear vector function such as g(x), the following integration could be satisfied [30]: F K(xi , xj )g(xi )g(xj )dxi dxj ≥ 0
(27.18)
Table 27.1 gives few of the kernel functions successfully integrated by the nonlinear SVCs.
27.3.3.2 Kernel Trick Having considered the application of kernel functions in determination of the inner product of input data brought in the feature space, the general dual equation presented earlier for the classification of
522
CHAPTER 27 SUPPORT VECTOR MACHINE: PRINCIPLES, PARAMETERS
linearly non-separable data (Eq. (27.16)) can be rewritten for a nonlinear classification case as: Max Ld (α) =
N
N 1 yi yj αi αj K(xi , xj ) 2
αi −
i,j =1
i=1
s.t.
⎧ ⎪ 0 ≤ αi ≤ C ⎪ ⎨ N ⎪ αi yi = 0 ⎪ ⎩
(27.19)
i=1
Finding the optimal hyperplane on these occasions, however, would not be an easy and straightforward task due to the unknown value of the weighting vector w which is written as Eq. (27.20) in the Hilbert space. ω=
N
yi αi ϕ(xi )
(27.20)
i=1
Under these circumstances, the kernel trick can be used through which the direct determination of the parameter w would not be required [2]. According to this concept, by substituting Eq. (27.20) into Eq. (27.9), the bias parameter b can be obtained using a kernel function as: b = yi −
NSV
αi yi K(xi , xj ).
(27.21)
i,j =1
Knowing that, the hyperplane is defined as: d(x) = w T ϕ(x) + b,
(27.22)
the optimal decision function being obtained by substituting Eq. (27.20) into Eq. (27.22) and considering a suitable kernel function [30]. The general equation of the hyperplane can, therefore, be stated as: d(x) =
N
yi αi K(x, xi ) + b
(27.23)
i=1
Hence, determination of the weighting parameter w in the Hilbert space is no longer required and SVMs can efficiently be used for solving nonlinear problems by selecting an appropriate kernel function.
27.4 SUPPORT VECTOR REGRESSION 27.4.1 CLASSIC REGRESSION ANALYSIS Regression analysis is one of the widely used statistical tools used to assess the relationship between an independent (Y ) and dependent variables (x1 , x2 , . . . , xn ) included in a system. In this
27.4 SUPPORT VECTOR REGRESSION
523
FIGURE 27.5 Concept of ε-insensitivity in the linear data analysis. Only the samples out of the ±ε margin will have a nonzero slack variable.
analysis, it is often attempted to find the best decision function which can satisfactorily explain the variation of the target parameter based on the input variables. This function, however, should have the minimum possible error of prediction when chosen. To minimize the empirical risk (error), on these occasions, the parameter ε is defined to measure the discrepancy between the real and estimated values. The sum of εi can then be minimized by the help of, for instance, an old-fashioned least square method to find the best function [14]. However, classic statistical approaches do not often exceed the expectation of providing a very accurate estimation due to including the entire data points into their analysis, even those that are already very well explained with the model [34]. This can easily result in reducing the flexibility of the model when few outliers are included in the input space.
27.4.2 LINEAR SUPPORT VECTOR REGRESSION A linear decision function in the Support Vector Regression (SVR) related discussion is defined as f (x) = w T x + b in which a vector like x is used to estimate the scaler vector of Y using the n-dimensional weighting vector w and the bias parameter b. However, the big difference between an SVR and a classic regression analysis is the fact that when the SVR is employed, the decision function is chosen such that it has the minimum deviation from the insensitivity parameter ε (see Fig. 27.5). This means that the SVR ignores the error posed by the data confined in the ε margins and considers the rest for finding the optimal hyperplane with the help of slack variables (ξi ) [14,26]. Hence, the regression analysis using the SVR would have an objective function Lp (Eq. (27.24)), whose aim is to find the optimal value of the weighting vector w such that the empirical risk can be minimized [26].
524
CHAPTER 27 SUPPORT VECTOR MACHINE: PRINCIPLES, PARAMETERS
1 Lp = w2 + C (ξi + ξi ) 2 i=1 ⎧ T ⎪ ⎨yi − w x − b ≤ ξi + ε s.t. yi + w T x + b ≤ ξ i + ε ⎪ ⎩ ξi , x i ≥ 0 N
(27.24)
Constraints included in the above equation ensure that any error smaller than ε would not enter the objective function. This concept is known as the insensitivity theory of Vapnik [36]. Like the classification cases, to resolve the optimization problem expressed as Eq. (27.24), Lagrange multipliers (αi , αi ) can be used followed by satisfying the KKT condition as: ∂L =0 ∂w
⇒
w=
N
αi − αi xi
(27.25)
i=1
∂L =0 ∂b
⇒
N
αi − αi = 0
(27.26)
i=1
Substituting Eq. (27.25) and Eq. (27.26) into Eq. (27.24), the general equation of the linear SVR is formulated as [34]: 1 (αi − αi )xiT xj (αi − αj ) + αi − αi yi − αi + αi ε 2 N
Ld =
N
i=1 j =1
N
i=1
(27.27)
s.t. 0 ≤ (αi − αi ) ≤ C Having found nonzero Lagrange multipliers, the weighting parameter w of the decision function is obtained from Eq. (27.25) while the bias parameter b is determined from one of the equations below: − y i + w T xi + b + ε = 0 yi − w T x i − b + ε = 0 αi , αj ≺ C
(27.28)
27.4.3 NON-LINEAR SUPPORT VECTOR REGRESSION The general idea behind the development of the SVR for the nonlinear data analysis is as same as the one presented earlier for the linear case in which an insensitive margin is defined to reduce the risk of predictions (see Fig. 27.6). However, the problem of mapping data into the high-dimensional Hilbert space has to be resolved.
27.5 STEP BY STEP WITH SVMS FOR CLASSIFICATION AND REGRESSION DATA ANALYSIS
FIGURE 27.6 Concept of ε-insensitivity in the nonlinear data analysis. Only the samples out of the ±ε margin will have a nonzero slack variable.
As it was discussed earlier, the optimal value of the weighting vector w for the regression task can be obtained from Eq. (27.25) which would be written as Eq. (27.29) in the Hilbert space. w=
N
αi − αi ϕ(xi )
(27.29)
i=1
However, the value of parameter ϕ in the Hilbert space is unknown, as mentioned earlier, which makes it difficult to estimate the weighting vector. To resolve this issue, the kernel trick can be used once again through which Eq. (27.29) is substituted into the decision function (i.e., f (x) = w T ϕ(xi ) + b) to formulate the general equation of the SVR for the nonlinear data analysis as [14]: yi =
N N
N N αi − αi ϕ(xi )T ϕ(xj ) + b = αi − αi K(xi , xj ) + b
i=1 j =1
(27.30)
i=1 j =1
By using the kernel trick, determination of the weighting vector is no longer required and the bias parameter b can be calculated by using the following equation: b = yi −
NSV
αi yi K(xi , xj )
(27.31)
i,j =1
27.5 STEP BY STEP WITH SVMS FOR CLASSIFICATION AND REGRESSION DATA ANALYSIS Having known the principles and equations of the SVM, the following steps can be followed to have a successful classification or regression tasks:
525
526
CHAPTER 27 SUPPORT VECTOR MACHINE: PRINCIPLES, PARAMETERS
(i) Preparation of a pattern matrix: The pattern (feature) matrix required for the classification and regression analysis is different. The one used for the classification should have a set of data divided into two or multiple classes of extracted features with a signature such as 1 or −1, indicating the class from which that feature belongs. For the regression analysis, on the other hand, independent (input parameters) and dependent data (target parameters) are often presented in separated columns of the matrix. Data can then be further partitioned into training, testing, and validation portions. (ii) Selection of a Kernel function: This is perhaps the most important step. There are many kernel functions which might be applied but their applications depend, to a great extent, on the nature of data (i.e., the degree of nonlinearity). The RBF (Gaussian) kernel function has been the most successful kernel based on its application reported in the literature [5,23] and can be considered as the first choice. (iii) Parameter selection: When SVMs are used, there are a number of parameters selected to have the best performance including: (1) parameters included in the kernel functions, (2) the trade-off parameter C, and (3) the ε-insensitivity parameter. Selection of these parameters, however, is not an easy and straightforward task as there are no mathematical equations or correlations to give an initial guess of their values. Values suggested by famous softwares (i.e., Weka) can be used/modified under these circumstances or a validation step can be set apart to determine the values of the parameters. (iv) Execution of training algorithms: When input and output data are defined at the training step, an SVM uses the general formulations presented earlier as Eq. (27.8), Eq. (27.16), Eq. (27.19) Eq. (27.27) or Eq. (27.30) to determine the Lagrange multipliers. The multipliers with nonzero values would indicate which one of the input data (xi ) can be a support vector. The support vectors would determine the margin of each class through which the optimum hyperplane (decision function) can be chosen. (v) Classification/prediction of unseen data: By determination of Lagrange multipliers and their corresponding support vectors, unseen data can be properly classified or estimated. Having failed in having a reliable classification or prediction tasks at this stage might be due to a bad feature extraction/selection, kernel selection or parameters estimation at any of the steps explained above. One may repeat the above steps to reduce the error and enhance the accuracy of the results under these circumstances.
27.6 STRENGTH AND WEAKNESS OF SVMS There have been many discussions on the strength and weakness of the SVM, among which the following are the most highlighted.
27.6.1 STRENGTHS The strengths of the SVM reported so far are: (i) Generalization and training efficiency: Unlike neural networks, the probability of having a local optimal during the training step is highly unlikely when SVMs are used due to having a quadratic programming problem formulated in their development. As a result, a good and efficient
27.7 APPLICATIONS
527
training stage is rewarded by a good performance at the testing stage when unseen data are faced. In addition, unlike ANNs, SVMs can be used in the scenarios where a limited number of data are available. This is, in fact, one of the main advantageous of SVMs over the ANNs. (ii) Mapping data: Perhaps one of the main differences between SVMs and ANNs is mapping the input data onto the Hilbert space which enhances the efficiency and accuracy of the analysis. (iii) Error-complexity trade-off: Unlike neural networks where the sum of squares error approach is used to reduce the errors posed by outliers, the SVM considers the trade-off parameter C and insensitivity parameter ε to control the error of classification/ regression tasks. Thus, one can suppress outliers by properly selecting the values of parameters C and ε.
27.6.2 WEAKNESSES There are two main concerns when SVMs are being used: (i) Parameters selection: Perhaps one of the main concerns of using SVMs is to find a suitable kernel function which can sophisticatedly represent the input data in the Hilbert space. Having selected a kernel, there are few other parameters including kernel’s parameters, parameter C, and parameter ε which need to be identified. Finding the optimum value of these parameters, however, does not seem to be an easy task. (ii) Running time: SVMs are generally good performance machines when a limited number of data are available. However, when a huge number of data are included and used for training, the SVM may require a long period of time to solve the dual optimization problem due to the number of input data and Lagrange multipliers involved in finding the support vectors.
27.7 APPLICATIONS In this section, attempts were made to show the successful applications of SVMs in petroleum and mining related projects. Few studies carried out in recent years together with their findings were highlighted to indicate as to how parameters included in the structure of SVMs should be selected to have a reliable result.
27.7.1 PERMEABILITY Permeability is one of the most important petrophysical parameters of reservoir rocks defined as the ability to conduct fluids flows through pore spaces. It plays an important role in the reservoir characterization [27], flow unit identification [8], completion and production type selections, etc. but is a not an easy parameter to be estimated. The common approach of estimating permeability is back to the use of core ample data which are not often available for the whole interval of interest. The well test data are perhaps another source of information which can be used for such estimation but they are not efficient in providing a continuous estimation [9]. Wireline logs data have been considered as another source of estimating the variation of permeability but there are complex nonlinear relationships between permeability and petrophysical logs which make multilinear and nonlinear statistical techniques unable to provide realistic results [12]. ANNs have, therefore, been used for permeability estimations but they
528
CHAPTER 27 SUPPORT VECTOR MACHINE: PRINCIPLES, PARAMETERS
FIGURE 27.7 Permeability–Porosity data of two different wells.
often suffer from poor generalization ability and overfitting issues [9,32]. This indicates the need to have a better machine learning technique which can overcome the issues faced by ANNs. Al-Anazi and Gates [5] attempted to evaluate the application of Multi-Layer Perceptron (MLP) Neural Networks and SVRs with sigmoid and RBF kernel functions in prediction of permeability. They used 10-fold cross-validation methods to tune the parameters included in the structure of the SVMs and MLP. The entire set of wireline logs were used for training including gamma ray (GR), neutron porosity (NPHI), sonic porosity (DT), bulk density (RHOB), and formation resistivity (ILD) without any input data selection approach. Four different criteria including correlation coefficient (R), Root mean Square Error (RMSE), Average Absolute Error (AAR), Maximum Absolute Error (MAE) were applied to evaluate the application of different machine learning techniques. The results obtained from their study indicated that generally the MLP is the least appropriate approach for prediction of permeability while the SVR with the RBF kernel function is probably the best technique for such predictions. In a similar study, Gholami et al. [23] attempted to predict the permeability of a reservoir located in the south part of Iran using the SVR and the Relevance Vector Regression (RVR) methods. In this study, the air permeability of core data taken from these wells was used as the target while wireline logs data of those wells were considered as the input parameters. Fig. 27.7 shows the variation of permeability (target variable) versus porosity of core samples in two of the wells. They did a regression analysis to evaluate the relationship between different logs and horizontal core permeability but the analysis was inconclusive. Knowing the sophisticated performance of Genetic Algorithm (GA) in the input data selection, attempts were made to integrate this algorithm with the SVR and RVR to find the best wireline log input parameters for training.
27.7 APPLICATIONS
529
FIGURE 27.8 Comparing the performance of the RVR and SVR methods in prediction of permeability [23].
Dividing data into training, testing, and validation portions, the min–max normalization method was used to have a better training by scaling up the data. The Gaussian kernel function was used as the best kernel because of its success and reputation in similar studies. The input wireline logs were selected by the GA and other parameters including the parameter of the RBF kernel function were tuned by the one-leave-out cross-validation approach. Gholami et al. [23] concluded that the RVR is a better machine compared to the SVR in the prediction of permeability, even though the results were very close. Fig. 27.8 shows the results obtained from the machines in the testing stage.
27.7.2 ROCK MASS CLASSIFICATION (RMR) SYSTEM Rock mass characterizations of complex structures are crucial to recognize their vulnerable regions. However, when it comes to large structures such as tunnels, obtaining information about rock mass properties is not an easy task. There are generally two methods, known as destructive and nondestructive, which can be used on these occasions considering different aspects of structures [13]. Destructive is a common term used for methods which can accurately determine the mechanical properties of rocks using direct mechanical tests in the lab. These methods are, however, time consuming and expensive [39]. As a result, non-destructive methods such as Ground Penetrating Radar (GPR) system, X-ray Radiography, Impact Echo (IE) have been developed [28,20]. Although non-destructive methods are cost-effective and faster compared to destructive ones, they not often can yield meaningful results because of not measuring the rock mass properties directly. Alimoradi et al. [6] carried a study on the estimation of Rock Mass Classification (RMR) system using the Tunnel Seismic Prediction (TSP-203) method which could provide compressional and shear wave related data including velocity, orientation, and polarity. They trained a conventional Back-Propagation Neural Network
530
CHAPTER 27 SUPPORT VECTOR MACHINE: PRINCIPLES, PARAMETERS
FIGURE 27.9 Comparing the application of the empirical correlation with the SVR and RVR in prediction of RMR [24].
(BPNN) for such estimation but did not get a better correlation coefficient than 0.88 at the testing stage. Gholami et al. [24] compared the application of an empirical correlation, SVR, and RVR in prediction of the RMR using the data of two tunnels located in the North of Iran. Linking the velocity of P -wave and Rock Quality Control (Q) to have an empirical correlation for the RMR, it was found that the empirical correlation either overestimates or underestimates the value of RMR along the route of the tunnels. Knowing that ANNs are not able to provide sophisticated results based on the earlier study, the SVR and RVR were integrated by the GA for prediction of the RMR. The results provided by the GA highlighted that P - and S-wave velocities together with their magnitudes and reflection depths are the best input parameters to train the machines. The available data was then normalized using the min–max approach and the Gaussian kernel function was chosen for the both machines. The parameters of each machine were then tuned by the K-fold cross-validation technique. Fig. 27.9 highlights the efficiency of the machines in the testing stage. Having a promising result from the machines based on the data of one tunnel, the same machines were used to estimate the RMR of the second tunnel where the real RMR was not available but two crushing sites were observed between 3100 and 3300 m of the tunnel route. Fig. 27.9 displays the performance of the SVR and RVR in recognizing two crushing sites and predicting the RMR of the second tunnel. Looking at Fig. 27.9, one may conclude that the SVR and RVR are a better option compared to ANNs in prediction of RMR. Although the RVR seems to be a better machine, the results obtained from the SVR were still very promising.
27.7 APPLICATIONS
531
27.7.3 SHEAR WAVE VELOCITY In petroleum engineering related rock mechanics, methodologies based on the acoustic velocity have gained a lot of attention lately, mainly because of their applications in providing good estimations of elastic and strength properties of rocks [33]. However, a complete rock characterization requires both compressional (P ) and shear (S) wave velocity data, which may not always be available [7]. The compressional wave velocity is often available for most of the wells drilled in a field. Measurements of the shear wave velocity are not often carried out, though, due to either complexity of measurements or cost saving reasons. Therefore, there have been many studies to formulate an approach which can estimate the shear wave velocity accurately. The conventional approaches used on these occasions are empirical correlations [16,15] which are not often able to provide very satisfactory results. This has resulted in proposing the idea of using ANNs for such predictions. Several studies were done in the past, attempting to estimate the shear wave velocity from conventional well logs using ANNS [11,10] but a better approach was still demanded to overcome shortcomings of ANNs. Maleki et al. [29] carried a study on the estimation of the shear wave velocity by comparing the applications of empirical correlations, BPNNs and SVRs. They used the wireline logs data of a field in the southern part of Iran as the input parameters and attempted to compare the prediction provided by different approaches. They had many logs which could possibly be used for the prediction purpose but only few of them had good relationships with the shear wave velocity. Having a good relationship between S-wave and P -wave velocity data, they compared the application of different empirical correlations developed based exclusively on the compressional wave velocity. They concluded that the correlation proposed by Castagna et al. [16] is the best approach in such predictions with a correlation coefficient of 0.93. However, a better approach was still required to get a more accurate result. They integrated GA with a BPNN and SVR for the input parameter selection and used the cross-validation approach to tune the parameters of the networks. The parameters of the GA were selected based on what was reported in the literature. The results obtained from the GA indicated that the gamma ray, density (RHOB), and P -wave slowness (DTCO) logs are the best choices as the input parameters. The optimal values of gamma (σ ) and epsilon (ε) were also found to be 0.55 and 0.19, respectively, while the optimal value of the learning and the momentum rate in the BPNN were obtained as 0.15 and 0.5, respectively. The trade-off parameter C was chosen to be 100 based on the recommendation indicated in the literature. From the values of R and RMSE criteria obtained, they concluded that the SVR is able to provide a better performance over the BPNN and empirical correlations in prediction of the shear wave velocity. In a similar study, Bagheripour et al. [11] used the conventional wireline logs data to predict the variation of the shear wave velocity using different empirical correlations, and an SVR. They had seven wireline logs data as the input parameters with unknown or very poor relationships with the shear wave velocity. They built a BPNN to select the best input parameters for training the SVR and concluded that compressional wave slowness, neutron porosity, bulk density, and true resistivity logs are the best parameters to have a good prediction. The RBF kernel function was used in the structure of the SVR while the optimal value of parameters C, gamma, and epsilon were found to be 125814.37412, 0.198179, and 0.014732 respectively based on the grid and pattern search method. Four different measures were used to evaluate the performance of different approaches including R, ARE, average absolute relative error (AARE), and RMSE. They concluded that the SVR is the best approach for the shear wave velocity prediction while the BPNN and the Pickett correlation could be the second and the third alternatives for such predictions.
532
CHAPTER 27 SUPPORT VECTOR MACHINE: PRINCIPLES, PARAMETERS
27.7.4 INTERFACIAL TENSION Interfacial Tension (IFT) is generally defined as the accumulation of energy and the imbalance force at the interface of two different phases such as liquid–solid [4]. This tension is particularly important for Enhanced Oil Recovery (EOR) or CO2 sequestration exercises where gas or water injections are done to improve the recovery or reduce the amount of greenhouse gas released into the atmosphere. The IFT is, in fact, the main parameter controlling the capillary trapping force which is in charge of trapping fluids in the pore structures of reservoirs. As a result, an accurate measurement of IFT is required to have a good estimation of capillary trapping forces under reservoir conditions. This would help to design a strategy through which a better recovery or storage practice can be implemented safety. There have been numerous attempts to determine the IFT by conducting experiments at laboratory (e.g., [22,3]), among which the pendant drop and the capillary rise methods are the common techniques used at high pressure and high temperature conditions [22]. However, laboratory measurements are often time consuming and expensive because of the equipment and expertise required. As a result, empirical correlations were developed to estimate the IFT under certain conditions [4] but their accuracy was uncertain. Zhang et al. [38] proposed to apply ANNs for estimation of the IFT using few of the related parameters including the pressure, temperature, monovalent cation molality, bivalent cation molality, molar fraction of CH4 , and the molar fraction of N2 . They developed a feedforward Multilayer Perceptron (MLP) network trained with the back-propagation algorithm to estimate the IFT and used early stoppage techniques to prevent the overfitting problem. A good prediction performance was achieved during the testing stage with a correlation coefficient and Mean Squared Error (MSE) of 0.970 and 4.18 respectively. They concluded that ANNs are a better option compared to the empirical correlation for prediction of the IFT but more studies were required to confirm their findings. Having the weakness of ANNs indicated, Ahmadi and Mahmoudi [4] proposed the use of a Least Square SVM (LSSVM) integrated by GA for the estimation of the IFT. The carbon chain compositions of oil and gas fluids obtained from performing a series of experiments were used as the input parameters of this study. They partitioned the data into two portions of training (80%) and testing (20%) without any validation set and used the GA to regularize the kernel parameter. The RBF was the kernel used for the purpose of the study and its parameter was found to be 157.9. They presented a flow chart for the method developed and concluded that the SVM is a promising approach and can perform much better than ANNs for the IFT estimation.
27.7.5 COMPRESSIVE STRENGTH Uniaxial compressive strength (UCS) is one of the most important mechanical properties of rocks widely used in different engineering related projects to evaluate the stability of structures against loads. Determination of the UCS demands the presence of high quality core samples which cannot always be provided due to existence of weak, fractured, and foliated rocks. In addition, Uniaxial or Triaxial Compressive tests conducted to determine the UCS of rocks are destructive, time-consuming, and expensive [25]. As a result, different models based on mineralogical–petrographic characteristics and physical properties have been developed [18,37]. These models are easy to use but need calibrations by basic mechanical tests [37]. ANNs have, therefore, found their applications in prediction of the UCS (e.g., [18]). These studies indicated that ANNs are effective approaches when they are compared with analytical predictive models, but a better approach was still required.
27.8 SUMMARY
533
Emphasizing the superior performance of SVMs and RVMs over the ANNs, Ceryan [17] attempted to evaluate the application of different machine learning methods in estimating the UCS of volcanic rock samples collected from different sites of Turkey. To create a database, a series of UCS tests were conducted on 10 samples cored from 47 blocks. The P -wave velocity and durability of the samples were considered as the input parameters and normalized using Min–Max normalization approach. The RBF kernel was used for both of the SVR and RVR machines while the ANN was a network with the Levenberg–Marquardt algorithm. The cross-validation was the approach taken to tune and optimize the parameters included in the structure of those three networks. Ceryan [17] used different measures (criteria) including the RMSE, Variance Account Factor (VAF), maximum determination coefficient (R 2 ), adjusted determination coefficient (Adj. R 2 ), Performance Index (PI), Nash–Sutcliffe coefficient (NS), and the Weighted Mean Absolute Percentage Error (WMAPE) to evaluate the application of different approaches. He concluded that based on those four criteria, the RVR is the best approach to estimate the UCS when limited number of data are available. However, the SVR was still providing very close results as those of the SVR.
27.8 SUMMARY Support Vector Machines (SVMs) have been one of the most successful machine learning techniques in recent years, applied successfully to many engineering related applications including those of the petroleum and mining. In this chapter, attempts were made to indicate how an SVM works and how it can be structured to provide reliable results. Few issues were raised including selection of kernel functions and other parameters of SVMs. By providing examples from different applications of SVMs, it was concluded that the RBF (Gaussian) kernel function is perhaps the best kernel to have an efficient SVM. To select the parameters of SVMs, though, it seems that the cross-validation approach would be the best choice based on the studies carried out so far. One should always remember that although SVMs are very good approaches in resolving the issues raised by having a limited number of data for training, they might be a time-consuming approach if applied to a huge database.
REFERENCES [1] M. Abbaszadeh, A. Hezarkhani, S. Soltani-Mohammadi, Proposing drilling locations based on the 3D modeling results of fluid inclusion data using the support vector regression method, J. Geochem. Explor. 165 (2016) 23–34. [2] S. Abe, Support Vector Machines for Pattern Classification, Springer-Verlag London Limited, 2008, 350 pp. [3] C.A. Aggelopoulos, M. Robin, E. Perfetti, O. Vizika, Interfacial tension between CO2 and brine (NaCl + CaCl2 ) at elevated pressures and temperatures: the additive effects of different salts, Adv. Water Recourse 34 (4) (2011) 505–511. [4] M.A. Ahmadi, B. Mahmoudi, Development of robust model to estimate gas–oil interfacial tension using least square support vector machine: experimental and modeling study, J. Supercrit. Fluids 107 (2016) 122–128. [5] A.F. Al-Anazi, I.D. Gates, Support vector regression for porosity prediction in a heterogeneous reservoir: a comparative study, Comput. Geosci. 36 (2010) 1494–1503. [6] A. Alimoradi, A. Moradzadeh, R. Naderi, M. Zad Salehi, A. Etemadi, Prediction of geological hazardous zones in front of a tunnel face using TSP-203 and artificial neural networks, Tunn. Undergr. Sp. Technol. 23 (2008) 711–717. [7] M.S. Ameen, B.G.D. Smart, J.M. Somerville, S. Hammilton, N.A. Naji, Predicting rock mechanical properties of carbonates from wireline logs, Mar. Pet. Geol. 26 (2009) 430–444.
534
CHAPTER 27 SUPPORT VECTOR MACHINE: PRINCIPLES, PARAMETERS
[8] K. Aminian, S. Ameri, A. Oyerokun, B. Thomas, Prediction of flow units and permeability using artificial neural network, in: SPE Western Regional Meeting, California, U.S.A., 2003. [9] E. Artun, S. Mohaghegh, J. Toro, T. Wilson, A. Sanchez, Reservoir characterization using intelligent seismic inversion, in: SPE Eastern Regional Meeting Held in Morgantown, 2005, SPE 98012. [10] M. Asoodeh, P. Bagheripour, Prediction of compressional, shear, and stoneley wave velocities from conventional well log data using a committee machine with intelligent systems, Rock Mech. Rock Eng. 45 (1) (2012) 45–63. [11] P. Bagheripour, A. Gholami, M. Asoodeh, M. Vaezzadeh-Asadi, Support vector regression based determination of shear wave velocity, J. Pet. Sci. Eng. 125 (2015) 95–99. [12] B. Balan, S. Mohaghegh, S. Ameri, State-of-the-art in permeability determination from well log data, Part 1, A comparative study, model development, in: SPE Eastern Regional Conference and Exhibition, Morgantown, West Virginia, 1995, SPE 30978. [13] N. Barton, Some new Q-value correlations to assist site characteristics and tunnel design, Int. J. Rock Mech. Min. Sci. 39 (2002) 185–216. [14] C.M. Bishop, Pattern Recognition and Machine Learning, Springer Science and Business Media, 2006, 743 pp. [15] T.M. Brocher, Empirical relations between elastic wave speeds and density in the Earth’s crust, Bull. Seism. Soc. Am. 95 (2005) 2081–2092. [16] J.P. Castagna, M.L. Batzle, T.K. Kan, Rock Physics – The Link Between Rock Properties and Avo Response, Society of Engineering Geology, 1993, pp. 124–157. [17] N. Ceryan, Application of support vector machines and relevance vector machines in predicting uniaxial compressive strength of volcanic rocks, J. Afr. Earth Sci. 100 (2014) 634–644. [18] N. Ceryan, U. Okkan, A. Kesimal, Application of generalized regression neural networks in predicting the unconfined compressive strength of carbonate rocks, Rock Mech. Rock Eng. 45 (2012) 1055–1072. [19] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines (and Other Kernel-Based Learning Methods), Cambridge University Press, UK, 2000, 190 pp. [20] N. Delatte, S. Chen, N. Maini, The application of nondestructive evaluation to subway tunnel systems, in: TRB 2003 Annual Meeting, 2002, pp. 2–8. [21] R.A. Duda, P.E. Hart, D.G. Stork, Pattern Classification, 2nd edition, Wiley & Son, 2002, 680 pp. [22] A. Georgiadis, G. Maitland, J.P.M. Trusler, A. Bismarck, Interfacial tension measurement of the (H2 O + CO2 ) system at elevated pressure and temperatures, J. Chem. Eng. Data 55 (10) (2010) 4168–4175. [23] R. Gholami, M. Moradzadeh, Sh. Maleki, S. Amiri, J. Hanachi, Applications of artificial intelligence methods in prediction of permeability in hydrocarbon reservoirs, J. Pet. Sci. Eng. 122 (2014) 643–656. [24] R. Gholami, V. Rasouli, A. Alimoradi, Improved RMR rock mass classification using artificial intelligence algorithms, Rock Mech. Rock Eng. 46 (2013) 1199–1209. [25] C. Gokceoglu, K. Zorlu, A fuzzy model to predict the uniaxial compressive strength and the modulus of elasticity of a problematic rock, Eng. Appl. Artif. Intell. 17 (2004) 61–72. [26] L. Hwei-Jen, Y. Jih Pin, Optimal reduction of solutions for support vector machines, Appl. Math. Comput. 214 (2009) 329–335. [27] J. Lim, Reservoir properties determination using fuzzy logic and neural network from well log data in offshore Korea, J. Pet. Sci. Eng. 49 (2005) 182–192. [28] L. Locatelli, G. Di Marco, C. Zanichelli, P. Jarre, Rehabilitation of highway tunnels-techniques and procedures, in: A.I.T.E.S-ITA 2001 World Tunnel Congress, 2001, pp. 1–3. [29] Sh. Maleki, R. Gholami, V. Rasouli, A. Moradzadeh, R. Ghavami Riabi, F. Sadaghzadeh, Comparison of different failure criteria in prediction of safe mud weigh window in drilling practice, Earth-Sci. Rev. 136 (2014) 36–58. [30] M. Martinez-Ramon, C. Cristodoulou, Support Vector Machines for Antenna Array Processing and Electromagnetic, Morgan & Claypool, 2006, 126 pp. [31] J. Mercer, Functions of positive and negative type and their connection with the theory of integral equations, Philos. Trans. R. Soc. Lond. 209 (1909) 415–446. [32] S.O. Olatunji, A. Selamat, A. Abdulraheem, Modeling the permeability of carbonate reservoir using type-2 fuzzy logic systems, Comput. Ind. 62 (2) (2011) 147–163. [33] R. Singh, A. Kainthola, T.N. Singh, Estimation of elastic constant of rocks using an ANFIS approach, Appl. Soft Comput. 12 (2012) 40–45. [34] I. Steinwart, Support Vector Machines, Springer Science and Business Media, 2008, 466 pp. [35] V. Vapnik, Statistical Learning Theory, Wiley, New York, 1998, 768 pp.
REFERENCES
535
[36] L. Wang, Support Vector Machines: Theory and Applications, Springer, Berlin, Heidelberg, 2005, 430 pp. [37] N. Yesiloglu-Gultekin, C. Gokceoglu, E.A. Sezer, Prediction of uniaxial compressive strength of granitic rocks by various nonlinear tools and comparison of their performances, Int. J. Rock Mech. Min. Sci. 62 (2013) 113–122. [38] J. Zhang, Q. Fenga, S. Wang, X. Zhang, S. Wang, Estimation of CO2 –brine interfacial tension using an artificial neural network, J. Supercrit. Fluids 107 (2016) 31–37. [39] M.D. Zoback, Reservoir Geomechanics, Cambridge University Press, US, 2007, 450 p.
This page intentionally left blank
CHAPTER
EVOLVING RADIAL BASIS FUNCTION NETWORKS USING MOTH–FLAME OPTIMIZER
28
Hossam Faris∗ , Ibrahim Aljarah∗ , Seyedali Mirjalili† ∗ King
Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan † School of Information and Communication Technology, Griffith University, Brisbane, QLD, Australia
28.1 INTRODUCTION Artificial Neural networks (ANNs) are powerful information processing models which have been widely investigated and applied by researchers and practitioners. This interest is due to many advantages that ANNs have, such as their robustness, accuracy, and parallelism. Radial Basis Function Networks (RBFNs) are universal approximators and a special type of feedfoward neural networks with radial basis functions used as activation functions. RBFN are commonly applied for regression, classification, pattern recognition, and time series forecasting problems [1–6]. Besides their strong global approximation capability, RBFNs benefit from other powerful characteristics such as the compact structure, the ability to approximate any continuous network, and their tolerance to noise [7–9]. Similarly to any other neural network, a key element in the performance of RBFN is the learning process. The goal of this process is to tune the parameters of the network in order to minimize some error criterion. An RBFN with a common architecture of a single hidden layer has three main parameters: the connection weights, widths, and centers. The conventional approach for training RBFN is to use a two sequential stages training process. In the first stage, the centers of the hidden layer and the widths are found using some unsupervised clustering algorithm such as k-means [10], vector quantizations [11], or decision trees [12]. In the second stage the connection weights between the hidden layer and the output layer are learned. Usually, the weights are determined linearly using the simple linear least squares (LS), the orthogonal least squares (OLS) algorithms [13,14], or a gradient descent algorithm [15]. Despite the advantages of RBFNs, training the network using the conventional approaches has common limitations in the convergence speed and prediction accuracy. For example, it is highly possible that the training process be trapped in a local minima when using a classical gradient decent method. Moreover, most of clustering algorithms and gradient decent methods are very sensitive to their initial parameters setting [16]. For these reasons, many researchers were motivated to investigate the use of nature-inspired and evolutionary algorithms as an alternative approach for training RBFNs. The advantage of this family of global search algorithms is that they are gradient-free and proved to be more efficient in searching for global solution when the search space is highly multi-modal and challenging [17–19]. Some examples of such algorithms used in literature for training RBFNs as part of differHandbook of Neural Computation. DOI: 10.1016/B978-0-12-811318-9.00028-4 Copyright © 2017 Elsevier Inc. All rights reserved.
537
538
CHAPTER 28 EVOLVING RADIAL BASIS FUNCTION NETWORKS
ent training schemes are Genetic Algorithms (GA) [20,15,14,21], Particle Swarm Optimization (PSO) [22–24], Differential Evolution (DE) [25], Ant Colony Optimization (ACO) [26], Biogeography-Based Optimizer (BBO) [27], and Firefly Algorithm [28]. One interesting approach for training RBFNs is to search for all the required parameters in just one stage and simultaneously. Although the training scheme of this approach looks simple, it might have some challenges. The main challenge is that the search space in this case becomes too large and therefore the problem is considered a complex and non-linear optimization task [22]. This makes some metaheuristic methods like GA perform slowly and requires more iterations to converge [29]. In this paper we propose the application of a new nature-inspired metaheuristic algorithm called Moth–Flame Optimizer (MFO) for training RBFN. MFO is a population-based algorithm inspired mainly by the special navigation paths of moths in nature. In recent studies, MFO has shown high tendency in avoiding the convergence to local minima and low dependency to the initial solutions when applied to complex and challenging problems [30–32]. Also, MFO showed promising results when training the MLP networks [33]. This motivated us to investigate the efficiency of this powerful metaheuristic in training RBFNs. The developed approach in this work is based on optimizing all the parameters of the network including the centers, widths and the connection weights simultaneously. To the best of our knowledge, this is the first time this promising optimizer is employed for training RBFNs. In order to assess the performance of the proposed training approach, the experiments in this work are carried out in two stages: first, we compare the MFO to other well-regarded metaheuristic algorithms and then we compare it to a well-known classical training method commonly used in literature for training RBFNs. Also, seven popular data sets are utilized to benchmark and compare the training algorithms. This book chapter is organized as follows: In Section 28.2, a brief overview of RBF networks is given. In Section 28.3, we describe the MFO algorithm and its main characteristics. The proposed RBFN training method is described in Section 28.4. The details of the experiments and the discussion of the obtained results are given in Section 28.5. Finally, we summarize the conclusions and findings of this work in Section 28.6.
28.2 RADIAL BASIS FUNCTION (RBF) NEURAL NETWORKS RBF network in its simplest form is a three-layer feedforward neural network. The first layer corresponds to the inputs of the network, the second is a hidden layer consisting of a number of RBF non-linear activation units, and the last one corresponds to the final output of the network. Activation functions in RBFNs are conventionally implemented as Gaussian functions. Fig. 28.1 shows an example of the RBFN structure. To illustrate the working flow of the RBFN, suppose we have a data set D which has N patterns of (xp , yp ) where xp is the input of the data set and yp is the actual output. The output of the ith activation function φi in the hidden layer of the network can be calculated using Eq. (28.1) based on the distance between the input pattern x and the center i. x − ci 2 φi (x − ci ) = exp − (28.1) 2σj2 Here, · is the Euclidean norm, cj and σj are the center and width of the hidden neuron j , respectively.
28.3 MOTH–FLAME OPTIMIZER
539
FIGURE 28.1 Illustrative example of RBF network.
Then, the output of the node k of the output layer of the network can be calculated using the Eq. (28.2): yk =
n
ωj k φj (x)
(28.2)
j =1
Most of classical approaches deployed in the literature for training RBFNs are performed in two stages. In the first stage the centers and widths are determined using for example some unsupervised clustering algorithm, while in the second stage the connection weights between the hidden layer and the output layer are found in a way such as an error criterion like the common Mean Squared Error (MSE) is minimized over all the data set.
28.3 MOTH–FLAME OPTIMIZER Moth–flame optimization algorithm (MFO) is a recent population-based metaheuristic algorithm proposed by Sayedali Mirjalili in [34]. MFO algorithm is mainly inspired by the navigation method performed by moths at night. Moths are kind of insects that are very similar to butterflies because their special navigation method is based on maintaining a fixed angle with the moonlight while traveling in order to guarantee staying in a straight line. This mechanism is known as traversed orientation for navigation. The MFO algorithm is mathematically modeled based on a special observed case of moths’ movement in nature, that is when moths are tricked with the artificial light made by humans. In this case the distance between the moth and the artificial light is very close compared to the moonlight which makes the moths converge with the light and travel in a shrinking spiral. Fig. 28.2 shows an illustrative example of this observation.
540
CHAPTER 28 EVOLVING RADIAL BASIS FUNCTION NETWORKS
FIGURE 28.2 Illustrative example of moths spiral movement around close light source.
MFO has two main components, namely moths and flames. Moths represent the search agents or what is called individuals in other population-based metaheuristics. The position of the moth in the search space represents the problem variables or a candidate solution. Therefore, the length of the position is the dimension of the targeted optimization problem. A population of moths P can be represented as given in Eq. (28.3). ⎡ ⎤ m11 m11 . . m1d ⎢ m21 ω21 . . m2d ⎥ ⎢ ⎥ P =⎢ . (28.3) .. .. .. .. ⎥ ⎣ .. . . . . ⎦ mn1 mn1 . . mnd where n is the number of moths in the population (i.e., size of the population) and d is the number of variables in the problem. The algorithm evaluates the fitness of each moth using a fitness function and stores the obtained values in a matrix, let’s call it OP, as shown in Eq. (28.4). ⎡ ⎤ Om1 ⎢ Om2 ⎥ ⎢ ⎥ OP = ⎢ . ⎥ (28.4) . ⎣ . ⎦ Omn where Omx is the fitness value of moth number x. The second main component in MFO algorithm are the flames. MFO maintains a number of flames equal to the number of moths in the population. Therefore, they are represented as a matrix with the same dimensions of the population P as in Eq. (28.3). The main purpose of the flames is to represent the best position obtained so far for each moth. Flames also have fitness values and they are stored
28.4 MFO FOR OPTIMIZING RBFN
541
in the same way as it was done for the population P in Eq. (28.4). Flames function as a memory for moths, so each moth memorizes its best position reached. Based on these definitions, MFO starts its iterative search process by randomly generating a population of moths. The moths are then evaluated using a fitness function and then their fitness values are stored as flames. In every iteration, MFO updates the positions of the moths and flames according to a navigation mechanism. After each update the lower and upper bounds of the variables that consist of the solutions are checked. The process keeps iterating until the maximum number of iterations is reached. Since the MFO algorithm is based on the traversed orientation, the new position of each moth is calculated based on the logarithmic spiral, using the following equation: S(Mi , Fj ) = Dij · ebt · cos(2πt) + Fj
(28.5)
where Dij represents the distance between Moth Mi and Flame Fj , b represents the shape of the logarithmic spiral, and t represents a random number in [r, 1] where r linearly decreases from −1 to −2 over the iterations. The distance between the Mothi and the corresponding Flamej can be measured as given in Eq. (28.6). Dij = |Fj − Mi |
(28.6)
In order to maintain the exploitation capability of the algorithm, an adaptive mechanism is used to decrease number of flames by the advancement of the iterations as given in Eq. (28.7). n−1 Flames_NO = round n − l ∗ (28.7) T where l is the number of the current iteration, n is the maximum number of flames, and T is the maximum number of iterations.
28.4 MFO FOR OPTIMIZING RBFN Unlike most of the classical approaches of training RBFNs, which rely on a two stages procedure, the proposed MFO based training method tunes the parameters of the network all at once. The MFO will be utilized to search for the best parameters that minimize a predetermined error criterion. The training procedure based on MFO can be described in the following steps: • Initialization: The MFO algorithm starts by randomly generating a predetermined number of moths (candidate solutions). Moths are designed to represent the solutions each of which consists of the RBFN parameters. The parameters include the centers, widths, and connections weights. Each individual is defined as a one-dimensional array of real numbers in the following form: [c1 , c2 , ..., cn , σ1 , σ2 , .., σn , ω11 , ..., ωnm ]. Therefore, the dimensions of the problem or the total length of each individual can be calculated as given in Eq. (28.8), where I is the number of input features, n is number of centers, and m is the number of output nodes. D = (n × I ) + n + (n × m) + m
(28.8)
542
CHAPTER 28 EVOLVING RADIAL BASIS FUNCTION NETWORKS
FIGURE 28.3 MFO based training approach for RBF networks.
At this step, the size of the population and the maximum number of iterations is determined. • Fitness evaluation: To assess the quality of the generated individuals, each individual is split into its component parts (i.e., centers, widths, and weights). Then, each part is scaled to an appropriate range and the parts are mapped to an RBF network. To evaluate the performance of the generated network, a prediction error criteria is calculated for the network over all the training samples. In this work, the Mean Squared Error (MSE) is used. MSE can be calculated as given in Eq. (28.9), where y is the actual training output and yˆ is the output of the evaluated network. The goal is to find the best set of parameters that minimizes the MSE value. 1 (y − y) ˆ 2 k k
MSE =
(28.9)
i=1
• Update and navigate: Based on the fitness evaluation, the positions of moths and flames are updated using the navigation mechanism described previously in Section 28.3. • Stopping criterion: The training process keeps iterating until a stopping criterion is met. In literature, there are two methods: the first is when a the fitness value reaches a certain value and the second is when the training process reaches a predetermined number of iterations. In our experiments the second approach is applied to control the computation time of the training. After terminating the training process, the best set of parameters found by MFO are assigned to the RBFN and then evaluated on unrepresented testing data set. The flow chart of the proposed MFO based training approach is summarized in Fig. 28.3.
28.5 EXPERIMENTS AND RESULTS
543
Table 28.1 Summary of the classification data sets No.
Data Set
#Features
#Train Samples
#Test Samples
1 2 3 4 5 6 7
Blood Breast Diagnosis I Diagnosis II Parkinson Heart PlanningRelax
4 8 6 6 22 10 12
493 461 79 79 128 102 120
255 238 41 41 67 53 62
It is worth mentioning here that the stochastic nature of MFO will improve local optima avoidance of RBFN. However, it requires more function evaluations and is computationally more expensive than conventional gradient-based training algorithms.
28.5 EXPERIMENTS AND RESULTS To evaluate the effectiveness and performance of the developed MFO based training approach, seven data sets have been utilized and drawn from the UCI Repository [35]. All data sets are binary classification problems from the biomedical domain. The chosen data sets vary in the number of features and instances they have which represent different levels of difficulty for the training methods. These data sets are described in Table 28.1 in terms of number of features and training/testing instances. All data sets are divided into 2/3 for training and 1/3 for testing. Training and testing parts are sampled using stratified sampling in order to preserve the original percentage of each class in the samples. All features in the data sets are normalized to the range [0, 1] using Eq. (28.10) to make features lie in the same scale. In Eq. (28.10), xi is the normalized data number i of feature x, yi is the value of this data after normalization, xmin and xmax are the minimum and the maximum values of the original feature x, and ymin and ymax are the minimum and maximum values of feature x after normalization. yi = (ymax − ymin )
xi − xmin + ymin xmax − xmin
(28.10)
The experiments are conducted in two stages: at first we compare the performance of the MFO based trainer with other three metaheuristic optimizers, while in the second we compare it with the powerful training routine newrb in Matlab.
28.5.1 COMPARISON WITH OTHER METAHEURISTICS In this part of the experiments, we train the RBFN with the MFO algorithm and compare its performance with other metaheuristic trainers including two well-regarded algorithms GA and PSO, and the recent BA algorithm. All the optimizers are tuned as per the values in Table 28.2. The maximum
544
CHAPTER 28 EVOLVING RADIAL BASIS FUNCTION NETWORKS
Table 28.2 Parameters Settings of the Metaheuristic Algorithms Algorithm
Parameter
Value
GA
Crossover probability Mutation probability Selection mechanism Population size Acceleration constants Inertia weights Number of particles Loudness Pulse rate Frequency minimum Frequency maximum Number of moths
0.9 0.1 Roulette wheel 50 [2.1,2.1] [0.9,0.6] 50 0.5 0.5 0 1 50
PSO
BA
MFO
number of iterations is set to 250 for all the algorithms. The metaheuristics are used to train RBFNs with different number of neurons in the hidden layer (i.e., 4, 6, 8, 10, 12 and 14). In addition, each experiment is executed 10 times, then the average of the accuracy rates, the best accuracy rate, and the standard deviation are calculated. Accuracy rate here is measured as the total number of correctly classified instances over the total number of instances. The average accuracy rates of all trained RBFNs are shown in Fig. 28.4 for each data set. It can be noticed that the MFO algorithm shows higher average accuracy rates in all data sets and for all sizes of RBFNs except for Blood data set in which the BA algorithm was a better trainer at 12 neurons. It is also interesting to see that MFO is the only optimizer that reached an average classification rate of 100% for small data sets Diagnosis I and Diagnosis II in some network sizes. In other words, it managed to achieve a perfect classification rates in all the 10 runs in some structures. In this case, this is an indicator of the robustness of the trainer. Inspecting the results obtained for Parkinson’s and PlanningRelax data sets, it may be seen that GA, PSO, and BA algorithms achieved low and almost equal accuracy rates which can be due to the high difficulty of these data sets. However, MFO showed higher rates in the Parkinson’s data set for all numbers of hidden units, and remarkably higher rates in the PlanningRelax for 6 and 12 hidden units. For the Heart data set, MFO shows significant improvement in the accuracy over the other metaheuristics. In Fig. 28.5, the best accuracy rates achieved by the trained RBFNs over the 10 runs using all structures are shown. This figure reveals that RBFN networks trained using MFO managed to achieve best accuracy results is most of the cases. The convergence curves in Fig. 28.6 show hat MFO benefits from a fast convergence speed as well.
28.5.2 COMPARISON WITH Newrb In this part of the experiment we compare the best obtained results by the metaheuristics applied in the previous section with the newrb routine in Matlab. The newrb is an advanced training method for RBFN included in Matlab neural networks toolbox as a standard training algorithm [36]. The newrb method trains the RBFN by iteratively adding new hidden units once at a time until the value of MSE
FIGURE 28.4 Average accuracy rates for MFO, GA, PSO, and BA obtained using different numbers of hidden neurons. (A) Blood, (B) Breast cancer, (C) Heart, (D) Parkinsons, (E) PlanningRelax, (F) Diagnosis I, (G) Diagnosis II.
FIGURE 28.5 Best accuracy rates for MFO, GA, PSO, and BA obtained using different numbers of hidden neurons. (A) Blood, (B) Breast cancer, (C) Heart, (D) Parkinsons, (E) PlanningRelax, (F) Diagnosis I, (G) Diagnosis II.
28.5 EXPERIMENTS AND RESULTS
547
FIGURE 28.6 Convergence curves for MFO, GA, PSO, and BA based on 12 hidden neurons. (A) Blood, (B) Breast cancer, (C) Heart, (D) Parkinsons, (E) PlanningRelax, (F) Diagnosis I, (G) Diagnosis II.
criterion drops below a specific level or the maximum number of hidden units allowed is reached. It is important to note that the newrb model is deterministic, that is the function will produce always the same output when trained on the same data set. Table 28.3 shows a comparison between the metaheuristic algorithms and the newrb routine. This table lists the average, standard deviation and the best obtained accuracy rates based on the seven data sets mentioned previously. By looking at the average rates, we can see that the RBFN trained by MFO achieved the best results in six data sets out of seven, while newrb ranked second in four data sets. Although MFO is a non-deterministic training method, it achieved very promising results compared with the deterministic newrb. For example in Diagnosis I and Diagnosis II data sets, MFO achieved average accuracy rate of 100% while newrb achieved 97.56% for Diagnosis I and 100% for Diagnosis II. There is also more significant difference in Heart, Parkinson’s and PlanningRelax data sets with 4% and 12% respectively for MFO over the newrb function. It can be noticed also that
548
CHAPTER 28 EVOLVING RADIAL BASIS FUNCTION NETWORKS
Table 28.3 Classification Rates for All Data Sets Algorithm Blood Breast Heart Parkinson’s PlanningRelax Diagnosis I Diagnosis II
MFO (AVE ± STD) [Best]
GA (AVE ± STD) [Best]
PSO (AVE ± STD) [Best]
BA (AVE ± STD) [Best]
newrb (AVE ± STD) [Best]
0.7769±0.0063 [0.7882] 0.9718±0.0028 [0.9748] 0.8391±0.0234 [0.8696] 0.7746±0.0227 [0.8209] 0.6500±0.0078 [0.6613] 1.0000±0.0000 [1.0000] 1.0000±0.0000 [1.0000]
0.7698±0.0056 [0.7804] 0.9538±0.0198 [0.9748] 0.6120±0.0757 [0.7935] 0.7642±0.0094 [0.7910] 0.6452±0.0000 [0.6452] 0.7098±0.1456 [1.0000] 0.8463±0.1858 [1.0000]
0.7663±0.0027 [0.7725] 0.8718±0.0878 [0.9664] 0.7228±0.0759 [0.8043] 0.7612±0.0000 [0.7612] 0.6452±0.0000 [0.6452] 0.7610±0.1143 [0.9268] 0.7878±0.1459 [1.0000]
0.7773±0.0076 [0.7882] 0.9655±0.0103 [0.9748] 0.7076±0.1085 [0.8370] 0.7612±0.0000 [0.7612] 0.6452±0.0000 [0.6452] 0.8366±0.0789 [1.0000] 0.9488±0.1382 [1.0000]
0.7686±0.0000 [0.7686] 0.9664±0.0000 [0.9664] 0.7935±0.0000 [0.7935] 0.7313±0.0000 [0.7313] 0.5323±0.0000 [0.5323] 0.9756±0.0000 [0.9756] 1.0000±0.0000 [1.0000]
the MFO has the lowest standard deviation when compared to the other metaheuristic trainers which indicates the stability of the algorithm. In summary, the results of this section showed that the MFO algorithm is able to efficiently train RBF network. This resulted in very high classification accuracy. This is due to the improved exploitive behavior of this algorithm, which is strengthened proportionally to the number of iterations. The solutions tend to update their position with respect to less solutions as the iteration counter increases, which causes more local search and eventually finding an accurate approximation of the global optimum. On the other hand, the adaptive mechanism to determine the number of best solutions in each iteration (flames) first emphasizes exploration and then exploitation. The problem of training RBF networks is multi-modal and the results showed that this exploratory mechanics of MFO is able to handle it effectively.
28.6 CONCLUSION This book chapter put forward the use of the recently proposed MFO in training RBF networks. The problem of training RBF networks was formulated by identifying the parameters and objective. The MFO was then employed as a global optimizer to find the optimal values for the parameters to minimize the objective (MSE). The results proved that MFO has merit in training RBF networks and is able to outperform other algorithms. Considering the results and finding of this work, we conclude that the MFO algorithm flexibly handles the problem of local optima stagnation with a reasonable convergence speed when training RBF networks.
REFERENCES
549
REFERENCES [1] E. Kovaˇc-Andri´c, A. Sheta, H. Faris, M.Š. Gajdošik, Forecasting ozone concentrations in the East of Croatia using nonparametric neural network models, J. Earth Syst. Sci. 125 (5) (2016) 997–1006, http://dx.doi.org/10.1007/s12040-016-0705-y. [2] W. Shen, X. Guo, C. Wu, D. Wu, Forecasting stock indices using radial basis function neural networks optimized by artificial fish swarm algorithm, Knowledge-Based Syst. 24 (3) (2011) 378–385. [3] A.F. Sheta, K. De Jong, Time-series forecasting using GA-tuned radial basis functions, Inform. Sci. 133 (3) (2001) 221–228. [4] S.-K. Oh, S.-H. Yoo, W. Pedrycz, Design of face recognition algorithm using PCA-LDA combined for hybrid data preprocessing and polynomial-based RBF neural networks: design and its application, Expert Syst. Appl. 40 (5) (2013) 1451–1466. [5] W. Jia, D. Zhao, L. Ding, An optimized RBF neural network algorithm based on partial least squares and genetic algorithm for classification of small sample, Appl. Soft Comput. 48 (2016) 373–384. [6] K. Meng, Z.Y. Dong, D.H. Wang, K.P. Wong, A self-adaptive RBF neural network classifier for transformer fault analysis, IEEE Trans. Power Syst. 25 (3) (2010) 1350–1360. [7] J. Park, I.W. Sandberg, Universal approximation using radial-basis-function networks, Neural Comput. 3 (2) (1991) 246–257. [8] K.-L. Du, M. Swamy, Radial basis function networks, in: Neural Networks and Statistical Learning, Springer, 2014, pp. 299–335. [9] H. Yu, T. Xie, S. Paszczyñski, B.M. Wilamowski, Advantages of radial basis function networks for dynamic system design, IEEE Trans. Ind. Electron. 58 (12) (2011) 5438–5450. [10] J. Sing, D. Basu, M. Nasipuri, M. Kundu, Improved k-means algorithm in the design of RBF neural networks, in: TENCON 2003. Conference on Convergent Technologies for the Asia-Pacific Region, Vol. 2, IEEE, 2003, pp. 841–845. [11] M. Vogt, Combination of radial basis function neural networks with optimized learning vector quantization, in: IEEE International Conference on Neural Networks, 1993, IEEE, 1993, pp. 1841–1846. [12] M. Kubat, Decision trees can initialize radial-basis function networks, IEEE Trans. Neural Netw. 9 (5) (1998) 813–821. [13] C.-L. Lin, J. Wang, C.-Y. Chen, C.-W. Chen, C. Yen, Improving the generalization performance of RBF neural networks using a linear regression technique, Expert Syst. Appl. 36 (10) (2009) 12049–12053. [14] S. Chen, Y. Wu, B. Luk, Combined genetic algorithm optimization and regularized orthogonal least squares learning for radial basis function networks, IEEE Trans. Neural Netw. 10 (5) (1999) 1239–1243. [15] R. Neruda, P. Kudová, Learning methods for radial basis function networks, Futur. Gener. Comput. Syst. 21 (7) (2005) 1131–1142. [16] M.Y. Mashor, Improving the performance of k-means clustering algorithm to position the centers of RBF network, Int. J. Comput., Internet Manag. 6 (2) (1998) 121–124. [17] X.-S. Yang, Nature-Inspired Optimization Algorithms, Elsevier, 2014. [18] S. Mirjalili, S.M. Mirjalili, A. Lewis, Let a biogeography-based optimizer train your multi-layer perceptron, Inform. Sci. 269 (2014) 188–209. [19] H. Faris, I. Aljarah, S. Mirjalili, Training feedforward neural networks using multi-verse optimizer for binary classification problems, Appl. Intell. (2016) 1–11. [20] S.A. Billings, G.L. Zheng, Radial basis function network configuration using genetic algorithms, Neural Netw. 8 (6) (1995) 877–890. [21] W. Jia, D. Zhao, T. Shen, C. Su, C. Hu, Y. Zhao, A new optimized GA-RBF neural network algorithm, Comput. Intell. Neurosci. 2014 (2014) 44. [22] S. Chen, X. Hong, B.L. Luk, C.J. Harris, Non-linear system identification using particle swarm optimisation tuned radial basis function models, Int. J. Bio-Inspir. Comput. 1 (4) (2009) 246–258. [23] D. Wu, K. Warwick, Z. Ma, M.N. Gasson, J.G. Burgess, S. Pan, T.Z. Aziz, Prediction of Parkinson’s disease tremor onset using a radial basis function neural network based on particle swarm optimization, Int. J. Neural Syst. 20 (02) (2010) 109–116. [24] Y. Zhong, X. Huang, P. Meng, F. Li, PSO-RBF neural network PID control algorithm of electric gas pressure regulator, in: Abstract and Applied Analysis, vol. 2014, Hindawi Publishing Corporation, 2014. [25] B. Yu, X. He, Training radial basis function networks with differential evolution, in: Proceedings of IEEE International Conference on Granular Computing, 2006. [26] M. Chun-tao, L. Xiao-xia, Z. Li-yong, Radial basis function neural network based on ant colony optimization, in: International Conference on Computational Intelligence and Security Workshops, 2007, CISW 2007, IEEE, 2007, pp. 59–62.
550
CHAPTER 28 EVOLVING RADIAL BASIS FUNCTION NETWORKS
[27] I. Aljarah, H. Faris, S. Mirjalili, N. Al-Madi, Training radial basis function networks using biogeography-based optimizer, Neural Comput. Appl. (2016) 1–25. [28] M.-H. Horng, Y.-X. Lee, M.-C. Lee, R.-J. Liou, Firefly metaheuristic algorithm for training the radial basis function network for data classification and disease diagnosis, Theory New Appl. Swarm Intell. 4 (7) (2012) 115–132. [29] M. Gan, H. Peng, X.-p. Dong, A hybrid algorithm to optimize RBF network architecture and parameters for nonlinear time series prediction, Appl. Math. Model. 36 (7) (2012) 2911–2919. [30] C. Li, S. Li, Y. Liu, A least squares support vector machine model optimized by moth–flame optimization algorithm for annual power load forecasting, Appl. Intell. (2016) 1–13. [31] B. Bentouati, L. Chaib, S. Chettih, Optimal power flow using the moth flam optimizer: a case study of the Algerian power system, Indones. J. Electr. Eng. Comput. Sci. 1 (3) (2016) 431–445. [32] L. Zhang, K. Mistry, S.C. Neoh, C.P. Lim, Intelligent facial emotion recognition using moth-firefly optimization, Knowledge-Based Syst. 111 (2016) 248–267. [33] W. Yamany, M. Fawzy, A. Tharwat, A.E. Hassanien, Moth–flame optimization for training multi-layer perceptrons, in: 2015 11th International Computer Engineering Conference, ICENCO, IEEE, 2015, pp. 267–272. [34] S. Mirjalili, Moth–flame optimization algorithm: a novel nature-inspired heuristic paradigm, Knowledge-Based Syst. 89 (2015) 228–249. [35] M. Lichman, UCI machine learning repository, http://archive.ics.uci.edu/ml, 2013. [36] M.H. Beale, M.T. Hagan, H.B. Demuth, Neural network toolbox™ user’s guide, in: R2012a, The MathWorks, Inc., 3 Apple Hill Drive Natick, MA 01760-2098, 2012, www.mathworks.com, Citeseer.
CHAPTER
APPLICATION OF FUZZY METHODS IN POWER SYSTEM PROBLEMS
29
Sajad Madadi, Morteza Nazari-Heris, Behnam Mohammadi-Ivatloo Faculty of Electrical and Computer Engineering, University of Tabriz, Tabriz, Iran
29.1 INTRODUCTION The goal of classification is to recognize some phenomena such as islanding, which occurs during disconnection of a micro-grid from the main network. Taking into account this event, the voltage and frequency of the distributed generator (DG) which is placed in the islanding grid are deviated. The fuzzy method has high accuracy in islanding detection; however, other methods have a fault in islanding or non-islanding detection. Classification is commonly implemented for a wide range of power system problems. In [1], classification is utilized as an online multicriteria fuzzy-logic-based technique for classifying the occurred faults in the transmission system. Another application of the fuzzy method is for generating an expert decision system, which uses classification. An expert system to define transient disturbance waveforms in a power system is presented in [2]. Taking into account the transient stability assessment as the capability of the power system to remain synchronized under credible disturbances, a fuzzy classifier can be used to assess transient stability in a multimachine system. Clustering is a powerful means for fast data processing, which is utilized to divide the processed data sets into smaller groups. The similarity and intermittent distances of data sets form the basis of data allocation. A significant application of clustering is in probabilistic power scheduling with high load computation. Total transfer capacity (TTC) determines the maximum power transaction permitted between areas, taking into account the contingencies which are probable in the power system. For calculating the TTC, the solution of optimal power flow is required. With the increase in renewable energy such as from wind farms and photovoltaic cells, providing a deterministic value for TTC is not practical due to the stochastic nature of these energy sources. Monte Carlo simulation (MCS) method is implemented for probabilistic TTC calculation, which requires significant computational efforts. The clustering method is a practical way to reduce such issues. Forecasting methods are generally used to detect a relationship between past and future data, and estimate the future data based on this relationship. Forecasting is essential for planning and scheduling of power systems, such as load, wind, and solar radiation forecasting. A fuzzy interface system (FIS) is a common fuzzy method that is used in a power system application. The steps of FIS include fuzzification, decision, and defuzzification. In the fuzzification step, Handbook of Neural Computation. DOI: 10.1016/B978-0-12-811318-9.00029-6 Copyright © 2017 Elsevier Inc. All rights reserved.
551
552
CHAPTER 29 APPLICATION OF FUZZY METHODS
a decision variable is proposed and this proposal is used by the decision maker. Considering the proposal, a decision maker sets a degree for fuzzy rules. In the defuzzification step, the degree of each rule is employed for obtaining the final result. Classical logic (CL) is defined as a distinct value, for instance, a set of electrical energy consumers in an area. If a consumer belongs to such an area, a value of 1 is specified for that consumer. Otherwise, the indicator of this consumer is set to 0. In contrast with CL, FL does not set a determined value for the sets. A degree of membership is applied in FL for defining the sets. Different functions are introduced for modeling memberships of the sets. The most important functions reported for this aim are provided in the following: 1. Triangular-shaped membership function This membership function is based on a vector x, and a series of control parameters indicated by I , j , and k. The following equation can be stated for the triangular membership function: x −a c−x f (x; a, b, c) = max min ,o . , b−a c−b
(29.1)
2. S-shaped membership function An S-shape function is a spline-based curve for obtaining the membership degree of the sets. Due to similarity of the shape to character S, this function is named S-shape function. The formulation of this function is written as follows, in which i and j determine the extremes of the sloped portion: ⎧ 0 ⎪ ⎪ ⎪ ⎪ ⎨ 2( x−a )2 b−a f (x; a, b) = ⎪ 2 ⎪ 1 − 2( x−a ⎪ b−a ) ⎪ ⎩ 1
x ≤a a≤x≤
a+b 2
⎫ ⎪ ⎪ ⎪ ⎪ ⎬
⎪ ≤x ≤b ⎪ ⎪ ⎪ ⎭ x ≥b a+b 2
.
(29.2)
3. Sigmoidally shaped membership function This membership function is based on a vector x, and two parameters defined by i and j . Sigmoidally shaped membership function can be formulated as follows: f (x; a, b, c) =
1 1 + e−a(x−c)
.
(29.3)
This function can be inherently open to the right or to the left by revising the sign of a, and it is generally derived to present concepts such as “very large” or “very negative”. A similar function can be attained by a product or difference of two different sigmoidal functions. 4. Gaussian curve membership function Gaussian curve membership function is based on a vector x, and parameters demonstrated by σ and i. The following equation can be written for this kind of membership function: f (x; σ, c) = e
− (x−c) 2 2σ
2
.
(29.4)
29.1 INTRODUCTION
553
FIGURE 29.1 Daily demand curve.
5. Trapezoidal-shaped membership function The trapezoidal-shaped membership function can be formulated as follows, which is based on parameters i, j , k, and l: ⎫ ⎧ x≤a ⎪ ⎪ ⎪ ⎪ 0 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ x−a ⎪ ⎪ a ≤ x ≤ b ⎪ ⎪ ⎬ ⎨ b−a 1 b≤x≤c f (x; a, b) = . (29.5) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ d−x ⎪ c≤x≤d ⎪ ⎪ ⎪ d−c ⎪ ⎪ ⎪ ⎪ ⎭ ⎩ 0 d ≤x
29.1.1 DEFINITION OF FL MEMBERSHIP Consider a daily demand curve, for instance, in a region given in Fig. 29.1. The load demand is classified to three classes including low, medium, and high demand. X is a parameter utilized for illustrating load demand. Assume x1 = 50 MW and x2 = 100 MW. The membership value is considered until x3 . Then X < X1 X1 < X < X2 X2 < X < X3
introduces low demand, introduces medium demand, introduces high demand.
Accordingly, a load demand of 75 MW introduces medium demand to grade 0.75. A load demand of 65 MW introduces medium demand to grade 0.25.
29.1.2 PROPOSITIONS OF FL A simple proposition generally includes a premise and a conclusion. An instance, for an FL proposition, the statement is: If the current increases form a determined value, then the protection devices should be operated. Consider the following FL proposition: If the air temperature is high the generation units should increase their generation. In this case there are various grades of air temperature between hot and cold,
554
CHAPTER 29 APPLICATION OF FUZZY METHODS
FIGURE 29.2 Results of minimum implication method.
and there are various grades of incensement of power generations. A 0–1 conclusion on the basis of air temperature is not possible. In the following text, mathematical formulations are explained to obtain an acceptable conclusion. Compound propositions can be defined in FL. “And” and “Or” fuzzy proposition are commonly used for modeling a compound proposition. A compound proposition with “and” is called a “T-norm”; and a compound proposition with “or” is referred to as an “S-norm”.
29.1.3 IMPLICATION ˜ for an FL proposition. The implication is defined to account for the Consider “if x is A˜ then y is B” accuracy of the conclusion of an FL proposition. In implication, the grade of accuracy of y (conclusion) is determined by the grade of accuracy of x (premise). Minimum and product are two operators of implication in FL. An instance is presented in the following to investigate application of such operators. The conditional proposition is represent as follows: If the ratio between water and cement concrete of a tower is medium, then the resistance of the tower is high. Considering 0.5 for the ratio between water and cement concrete of the tower, the resistance of the tower is calculated by the minimum implication method. The results of such fuzzy logic are illustrated in Fig. 29.2. The ratio between water and cement concrete of the tower is shown by the left panel of this figure, and the right panel (b) shows the resistance of the tower. The resulting resistance of the tower, which is obtained by the product implication method, is demonstrated in Fig. 29.3.
29.1.4 FUZZY INTERFACE SYSTEM (FIS) FIS can be classified into two types. In the first type, both section of each rules (premise and conclusion) have fuzzy logic, such type of FIS is called Mamdani. An instance of this type of FIS is illustrated in Fig. 29.4. Another type of FIS is Takagi–Sugeno–Kang (TSK). In TSK, premise sections of rules have fuzzy logic and conclusion sections have algebra logic. A TSK rule is shown in Fig. 29.5. Determining fuzzy rules is the basic step for designing an FIS. This step includes defining the general form of a rule and appropriation of the parameters of each rule such as types of membership
29.1 INTRODUCTION
555
FIGURE 29.3 Result of product implication method.
R1 : If x1 is A˜ t1 and (or) x2 is A˜ t2 and (or) ...xm is A˜ tm , then yt = B˜ t
(i = 1, 2, ..., c)
FIGURE 29.4 Mamdani Rule.
R1 : If x1 is A˜ t1 and (or) x2 is A˜ t2 and (or) ...xm is A˜ tm , then yt = at1 x1 + at2 x2 + ... + atm xm + at0 (i = 1, 2, ..., c) FIGURE 29.5 TSK Rule.
functions and the parameters of each membership function. Expert opinion and using data of a process are two types of the designing method. Using expert opinion, the rules and parameters of each rule are determined by a user; however, in the other method, the rules and their parameters are determined by solving an optimization problem. This optimization problem is defined for reducing error of estimation or prediction. The steps of FIS are fuzzification, degree of fulfillment or firing strength, implication, aggregation, and defuzzification. In the fuzzification step, the value of a membership function for input signal is calculated. In the fulfillment step, the weight of each rule is computed by determining the value of the premise section of each rule. In the implication step, the result of each rule is obtained according to the implication method. The results of rules are combined together and form an output fuzzy set in the aggregation step. For the aggregation step, three methods are introduced. Aggregation by selecting the maximum value of each point is the first type. In the second type of aggregation, the weights of all rules in the different points are gathered or multiplied. In the last step, the fuzzy output is changed to a numeric value, which is called the defuzzification step. The numeric value is calculated by the centroid, bisector, mom, lom, and som methods.
556
CHAPTER 29 APPLICATION OF FUZZY METHODS
29.2 MATHEMATICAL CLASSIFICATION The main goal of classification in islanding detection is recognition of the patterns to match the islanding condition with another condition which generally changes the system parameters to those of the islanding condition. Additionally, in general, passive approaches cannot discern them; consequently, the system is taken into false application, or an islanding condition is not recognized at these events. The mathematical model for islanding detection based on the classification method with the consideration of a typical distributed generation can be represented as follows: x = {x1 , x2 , ....., xn }T , Xi = {xi1 , xi2 , ...., xij , ...., xim }, Y = {y1 , y2 , ....., yn }T , 0 / E = (Xk , yk ), k = 1, 2, ..., N ,
(29.6)
where, x is an n-dimensional vector denoting classification input. Xi is the ith member of the classification vector, which includes a feature of each pattern member. In this model, m shows the number of independent variables, and vector of classification is demonstrated by Y , which is equal to zero under the non-islanding condition, and equal to one when distributed generation unit is operated under the islanding condition. E is a vector of labeled credible events with a total number of N events.
29.2.1 ISLANDING DETECTION Nowadays distributed power generation (DG) is expanded to meet environmental, technical, and economic constraints. These generation units are generally installed in a distributed network for reducing power transmission loss and for being able to utilize units in MW scale, such as wind turbines, photovoltaic cells (PV), and microthermal generation. The impacts of additional DG units of a distributed system can be classified into positive and negative factors. Positive factors are accounted for to power transmission loss, propagate expansion planning for conventional generating units, make improvements to power quality, including frequency deviation, voltage fluctuation, and harmonic and enhancement reliability of distributed systems. However, protection problems such as islanding detection generally occur in distributed networks in the presence of DG units. Such a network is introduced as an active distributed network. In an active distributed network, protection schemes are presented to eliminate protection problems. However, many researchers have proposed studies to improve these schemes, including islanding detection problem. Islanding condition is generally applied when DG units are disconnected from the main network, which is classified into intentional or unintentional. The goal of intentional disconnection is to protect the microgrid during system disturbances, and the reason of an unintentional islanding condition is an upstream fault in the grid system. In the islanding condition, abnormal operation and protection issues are faced. Therefore, islanding detection is necessary for safe usage of DG units. According to such a requirement, many plans have been presented by researchers, which can be classified into three types. The first type is a remote technique which is usually applied to detect the unintentional islanding condition and is based on supervisory control and data acquisition (SCADA) systems. In such a technique, parameter monitoring of the whole distributed network is deployed for islanding detection.
29.2 MATHEMATICAL CLASSIFICATION
557
Accordingly, the incidence of islanding is recognized when the parameters of a separated area can be detected by SCADA systems [3]. Another technique to recognize islanding is based on local techniques; these techniques are classified into active and passive schemes. In active schemes, a small disturbance is injected to the grid and the response of disturbance is used to detect islanding condition. Impedance measurement, slipmode frequency shift, and active frequency drift belong to this category of islanding detection. In the impedance measurement method, the system impedance is measured, whereas under islanding such impedance is deviated, and this deviation is deployed to detect islanding [4]. In this method power system impedance is computed by using a shut inductor and supply voltage, where such inductor is momentarily connected to supply voltage and short circuit current is calculated. After calculation of short circuit current, such current and supply voltage are used to calculate impedance of the power system, provided the impedance value is deviated under islanding condition [5,6]. Other approaches to islanding detection, which belong to the active islanding detection, are slip-mode frequency shift [7] and active frequency drift [8]. Passive islanding detection methods are presented to deal with power quality issues which are occurring by injection of a small disturbance in active islanding detection methods. In passive techniques, islanding conditions are detected by monitoring the system parameters such as voltage, frequency, harmonic distortion, and current at the point of common coupling of DG and distributed networks. Under/over voltage and under/over frequency, voltage phase jump detection, and harmonics measurement can be listed for traditional islanding detection by using passive islanding detection techniques. The oldest approach to passive islanding detection is under/over voltage and under/over frequency. In this method, protection relays are located at the points of common connections of DG and networks [9]. An advantage of this approach can be mentioned and is its ease of implementation; however, the nondetection zone is large and results in false performance of the method; hence it was improved by some researchers [10]. An approach is deployed for a rapid change of phase voltage angle for islanding detection. This method is called voltage phase jump detection. Islanding condition can be recognized from the change of the total harmonic distortion [11]. Passive approaches are more prevalent to recognize islanding conditions due to using system parameters and not an injection disturbance to the power grid. However, false performance and large nondetection zones of these techniques are a motivation to present new methods based on passive characteristics. In [12], an intelligent islanding detection method is presented. This paper applied an artificial neural network (ANN), in which parameters of ANN were set by a particle swarm optimization (PSO) algorithm for minimizing nondetection zone of the passive method. In [13], a new method based on wavelet packet transformer and propagation neural network was presented for passive islanding detection in grid-connection photovoltaic inverter. In [14], a passive islanding detection approach was presented. Such a method combines various system parameters to classify islanding condition and nonislanding conditions. Generally, the features which can be deployed for classification of an islanding detection problem are selected through local parameters. Such parameters include frequency deviation (Hz), voltage deviation, rate of change of frequency, rate of change of voltage, rate of change of the power, rate of change of frequency over power, total harmonic distortion of the current, total harmonic distortion of the voltage, power factor deviation, absolute value of the phase voltage times power factor, and gradient of the voltage times power factor. These parameters are measured at point of the common connection of a DG unit and distributed networks.
558
CHAPTER 29 APPLICATION OF FUZZY METHODS
FIGURE 29.6 Schematic of case study 1.
FIGURE 29.7 Schematic of case study 2.
Two case studies are presented to investigate the proposed islanding detection. Diagrams of such case study are illustrated in Figs. 29.6 and 29.7. The data of these case studies are given in [14] and [15]. Detailed construction and testing of the classification model are covered in the following subsections.
29.3 MATHEMATICAL FORECASTING
559
29.2.1.1 Construction of the Classification Model Six sets of prescribed events are used for constructing the classification model of the target islanding relay, and they are defined as follows: • Set 1: Tripping of the circuit breaker cb1 to simulate the condition of islanding of the DG with the PCC-LV bus loads. • Set 2: Tripping of the circuit breaker cb2 (isolating the PCC-LV bus loads) to simulate disturbances on the DG. • Set 3: Tripping of the circuit breaker cb3 to simulate the islanding of the DG without the PCC-LV bus loads. • Set 4: Three-phase fault on the PCC-HV bus with instantaneous (1 cycle) fault-clearing time by the cb1 which in turn causes islanding of the DG. • Set 5: Sudden decrease of the loading on the target distributed resource by 40%. • Set 6: Tripping of the largest distributed resource within the DG other than the target one. Each set of these events is simulated under different EPS and DG operating states. The EPS operating states are normal, minimum, and maximum system loading. Similarly, the DG operating states include normal, minimum, and maximum PCC-bus loading. A part of training data is illustrated in Table 29.1. In this table, class 1 illustrates islanding condition, and nonislanding condition is determined by 0. Such parameters for different condition are plotted in Figs. 29.8–29.11. Table 29.2 lists the results of accuracy of the proposed technique. Fig. 29.14 illustrates the accuracy of the fuzzy method for test and training data.
29.2.1.2 Fuzzy Rules Generation In this section, data can be classified into 2 types, training data and test type. For training data, decision trees are generated, and by considering such trees, fuzzy rules are written. Two decision trees and fuzzy rules are illustrated in Figs. 29.12 and 29.13.
29.3 MATHEMATICAL FORECASTING The main goal of forecasting in the application of dynamic line rating (DLR) prediction is the recognition of patterns to estimate the future condition of DLR. The mathematical model for forecasting can be represented as follows: I (t − Ti ) (29.7) I (t) = i
where I (t) denotes the value of the variable at a future time t , and Ti are delay points.
29.3.1 DYNAMIC LINE RATING Nowadays renewable power generators such as wind power plants are increasingly common in electrical power networks. Such systems are significantly deployed considering different types of means [15]. Often overhead lines in a distribution system do not have enough capacity to transmit full output
560
CHAPTER 29 APPLICATION OF FUZZY METHODS
Table 29.1 Training Data for Fuzzy Classification THDt
THDv
Pf
f P
P T
V
f
Class
0.01 0.03 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.01 0.01 0.01 0.01 0.01
0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.02 0.01 0.01 0.01 0.01 0.01
0.51 0.09 0.11 0.03 0.11 0.02 0.01 0.03 0.05 0.03 0.50 0.00 0.02 0.03 0.11 0.04 0.03 0.03 0.05 0.03 0.52 0.12 0.14 0.03 0.15 0.01 0.01 0.03
−2E–9 7E–7 1E–6 2E–7 −1E–6 −2E–6 1E–7 2E–7 1E–7 −2E–8 −5E–9 −3E–6 1E–6 2E–7 −1E–6 2E–6 1E–6 2E–7 1E–7 5E–7 2E–10 −5E–7 −1E–6 2E–7 −1E–6 −2E–6 1E–7 2E–7
1E7 5E6 5E6 3E5 1E5 1E5 3E5 4E5 6E5 −7E5 1E7 1E6 4E4 3E5 −4E6 2E5 −3E5 4E5 6E5 −3E5 −1E7 6E6 8E6 3E5 4E6 5E4 6E5 4E5
−0.03 −0.01 −0.01 0.01 0.03 0.01 −0.01 0.01 0.01 −0.02 −0.03 −0.02 −0.01 0.01 0.03 0.01 −0.01 0.01 0.01 −0.02 −0.03 0.00 −0.01 0.01 0.03 0.00 −0.01 0.01
0/00 −0.43 −0.69 0.01 −0.19 −0.02 0.00 0.01 0.01 0.00 0.01 −0.42 0.01 0.01 0.52 −0.05 −0.01 0.01 0.01 −0.02 0.00 −0.43 −1.05 0.01 −0.56 −0.01 0.01 0.01
0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0
power generated by wind power plants. Distribution system operators (DSO) can reinforce overhead lines to meet this problem. Transmission expansion planning is scheduled on a long time scale. This solution is also more expensive and is not suitable due to increasing installation of wind farms. The use of dynamic line rating (DLR) in operation planning is a way to improved utilization of latent transmission capacity [16]. DLR is also used for utilization all of wind power stations. In DLR, capacity of overhead lines is determined by real time data [17]. Generally, output power of wind power plants depends on such data. In other words, by increasing wind power generation, the capacity of lines also increases. Another impact of DLR on network is deferment of network reinforcements. Integration of DLR in operation programming allows distribution system operators to connect more renewable power plants to the grid and reduce cost of electrical market, which increases social welfare. Traditionally, the capacity of overhead lines is determined in the planning stage and is considered constant (static). In this method, ratings of lines are determined taking conservative weather conditions into account. For the first time using nonconstant overhead line rating to increase capacity of overhead lines has been studied by [18–20].
29.3 MATHEMATICAL FORECASTING
FIGURE 29.8 Frequency deviation under nonislanding condition.
FIGURE 29.9 Rate-of-change of frequency under nonislanding condition.
561
562
CHAPTER 29 APPLICATION OF FUZZY METHODS
FIGURE 29.10 Frequency deviation under islanding condition.
FIGURE 29.11 Rate-of-change of frequency under islanding condition.
29.3 MATHEMATICAL FORECASTING
563
R1: If x4 is A1 or x4 is A2 then Class-1 R2: If x4 is B1 or x4 is B2 then Class-1 FIGURE 29.12 Decision tree for variable 4.
R1: If x1 is A1 or x1 is A2 then Class-1 R2: If x1 is B1 or x1 is B2 then Class-1 FIGURE 29.13 Decision tree for variable 1.
DSO uses thermal rating for limiting capacity of overhead lines, and a program for calculating dynamic thermal rating is presented in [21]. In this paper, dynamic thermal rating for equipment of power system substations, such as transmission lines and transformers, is calculated. This program is based on thermal measurements. A system for estimation of thermal ratings of transmission lines is proposed in [22]. This system uses information of tension point conductors and receives this information by cellular telecommunication technologies. Such methods use several measurements. DLR was improved
564
CHAPTER 29 APPLICATION OF FUZZY METHODS
FIGURE 29.14 Accuracy of the fuzzy method for training data and test.
Table 29.2 Characteristics of the Studied Overhead Line Characteristic
Value
Conductor type Conductor diameter Conductor cross-section Aluminum mass per unit length Steel mass per unit length Pylons Thermo-resistivity coefficient Nominal voltage Positive sequence resistance Positive sequence reactance Positive sequence susceptance Length
ACSR 31.5 mm 585.3 mm2 1.401 kg/m 0.522 kg/m n.a. 19.4 × 106 400 kV 0.647 10.47 132.74 × 106 32.85 km
29.3 MATHEMATICAL FORECASTING
565
by reducing the number of required measurements in [22]. The minimum number of measurements is used in [23,24] where the temperature of a conductor at sag points of overhead lines is measured for DLR calculation. Such data are used for DLR forecasting, and satisfactory performance is obtained. In [25], a new general DLR calculation model is presented, which is based on IEEE standards and SLR calculation. This model is simplified, and DLR calculation is formulated by using wind speed, air temperature, and SLR. Impact of DLR on wind power expansion planning is investigated in [25]. It proposes a new economic model for integration of wind power in a distribution network.
29.3.1.1 Formulation of DLR This section studies a model for calculating dynamic capacity of overhead lines. This model is based on the IEEE standard 738 [20]. IEEE model uses steady-state heat balance equation for calculating the current of overhead conductors. This model is shown by (29.8), where R(Tc ) is the conductor resistance at special temperature, I is the conductor current, Qs is the solar heating which is calculated by (29.9), Qr is the radioactive cooling which is computed by (29.10), and Qc is the convective cooling which is calculated by (29.11). R(Tc )I 2 + Qs = QR + Qc , Qs = γ DSi , Qr = SB πDKr Tc4 − T14 , Qc = λNu (Tc − T1 )π,
(29.8) (29.9) (29.10) (29.11)
where Nu is Nusselt number, Re is Reynolds number that shows impact of wind speed to capacity of overhead lines. These parameters are obtained by (29.12) and (29.13) where V is the wind speed. Nu = 0.65 Re0.2 +0.23 Re0.61 , < =−1.78 Re = 1.644 × 109 V D T1 + 0.5(Tc − T1 ) .
(29.12) (29.13)
Considering the above equations, the maximum current can be obtained using the following equation [25]: ⎧ ⎪ Dρ V ⎪ [1.01+0.0371×( μf )0.52 ]×[Kf ×Kangle ×T ] ⎪ ⎪ f ⎪ ⎨ R(Tc ) Imax = max (29.14) ⎪ Dρf V ⎪ ⎪ 0.52 [0.0119×( μ ) ]×[Kf ×Kangle ×T ] ⎪ ⎪ f ⎩ R(Tc )
2 where T is Tc − T2 . D is the conductor diameter, ρf is the density of air at temperature Tc +T 2 (where Tc and T2 are conductor temperature and ambient air temperature, respectively), v is the speed 2 of air stream at conductor, μf is the dynamic viscosity of air at temperature Tc +T 2 , Kf is the thermal 2 conductivity of air at temperature Tc +T 2 , Kangle is a parameter represents the angle between wind speed and the conductor axis. Capacity of each sag point of an overhead line is calculated, and minimum capacity of these points is selected for capacity of the overhead line. This model for dynamically calculating capacity of overhead lines is difficult. And often it is used only for determining the static capacity of overhead lines. In determining the static capacity of overhead lines, the worst case scenario,
566
CHAPTER 29 APPLICATION OF FUZZY METHODS
such as minimum wind speed and highest ambient temperature, is used. The value obtained using the worst case scenario is called the static line rating. SLR can be computed by using the following equation: SLR Imax
≈
[1.01 + 0.0371 × (
Dρf V SLR 0.52 ) ] × [Kf μf
× Kangle × T SLR ]
R(Tc )
(29.15)
where V SLR is the wind speed under static line rating conditions. T SLR is the temperature difference. High ambient temperature and high conductor temperature are selected for calculating the temperature difference. These values may change with season. But they are assumed constant for all seasons in this work. Other parameters can be found from design standards of overhead lines [25]. In this paper the wind speed is set to 0.5 m/s and the ambient temperature is set to 50 for SLR calculation. DLR can be estimated under real time weather conditions. This method is based on the impact of weather conditions on the capacity of overhead lines. Real time weather conditions consist of wind speed and air temperature at each sag point of the overhead line. In [25] a simplified method for DLR calculation using SLR is presented. This method neglects the correlation between air temperature and wind speed and investigates their effect on capacity of overhead lines. The ratio of η is defined to model the impact of wind speed and air temperature on capacity of overhead lines. ⎧ SLR ⎪ Imax ⎪ ⎪ ⎪
⎨ Tc −T v 0.26 SLR DLR ( ) Imax = max vSLR Tc −TSLR Imax ⎪ ⎪
⎪ ⎪ 0.566 ρf 0.04 0.04 0.3 Tc −T SLR ⎩ ( ) D v 0.26 μf Tc −TSLR Imax
(29.16)
vSLR
Capacity of each sag point of overhead lines is estimated by (29.11) and the minimum capacity is considered as the dynamic line rating for the study period. This method uses separate equation for calculating DLR at low wind speed scenario and high wind speed scenario. DLR should not be lower than SLR.
29.3.1.2 Forecasting Method Two supplementary means of building intelligent systems are fuzzy systems and neural networks. Neural networks deal with low level computational structures, and raw data are considered for performance of neural networks. In contrast to neural networks, high level structures are dealt by using fuzzy logic. Inability of fuzzy systems to learn and adjust can be counted as a weakness of fuzzy systems. A combination of a neural network and a fuzzy system can be defined as an effective way to build prediction models of very short-term wind. Layer 1 is called an input layer, in which neurons simplify passage of external crisp signals to the next layer. Layer 2 is specified in the fuzzification section, in which neurons accomplish fuzzification. A bell activation function is utilized in Jang’s model by fuzzification neurons. Layer 3 is called the rule layer, and in this layer a single fuzzy rule is specified to each neuron. Fuzzification neurons send inputs to the rule neuron, and then the firing strength of the rule is calculated.
29.4 CONCLUSION
567
Table 29.3 Characteristics of the Studied Overhead Line Data Type
Mean Error
STD
RMSE
Training data Length
0.0257 0.09414
64.523 82.646
64.5153 82.83
Layer 4 is called the normalization layer, in which neurons in the last layer send inputs to neurons in this layer. Normalized firing strengths of the rules in this layer are calculated. Moreover, for each rule the ratio of the calculated firing strength to the sum of firing strengths of all rules is provided. Layer 5 is the defuzzification layer. Each neuron in this layer is connected to the respective normalization neuron and also receives the initial inputs, ×1 and ×2. A defuzzification neuron calculates the weighted consequent value of a given rule. Layer 6 is represented by a single summation neuron. This neuron calculates the sum of outputs of all defuzzification neurons and produces the overall ANFIS output.
29.3.1.3 Simulation Results We have used a transmission overhead line in Khaf, Iran. The historical data of wind speed and air temperature for a period of 12 month with resolution of 10 minutes has been recorded. This data is suitable for training and test models. A random day of this period is chosen as test data, and older data are used to train neural networks. This overhead line has three aluminum conductor steel reinforced (ACSR) wire single phases. The diameter of each conductor is 31.5 mm. The maximum temperature of a conductor set by the standard. This value is 75 °C. Other characteristics of the overhead line are listed in Table 29.2. The result of forecasting for training data is illustrated in Fig. 29.15. Fig. 29.16 shows the histogram of error for training data. Values of DLR prediction for test data and histogram of error for training data are represented in Figs. 29.17 and 29.18, respectively. More information about DLR forecasting is presented in Table 29.3.
29.4 CONCLUSION Fuzzy approach is a practical method for data processing of power systems, which is defined as an effective technique for solving different problems of power systems, including classification, clustering, and forecasting. The fuzzy method is an effective tool for solving power system problems with high accuracy. Moreover, this method can be useful for solving probabilistic power problems with lexical criterion. In this chapter, an application of the fuzzy method to the mentioned problems is discussed. Different steps of the fuzzy method are introduced and analyzed. Different case studies are taken into account in this chapter, which include islanding detection and forecasting. The simulation results are prepared and discussed. The solution results ensure the effectiveness and high performance of the fuzzy method in solving power system problems.
568
CHAPTER 29 APPLICATION OF FUZZY METHODS
FIGURE 29.15 Training forecasting result.
FIGURE 29.16 Histogram of error for training data.
29.4 CONCLUSION
FIGURE 29.17 Train forecasting result.
FIGURE 29.18 Histogram of error for test data.
569
570
CHAPTER 29 APPLICATION OF FUZZY METHODS
REFERENCES [1] Omar A.S. Youssef, Combined fuzzy-logic wavelet-based fault classification technique for power system relaying, IEEE Trans. Power Deliv. 19 (2) (2004) 582–589. [2] Dahai You, et al., Transient stability assessment of power system using support vector machine with generator combinatorial trajectories inputs, Int. J. Electr. Power Energy Syst. 44 (1) (2013) 318–325. [3] Irvin J. Balaguer-Alvarez, Eduardo I. Ortiz-Rivera, Survey of distributed generation islanding detection methods, IEEE Latin Am. Trans. 8 (5) (2010) 565–570. [4] Aziah Khamis, et al., A review of islanding detection techniques for renewable distributed generation systems, Renew. Sustain. Energy Rev. 28 (2013) 483–493. [5] Pukar Mahat, Zhe Chen, Birgitte Bak-Jensen, Review on islanding operation of distribution system with distributed generation, in: 2011 IEEE Power and Energy Society General Meeting, IEEE, 2011. [6] H.H. Zeineldin, Ehab F. El-Saadany, M.M.A. Salama, Impact of DG interface control on islanding detection and nondetection zones, IEEE Trans. Power Deliv. 21 (3) (2006) 1515–1523. [7] Mohamed Moin Hanif, Malabika Basu, Kevin Gaughan, A Discussion of Anti-Islanding Protection Schemes Incorporated in an Inverter Based DG, 2011. [8] R. Kunte, W. Gao, Comparison and review of islanding detection techniques for distributed energy resources, in: 2008 40th North American Power Symposium, 2008, pp. 1–8. [9] Adrian Timbus, Alexandre Oudalov, Carl N.M. Ho, Islanding detection in smart grids, in: 2010 IEEE Energy Conversion Congress and Exposition, IEEE, 2010. [10] Wen Hu, Yun-Lian Sun, A compound scheme of islanding detection according to inverter, in: 2009 Asia-Pacific Power and Energy Engineering Conference, IEEE, 2009. [11] Wei Yee Teoh, Chee Wei Tan, An overview of islanding detection methods in photovoltaic systems, World Acad. Sci., Eng. Technol. 58 (2011) 674–682. [12] Haidar Samet, Farid Hashemi, Teymoor Ghanbari, Minimum non detection zone for islanding detection using an optimal artificial neural network algorithm based on PSO, Renew. Sustain. Energy Rev. 52 (2015) 1–18. [13] Hieu Thanh Do, et al., Passive-islanding detection method using the wavelet packet transform in grid-connected photovoltaic systems, IEEE Trans. Power Electron. 31 (10) (2016) 6955–6967. [14] Khalil El-Arroudi, et al., Intelligent-based approach to islanding detection in distributed generation, IEEE Trans. Power Deliv. 22 (2) (2007) 828–835. [15] J.W. Nowak, S. Sarkani, T.A. Mazzuchi, Risk assessment for a national renewable energy target part II: employing the model, IEEE Syst. J. 10 (2) (2016) 459–470. [16] A. Michiorri, H.-M. Nguyen, S. Alessandrini, J.B. Bremnes, S. Dierer, E. Ferrero, B.-E. Nygaard, P. Pinson, N. Thomaidis, S. Uski, Forecasting for dynamic line rating, Renew. Sustain. Energy Rev. 52 (2015) 1713–1730. [17] M. Jabarnejad, J. Valenzuela, Optimal investment plan for dynamic thermal rating using benders decomposition, Eur. J. Oper. Res. 248 (3) (2016) 917–929. [18] M.W. Davis, A new thermal rating approach: the real time thermal rating system for strategic overhead conductor transmission lines–part I: general description and justification of the real time thermal rating system, IEEE Trans. Power Appar. Syst. 96 (3) (1977) 803–809. [19] M.W. Davis, A new thermal rating approach: the real time thermal rating system for strategic overhead conductor transmission lines part III steady state thermal rating program continued-solar radiation considerations, IEEE Trans. Power Appar. Syst. 2 (1978) 444–455. [20] IEEE, IEEE Standard for Calculating the Current Temperature of Bare Overhead Conductors, 2006. [21] T.O. Seppa, Accurate ampacity determination: temperature-sag model for operational real time ratings, IEEE Trans. Power Deliv. 10 (3) (1995) 1460–1470. [22] T. Seppa, M. Clements, R. Payne, S. Damsgaard-Mikkelsen, N. Coad, Application of Real Time Thermal Ratings for Optimizing Transmission Line Investment and Operating Decisions, CIGRE Paper, 2000, pp. 22–301. [23] T.O. Seppa, Power Transmission Line Tension Monitoring System, US Patent 5,517,864, May 21 1996. [24] A.K. Deb, Computer System, US Patent 5,933,355, Aug. 3 1999. [25] C.J. Wallnerstrom, Y. Huang, L. Soder, Impact from dynamic line rating on wind power integration, IEEE Trans. Smart Grid 6 (1) (2015) 343–350.
CHAPTER
APPLICATION OF PARTICLE SWARM OPTIMIZATION ALGORITHM IN POWER SYSTEM PROBLEMS
30
Milad Zamani-Gargari, Morteza Nazari-Heris, Behnam Mohammadi-Ivatloo Faculty of Electrical and Computer Engineering, University of Tabriz, Tabriz, Iran
30.1 INTRODUCTION Different fields of science and technology face optimization problems with various rating of complexity according to objective function nature or the equality and inequality constraints. To handle non-linear non-convex optimization problems with high level of complexity and requirement to computational efforts, the nature-inspired optimization methods are proposed. Heuristic optimization methods which are defined as experience-based methods are fast growing tools that can handle complex optimization problems effectively and efficiently [1]. Well-known heuristic optimization algorithms are genetic algorithm, particle swarm optimization (PSO), simulated annealing, differential evolution, ant colony optimization, imperialistic competition algorithm, biogeography based optimization, bacterial foraging algorithm, artificial bee colony, and harmony search [2]. PSO method, which is counted among recently developed heuristic optimization problems, is successfully employed in power system optimization problems. Firstly, Eberhart and Kennedy have proposed the PSO algorithm in 1995 as a population-based search procedure. The simulation of social behavior is the basic idea of introducing PSO method, in which some kinds of operators are implemented to update the population of individuals. According to the fitness information provided from the environment, the population, which is updated, is expected to move to better solution regions. Individuals are called agents or particles in PSO algorithm, in which the particles’ positions are altered over time. In PSO, the particles are handled in a way that a velocity is allocated to each particle flying in the search space. The particle flying experience and its companions’ flying experiences are the basic dynamic adjustment of the velocity. For obtaining the optimal solution by application of PSO, a multidimensional search space is considered for flying the particles [3]. After finishing each iteration, new velocity is allocated to each particle. Accordingly, new position of each particle is updated based on a set of parameters including present velocity of each particle, distance of the particle from the best performance obtained for this particle during the search process, Handbook of Neural Computation. DOI: 10.1016/B978-0-12-811318-9.00030-2 Copyright © 2017 Elsevier Inc. All rights reserved.
571
572
CHAPTER 30 APPLICATION OF PSO IN POWER SYSTEMS
and the distance of the particle from the particle that achieved the best performance. The position and velocity of the ith particle as the vectors can be represented as Xik = k k , . . . , x k ) and V k = (v k , v k , . . . , v k ), respectively. The algorithm starts with random se(xi1 , xi2 n iD iD i1 i2 lection of position and velocity vectors. The minimum and maximum limitations of the velocity of ith particle should be considered as: Vimax = N ∗ Ximax − Ximin ;
0 > Fit Psit = asi + bsi Psit + csi Psit + >dsi sin esi Psimin − Psit >
(30.2)
30.3 HYDROTHERMAL SCHEDULING
575
where dsi and esi are coefficients utilized to indicate the valve-point loading effect of thermal unit i, and Psimin is the minimum capacity limitation of ith thermal unit. The objective function of STHS which aims to obtain daily power generation of hydrothermal systems with the minimum operation cost during 24 hours is: Ns 24 / 2 > >0 Fit Psit = asi + bsi Psit + csi Psit + >dsi sin esi Psimin − Psit >
(30.3)
t=1 i=1
in which the number of thermal plants is denoted by Ns .
30.3.2 CONSTRAINTS Some equality and inequality constraints of hydrothermal system, hydro- and thermal power generation plants should be taken into account for the solution of STHS problem. The generation scheduling of hydro- and thermal units should be provided considering the following constraints:
30.3.2.1 System Power Balance Power generation of hydro- and thermal units should satisfy the load demand of the system. So that, the system power balance constraint is needed to be considered for each time interval, which can be formulated as follows: t PLoad
=
NS i=1
Psit
+
Nh
t Phj ;
t ∈ 24
(30.4)
j =1
t t is to indicate the power in which PLoad defines load demand of the system at time interval t, and Phj generation of hydro plant at time interval t. Moreover, the number of hydro units is denoted by Nh . Power generation of hydro plants depends on water release and reservoir volume in each time interval. The formulation of power generation of hydro plants is considered as quadratic function:
2 2 t Phj = C1j Vhjt + C2j Qthj + C3j Vhjt Qthj + C4j Vhjt + C5j Qthj + C6j ;
j ∈ Nh ; t ∈ T (30.5)
t is the power generation of hydro plant j at time t, and the coefficients of hydro plant j are in which Phj shown by c1j , c2j , c3j , c4j , c5j , and c6j . Vhjt and Qthj are used to define the volume and the discharge of j th hydro unit at time interval t.
30.3.2.2 Output Capacity Limitations Minimum and maximum amounts of power generations of hydro- and thermal units should be considered as inequality constraints, which can be stated as follows: min t max Phj ≤ Phj ≤ Phj ;
j ∈ Nh , t ∈ 24
(30.6)
Psimin ≤ Psit ≤ Psimax ;
i ∈ Ns , t ∈ 24
(30.7)
576
CHAPTER 30 APPLICATION OF PSO IN POWER SYSTEMS
Table 30.1 Hydrothermal Generation Schedules (MW) Hour Ph1
Ph2
Ph3
Ph4
Ps1
Ps2
Ps3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
97.58 58.62 7.06 78.89 78.62 59.11 84.41 42.14 52.24 55.66 64.49 66.41 42.16 60.55 66.64 74.76 46.62 51.31 46.65 44.79 40.75 53.67 41.80 71.89
62.06 41.59 1.65 34.26 9.89 31.17 43.25 71.56 45.27 33.52 37.96 21.66 49.04 43.54 94.03 49.15 50.73 49.33 46.43 28.02 55.11 57.97 50.66 58.72
156.94 226.88 223.97 131.95 43.96 217.35 214.62 223.66 113.96 202.54 226.53 223.91 191.54 255.14 256.55 223.65 253.08 267.96 296.81 323.86 265.62 218.14 227.52 286.83
22.04 28.14 109.28 27.64 179.96 43.02 23.69 104.75 105.05 105.54 99.29 116.66 104.52 176.99 49.11 169.63 99.14 167.96 81.58 127.46 24.88 105.41 105.58 21.55
126.89 128.11 18.95 137.94 148.78 48.98 296.66 118.72 212.02 129.03 120.96 297.31 104.93 120.74 129.54 212.06 217.84 153.86 206.56 230.09 215.55 118.95 127.70 38.27
211.90 216.93 149.63 155.72 131.63 323.92 230.98 420.25 499.25 495.05 494.05 344.63 498.75 321.81 326.66 279.54 310.74 326.64 314.41 216.31 226.63 233.69 292.08 233.67
71.88 77.96 38.21 101.93 75.66 74.81 54.68 68.51 60.87 55.04 53.63 76.71 71.73 55.96 50.53 50.66 64.66 93.62 71.79 103.16 85.60 68.46 71.48 91.51
min and P max are utilized to demonstrate lower and upper limitations of power generation of hydro Phj hj unit j , respectively. Additionally, the respective indicators of lower and upper limitations of power generation of ith thermal unit are Psimin and Psimax .
30.3.2.3 Hydraulic Network Constraints The limitations of discharge and reservoir storage volumes of hydro power generation unit are other constraints of the STHS problem, which can be considered as the following: Vhjmin ≤ Vhjt ≤ Vhjmax ;
j ∈ Nh , t ∈ 24
(30.8)
t max Qmin hj ≤ Qhj ≤ Qhj ;
j ∈ Nh , t ∈ 24
(30.9)
in which Vhjmin and Vhjmax are the respective indicators of lower and upper reservoir storage volumes of hydro power generation plant j . Also, the minimum and maximum amounts of volume and discharge max of j th hydro plant are denoted by Qmin hj and Qhj , respectively. Water dynamic balance in reservoirs
30.3 HYDROTHERMAL SCHEDULING
577
Table 30.2 Hourly Discharge (×103 m3 ) by Using PSO Algorithm Hour Q1
Q2
Q3
Q4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
10.78 7.02 7.84 6 7.63 6 8.39 11.96 8.93 9.44 12.60 10.61 10.40 6 7.29 6.21 9.17 8.69 6 11.10 6 6.50 6 11.36
14.66 17.66 30 28.50 16.79 19.07 10.06 18.08 29.14 10.59 14.10 10 12.39 10 19.01 23.94 10 10 17.97 25.11 23.19 18.41 10 16.07
6.32 6.39 6 6.16 7.31 11.49 7.08 19.11 14.94 15.82 20 19.87 20 16.27 14.45 9.42 15.80 16.32 17.32 20 12.27 20 20 11.54
5.85 13.15 7.87 5 8.52 5.38 5.29 5 5 5 6.05 12.25 15 9.44 11 7.33 11.46 11 9.54 5.22 5 5 10.69 9.89
is formulated as: t t Vhjt = Vhjt−1 + Ihj − Qthj − Shj +
t−T d t−T dk Qhk k + Shk ;
j ∈ Nh , t ∈ 24
(30.10)
j k∈Rup
in which reservoir storage volumes of hydro plant j at time t, and time t − 1 are denoted by Vhjt and t and Qt , Vhjt−1 , respectively. The inflow rate and discharge of hydro plant j at time t are shown by Ihj hj t is utilized to define spillage of j th hydro plant at time t. respectively. Moreover, Shj The values of initial and final reservoir storage volume of hydro units are known. Accordingly, the following constraints are considered: h Vjh0 = Vj,init
(30.11)
h Vjh24 = Vj,end
(30.12)
578
CHAPTER 30 APPLICATION OF PSO IN POWER SYSTEMS
FIGURE 30.2 Cost convergence curve from PSO method.
where Vjh0 and Vjh24 are the respective indicators of the reservoir storage of j th hydro plant at times 0 h h and 24. The initial and final reservoir storage of j th hydro unit are shown by Vj,init and Vj,end .
30.4 SIMULATION RESULTS The proposed PSO algorithm for solving short-term hydrothermal scheduling problem is applied to a test system, which contains four hydro units and three thermal units [15,16]. The time interval considered in this study is 1 hour and the scheduling period is 24 hours. In the proposed PSO algorithm, the value of population size and the value of maximum iteration number are set to 50 and 200, respectively. For this case study simulation is executed on a PC with Intel corei7 2.2-GHz processor and 4 GB of RAM under the 64-bit Windows 7 operating system. The optimal hourly hydrothermal generation schedules by using PSO algorithm are shown in Table 30.1. The water discharge rates for hydro units are presented in Table 30.2. The cost convergence curve of the proposed PSO algorithm is demonstrated in Fig. 30.2. According to Fig. 30.2 the cost convergence curve of PSO algorithm converged approximately after 150 iterations and the total cost of hydrothermal generation schedule is 43580 USD.
30.5 CONCLUSION This chapter introduced PSO method as an effective tool for solving optimization problems. In this study, the particle swarm optimization (PSO) algorithm is adopted to solve the short-term hydrothermal
REFERENCES
579
scheduling problem. According to the results, by utilizing the PSO algorithm, near optimal solutions can be achieved in the reasonable computational time. So, using nature-inspired algorithms like PSO in optimization problems with complicated constraints has a very effective impact on numerical results and computational time.
REFERENCES [1] M.R. AlRashidi, M.E. El-Hawary, A survey of particle swarm optimization applications in electric power systems, IEEE Trans. Evol. Comput. 13 (2009) 913–918. [2] M. Nazari-Heris, B. Mohammadi-Ivatloo, Application of heuristic algorithms to optimal PMU placement in electric power systems: an updated review, Renew. Sustain. Energy Rev. 50 (2015) 214–228. [3] B. Mohammadi-Ivatloo, M. Moradi-Dalvand, A. Rabiee, Combined heat and power economic dispatch problem solution using particle swarm optimization with time varying acceleration coefficients, Electr. Power Syst. Res. 95 (2013) 9–18. [4] M. Mehdinejad, B. Mohammadi-Ivatloo, R. Dadashzadeh-Bonab, K. Zare, Solution of optimal reactive power dispatch of power systems using hybrid particle swarm optimization and imperialist competitive algorithms, Int. J. Electr. Power Energy Syst. 83 (2016) 104–116. [5] Y. Zhang, D.-w. Gong, N. Geng, X.-y. Sun, Hybrid bare-bones PSO for dynamic economic dispatch with valve-point effects, Appl. Soft Comput. 18 (2014) 248–260. [6] A. Moradi, M. Fotuhi-Firuzabad, Optimal switch placement in distribution systems using trinary particle swarm optimization algorithm, IEEE Trans. Power Deliv. 23 (2008) 271–279. [7] R.K. Sahu, S. Panda, G.C. Sekhar, A novel hybrid PSO-PS optimized fuzzy PI controller for AGC in multi area interconnected power systems, Int. J. Electr. Power Energy Syst. 64 (2015) 880–893. [8] V. Girish, T. Ananthapadmanabha, A cluster objective PSO algorithm for optimal PMU placement in IEEE bus systems and in KPTCL grid, Int. J. Power Energy Convers. 7 (2016) 121–138. [9] R.P. Singh, V. Mukherjee, S. Ghoshal, Particle swarm optimization with an aging leader and challengers algorithm for the solution of optimal power flow problem, Appl. Soft Comput. 40 (2016) 161–177. [10] M.H. Moradi, M. Abedini, A combination of genetic algorithm and particle swarm optimization for optimal DG location and sizing in distribution systems, Int. J. Electr. Power Energy Syst. 34 (2012) 66–74. [11] S. Pookpunt, W. Ongsakul, Optimal placement of wind turbines within wind farm using binary particle swarm optimization with time-varying acceleration coefficients, Renew. Energy 55 (2013) 266–276. [12] M. Saravanan, S.M.R. Slochanal, P. Venkatesh, J.P.S. Abraham, Application of particle swarm optimization technique for optimal location of FACTS devices considering cost of installation and system loadability, Electr. Power Syst. Res. 77 (2007) 276–283. [13] M. Basu, Improved differential evolution for short-term hydrothermal scheduling, Int. J. Electr. Power Energy Syst. 58 (2014) 91–100. [14] M. Nazari-Heris, B. Mohammadi-Ivatloo, G.B. Gharehpetian, Short-term scheduling of hydro-based power plants considering application of heuristic algorithms: a comprehensive review, Renew. Sustain. Energy Rev. 74 (2017) 116–129. [15] M. Basu, An interactive fuzzy satisfying method based on evolutionary programming technique for multiobjective shortterm hydrothermal scheduling, Electr. Power Syst. Res. 69 (2004) 277–285. [16] M. Nazari-Heris, A. Hagrah, B. Mohammadi-Ivatloo, Optimal short-term generation scheduling of hydrothermal systems by implementation of real-coded genetic algorithm based on improved Mülenbein mutation, Energy 128 (2017) 77–85.
This page intentionally left blank
CHAPTER
OPTIMUM DESIGN OF COMPOSITE CONCRETE FLOORS USING A HYBRID GENETIC ALGORITHM
31
Mohamed G. Sahab∗ , Vassili V. Toropov† , Amir H. Gandomi‡,§ ∗ School of Civil Engineering, Tafresh University, Tafresh, Iran † School of Engineering and Materials Science, Queen Mary University of London, London, United Kingdom ‡ School of Business, Stevens Institute of Technology, Hoboken, NJ, United States § BEACON Center for the Study of Evolution in Action, Michigan State University, East Lansing, MI, United States
NOMENCLATURE bE b0 C(X) Cb (X) Cc (X) Cs (X) E I l ts ω
effective flange width steel joist spacing total cost function steel beam cost function concrete cost function reinforcing bars cost function modulus of elasticity of steel moment of inertia of the transformed composite section the length of joists slab thickness the uniform service live load per unit length of beam
31.1 INTRODUCTION Among the structural elements of a building, floor construction is the most time-consuming and costly activity particularly for a framed building, representing some 60% to 80% of the total structural cost of a building in both cost and time [1,2]. One of the most economic structural flooring systems is the steel–concrete composite floor. Steel beams encased in concrete to protect against fire were used from the early 1900s and some of them were designed as steel–concrete composite beams. In the early 1930s, bridge constructions began to use composite sections and then in the early 1960s the composite steel–concrete were employed in building construction [3]. A typical steel–concrete composite floor is formed by steel beams connected to a concrete slab on the top of them using shear connectors in which the concrete slab mainly acts as the compression flange of the composite beam. Steel beams may be partially or completely encased in concrete. In a different type of composite steel–concrete slab, called profiled steel sheeting, the floor slab is cast on permanent steel formwork that acts in construction time Handbook of Neural Computation. DOI: 10.1016/B978-0-12-811318-9.00031-4 Copyright © 2017 Elsevier Inc. All rights reserved.
581
582
CHAPTER 31 OPTIMUM DESIGN OF COMPOSITE CONCRETE FLOORS
as a formwork and then in service conditions as bottom reinforcement for the concrete slab. To design a composite floor, in practice, engineers manipulate some design parameters like compressive concrete strength, slab thickness, beam spacing, and steel section size to obtain a proper design. This design procedure can be treated as a design optimization problem to find the minimum cost design of a composite steel–concrete floor. Over the last three decades, the cost optimization of composite structures was mainly considered from the viewpoint of the development and application of different optimization techniques. Zahn [4] discusses the economies of the AISC Load and Resistance Factor Design (LRFD) code versus the AISC Allowable Stress Design (ASD) code, in the design of composite floor beams, through weight comparison of some 2500 composite designs using A36 steel. He concludes that for short-span beams in the range of 10 ft (3.05 m) to 20 ft (6.1 m) the vibration serviceability constraint is the controlling design constraint using either one of the codes. The author’s ‘preliminary results’ indicate that the LRFD code yields a saving of ‘6% to 15% for span lengths ranging from 10 to 45 ft (3.05 to 13.7 m)’. The minimum cost design of composite beams based on the AISC-LRFD code has been developed by Lorenz [5]. Bhatti [6] presents the minimum cost design of simply supported partially or fully composite I-shaped steel beams with concrete slabs subjected to a uniformly distributed load, and strength, deflection, and vibration constraints of the AISC-LRFD specifications using the Lagrange multiplier approach. Adeli and Kim [7,8] published cost optimization of composite floors using neural dynamics model and a floating point genetic algorithm. A different research work on optimization of composite floors using genetic algorithm has been developed by Shock [9]. Cost optimization of a composite I-beam floor system has been developed by Klanšek and Kravanja [10]. Senouci and Al-Ansari [11] present a genetic algorithm model for the cost optimization of composite beams based on the AISC-LRFD specifications. This paper presents the optimal design of composite floors consisting of steel joists and covering slab based on AISC-ASD, AISC-LRFD [12] and Eurocode 4 [13] code provisions. Both shored and unshored constructions are considered and compared from economic point of view. An optimality criterion based on the position of the neutral axis is introduced, and it is shown that this makes the optimization problem much easier and practical to solve.
31.2 PROBLEM FORMULATION OF COST OPTIMIZATION OF COMPOSITE FLOORS 31.2.1 DESIGN VARIABLES Fig. 31.1 shows the cross section and the design variables which are considered for the design optimization of a steel–concrete floor in this paper. These design variables are slab thickness, ts , steel joist spacing, b0 , and steel joist section size (e.g. IPE 240).
31.2.2 COST FUNCTION The cost function is the total cost of a unit area of the composite floor. Eq. (31.1) describes this cost function including the cost of steel joists and the cost of reinforcing bars and concrete of covering slab. C(X) = Cs (X) + Cc (X) + Cb (X)
(31.1)
31.2 PROBLEM FORMULATION OF COST OPTIMIZATION OF COMPOSITE FLOORS
583
FIGURE 31.1 Design variables and effective width in a composite steel–concrete floor.
where Cs , Cc , and Cb are the costs of reinforcing bars and concrete of covering slab and steel beams for a unit area of the floor. These cost functions are of design variable vector, X, which is defined as follows: X = (b0, ts, profile No.)
(31.2)
31.2.3 DESIGN CONSTRAINTS Until the early 1990s, the design of steel–concrete composite floors was based on allowable stress design (ASD). More recently, a method known as load and resistance factor design (LRFD), has come into use because it permits a more rational design. It takes into account the probability of loading conditions and statistical variations in the strength, or resistance capability, of members and connection materials. The use of LRFD design procedures will result in a savings of material, generally in the range of 15 to 20%, and on major structures, some elements may show a savings of up to 25%. Such weight savings generally means a lesser cost for the structural steel. However, except for major structures, when serviceability factors such as deflection and vibration are considered in the proportioning of the individual members, the nominal savings of LRFD procedures versus ASD procedures is more likely to be approximately 5%. AISC design code presents both LRFD and ASD design procedures for the design of steel–concrete composite floors. However, Eurocode 4 (EC4) is an ultimate limit state design, in which, similarly to AISC-LRFD, the maximum stresses of steel and concrete in the composite section are allowed to reach the yield strength of steel and compressive strength of concrete, respectively. In this paper all the three above mentioned design codes have been considered to define design constraints individually. An early stage in the design of a composite beam is to determine the effective breath of the concrete flange, be , as shown in Fig. 31.1. The assumed composite cross section is the same for ASD and LRFD procedures. The effective width of the slab is governed by beam span and beam spacing or edge distance. The effective width of the concrete slab is the sum of the effective widths for each side of the beam centerline, which is taken as the smaller of: 1. one-eighth of the beam span, center-to-center of supports. According to AISC Code provisions this width is considered for both simply supported and continuous beams. EC4 Code presents the same provision as the AISC Code for simply supported beams; but prescribes one-sixteenth of the beam span for effective width of continuous beams. 2. one-half the distance to the centerline of the adjacent beam; or 3. the distance to the edge of the slab.
584
CHAPTER 31 OPTIMUM DESIGN OF COMPOSITE CONCRETE FLOORS
The design procedures are different for the shored and unshored steel-concrete composite floors. If shoring is not used (temporary supports are not provided to the floor beams), the steel beam must carry all dead loads applied until the concrete hardens, even if full plastic capacity is permitted for the composite section afterward. The assumed composite cross section is the same for ASD and LRFD procedures. Ultimate load: Ultimate load = 1.35 DL + 1.5 LL
EC4
(31.3)
Ultimate load = 1.2 DL + 1.6 LL AISC
(31.4)
31.2.3.1 Flexural Strength Constraints According to AISC-LRFD the ultimate bending moment must be less than or equal to the nominal flexural strength multiplied by the strength reduction factor (φ = 0.9). Two cases must be considered. First, for unshored beams, the ultimate bending moment capacity of the noncomposite steel section (excluding the concrete strength) must be checked to make sure that the steel beam can support a dead load of its estimated self-weight, the weight of wet concrete, the weight of the formwork, and a construction live load. This constraint is expressed as Muu_noncomposite ≤ 0.90Mnn_noncomposite Muu_noncomposite ≤ Mnn_noncomposite
(according to AISC)
(according to EC4)
(31.5) (31.6)
where Muu_noncomposite and Mnn_noncomposite are the required ultimate moment capacity and the nominal moment capacity of the noncomposite steel section, respectively. The ultimate moment resistance of the composite section, Muu , should be equal to or less than the ultimate design moment, Mnc . Muu ≤ Mnc
(31.7)
31.2.3.2 Deflection Constraints The AISC LRFD code does not include any explicit requirement on deflections. However, the deflection constraint is included for the generality of the optimization formulation. The deflection of a composite beam depends on whether it is shored during the construction. Shoring provides a temporary support during the hardening of the concrete slab, and consequently reduces the deflection of the composite beam. However, the unshored construction method is labor-intensive, faster and more convenient than the shored construction method, consequently it is often preferred. For unshored composite beams, the deflection of the composite beam due to live load, LL, is limited to a certain value defined as a percentage of the span length in the following form: ll =
5ωl 4 384EI
l l to for buildings and ≤ a predefined limit ranging from 300 360 l l to for bridges 500 900
(31.8)
31.3 INTRODUCING OPTIMALITY CRITERIA
585
FIGURE 31.2 Stress distribution along the depth of a steel–concrete composite section: (A) elastic analysis, and (B) plastic analysis.
where ω is the uniform service live load per unit length of the beam, l is the length of joists, E and I are the modulus of elasticity of steel and the moment of inertia of the transformed composite section, respectively.
31.2.4 SOLUTION METHOD OF OPTIMIZATION PROBLEM A hybrid genetic algorithm (GA) is used to solve the optimization problem [14]. This algorithm includes two stages of the search. In the first stage, a global search is carried out over the design search space using a modified GA. In this algorithm, the population size varies from a maximum to a predefined minimum size, depending on the uniformity of individuals in each generation. In the second stage, a local search is implemented in the vicinity of the obtained GA solution, using a discretized form of Hook and Jeeves method to find a better solution.
31.3 INTRODUCING OPTIMALITY CRITERIA Optimality criteria methods are the techniques that use the optimality conditions or some heuristic rules to develop efficient iterative techniques to find the optimum solution. Figs. 31.2A and B show the stress distribution along the depth of a steel–concrete composite section for an elastic and a plastic analysis, respectively. As these figures show, the position of the neutral axis can be within the concrete slab, the flange of the steel beam, or the web of the steel beam section. As the external moment imposed on the composite beam increases, the neutral axis makes the distance from the bottom of the section, and it moves toward the top of the concrete slab. When it locates within the concrete slab, a portion of concrete slab under the neutral axis is not involved in producing resisting moments of the section, but it imposes an additional dead weight to the floor. Because of that, it can be considered that in the optimal composite section the neutral axis should be located at the top of the steel beam section and the bottom of the concrete slab.
586
CHAPTER 31 OPTIMUM DESIGN OF COMPOSITE CONCRETE FLOORS
FIGURE 31.3 Floor plan.
The distance between the bottom of the concrete slab and the PNA, yb , can be found from the equilibrium between the tension force and total compression forces. Considering Fig. 31.3 the depth of neutral axis can be obtained from Eq. (31.9): 2 Ai yi (31.9) yb = i=1 2 i=1 Ai The above formula is valid in the range of elastic behavior of materials. In the case of plastic design, the position of plastic neutral axis, PNA, can be identified by equating the tension and compression forces on the section. Since the optimum solution, obtained using this optimality criterion, is independent of the relative cost of materials and labor, therefore it is also time and place-independent solution.
31.4 MULTIOBJECTIVE DESIGN OPTIMIZATION OF STEEL–CONCRETE COMPOSITE FLOOR As mentioned in Section 31.1, the cost of floors usually comprises about 60% of the total structural cost of a building. On the other hand, the major part of the dead load of building is imposed by the self-weight of floors. Therefore, the heavier floor leads to more costly girders, columns, foundation, and other floor supporting structural elements. Furthermore, the earthquake lateral load is in proportion to the self-weight of the structure. Therefore, decreasing the constructional cost of floors may not provide the minimum total structural cost. Whereas the concrete covering slab contains the main part of self-weight of the composite floor it can be said that by decreasing the slab thickness the self-weight of floor decreases. On the basis of this reality the following objective function which includes both these objectives can be defined: Ct−m = 0.6C(X) + 0.4ts
(31.10)
31.5 COMPARATIVE DESIGN EXAMPLE
587
Table 31.1 Variation of the Location of the Neutral Axis With Respect to Steel Beam Section Size IPE
Cost of Floor per Unit Area (currency/m2 )
Distance of Neutral Feasibility Axis From the Top of Steel Beam (cm)
300 330 360 400
45.21 50.48 57.32 64.88
0.83 1.14 3.05 4.57
Infeasible Feasible Feasible Feasible
Table 31.2 Variation of the Location of the Neutral Axis With Respect to Thickness of Covering Slab Thickness of Cover- Cost of Floor ing Slab (cm) per Unit Area (currency/m2 )
Distance of Neutral Feasibility Axis From the Top of Steel Beam (cm)
8 9 10 11 12
3.22 1.14 1.05 0.27 −0.47
50.08 50.48 50.97 51.52 52.11
Infeasible Feasible Infeasible Infeasible Infeasible
31.5 COMPARATIVE DESIGN EXAMPLE This design example has been taken from Salmon and Johnson with some changes to make it more practical. Fig. 31.3 shows the dimensions of the floor. The floor is constructed without shores. Other input design values are as follows: Fy = 2400 kg/cm2 , fc = 200 kg/cm2 , live load = 500 kg/m2 , dead load of partitions and flooring imposed after concrete was cured = 500 kg/m2 . An appropriate design has been presented by Salmon and Johnson [3]. They chose a concrete slab thickness of ts = 100 mm and steel joist W 16 × 36 which is equivalent to the European beam section IPE 400. Here an optimized design is presented and compared to this floor. The unit prices of material and labors for reinforcing bars, steel beam, shear studs, and concrete are assumed to be as 1 currency/kg, 1.125 currency/kg, 2.5 currency/each stud, and 62.5 currency/m3 , respectively. Deflection is limited to l/360. The lower bound of concrete slab thickness was considered 70 mm. Three different sets of initial design values are being chosen. Tables 31.1 to 31.3 summarize the obtained results of design optimization. Table 31.3 shows the optimum solution, obtained for different assumed concrete costs. As it is seen in a range of concrete cost, the optimum solution does not change. When concrete cost considerably increases, the optimum solution changes by decreasing the slab thickness to save expensive concrete consumption. The optimal cost solution obtained in this research is about 9% lower than that of a conventional design, presented by Salmon and Johnson.
588
CHAPTER 31 OPTIMUM DESIGN OF COMPOSITE CONCRETE FLOORS
Table 31.3 Variation of the Optimum Design With Respect to Concrete Price Concrete Price IPE (currency/m3 )
b0 (cm)
ts (cm)
Cost of Floor per Unit Area (currency/m2 )
62.5 100 125 250
160 160 160 150
9 9 9 7
50.48 53.85 56.11 65.06
330 330 330 330
31.6 CONCLUSIONS Optimal design of steel–concrete composite floors according to AISC-ASD, AISC-LRFD, and Eurocodes, and based on a hybrid and modified GA, were presented. Considering the position of neutral axis, it was shown that the distance between the neutral axis and the bottom of a concrete slab could be considered as an alternative objective function to find the optimal design of a composite steel–concrete section and minimization of this function can be led to the optimum design. This optimality criterion makes the optimization problem much easier and practical to solve. A comparative example was demonstrated and by the aid of this numerical example the following conclusions were drawn: 1. In the optimal steel–concrete composite section the neutral axis is close to the bottom of concrete slab as much as the design constraints allow and a feasible solution is obtained. 2. Considering the discretized nature of design variables, the optimal composite section does not continuously change by the variation of the relative costs of concrete and steel. 3. The solution obtained based on this optimality criteria is place- and time-independent.
REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]
P.W. Matthew, D.H. Bennett, Economic Long Span Concrete Floors, British Cement Association, Crowthorne, UK, 1990. C.H. Goodchild, Economic Concrete Frame Elements, British Cement Association, Crowthorne, UK, 1997. C.G. Salmon, J.E. Johnson, Steel Structures Design and Behavior, Harper Collins College Publishers, New York, 1996. M.C. Zahn, The economies of LRFD in composite floor beams, AISC Eng. J. 24 (2) (1987) 87–92. R.F. Lorenz, Understanding composite beam design methods using LRFD, AISC Eng. J. (First Quarter) (1988) 35–38. M.A. Bhatti, Optimum cost design of partially composite steel beams using LRFD, AISC Eng. J. (First Quarter) (1996) 18–29. H. Adeli, H. Kim, Cost optimization of composite floors using neural dynamics model, Commun. Numer. Methods Eng. 17 (2001) 771–787. H. Kim, H. Adeli, Discrete cost optimization of composite floors using a floating-point genetic algorithm, Eng. Optim. 33 (4) (2001) 485–501. B.T. Shock, Automated Design of Steel Wide Flange Beam Floor Framing Systems Using Genetic Algorithms, M.S. thesis, Marquette University, Milwaukee, WI, 2003. U. Klanšek, S. Kravanja, Cost optimization of composite I beam floor system, Am. J. Appl. Sci. 5 (1) (2007) 7–17. A.B. Senouci, M.S. Al-Ansari, Cost optimization of composite beams using genetic algorithms, Adv. Eng. Softw. 40 (2009) 1112–1118. AISC, Manual of Steel Construction – LRFD/ASD, 13th ed., American Institute of Steel Construction, Chicago, Illinois, 1995.
REFERENCES
589
[13] British Standards Institution, Design of Composite Steel and Concrete Structures, British Standards Institution, London, 1994. [14] M.G. Sahab, V. Toropov, A.F. Ashour, A hybrid genetic algorithm for reinforced concrete flat slab buildings, Comput. Struct. 83 (2005) 551–559.
This page intentionally left blank
CHAPTER
A COMPARATIVE STUDY OF IMAGE SEGMENTATION ALGORITHMS AND DESCRIPTORS FOR BUILDING DETECTION
32
Fadi Dornaika∗ , Abdelmalik Moujahid† , Youssef El Merabet‡ , Yassine Ruichek§ ∗ IKERBASQUE,
Basque Foundation for Science, Bilbao, Spain † Carlos III University of Madrid (UC3M), Madrid, Spain ‡ Faculté des Sciences, Université Ibn Tofail, Kénitra, Morocco § IRTES-SET, UTBM, Belfort, France
32.1 INTRODUCTION 32.1.1 MOTIVATION Nowadays, automatic object recognition is a topic of growing interest for the machine vision community. In particular, the automatic building detection from monocular satellite and aerial images has been an important tool for many applications such as creation and update of maps and the Geographical Information Systems database, land use analysis, change detection, and urban monitoring applications [1–4]. Due to the rapidly growing urbanization, detecting buildings from images is a hot topic and an active field of research. Recently, vision and photogrammetry tools have been increasingly used in the processing of Geographical Information Systems, cultural heritage modeling, risk management, and monitoring of urban regions. More specifically, extracting objects such as roads and buildings has gained significant attention over the last decade. Aerial data are very useful for the coverage of large areas such as cities and several aerial-based approaches have been proposed for the extraction of buildings. More precisely, the data employed as input to these approaches are either optical aerial images and derived Digital Surface Models (e.g., [5]) or aerial LiDAR 3D point clouds (e.g., [6]). It is well known that segmenting buildings in aerial images is a challenging task. This problem is generally considered when we talk about high-level image processing in order to produce numerical or symbolic information. In this context, many techniques have been proposed in the literature. Among the techniques most frequently used, one can cite semi-automatic methods that need user interaction in order to extract desired targets or objects of interest from images [7]. Generally, this category of methods has been introduced to overcome the problems associated with the full automatic segmentation which is usually not perfect. Handbook of Neural Computation. DOI: 10.1016/B978-0-12-811318-9.00032-6 Copyright © 2017 Elsevier Inc. All rights reserved.
591
592
CHAPTER 32 A COMPARATIVE STUDY OF IMAGE SEGMENTATION ALGORITHMS
32.1.2 CONTRIBUTION From the point of view of machine learning paradigms, it is desirable to keep the user interaction at the training phase only and to fully automate the detection and recognition at the test phase. In this chapter, we propose an image-based approach for object detection and classification, namely, detecting roof building in orthophotos. We use image segmentation algorithms to get an over-segmented orthophoto (e.g., [8,9]). The obtained regions are then described by holistic and hybrid descriptors for detection of roof building in orthophotos. First, an over-segmentation is applied on the orthophoto. This oversegmentation is applied on both the training and test images. Second, holistic descriptors including color and texture are fused in order to get the feature descriptor of a given region. Third, the segmented regions in a test image are then classified using machine learning tools. We investigate the good combination (segmentation, descriptors) that can lead to optimal detection results via a case study over a set of aerial images. The main contributions of the chapter are as follows. Firstly, we apply the matrix covariance descriptor to the building detection problem. To the best of our knowledge, this recent descriptor was not used in the context of building detection. This descriptor has proved to be very informative and compact. Secondly, we introduce a principled evaluation that studies the performances of the two main modules used in the detection chain, namely the image over-segmentation algorithm and the descriptor extractor. This study can provide and select the best pair segmentation algorithm-region descriptor in the context of building detection. Thirdly, we provide a performance study on classifiers whose role is to decide if any arbitrary region is a building or not. We provide evaluation performances over 200 buildings using different segmentation algorithms and descriptors. In this work, we are interested in studying the performance of a machine learning approach and its processing pipeline that combines several modules: image segmentation, image descriptor extraction, and classification. Thus, the work studies the influence of different modules on the final performance of building detection in orthophotos. Based on this study, we can identify the configurations that should be adopted for the task at hand. The rest of the chapter is organized as follows. Section 32.2 presents some related, state-of-the-art work. Section 32.3 describes the proposed machine learning approach as well as its main differences with existing work. It also presents the studied image descriptors together with their implementation details. Section 32.4 presents the performance evaluation trough the use of image segmentation and descriptors classification. It presents a performance study of several combinations of pairs segmentation algorithm-image descriptor as well as of several classifiers. Finally, Section 32.5 concludes the chapter.
32.2 RELATED WORK This section is split into two main subsections. The first subsection enumerates the image segmentation algorithms used. The second subsection provides an overview of the state-of-the-art in building detection.
32.2.1 GENERAL PURPOSE IMAGE SEGMENTATION We use three popular segmentation techniques of the literature: Statistical Region Merging (SRM) [9], Mean shift-based segmentation (MS) [10], and Superpixels [11]. These segmentation methods are
32.3 PROPOSED MACHINE LEARNING APPROACH
593
well known and often used for building segmentation purposes. Most of these methods have several control parameters. Some parameters specify the image size or the output format. Other parameters are essential for the segmentation process.
32.2.2 BUILDING DETECTION In [12], an efficient approach is proposed for automatic rectangular building detection from monocular aerial images. The image is first decomposed into small homogeneous regions using superpixel segmentation of a masked image. Regions are then grouped into clusters by a region-level MRF segmentation method. In [13], the authors detect buildings in satellite images. The images are processed by applying a clustering technique using color features to eliminate vegetation areas and shadows that may adversely affect the performance of the algorithm. Subsequently, the Hue Saturation Value (HSV) representation of the image is used and a new active contour model was developed and applied for building extraction. Two recent works started to exploit deep learning paradigms [14]. Some attempted to apply Convolutional Neural Networks (CNN) to aerial images in order to either retrieve features or classes. In [15], the authors use CNNs that produce local label maps from rectangular aerial image patches. In this work, three categories (buildings, roads, and others) were considered. In [16], the authors propose an automated building detection framework from very high resolution remote sensing data based on deep CNNs. The core of their method is based on a supervised classification procedure employing a very large training data set. An MRF model is then responsible for obtaining the optimal labels regarding the detection of scene buildings.
32.3 PROPOSED MACHINE LEARNING APPROACH 32.3.1 OVERVIEW OF THE PROPOSED FRAMEWORK AND MAIN DIFFERENCES WITH RELATED WORK The general flow chart of the proposed building-detection method is illustrated in Fig. 32.1. It should be noticed that the training set is formed by a set of labeled regions together with their image descriptor. As can be seen, many existing works are based on local features to classify each pixel to an object label, so that these approaches need class-conditional distribution of pixel values. For instance the work in [16] attempts to derive pixel based descriptors using CNNs, then a binary classifier is used at pixel level with an extra regularization step in order to get the real detected object of interest. This scheme needs to specify the ideal local image support as well as the parameters of the final regularization. In [15], again the choice of the size of both the rectangular input patch and of the output label map seem to be ad hoc. Our proposed framework considers semantic patches that are provided by well studied image segmentation algorithms. Each local patch is represented by a rich descriptor that globally describes the entire region, and is used to infer its category via a supervised scheme. Our proposed method also differs from existing works by the fact that it does not impose any constraint on building shapes whereas many works assume a quadrilateral shape for buildings.
594
CHAPTER 32 A COMPARATIVE STUDY OF IMAGE SEGMENTATION ALGORITHMS
FIGURE 32.1 General flow chart of the proposed machine learning building-detection method.
32.3.2 STUDIED IMAGE DESCRIPTORS Image textures and color can characterize intensity variations of object surfaces. Texture representation and analysis are a main focus of machine vision. Texture classification subject to changes and perturbations in image acquisition process, such as illumination, noise, or scale, is very challenging, which often leads to a large intra-class variability. In this section, we present the image descriptors used in our work as well as some implementation details.
32.3.2.1 Color Color was among the main descriptors that are used in order to characterize image regions. Indeed, existing methods have exploited color invariant descriptors in order to detect objects in aerial images (e.g., [17]). Color invariant descriptors are object properties that are not affected by external conditions [18]. For image regions, color histograms can be considered as a simple and fast descriptor. These histograms computed in any color space quantify color distribution in a given region and hence can be used as a discriminant signature. Fig. 32.2 illustrates the color descriptor associated with two different segmented regions, each belonging to a different class. In our work, we use color histograms in RGB space. In order to computer-color histograms, we uniformly quantize each color channel into 16 bins and then the color histogram of each region is computed in the feature space of 16×16×16 = 4096 bins. Obviously, quantization reduces the information regarding the content of regions and it is used as trade-off to reduce processing time.
32.3 PROPOSED MACHINE LEARNING APPROACH
595
FIGURE 32.2 (A) A segmented roof region and its color histogram. (B) A segmented background region and its color histogram.
32.3.2.2 Local Binary Patterns Local Binary Patterns are among the recent texture descriptors. The original LBP operator replaces the value of the pixels of an image with decimal numbers, which are called LBPs or LBP codes that encode the local structure around each pixel [19–21]. Each central pixel is compared with its eight neighbors; the neighbors having smaller value than that of the central pixel will have the bit 0, and the other neighbors having value equal to or greater than that of the central pixel will have the bit 1. For each given central pixel, one can generate a binary number that is obtained by concatenating all these binary bits in a clockwise manner, which starts from the one of its top-left neighbor. The resulting decimal value of the generated binary number replaces the central pixel value. The histogram of LBP labels (the frequency of occurrence of each code) calculated over a region or an image can be used as a texture descriptor of that image. The size of the histogram is 2P since the operator LBP(P , r) is able to generate 2P different binary codes, formed by the P neighboring pixels. Recently, several LBP variants have been developed in order to improve the texture description [22,23]. For describing a segmented region with LBP descriptors, in our work, we use eight neighboring points (P = 8) with three radii (r = 1, r = 2, r = 3), each with three modes (uniform, rotation invariant, uniform and rotation invariant). Thus, there are nine LBP descriptors. The final descriptor is given by
596
CHAPTER 32 A COMPARATIVE STUDY OF IMAGE SEGMENTATION ALGORITHMS
the concatenation of all. It is worth noting that despite the use of nine LBP descriptors, the final one is described by 3 × (59 + 36 + 10) = 315 variables only.
32.3.2.3 Covariance Descriptors Tuzel and Meer introduced the covariance descriptor [24]. This descriptor represents an image or an image region using a sample covariance matrix. Let J denote an M × N intensity or color image, and V be the M × N × d-dimensional feature image extracted from J . Thus, V can be understood as a set of d 2D arrays (channels) where every array can correspond to a given image feature such as horizontal coordinate, vertical coordinate, color, image derivatives, filter responses, etc. This 3D array can be written as V (x; y) = φ(J ; x; y) where φ is a function that extracts image features. For a given image region R ∈ J containing n pixels, let {vi }i=1,...,n denote the d-dimensional feature vectors obtained by φ within R. According to [24], the region R can be described by a d × d covariance matrix: 1 (vi − m)(vi − m)T n−1 n
SR =
i=1
where m is the mean vector of {vi }i=1,...,n . Since covariance matrices do not live in the Euclidean space, the difference between two matrices would not quantify the similarity or dissimilarity between the corresponding regions. Under the Log-Euclidean Riemannian metric, it is possible to measure the distance between covariance matrices. Given two covariance matrices S1 and S2 , their distance is given by d(S1 , S2 ) = || log(S1 ) − log(S2 )||2 where ||.||2 is the 2 vector norm and log(S) is the matrix logarithm of the square matrix S. Thus, every image region, R, can be characterized by log(SR ). Since this is a symmetric matrix, then the feature vector can be described by d × (d + 1)/2 where d is the number of channels. In our work, the image covariance descriptor is computed as follows. We consider 23 channels (x, y, R, G, B, H, S, V , Ix , Iy , Ixx , Iyy , Ixy , Ixy , LBPu2 (r = 1), LBPri (r = 1), LBPriu2 (r = 1), LBPu2 (r = 2), LBPri (r = 2), LBPriu2 (r = 2), LBPu2 (r = 3), LBPri (r = 3), LBPriu2 (r = 3)). An x denotes the channel that contains the horizontal coordinate of pixels, y denotes the channel that contains the vertical coordinate of pixels, R, G, B denote the three color components, H , S, V , the color channels in HSV space, Ix , Iy , Ixx , Iyy , Ixy , Iyx denote the first order and second order image partial derivatives, and LBPmode (R) denotes the LBP image obtained for a given mode and a given radius R. In our work, we use nine LBP images associated with three different modes mode ∈ {uniform, rotation invariant, uniform & rotation invariant} and three different radii R ∈ {1, 2, 3}. For all LBP images, the number of neighboring points is fixed to 8. Since the number of channels used is 23, it follows that the descriptor of each region is described by 276 features. We stress the fact that even if the covariance matrix descriptor used color channels and LBP images, its descriptor is still different from that of color histograms and LBP histograms.
32.4 PERFORMANCE EVALUATION
597
32.3.2.4 Hybrid Descriptors Hybrid descriptors can be obtained by concatenating the feature vectors provided by different descriptors. While this can enrich the discrimination capacity of the resulting descriptor, it has the disadvantage that the dimensionality of the resulting feature vector can be very high.
32.3.3 TRAINING AND TESTING As in any training process, we need a set of segmented regions with known labels. In other words, each segmented region that will be used as a training example should have been identified as background or building. In our case, this process is semi-automatic. In order to get a training set which contains regions belonging to the two classes (background and building) with ground-truth labels, we proceed as follows. The buildings footprints are first manually delineated in each training orthophoto. Each such ground-truth map is then overlapped with the corresponding automatically over-segmented orthophoto. The label of any segmented region can be inferred by using the size of the intersection with the ground-truth building region/pixels. Any segmented region whose overlap with a building footprint exceeds 90% of its size will be labeled as building. Any segmented region whose overlap with the building footprint is below 3% of its size will have the non-building label. The segmented regions that do not meet any of the two conditions are discarded and will not be included in the set of training samples. This selection scheme makes sure that the used descriptors are associated with their own classes. The reason behind using these thresholds is the fact that an automatically segmented region may be shared by a building region and a background region. Fig. 32.3 illustrates the semi-automatic training process: (a) depicts a part of an original orthophoto, (b) illustrates the output of the SRM image segmentation algorithm, (c) illustrates the ground-truth building footprints obtained manually, (d) depicts the regions labeled as building and those labeled as background. In total, there are 19 regions that are labeled as positive examples (building samples). In brief, the training process consists of (i) obtaining a set of training descriptors together with their labels, and (ii) learning a classifier that can separate between building regions and non-building ones. At testing phase, the same processing pipeline is adopted. Firstly, the orthophoto will be automatically segmented using the same segmentation algorithm used in training. Secondly, the descriptor associated with each segmented region in the test orthophoto will be computed. Thirdly, a classifier will decide the class of every segmented region, in the test orthophoto, using the set of trained descriptors. Based on this decision the region pixels will be labeled as building and non-building. The building pixels will be grouped into connected components that will represent the detected buildings in the test orthophoto.
32.4 PERFORMANCE EVALUATION In this section, we present two groups of experiments. The first group evaluates the pair segmentation algorithm-descriptor. The second group of experiments studies the performance of different classifiers.
598
CHAPTER 32 A COMPARATIVE STUDY OF IMAGE SEGMENTATION ALGORITHMS
FIGURE 32.3 Semi-automatic generation of training examples. (A) Original orthophoto portion. (B) Associated automatic over-segmentation. (C) Associated ground-truth building delineation (manual delineation). (D) Labeled regions by overlapping the ground-truth footprints with the segmented regions.
32.4.1 DATA SET The data set used in this research to evaluate the accuracy of the proposed framework corresponds to 12 large orthophotos depicting several zones in the region of Belfort city situated in the northeastern France. The spatial resolution of these orthophotos, provided by Communauté de l’Agglomération Belfortaine (CAB 2008), is 16 cm/pixel. These orthophotos contain about 200 buildings. In these orthophotos, the building roofs have different colors and textures. Furthermore, the background contains highly varying appearances corresponding to vegetation, cars, roads, and other objects.
32.4.2 EVALUATION METRICS AND PROTOCOL In order to get a quantitative evaluation, we use the ground-truth building maps. The manually delineated buildings (ground-truth buildings) were used as a reference building set to evaluate the whole automated building-extraction accuracy. The automatically detected buildings and the ground-truth buildings are compared pixel-by-pixel. All pixels in the test orthophoto are grouped into four categories. (1) True positive (TP). The automated and manual techniques classify the given pixel as belonging to the buildings.
32.4 PERFORMANCE EVALUATION
599
Table 32.1 Segmentation Algorithms: Acronyms and Parameters SRM-1 SRM-2 SRM-3 SRM-4 SRM-5 MS-1 MS-2 Turbo
Statistical Region Merging Statistical Region Merging Statistical Region Merging Statistical Region Merging Statistical Region Merging Mean Shift Mean Shift Turbopixel
Q = 2000 Q = 10000 Q = 15000 Q = 20000 Q = 30000 hs = 9, hr = 1 hs = 9, hr = 3 N = 7000
(2) True negative (TN). The automated and manual techniques classify the given pixel as belonging to the background. (3) False positive (FP). The automated technique misclassifies the given pixel as belonging to a building. (4) False negative (FN). The automated technique incorrectly classifies the given pixel as belonging to the background. From these measures it is straightforward to compute the following scores associated with the building regions in the test image: recall, precision, F1 measure, accuracy, and Matthews correlation coefficient (MCC). The MCC returns a score between −1 and +1. The two groups of experiments adopt the following evaluation protocol. The whole set of orthophotos is divided evenly into two subsets: training subset and test subset. The segmented regions in the training subset are used to learn image descriptors and a given classifier. The segmented regions in the test orthophotos are used to evaluate the automatic detection using the recall, precision, F1 measure, accuracy, and MCC. This process is repeated 20 times and the statistical scores are averaged over these 20 splits.
32.4.3 SEGMENTATION ALGORITHMS AND DESCRIPTORS In this section, we study the performance of different segmentation algorithms and different descriptors. As segmentation methods, we consider (1) the SRM algorithm adopting five levels of segmentation, (2) the Mean Shift algorithm adopting two levels of segmentation, and (3) Turbopixel [25] adopting one level of segmentation. Table 32.1 summarizes the acronyms of these algorithms as well as their parameters. Figs. 32.4 and 32.5 illustrate the segmentation of an orthophoto obtained by the eight segmentation algorithms and levels. As can be seen, the quality and the number of regions depend on the segmentation technique and its parameters. We use four descriptors: (i) color histogram (RGB), (ii) color histogram and LBP (hybrid descriptor = RGB + LBP), (ii) covariance matrix (COV), (iv) color histogram and covariance matrix (hybrid descriptor = RGB + COV). Tables 32.2, 32.3, 32.4, and 32.5 illustrate the performance of the eight segmentation algorithms obtained with RGB, RGB + LBP, COV, and RGB + COV, respectively. The classifier used is a Support Vector Machine (SVM). In each table, the results correspond to 20 random splits training/test subsets. These tables depict the average and confidence interval of the recall, precision, F1 measure, accuracy, and MCC. The hybrid descriptors are obtained by simple concatenation of the feature vectors. The
600
CHAPTER 32 A COMPARATIVE STUDY OF IMAGE SEGMENTATION ALGORITHMS
FIGURE 32.4 SRM segmentation algorithm with different parameters. (A) Original orthophoto; (B) SRM-1; (C) SRM-2; (D) SRM-3; (E) SRM-4; (F) SRM-5.
32.4 PERFORMANCE EVALUATION
601
FIGURE 32.5 Mean shift and turbopixel segmentation results. (A) Mean Shift-1; (B) Mean Shift-2; (C) Turbopixel.
Table 32.2 Average and 95% Confidence Interval of Recall, Precision, F1, Accuracy, and Matthews Correlation Coefficient (MCC) Corresponding to a Binary Classification (Pixel Level) Using Color Histograms. The Averages Correspond to 20 Random Splits Training/Test. The Results Are Obtained with SVM Classifier Color Histogram (RGB) Seg. Recall (%) Precision (%) F1 (%) Acc. (%) MCC SRM-1 SRM-2 SRM-3 SRM-4 SRM-5 MS-1 MS-2 TURBO
82.7 ± 2.9 90.1 ± 1.9 89.2 ± 1.8 88.7 ± 1.6 90.2 ± 2.2 90.5 ± 1.2 90.8 ± 2.1 89.5 ± 2.2
84.6 ± 2.1 79.9 ± 2.3 81.2 ± 2.8 81.8 ± 1.8 80.4 ± 2.0 78.2 ± 2.9 79.5 ± 3.6 72.2 ± 3.1
83.5 ± 2.2 84.4 ± 1.1 84.8 ± 1.3 84.9 ± 1.4 84.8 ± 1.0 83.6 ± 1.6 84.3 ± 1.5 79.6 ± 1.7
93.6 ± 1.4 93.9 ± 0.6 93.9 ± 0.9 94.3 ± 0.6 94.0 ± 0.7 93.7 ± 0.9 93.7 ± 0.8 90.8 ± 1.6
0.80 ± 0.03 0.81 ± 0.01 0.81 ± 0.02 0.82 ± 0.02 0.81 ± 0.01 0.80 ± 0.02 0.81 ± 0.02 0.75 ± 0.02
Table 32.3 Average and 95% Confidence Interval of Recall, Precision, F1, Accuracy, and Matthews Correlation Coefficient (MCC) Corresponding to a Binary Classification (Pixel Level) Using Both Color and LBP Descriptors (Color Histograms with LBPs). The Averages Correspond to 20 Random Splits Training/Test. The Results Are Obtained with the SVM Classifier Seg.
Recall (%)
Hybrid Descriptor (RGB + LBP) Precision (%) F1 (%) Acc. (%)
MCC
SRM-1 SRM-2 SRM-3 SRM-4 SRM-5 MS-1 MS-2 TURBO
87.1 ± 3.0 86.8 ± 1.3 86.4 ± 1.1 85.6 ± 1.2 87.0 ± 1.3 86.0 ± 1.0 87.8 ± 1.1 84.2 ± 3.4
90.7 ± 1.5 89.2 ± 1.2 88.0 ± 1.2 88.7 ± 0.8 87.9 ± 1.2 86.3 ± 1.0 86.6 ± 1.8 85.3 ± 2.0
0.86 ± 0.02 0.85 ± 0.01 0.84 ± 0.01 0.84 ± 0.01 0.85 ± 0.00 0.83 ± 0.01 0.84 ± 0.01 0.81 ± 0.02
88.6 ± 1.8 87.8 ± 0.5 87.0 ± 0.4 87.0 ± 0.6 87.2 ± 0.3 86.0 ± 0.5 86.8 ± 0.9 84.4 ± 1.6
95.8 ± 0.7 95.6 ± 0.2 95.4 ± 0.3 95.4 ± 0.2 95.5 ± 0.2 94.8 ± 0.2 95.4 ± 0.4 94.4 ± 0.3
602
CHAPTER 32 A COMPARATIVE STUDY OF IMAGE SEGMENTATION ALGORITHMS
Table 32.4 Average and 95% Confidence Interval of Recall, Precision, F1, Accuracy, and Matthews Correlation Coefficient (MCC) Corresponding to a Binary Classification (Pixel Level) Using Covariance Descriptors. The Averages Correspond to 20 Random Splits Training/Test. The Results Were Obtained with the SVM Classifier Covariance Matrix Descriptor (COV) Seg. Recall (%) Precision (%) F1 (%) Acc. (%) MCC SRM-1 SRM-2 SRM-3 SRM-4 SRM-5 MS-1 MS-2 TURBO
87.3 ± 1.6 85.6 ± 1.1 86.4 ± 0.7 86.5 ± 1.0 85.3 ± 1.4 85.4 ± 1.3 86.7 ± 0.9 88.1 ± 1.6
87.2 ± 1.5 86.1 ± 1.1 86.1 ± 0.9 85.6 ± 1.3 86.8 ± 1.3 85.7 ± 1.5 88.2 ± 1.2 83.4 ± 1.4
87.1 ± 0.8 85.7 ± 0.5 86.1 ± 0.4 85.9 ± 0.5 85.9 ± 0.5 85.3 ± 0.4 87.3 ± 0.6 85.3 ± 0.6
94.79 ± 0.6 94.92 ± 0.2 94.93 ± 0.2 94.71 ± 0.2 94.80 ± 0.3 94.61 ± 0.2 95.28 ± 0.4 94.27 ± 0.5
0.84 ± 0.01 0.83 ± 0.01 0.83 ± 0.01 0.83 ± 0.01 0.83 ± 0.01 0.82 ± 0.00 0.84 ± 0.01 0.82 ± 0.01
Table 32.5 Average and 95% Confidence Interval of Recall, Precision, F1, Accuracy, and Matthews Correlation Coefficient (MCC) Corresponding to a Binary Classification (Pixel Level) Using Hybrid Descriptors (Color Histograms and Covariance Descriptors). The Averages Correspond to 20 Random Splits Training/Test. The Results Were Obtained with the SVM Classifier Seg.
Recall (%)
Hybrid Descriptor (RGB + COV) Precision (%) F1 (%) Acc. (%)
MCC
SRM-1 SRM-2 SRM-3 SRM-4 SRM-5 MS-1 MS-2 TURBO
87.9 ± 2.1 88.1 ± 0.9 88.4 ± 1.1 88.4 ± 0.9 88.4 ± 1.2 88.3 ± 1.0 88.6 ± 0.7 87.3 ± 4.0
91.6 ± 1.1 89.3 ± 0.9 88.6 ± 0.8 89.0 ± 0.9 88.6 ± 1.0 87.6 ± 1.3 88.7 ± 1.8 86.2 ± 1.4
0.87 ± 0.01 0.86 ± 0.01 0.86 ± 0.01 0.86 ± 0.01 0.86 ± 0.01 0.85 ± 0.01 0.86 ± 0.01 0.83 ± 0.02
89.6 ± 1.4 88.6 ± 0.5 88.4 ± 0.5 88.6 ± 0.6 88.4 ± 0.7 87.8 ± 0.6 88.4 ± 1.0 86.4 ± 1.8
96.0 ± 0.4 96.0 ± 0.2 96.0 ± 0.1 95.8 ± 0.2 95.8 ± 0.2 95.5 ± 0.2 95.8 ± 0.5 94.8 ± 0.3
feature vector lengths of RGB, RGB + LBP, COV, and RGB + COV, are respectively 4096, 4411, 276, and 4372. It is worth noticing that our evaluation is performed without any postprocessing of the detection process. In other words, we evaluate the classification results at pixel level without adding ad hoc post-processing schemes that can reduce the false positive rate. As can be seen, based on the five statistical scores, the best performance over the 32 combinations segmentation algorithm-descriptor is the one obtained with SRM-1 and the hybrid descriptor RGB + COV descriptor. Regarding the segmentation algorithms, the best ones are SRM-1, SRM-2, and MS-2. We can also observe that the best performances are obtained with the hybrid descriptors. It should be noticed that the worst performance is achieved with the color descriptor alone. For example, the 93.16% accuracy obtained with the color descriptor becomes 95.85% (color + LBP), 95.10% (covariance), and 96.03% (color + covariance), which corresponds to an absolute improvement of about 2–3%. A 3% of an orthophoto may correspond to several tens of squared meters.
32.5 CONCLUSION
603
Table 32.6 Average and 95% Confidence Interval of Recall, Precision, F1, Accuracy, and Matthews Correlation Coefficient (MCC) Corresponding to a Binary Classification (Pixel Level) Based on Covariance Descriptor. The Segmentation Method Is the SRM-1 Classification
Recall (%)
Covariance descriptor Precision (%) F1 (%)
Acc. (%)
MCC
1-NN 3-NN PLS SVM NL PLS
81.4 ± 1.8 85.1 ± 1.4 82.5 ± 1.7 87.3 ± 1.6 87.4 ± 2.1
84.2 ± 1.3 84.8 ± 1.9 87.3 ± 1.1 87.2 ± 1.5 91.7 ± 0.9
93.0 ± 0.7 93.5 ± 0.5 93.5 ± 0.6 94.8 ± 0.6 96.3 ± 0.3
0.78 ± 0.01 0.81 ± 0.01 0.81 ± 0.01 0.84 ± 0.01 0.87 ± 0.01
82.6 ± 0.8 84.8 ± 0.7 84.8 ± 0.9 87.1 ± 0.8 89.3 ± 1.1
We can observe that the covariance descriptor produces very good results despite its low number of features (276), which is about 15 times less than the color descriptor (4096). For classifier training, it will be advantageous to use compact predictive variables. Thus, in practice the covariance descriptor can be a good trade-off for accuracy and training efficiency.
32.4.4 CLASSIFIERS PERFORMANCE In this section, we study the performance of classifiers used to classify the segmented regions. We have used five classifiers: K-Nearest Neighbor (K-NN) with (K = 1 and K = 3), SVM [26], linear Partial Least Square, and non-linear Partial Least Square [27,28]. For SVM, we used Gaussian kernels. For PLS, the number of latent components is set to 50. Table 32.6 summarizes the average and confidence interval of the recall, precision, F1 measure, accuracy, and MCC obtained by the five classifiers. The segmentation algorithm is the SRM-1 and the descriptor was provided by the covariance matrix. For each classifier, the results correspond to 20 random splits training/test. Fig. 32.6 illustrates the automatic building detection obtained with the proposed framework after applying it on two orthophotos. The image segmentation is obtained with the SRM-1 algorithm. The region descriptor is given by RGB + COV. The classifier used is the SVM. The right part illustrates the automatic building delineation (delimited by closed yellow contours) overlaid with the ground-truth delineation (shown in dark red). One can also appreciate the false positive and false negative regions. The former ones are blue pixels within a detected region, and the latter ones are red pixels outside of any detected region. As can be seen, the developed framework demonstrates excellent accuracy in terms of building boundary extraction, i.e., the majority of the building roofs present in the image are detected with good boundary delineation. Indeed, the proposed framework gives reliable results for complex environments having buildings with red and non-red rooftop buildings and/or buildings having very complex shapes.
32.5 CONCLUSION In this chapter, we have addressed building detection in orthophotos. The detection is achieved through the use of image segmentation and descriptor classification. The proposed framework has several strengths that cannot be found jointly in other frameworks. These are as follows. First, the proposed
604
CHAPTER 32 A COMPARATIVE STUDY OF IMAGE SEGMENTATION ALGORITHMS
FIGURE 32.6 Building detection results with SRM-1 segmentation algorithm and RGB + COV descriptor. The classifier is SVM.
framework releases the use of active sensors. Secondly, the framework does not need a priori assumptions on the shape of building footprints. Thirdly, at running time, there is no user interaction. Fourthly, the proposed framework (detection chain and performance study) is generic in the sense that it can be easily extended to the automatic detection of other categories of objects such as roads, vehicles, and vegetation. The main limitation of the proposed method is its confidence in the image segmentation algorithm. Indeed, the framework assumes that each provided segmented regions contains either a building part or a non-building part. In some segmented regions where building and surrounding regions have very similar appearances, the segmented patches/regions may contains pixels belonging to both categories. Therefore, since the classifier will assign a single label to the whole patch, some pixels will be misclassified. It should be noted that this phenomenon is also affecting any other image-based building detection framework.
REFERENCES
605
ACKNOWLEDGMENTS We thank Elsevier for allowing us to reuse material for this chapter from our following paper: F. Dornaika, A. Moujahid, Y. EL Merabet, and Y. Ruichek “Building Detection from Orthophotos using a Machine Learning Approach: An Empirical Study on Image Segmentation and Descriptors” Expert Systems with Applications. Volume 58, Issue C, pp. 130–142, October 2016.
REFERENCES [1] B. Sirmacek, C. Unsalan, Urban area and building detection using sift keypoints and graph theory, Comput. Vis. Image Underst. 47 (4) (2009) 1156–1167. [2] C. Unsalan, K. Boyer, A system to detect houses and residential street networks in multispectral satellite images, Comput. Vis. Image Underst. 98 (3) (2005) 423–461. [3] N.T. Quang, N.T. Thuy, D.V. Sang, H.T.T. Binh, An efficient framework for pixel-wise building segmentation from aerial images, in: Proceedings of the Sixth International Symposium on Information and Communication Technology, SoICT 2015, 2015, pp. 282–287. [4] M. Sun, L. Pang, H. Liu, X. Zhang, L. Ai, S. He, Urban extraction based on multi-scale building information extrasegmentation and SRA coherence image, in: Geo-Informatics in Resource Management and Sustainable Ecosystem, Springer, 2016. [5] O. Tournaire, M. Brédif, M. Boldo, D. Durupt, An efficient stochastic approach for buildings footprint extraction from digital elevation models, ISPRS J. Photogramm. Remote Sens. 65 (2010) 317–327. [6] O. Wang, S.K. Lodha, D.P. Helmbold, A Bayesian approach to building footprint extraction from aerial LIDAR data, in: International Symposium on 3D Data Processing, Visualization, and Transmission, 2006. [7] K. McGuinness, N.E. O’Connor, A comparative evaluation of interactive segmentation algorithms, Pattern Recognit. 43 (2010) 434–444. [8] P. Arbelaez, M. Maire, C. Fowlkes, J. Malik, Contour detection and hierarchical image segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 33 (2011) 898–916. [9] R. Nock, F. Nielsen, Statistical region merging, IEEE Trans. Pattern Anal. Mach. 26 (11) (2004) 1452–1458. [10] D. Comaniciu, P. Meer, Mean shift: a robust approach toward feature space analysis, IEEE Trans. Pattern Anal. Mach. Intell. 24 (2002) 603–619. [11] X. Ren, J. Malik, Learning a classification model for segmentation, IEEE Int. Conf. Comput. Vis. (2003) 10–17. [12] T.-T. Ngo, C. Collet, V. Mazet, Automatic rectangular building detection from VHR aerial imagery using shadow and image segmentation, in: 2015 IEEE International Conference on Image Processing (ICIP), 2015, pp. 1483–1487. [13] G. Liasis, S. Stavrou, Building extraction in satellite images using active contours and colour features, Int. J. Remote Sens. 37 (5) (2016) 1127–1153. [14] J. Wu, J. Xu, J. Zhao, N. Li, S. Xiang, Comparison of several features of building detection in remote sensing image, in: International Conference on Mechatronics and Industrial Informatics, 2015. [15] S. Saito, T. Yamashita, Y. Aoki, Multiple object extraction from aerial imagery with convolutional neural networks, J. Imag. Sci. Technol. 60 (1) (2016) 010402. [16] M. Vakalopoulou, K. Karantzalos, N. Komodakis, N. Paragios, Building Detection in Very High Resolution Multispectral Data with Deep Learning Features, Tech. Rep. hal 01264084, 2016, HAL hal.archives-ouvertes.fr. [17] B. Sirmacek, C. Unsalan, Building detection from aerial images using invariant color features and shadow information, in: 23rd International Symposium on Computer and Information Sciences, 2008, ISCIS ’08, 2008, pp. 1–5. [18] T. Gevers, A. Smeulder, Computing color and shape invariant features for image retrieval, IEEE Trans. Image Process. (2000) 102–119. [19] T. Ojala, M. Pietikäinen, T. Maenpaa, Multiresolution gray-scale and rotation invariant texture classification with local binary patterns, Trans. Pattern Anal. Mach. Intell. 24 (2002) 971–987. [20] V. Takala, T. Ahonen, M. Pietikäinen, Block-based methods for image retrieval using local binary patterns, in: Image Analysis, SCIA, in: LNCS, vol. 3540, 2005. [21] T. Ahonen, A. Hadid, M. Pietikäinen, Face description with local binary patterns: application to face recognition, IEEE Trans. Pattern Anal. Mach. Intell. 28 (12) (2006) 2037–2041.
606
CHAPTER 32 A COMPARATIVE STUDY OF IMAGE SEGMENTATION ALGORITHMS
[22] M. Bereta, P. Karczmarek, W. Pedrycz, M. Reformat, Local descriptors in application to the aging problem in face recognition, Pattern Recognit. 46 (2013) 2634–2646. [23] L. Wolf, T. Hassner, Y. Taigman, Descriptor based methods in the wild, in: Faces in Real-Life Images Workshop in ECCV, 2008. [24] F.P. Tuzel, P. Meer, A fast descriptor for detection and classification, in: European Conf. on Computer Vision, 2006, pp. 589–600. [25] A. Levinshtein, A. Stere, K. Kutulakos, D. Fleet, S. Dickinson, K. Siddiqi, Turbopixels: fast superpixels using geometric flows, IEEE Trans. Pattern Anal. Mach. Intell. 31 (12) (2009) 2290–2297. [26] D. Meyer, F. Leisch, K. Hornik, The support vector machine under test, Neurocomputing 55 (2003) 169–186. [27] R. Rosipal, N. Kramer, Overview and recent advances in partial least squares, in: Subspace, Latent Structure and Feature Selection Techniques, Springer, 2006, pp. 34–51. [28] N. Kramer, M. Braun, Kernelizing PLS, degrees of freedom, and efficient model selection, in: International Conference on Machine Learning, 2007, pp. 441–448.
CHAPTER
OBJECT-ORIENTED RANDOM FOREST FOR HIGH RESOLUTION LAND COVER MAPPING USING QUICKBIRD-2 IMAGERY
33 Taskin Kavzoglu
Gebze Technical University, Gebze, Turkey
33.1 INTRODUCTION Land cover information about the Earth’s surface features in terms of their quantity, diversity, and spatial distribution has been identified as one of the crucial data components for many aspects of global change studies and environmental applications [40]. The extraction of such information from remotely sensed images is practiced typically through the use of classification techniques. Image classification is a process of grouping pixels into several classes of land use/land cover (LULC) based on the application of statistical decision rules in the multispectral domain or logical decision rules in the spatial domain [15]. While classification in multispectral domain considers solely the spectral information introduced with image bands or channels, classification in spatial domain uses geometric shape, size, texture, and patterns of pixels or objects derived from some neighboring analyses (e.g. segmentation). Extraction of LULC information from remotely sensed imagery, which results in a thematic map of study area, has traditionally been carried out through pixel-based image classification. However, extracting such information from VHR imagery that have high spectral variability or heterogeneity within the surface materials is a complicated process due to the high degree of within-class spectral variability and between-class spectral similarity of the LULC types [9]. Moreover, pixel-based classification produces unnatural and inappropriate images with so-called salt-and-pepper look that makes them inappropriate for many studies. This problem becomes more serious as the spatial resolution of the image increases. A new and evolving paradigm, namely Object-based Image Analysis (OBIA) or Geospatial Objectbased Image Analysis (GEOBIA), has been introduced to overcome above-mentioned difficulties and improve the quality of information extraction from imagery by considering not only the spectral information but only the spatial, textural, and contextual information of extracted image objects, which are produced in so-called segmentation stage. OBIA has the potential of handling more complex image analysis tasks and produce more realistic thematic maps. It has become popular and attracted wide interest due to its efficiency to provide enhanced and reliable geospatial intelligence. Due to its unique Handbook of Neural Computation. DOI: 10.1016/B978-0-12-811318-9.00033-8 Copyright © 2017 Elsevier Inc. All rights reserved.
607
608
CHAPTER 33 OBJECT-ORIENTED RANDOM FOREST FOR HIGH RESOLUTION
advantages, it has been utilized by a significant portion of the remote sensing community for extracting information and producing thematic maps from remotely sense imagery. It has been successfully employed from LULC classification to building extraction, from change detection to forest management (e.g. [25,31,23]). Detailed comparison of pixel and object-based classification has been provided in several studies, including [5,28,6]. Remotely sensed image data are known for their high degree of complexity and irregularity. Pixels in the imagery often include more than one feature or class (i.e. mixtures of man-made and natural objects within a scene) depending on the spatial resolution of the sensor and the structure or texture of the area under investigation [21]. Traditional image classification methods have become less effective given the magnitude of heterogeneity existing particularly in VHR images. Also, conventional classifiers rely on statistical estimates to classify pixels or image objects into LULC classes from the radiance values recorded by a sensor. Because of this nature, these classifiers have some statistical assumptions. For instance, the maximum likelihood classifier, which has been the most popular classification technique, assumes that samples for each class introduced to the classifier in the training stage are in normal distribution. Due to these restrictions and limitations, it is reported that the process of extracting land cover information from remotely sensed data is still far from being standardized or optimized [17]. Therefore, advanced techniques are required to handle large volume of image data with high local variance without any constraint to statistical distribution. In the last three decades, machine learning algorithms including artificial neural networks, decision trees, and support vector machines have been intensively used for image classification since they learn the underlying relationship among LULC classes from the introduced training samples and provide an opportunity to incorporate ancillary data into the classification process. Recently, ensemble methods have been introduced to use multiple learning algorithms to obtain better predictive performance. The main idea behind the ensemble methodology, which is similar to human decision-making process, is to weigh several individual classifiers, and combine them in order to obtain a classifier that outperforms every one of them [38]. In ensemble learning, results of individually trained classifiers called base classifiers are combined when classifying a new data set. In other words, the individual classifiers are trained on different subsets of the data set so that a set of diverse classifiers is built [27]. It was reported that ensemble classifiers give higher prediction accuracy and outperform individual classifiers (e.g. [33,16,22]). Due to their high potential, many ensemble methods have been proposed in the literature. The most popular ones are bagging, boosting, random subspace, and random forest methods. The random forest classifier (RF) is based on multiple decision trees in that each tree is trained using a bootstrap (i.e. random) sample of the data set and a simple majority vote is taken for final prediction [7]. In recent years, the RF classifier employed in this study has received increasing attention from the remote sensing community due to the excellent classification results obtained, the opportunity to avoid overfitting, and the speed of processing [34,4]. In this study, random forest classifier (RF) is employed for LULC classification using objectbased image classification approach that has been lately investigated by limited number of studies (e.g. [13,36]). Primary motivation of this study is to show the ease-of-use of the random forest ensemble method (referred to as ‘off the shelf’ method) and its effectiveness when combined with the object-based image classification. The ultimate aim of this study is to produce the highest classification accuracy by putting three crucial factors of image classification into operation: high quality image data acquired by a recent sensor (Quickbird-2 imagery), a mathematically robust classifier
33.2 STUDY AREA AND DATA
609
FIGURE 33.1 Location of the study area, Yomra district of Trabzon province in Turkey.
(random forest), and a powerful image analysis approach (object-based image analysis). Classification performance of random forest classifier is compared with well-known k-nearest neighbor (k-NN) classifier for a study site in Trabzon province of Turkey. In addition to standard accuracy measures of overall accuracy and Kappa coefficient, a statistical test of significance through McNemar’s test was applied to make objective comparison between the performances of the classifiers.
33.2 STUDY AREA AND DATA The study area chosen for this research covers approximately 75 ha area located in Trabzon province of Turkey. A four-band radiometrically-corrected pan-sharpened Quickbird-2 satellite image acquired on 5th of May 2008 that covers the study area was chosen to conduct research objectives of this study (Fig. 33.1). The selected study area is a semi-urban site including seaside, residential areas, main roads, and industrialization sites. The area is bounded on the north by the Black Sea and extended to the south up to forested land. This particular image was selected to assess the effectiveness of the random forest ensemble method since it includes LULC classes with diverse spectral and spatial characteristics. Matlab (R2013a) was used in the evaluation of ensemble learning method and Definiens eCognition Developer software package (v.8.7) was employed for segmentation and nearest neighbor (k-NN) classification processes.
610
CHAPTER 33 OBJECT-ORIENTED RANDOM FOREST FOR HIGH RESOLUTION
FIGURE 33.2 Ground reference data used for (A) training, and (B) test samples.
Existing maps and aerial photographs were used to create ground reference map that was later utilized in the formation of training and test samples (Fig. 33.2). Having analyzed the existing/collected data and applying inter-class separability, it was decided that ten types of LULC classes cover the bulk of the study site, which are forest, bare ground, pasture, water, gravel-concrete surface, shadow, red roof, metal roof, white roof, and gray roof.
33.3 METHODOLOGY Classification of the remotely sensed images is a common practice to acquire valuable information about the Earth’s surface, which is required for many local to global scale studies. In this study, classification of a VHR image is conducted with recently popular object-based image analysis (OBIA) in that image objects (instead of pixels) derived from the image considering spectral, textural, and contextual information are used as units to be classified as a LULC class. In experiments an object-based RF classification was applied to classify the derived image objects. It should be mentioned that image segmentation was performed using the multi-resolution segmentation algorithm. In order to show the robustness of the proposed approach, its performance was compared with the nearest neighbor (k-NN) classifier, which was taken as a benchmark method. The theories and applications of OBIA and random forest classifier are discussed comprehensively in the following subsections.
33.3 METHODOLOGY
611
33.3.1 OBJECT-BASED IMAGE ANALYSIS (OBIA) Object-based image analysis (OBIA) is a systematic framework for geographic object identification, which combines pixels with the same semantic information into an object, thereby generating an integrated geographic object [30]. OBIA is a new approach (including theory, methods, and tools) to partition remote sensing imagery into meaningful image-objects, and assess their characteristics through scale. It is not limited to the remote sensing community but also embraces GIS, landscape ecology and GIScience concepts and principles, among others [6]. Addition of various spatial, contextual, and textural features is a unique advantage of OBIA, but the addition of each feature of object requires more training samples to learn the underlying characteristics of LULC class types. This is known as ‘curse of dimensionality’ that the RF classifier is less sensitive compared to most parametric and non-parametric classifiers as it uses different portions of samples in training stage. OBIA has two major steps, namely image segmentation and classification. Image segmentation is a primary and crucial step to extract relevant image objects. It is used to divide an image into meaningful parts (i.e. segments or objects) that have a strong correlation with objects or areas of the real world pictured in the image. The quality of segmentation has a direct influence on the performance of subsequent classification, which is largely related to segmentation parameters set by the analyst. Segmentation quality is also related to image quality, the number of image bands, the image resolution, and the complexity of the scene [3]. Most of the available segmentation algorithms need to be fine-tuned by the analyst to extract specific objects of interest, which can be regarded as a subjective task usually applied through a manual trial-and-error strategy. Currently, there is no universally accepted method or algorithm for estimating optimal parameters for a particular problem. However, some supervised and unsupervised methods and strategies have been suggested in the literature [14,32,20]. Supervised methods utilize a manually produced segmentation map called a reference map whilst unsupervised methods estimate quality scores based on the segmented image. A tool called Estimation of Scale Parameter (ESP-2) introduced by [11] was employed in this study to estimate optimal scale value for the Quickbird-2 image under consideration. The tool helps to estimate optimal segmentation scale by producing a rate of change (RoC) graph of local variance considering all image bands (maximum of 30). Previous version of the tool could only consider a single band for the analysis. Variation in heterogeneity is analyzed by evaluating the LV against the corresponding scale. The thresholds in rates of change of local variance (RoC-LV) indicate the scale levels at which the image can be segmented in the most appropriate manner, relative to the data properties at the scene level [12]. Multi-resolution segmentation, which is a bottom-up region merging method introduced by [2], has been the most popular method, particularly for segmenting remotely sensed imagery. It has three main user-defined parameters, namely scale, shape, and compactness, that define within-object homogeneity. Scale parameter that defines the average size of image objects is considered the most effective parameter affecting the segmentation quality [29,26]. It should be noted that the higher value is selected for scale parameter, the larger object is obtained. Some studies have also revealed that shape and compactness parameters both of which take values between 0 and 1 have some degree of effect on resulting classification accuracy; therefore, they should be also set by optimal or near-optimal values [42, 24]. However, these parameters were frequently used with constant values in previous researches (e.g. [28,29]). There are usually two distinguished cases for segmentation: over-segmentation and undersegmentation. In over-segmentation, the number of regions is larger than the number of objects in the image. Under-segmentation is the opposite one in that objects include multiple land cover types.
612
CHAPTER 33 OBJECT-ORIENTED RANDOM FOREST FOR HIGH RESOLUTION
33.3.2 RANDOM FOREST CLASSIFIER A random forest classifier (RF) is a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest [7]. Although the classifier was originally developed for the machine learning community, it has become popular in the remote sensing community, particularly for the classification of remotely sensed imagery due to the accuracy level achieved, speed, and easy parameterization. The RF is a classifier that includes a large number of decision tree classifiers. Each tree is trained with randomly selected (i.e. bootstrapped) subset of the training samples and variables to tackle the same problem, and then the final classification is conducted based on a majority vote of the trees in the forest. It should be highlighted that new training samples are chosen from the original data set with replacement. It is in fact an advanced version of bagging ensemble method. According to Joelsson et al. [19], the individuality of the trees is assured by three factors: 1. Each tree is trained using a random subset of the training samples. 2. During the growing process of a tree the best split on each node in the tree is found by searching through m randomly selected features. For a data set with M features, m is selected by the user and kept much smaller than M. 3. Every tree is grown to its fullest to diversify the trees so there is no pruning. The RF classifier is superior to most image classification methods since it is non-parametric, capable of using continuous and categorical data, easy to parameterize, robust against overfitting, not sensitive to noise in the data set (i.e. good at dealing with outliers in training data). Breiman and Cutler [8] underline that RF is unexcelled in accuracy among current algorithms. It has been reported that it could outperform conventional classifiers and some machine learning algorithms (e.g. [34,16,37]). It is capable of handling large numbers of variables relative to the number of observations, which is a common problem known as ‘curse of dimensionality’ [41]. It can also give estimates of which variables are important in the classification, thus it is used as an embedded feature selection method in the literature [39,35]. In other words, the samples left-out in the training of each classifier (referred to as out-of-bag samples) are used for feature selection by determining the importance of different features during classification process. The success of random forest lies in the creation of decision trees that form the forest. This is conducted in two steps. In the first step, each tree is built using randomly selected samples (with replacement). Each tree in the forest is trained with different samples but of the same size. Two thirds of the training samples (i.e. in-bag samples) are used to train the trees, and the remaining one third (i.e. out-of-bag samples) is used in cross-validation to estimate how well the resulting RF model performs. The underlying philosophy in this approach is that the ‘strength’ of the trees is maintained while reducing the correlation between the trees. In the second step, split conditions for each node in the tree are decided considering predictor variables. It is important to choose the number of variables that provides sufficiently low correlation with adequate predictive power. Fortunately, the optimal range for the subset of predictor variables is quite wide and there are simple tests to set an optimal subset size [34,18]. There are two user-defined parameters in the application of the RF classifier: the number of features used at each node to create a tree and the number of trees to be grown in the forest. On the other hand, the classifier uses GINI index as an attribute selection measure to measure the impurity of an attribute
33.4 RESULTS
with respect to the classes. GINI index can be calculated from the following equation: f (Ci , T )/|T | f (Cj , T )/|T |
613
(33.1)
j =i
where T is a given training data set, C is the class that a randomly selected case (pixels or image objects) belongs to, and (f (Ci , T )/|T |) is the probability that the selected case belongs to class Ci [34]. When estimated GINI index value increases, class heterogeneity also increases. However, as GINI index decreases, class homogeneity increases. If a child node of GINI index is less than a parent node, then the split is successful. Tree splitting is terminated when GINI index reaches zero, meaning that one class is present at each terminal node in the tree [1]. Once all trees are grown in the forest with above consideration, classification is conducted on the new data set.
33.3.3 MCNEMAR’S TEST FOR COMPARISON OF CLASSIFIER PERFORMANCES McNemar’s test is a well-known statistical test to analyze statistical significance of the differences in classifier performances [10]. The test is a Chi-square (χ 2 ) test for goodness of fit comparing the distribution of counts expected under the null hypothesis to the observed counts [22]. It is applied to a 2 × 2 contingency table, the cells of which include the number of samples correctly and incorrectly identified by both methods, the number of samples only classified correctly by one method. The test statistic with continuity correction is estimated from the following equation with 1 degree of freedom: χ2 =
(|nij − nj i | − 1)2 nij + nj i
(33.2)
where nij indicates the number of pixels misclassified by method i but classified correctly by method j , and nj i indicates the number of pixels misclassified by method j but not by method i. If the estimated test value is greater than the χ table value of 3.84 at 95% confidence interval, then it can be stated that the two methods differ in their performances. In other words, the difference in accuracy between the methods i and j is said to be statistically significant.
33.4 RESULTS The performance of the RF classifier used in conjunction with the object-based approach was compared to that of the k-nearest neighbor classifier in the classification of ten LULC classes using a Quckbird-2 image. Before implementing the object-based RF classifier, optimal parameters were defined for the RF classifier and object-based image analysis, which is of crucial importance to obtain high accurate classified thematic map of the study area. The RF classifier is known as ‘off the shelf’ classifier as only two user-defined parameters, namely number of trees (M) and number of variables at each split (n), are required. To find the optimal values for the parameters, a number of experiments were conducted on the data set and the number of trees (M) was set to 250 and the number of variables at each node (n) was set to 4, which is the square root of the number of input variables (18 in total). Parameter setting of OBIA is relatively more important and challenging task since there is no method or algorithm universally
614
CHAPTER 33 OBJECT-ORIENTED RANDOM FOREST FOR HIGH RESOLUTION
FIGURE 33.3 LV-RoC graph of the image. Segmentation scale of 24 was selected as the optimal one.
accepted in the literature. Multi-resolution segmentation applied in this study has three user-defined parameters: scale, shape, and compactness. As discussed in Section 33.3.1, an unsupervised image analysis algorithm called ESP2 tool was employed in this study to estimate optimal scale value. It can be seen from Fig. 33.3 that the first peak on the rate of change graph, which theoretically indicates the optimal scale value, is at scale of 24. This scale level was taken and utilized in all OBIA experiments. Having limited effect on the performance of OBIA, shape and compactness parameters were kept constant at 0.1 and 0.5, respectively, as suggest by many studies (e.g. [29,13]). It should be also noted that equal weights were assigned to multispectral bands of the Quickbird-2 image. In order to evaluate the performance of the object-based RF classifier, image objects were firstly formed using the optimal segmentation parameters and then they were classified using the RF classifier that was trained using the training samples selected from the regions shown in Fig. 33.2A. The k-NN classifier was also applied to the image objects derived from the image. It should be noted that totally 18 spectral object features, namely mean, min and max values of objects at all four bands, NDVI, brightness, and band ratios for all spectral bands, were considered in all classification processes to improve classification accuracy, which is the great benefit of using OBIA in classification. Thematic maps showing classification results of the RF and k-NN classifiers are given in Figs. 33.4 and 33.5, respectively. Visual interpretation of the figures in reference to Fig. 33.2 shows that the RF classifier identifies shadow, bare ground, gravel-concrete much accurately. Shadow regions on the seashore were mapped accurately and bare ground fields were generally classified as clear regions without involvement of other cover types. However, Fig. 33.5 reveals that the k-NN method eliminated most coastal shadows and interestingly identified some red roof objects inside the bare soils on the upper left corner of the study site. Interestingly, asphalt road objects were identified inside gravel-concrete fields on the upper left corner, near the seashore. It was observed that asphalt road and red roof class types were overly emphasized in the thematic map of the k-NN classifier (Fig. 33.5). Ground reference data (Fig. 33.2B) were also used to analyze the accuracy of the thematic maps and to perform detailed comparisons about the classification performances of the RF and k-NN methods. In order to avoid a bias towards a certain land cover type and assign equal weight to each type in the estimation of accuracy metrics, 2000 randomly selected points per cover type were utilized in accuracy
33.4 RESULTS
FIGURE 33.4 Classification map produced using the RF classifier.
FIGURE 33.5 Classification map produced using the k-NN classifier.
615
616
CHAPTER 33 OBJECT-ORIENTED RANDOM FOREST FOR HIGH RESOLUTION
Table 33.1 Accuracy Assessment for Object-Based Classification Using (A) Random Forest (RF), and (B) k-Nearest Neighbor (k-NN) Classifiers. Please note that UA Indicates User’s Accuracy and PA Indicates Producer’s Accuracy (A) A B C D E F G H I J
A
B
C
D
E
F
G
H
I
J
UA
1770 3 0 0 0 0 152 0 0 0
169 1969 0 357 46 1 6 0 4 7
0 0 1922 0 0 47 0 0 0 0
0 21 0 1643 3 0 0 0 0 0
0 0 0 0 1951 0 0 0 0 3
36 0 78 0 0 1948 4 0 0 0
25 4 0 0 0 0 1830 0 0 0
0 0 0 0 0 4 0 2000 0 0
0 3 0 0 0 0 0 0 1996 0
0 0 0 0 0 0 8 0 0 1990
88.50 98.45 96.10 82.15 97.55 97.40 91.50 100.0 99.80 99.50
PA 91.95 76.94 97.61 98.56 99.85 94.29 98.44 99.80 99.85 99.60 Overall accuracy = 95.10, Kappa coefficient = 0.946
(B) A B C D E F G H I J
A
B
C
D
E
F
G
H
I
J
UA
1222 1 0 0 0 0 77 0 0 1
88 1589 0 101 4 0 5 1 0 50
0 0 1917 0 0 558 0 1 0 0
0 299 0 1899 0 0 0 0 0 0
0 0 0 0 1996 0 0 0 0 0
125 0 83 0 0 1442 1 0 0 0
565 111 0 0 0 0 1916 0 0 0
0 0 0 0 0 0 1 1709 0 0
0 0 0 0 0 0 0 289 2000 0
0 0 0 0 0 0 0 0 0 1949
61.10 79.45 95.85 94.95 99.80 72.10 95.80 85.45 100.0 97.45
PA 93.93 86.45 77.42 86.40 100.0 87.34 73.92 99.94 87.37 100.0 Overall accuracy = 88.20, Kappa coefficient = 0.869 Class key: A: bare ground; B: gravel-concrete; C: forest; D: asphalt road; E: blue roof; F: pasture; G: red roof; H: shadow; I: water; J: white roof
assessment. Confusion (i.e. error) matrices including classifiers’ performances on each LULC class type were given in Table 33.1. When the error matrices of both classifiers are analyzed, it can be seen that the RF classifier clearly outperformed parametric k-NN classifier in the classification of the data set. The difference in overall accuracy reached 7% that can be described as a significant improvement. User’s accuracy values clearly show that the RF classifier could distinguish all LULC types except for the gravel-concrete class more accurately. The difference in individual class accuracies of bare ground and pasture classes was over 25%. In other words, quarter more pixels of these particular classes were correctly identified on the thematic map produced by the RF classifier. Gravel-concrete and shadow
33.5 CONCLUSIONS
617
classes were also classified with significantly higher accuracies by about 19 and 15%, respectively. The only class where the RF method classified poorly compared to the k-NN method was asphalt road class. It can be seen from the error matrix (Table 33.1A) that the 357 asphalt road pixels were misclassified as gravel-concrete. This indicates a difficulty in distinction of asphalt and gravel-concrete pixels that could be resulting from their spectral similarities on the selected satellite image. Another reason might be the outliers in training samples that could have been improved with refinement of the data set. On the other hand, a significant number of bare ground pixels (565 pixels) were unexpectedly classified as red-roof buildings by the k-NN classifier, which is a serious classification error that makes the produced thematic map unreliable and questionable for a GIS analysis. In order to analyze whether the differences in the classification accuracies produced by the RF and k-NN classifiers were statistically significant, a two-tailed McNemar’s test was applied. A McNemar’s test confirmed that the object-based RF classification was significantly better in comparison to the object-based k-NN classification (χ 2 = 809.21, p-value