Remote Sensing Big Data (Springer Remote Sensing/Photogrammetry) 3031339312, 9783031339318

This monograph provides comprehensive coverage of the collection, management, and use of big data obtained from remote s

2,159 161 18MB

English Pages 306 [298] Year 2023

Table of contents :
Contents
About the Authors
Chapter 1: Introduction
1.1 Concepts of Big Data
1.2 Features of Big Data
1.2.1 Big Data Volume
1.2.2 Big Data Velocity
1.2.3 Big Data Variety
1.2.4 Big Data Veracity
1.2.5 Big Data Value
1.3 Big Data Method and Technology
1.4 Remote Sensing Big Data
References
Chapter 2: Remote Sensing
2.1 Concepts
2.2 Sensors
2.2.1 Sensors by Radiometric Spectrums
2.2.1.1 Multi- and Hyperspectral Remote Sensing
2.2.1.2 Active Microwave Remote Sensing
2.2.1.3 Passive Microwave Remote Sensing
2.2.1.4 Active Optical Remote Sensing
2.2.1.5 GPS Remote Sensing
2.2.1.6 Imaging Sonar
2.2.2 Sensors by Work Mode
2.2.2.1 Frame
2.2.2.2 Whiskbroom
2.2.2.3 Pushbroom
2.2.2.4 Side Scanning
2.2.2.5 Conical Scanning
2.3 Platforms
2.3.1 Satellites
2.3.2 Airborne
2.3.3 In Situ
2.3.4 Shipborne
References
Chapter 3: Special Features of Remote Sensing Big Data
3.1 Volume of Remote Sensing Big Data
3.2 Variety of Remote Sensing Big Data
3.3 Velocity of Remote Sensing Big Data
3.4 Veracity of Remote Sensing Big Data
3.5 Value of Remote Sensing Big Data
References
Chapter 4: Remote Sensing Big Data Collection Challenges and Cyberinfrastructure and Sensor Web Solutions
4.1 Remote Sensing Big Data Collection Challenges
4.2 Remote Sensing Big Data Collection Cyberinfrastructure
4.2.1 Global Earth Observation System of Systems (GEOSS)
4.2.2 NASA Earth Observing System (EOS) Data and Information System (EOSDIS)
4.2.3 ESA Federated Earth Observation (FedEO)
4.3 Sensor Web
4.4 Applications
4.4.1 Climate
4.4.2 Weather
4.4.3 Disasters
4.4.4 Agriculture
References
Chapter 5: Remote Sensing Big Data Computing
5.1 Computing Power to Handle Big Data: Distributed and Parallel Computing
5.2 Evolution of Geospatial Computing Platform
5.2.1 Stand-Alone Software System Architecture
5.2.2 Client-Server Software System Architecture
5.2.3 Distributed Computing
5.3 Service-Oriented Architecture (SOA)
5.3.1 Service Roles
5.3.2 Service Operations
5.3.3 Service Chaining
5.3.4 Web Services
5.3.5 Common Technology Stack for Web Services
5.3.5.1 Web Services Description Language (WSDL)
5.3.5.2 Universal Description, Discovery, and Integration (UDDI)
5.3.5.3 The Simple Object Access Protocol (SOAP)
5.3.5.4 Business Process Execution Language (BPEL)
5.3.6 Web Service Applications
5.3.7 Web Service Standards
5.3.8 OGC Web Services
5.3.8.1 Operation Components
5.3.8.1.1 Client Services
5.3.8.1.2 Catalog and Registry Services
5.3.8.1.3 Data Services
5.3.8.1.4 Application Services
5.3.8.2 Data Components
5.3.8.2.1 Geospatial Data
5.3.8.2.2 Geospatial Metadata
5.3.8.2.3 Names
5.3.8.2.4 Relationship
5.3.8.2.5 Containers
5.4 High-Throughput Computing Infrastructure
5.4.1 Super Computing
5.4.2 Cluster Computer
5.4.3 Grid Computing
5.4.4 Cloud Computing
5.4.4.1 What Does the Cloud Provide?
5.4.4.2 What Make Cloud Possible?
5.4.4.3 Characteristics of Cloud Computing
5.4.4.4 Comparing Cloud Computing with Grid Computing
5.4.4.5 Software Platforms for Distributed Processing of Big Data in Cloud Computing
5.4.4.5.1 MapReduce with Hadoop
5.4.4.5.2 Spark
5.4.4.5.3 SCALE
5.4.4.5.4 Other Platforms
References
Chapter 6: Remote Sensing Big Data Management
6.1 Remote Sensing Big Data Governance
6.1.1 Strategy
6.1.2 Organizational Structure/Communications
6.1.3 Data Policy
6.1.4 Measurements
6.1.5 Technology
6.2 Remote Sensing Big Data Curation
6.2.1 Remote Sensing Big Data Organization
6.2.1.1 Data Format
6.2.1.2 Metadata
6.2.1.3 Map Projection
6.2.2 Remote Sensing Big Data Archiving
6.2.3 Remote Sensing Big Data Cataloging
6.2.4 Remote Sensing Big Data Quality Assessment
6.2.5 Remote Sensing Big Data Usability
6.2.6 Remote Sensing Big Data Version Control
6.3 Remote Sensing Big Data Dissemination Services
6.3.1 Data Discovery
6.3.2 Data Access
References
Chapter 7: Standards for Big Data Management
7.1 Standards for Remote Sensing Data Archiving
7.2 Standards for Remote Sensing Big Data Metadata
7.2.1 What Is Metadata?
7.2.2 The FGDC Content Standard for Digital Geospatial Metadata
7.2.3 The FGDC Remote Sensing Metadata Extensions
7.2.4 ISO 19115 Geographic Information—Metadata
7.2.4.1 ISO 19115-2
7.2.4.2 ISO 19115-1
7.2.5 ISO Standards for Data Quality
7.3 Standards for Remote Sensing Big Data Format
7.4 Standards for Remote Sensing Big Data Discovery
7.4.1 OGC Catalog Service for Web (CSW)
7.4.2 OpenSearch
7.5 Standards for Remote Sensing Big Data Access
7.5.1 OGC Web Coverage Service (WCS)
7.5.2 OGC Web Feature Service (WFS)
7.5.3 OGC Web Map Service (WMS)
7.5.4 OGC Sensor Observation Service (SOS)
7.5.5 OpenDAP
References
Chapter 8: Implementation Examples of Big Data Management Systems for Remote Sensing
8.1 CWIC
8.1.1 Introduction
8.1.2 CEOS WGISS
8.1.3 CWIC Architecture Design
8.1.4 CWIC System Implementation
8.1.5 Results and Conclusion
8.1.6 Future Work
8.2 The Registry in GEOSS GCI
8.2.1 Background
8.2.1.1 GEO
8.2.1.2 The Role of the Registry
8.2.2 The GEOSS Component and Service Registry
8.2.2.1 Functionalities
8.2.2.2 Concept
8.2.2.3 System Design
8.2.3 System Implementation
8.2.3.1 Logical Design and Main Functionalities
8.2.3.2 Registry Pages
8.2.3.3 The Registry
References
Chapter 9: Big Data Analytics for Remote Sensing: Concepts and Standards
9.1 Big Data Analytics Concepts
9.1.1 What Is Big Data Analytics?
9.1.2 Categories of Big Data Analytics
9.1.3 Big Data Analytics Use Cases
9.2 Remote Sensing Big Data Analytics Concepts
9.2.1 Remote Sensing Big Data Challenges
9.2.2 Categories of Remote Sensing Big Data Analytics
9.2.3 Processes of Remote Sensing Big Data Analytics
9.2.4 Objectives of Remote Sensing Big Data Analytics
9.3 Big Data Analytics Standards
9.3.1 IEEE Big Data Analytics Standards
9.3.2 ISO Big Data Working Group: ISO/IEC JTC 1/SC 42/WG 2
References
Chapter 10: Big Data Analytic Platforms
10.1 Big Data Analytic Platforms
10.2 Data Storage Strategy in Big Data Analytic Platforms
10.3 Data-Processing Strategy in Big Data Analytic Platforms
10.4 Tools in Big Data Analytic Platforms
10.5 Data Visualization in Big Data Analytic Platforms
10.6 Remote Sensing Big Data Analytic Platforms
10.6.1 GeoMesa
10.6.2 GeoTrellis
10.6.3 RasterFrames
10.7 Remote Sensing Big Data Analytic Services
10.7.1 Google Earth Engine
10.7.2 EarthServer—an Open Data Cube
10.7.3 NASA Earth Exchange
10.7.4 NASA Giovanni
10.7.5 Others
References
Chapter 11: Algorithmic Design Considerations of Big Data Analytics
11.1 Complexity of Remote Sensing Big Data Analytic Algorithms
11.2 Challenges and Algorithm Design Considerations from Volume
11.3 Challenges and Algorithm Design Considerations from Velocity
11.4 Challenges and Algorithm Design Considerations from Variety
11.5 Challenges and Algorithm Design Considerations from Veracity
11.6 Challenges and Algorithm Design Considerations from Value
References
Chapter 12: Machine Learning and Data Mining Algorithms for Geospatial Big Data
12.1 Distributed and Parallel Learning
12.2 Data Reduction and Approximate Computing
12.2.1 Sampling
12.2.2 Approximate Computing
12.3 Feature Selection and Feature Extraction
12.4 Incremental Learning
12.5 Deep Learning
12.6 Ensemble Analysis
12.7 Granular Computing
12.8 Stochastic Algorithms
12.9 Transfer Learning
12.10 Active Learning
References
Chapter 13: Modeling, Prediction, and Decision Making Based on Remote Sensing Big Data
13.1 A General Framework
13.2 Modeling
13.2.1 Data Models and Structures
13.2.2 Modeling with Remote Sensing Big Data
13.2.3 Validation with Remote Sensing Big Data
13.3 Decision Making
References
Chapter 14: Examples of Remote Sensing Applications of Big Data Analytics—Fusion of Diverse Earth Observation Data
14.1 The Concept of Data Fusion
14.1.1 Definitions
14.1.2 Classification of Data Fusion
14.2 Data Fusion Architectures
14.3 Fusion of MODIS and Landsat with Deep Learning
14.3.1 The Problem
14.3.2 Data Fusion Methods
References
Chapter 15: Examples of Remote Sensing Applications of Big Data Analytics—Agricultural Drought Monitoring and Forecasting
15.1 Agricultural Drought
15.2 Remote Sensing Big Data for Agricultural Drought
15.3 Geospatial Data Analysis Infrastructure GeoBrain
15.4 The Global Agricultural Drought Monitoring and Forecasting System Portal
References
Chapter 16: Examples of Remote Sensing Applications of Big Data Analytics—Land Cover Time Series Creation
16.1 Remote Sensing Big Data for Land Cover Classification
16.2 Land Cover Classification Methodology
16.3 Results and Discussions
References
Chapter 17: Geospatial Big Data Initiatives in the World
17.1 US Federal Government Big Data Initiative
17.1.1 Big Earth Data Initiative
17.1.2 NSF EarthCube
17.2 Big Data Initiative in China
17.3 Big Data Initiatives in Europe
17.4 Big Data Initiatives in Australia
17.5 Other Big Data Initiatives
References
Chapter 18: Challenges and Opportunities in the Remote Sensing Big Data
18.1 Challenges
18.2 Opportunities
References
Index

Recommend Papers

Radar Remote Sensing for Crop Biophysical Parameter Estimation (Springer Remote Sensing/Photogrammetry) 9811644233, 9789811644238

This book presents a timely investigation of radar remote sensing observations for agricultural crop monitoring and adva

102 43 10MB Read more

Environmental Remote Sensing in Egypt (Springer Geophysics) 3030395928, 9783030395926

This book presents a comprehensive selection of applications employed in environmental remote sensing using optical and

114 67 40MB Read more

Geoscience and Remote Sensing 9789533070032

539 123 4MB Read more

Satellite Remote Sensing in Hydrological Data Assimilation 3030373746, 9783030373740

This book presents the fundamentals of data assimilation and reviews the application of satellite remote sensing in hydr

122 87 14MB Read more

Remote Sensing: Applications 9535106517, 9789535106517

This book intends to show the reader how remote sensing impacts other areas of science, technology, and human activity,

502 34 26MB Read more

Radar Remote Sensing of Urban Areas (Remote Sensing and Digital Image Processing, 15) 9048137500, 9789048137503

One of the key milestones of radar remote sensing for civil applications was the launch of the European Remote Sensing S

121 73 8MB Read more

Remote Sensing of Urban and Suburban Areas (Remote Sensing and Digital Image Processing, 10) 9781402043710, 1402043716

"Remote Sensing of Urban and Suburban Areas" provides instructors with a text reference that has a logical and

115 42 12MB Read more

Remote Sensing of Global Croplands for Food Security (Remote Sensing Applications) 1420090097, 9781420090093, 9781420090109

Increases in populations have created an increasing demand for food crops while increases in demand for biofuels have cr

510 77 14MB Read more

Foundations of Atmospheric Remote Sensing 3030667448, 9783030667443

Theoretical foundations of atmospheric remote sensing are electromagnetic theory, radiative transfer and inversion theor

113 76 10MB Read more

Remote Sensing And GIS Integration 9780071625463

701 28 8MB Read more

Remote Sensing Big Data (Springer Remote Sensing/Photogrammetry)
3031339312, 9783031339318

Author / Uploaded
Liping Di
Eugene Yu

0 0 1
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Springer Remote Sensing/Photogrammetry

Liping Di Eugene Yu

Remote Sensing Big Data

Springer Remote Sensing/Photogrammetry

The Springer Remote Sensing/Photogrammetry series seeks to publish a broad portfolio of scientific books, aiming at researchers, students, and everyone interested in the broad field of geospatial science and technologies. The series includes peer- reviewed monographs, edited volumes, textbooks, and conference proceedings. It covers the entire area of Remote Sensing, including, but not limited to, land, ocean, atmospheric science and meteorology, geophysics and tectonics, hydrology and water resources management, earth resources, geography and land information, image processing and analysis, satellite imagery, global positioning systems, archaeological investigations, and geomorphological surveying. Series Advisory Board: Marco Chini, Luxembourg Institute of Science and Technology (LIST), Belvaux, Luxembourg Manfred Ehlers, University of Osnabrueck Venkat Lakshmi, The University of South Carolina, USA Norman Mueller, Geoscience Australia, Symonston, Australia Alberto Refice, CNR-ISSIA, Bari, Italy Fabio Rocca, Politecnico di Milano, Italy Andrew Skidmore, The University of Twente, Enschede, The Netherlands Krishna Vadrevu, The University of Maryland, College Park, USA

Liping Di • Eugene Yu

Remote Sensing Big Data

Liping Di Center for Spatial Information Science and Systems George Mason University Fairfax, VA, USA

Eugene Yu Center for Spatial Information Science and Systems George Mason University Fairfax, VA, USA

ISSN 2198-0721 ISSN 2198-073X (electronic) Springer Remote Sensing/Photogrammetry

ISBN 978-3-031-33931-8 ISBN 978-3-031-33932-5 (eBook) https://doi.org/10.1007/978-3-031-33932-5 © Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Contents

1

Introduction�� 1 1.1 Concepts of Big Data�� 1 1.2 Features of Big Data �� 4 1.2.1 Big Data Volume �� 6 1.2.2 Big Data Velocity�� 8 1.2.3 Big Data Variety�� 8 1.2.4 Big Data Veracity�� 8 1.2.5 Big Data Value�� 9 1.3 Big Data Method and Technology�� 10 1.4 Remote Sensing Big Data �� 10 References�� 11

2

Remote Sensing�� 17 2.1 Concepts�� 17 2.2 Sensors�� 19 2.2.1 Sensors by Radiometric Spectrums�� 19 2.2.2 Sensors by Work Mode �� 30 2.3 Platforms �� 36 2.3.1 Satellites�� 36 2.3.2 Airborne�� 38 2.3.3 In Situ�� 38 2.3.4 Shipborne�� 39 References�� 39

3

Special Features of Remote Sensing Big Data �� 45 3.1 Volume of Remote Sensing Big Data �� 46 3.2 Variety of Remote Sensing Big Data�� 46 3.3 Velocity of Remote Sensing Big Data�� 48 3.4 Veracity of Remote Sensing Big Data�� 48 3.5 Value of Remote Sensing Big Data�� 49 References�� 50

v

vi

Contents

4

Remote Sensing Big Data Collection Challenges and Cyberinfrastructure and Sensor Web Solutions �� 53 4.1 Remote Sensing Big Data Collection Challenges�� 53 4.2 Remote Sensing Big Data Collection Cyberinfrastructure �� 56 4.2.1 Global Earth Observation System of Systems (GEOSS) �� 56 4.2.2 NASA Earth Observing System (EOS) Data and Information System (EOSDIS)�� 58 4.2.3 ESA Federated Earth Observation (FedEO) �� 60 4.3 Sensor Web�� 62 4.4 Applications�� 66 4.4.1 Climate�� 66 4.4.2 Weather �� 67 4.4.3 Disasters�� 68 4.4.4 Agriculture�� 68 References�� 69

5

Remote Sensing Big Data Computing�� 73 5.1 Computing Power to Handle Big Data: Distributed and Parallel Computing�� 73 5.2 Evolution of Geospatial Computing Platform�� 74 5.2.1 Stand-Alone Software System Architecture �� 74 5.2.2 Client-Server Software System Architecture�� 74 5.2.3 Distributed Computing�� 75 5.3 Service-Oriented Architecture (SOA)�� 76 5.3.1 Service Roles�� 77 5.3.2 Service Operations�� 77 5.3.3 Service Chaining�� 78 5.3.4 Web Services�� 78 5.3.5 Common Technology Stack for Web Services�� 78 5.3.6 Web Service Applications �� 81 5.3.7 Web Service Standards �� 81 5.3.8 OGC Web Services �� 81 5.4 High-Throughput Computing Infrastructure�� 85 5.4.1 Super Computing�� 85 5.4.2 Cluster Computer�� 85 5.4.3 Grid Computing�� 86 5.4.4 Cloud Computing�� 88 References�� 92

6

Remote Sensing Big Data Management�� 95 6.1 Remote Sensing Big Data Governance�� 96 6.1.1 Strategy �� 97 6.1.2 Organizational Structure/Communications�� 98 6.1.3 Data Policy�� 98 6.1.4 Measurements �� 98 6.1.5 Technology�� 98

Contents

vii

6.2 Remote Sensing Big Data Curation�� 99 6.2.1 Remote Sensing Big Data Organization �� 99 6.2.2 Remote Sensing Big Data Archiving�� 101 6.2.3 Remote Sensing Big Data Cataloging�� 101 6.2.4 Remote Sensing Big Data Quality Assessment�� 101 6.2.5 Remote Sensing Big Data Usability �� 102 6.2.6 Remote Sensing Big Data Version Control�� 102 6.3 Remote Sensing Big Data Dissemination Services�� 102 6.3.1 Data Discovery�� 102 6.3.2 Data Access�� 104 References�� 104 7

Standards for Big Data Management�� 107 7.1 Standards for Remote Sensing Data Archiving�� 107 7.2 Standards for Remote Sensing Big Data Metadata�� 108 7.2.1 What Is Metadata?�� 108 7.2.2 The FGDC Content Standard for Digital Geospatial Metadata�� 109 7.2.3 The FGDC Remote Sensing Metadata Extensions �� 112 7.2.4 ISO 19115 Geographic Information—Metadata�� 115 7.2.5 ISO Standards for Data Quality�� 118 7.3 Standards for Remote Sensing Big Data Format�� 119 7.4 Standards for Remote Sensing Big Data Discovery �� 120 7.4.1 OGC Catalog Service for Web (CSW)�� 121 7.4.2 OpenSearch �� 125 7.5 Standards for Remote Sensing Big Data Access�� 126 7.5.1 OGC Web Coverage Service (WCS)�� 126 7.5.2 OGC Web Feature Service (WFS)�� 128 7.5.3 OGC Web Map Service (WMS) �� 128 7.5.4 OGC Sensor Observation Service (SOS)�� 129 7.5.5 OpenDAP�� 130 References�� 130

8

Implementation Examples of Big Data Management Systems for Remote Sensing�� 135 8.1 CWIC�� 136 8.1.1 Introduction�� 136 8.1.2 CEOS WGISS �� 136 8.1.3 CWIC Architecture Design �� 136 8.1.4 CWIC System Implementation �� 137 8.1.5 Results and Conclusion�� 139 8.1.6 Future Work�� 144 8.2 The Registry in GEOSS GCI�� 145 8.2.1 Background �� 145 8.2.2 The GEOSS Component and Service Registry�� 146 8.2.3 System Implementation�� 151 References�� 153

viii

9

Contents

Big Data Analytics for Remote Sensing: Concepts and Standards�� 155 9.1 Big Data Analytics Concepts�� 155 9.1.1 What Is Big Data Analytics?�� 155 9.1.2 Categories of Big Data Analytics�� 156 9.1.3 Big Data Analytics Use Cases�� 163 9.2 Remote Sensing Big Data Analytics Concepts �� 163 9.2.1 Remote Sensing Big Data Challenges�� 163 9.2.2 Categories of Remote Sensing Big Data Analytics�� 164 9.2.3 Processes of Remote Sensing Big Data Analytics�� 164 9.2.4 Objectives of Remote Sensing Big Data Analytics�� 165 9.3 Big Data Analytics Standards �� 166 9.3.1 IEEE Big Data Analytics Standards�� 166 9.3.2 ISO Big Data Working Group: ISO/IEC JTC 1/SC 42/WG 2�� 167 References�� 167

10 Big Data Analytic Platforms�� 171 10.1 Big Data Analytic Platforms �� 171 10.2 Data Storage Strategy in Big Data Analytic Platforms �� 173 10.3 Data-Processing Strategy in Big Data Analytic Platforms �� 174 10.4 Tools in Big Data Analytic Platforms �� 180 10.5 Data Visualization in Big Data Analytic Platforms�� 181 10.6 Remote Sensing Big Data Analytic Platforms�� 183 10.6.1 GeoMesa �� 183 10.6.2 GeoTrellis�� 184 10.6.3 RasterFrames�� 184 10.7 Remote Sensing Big Data Analytic Services�� 185 10.7.1 Google Earth Engine�� 185 10.7.2 EarthServer—an Open Data Cube�� 186 10.7.3 NASA Earth Exchange �� 187 10.7.4 NASA Giovanni�� 188 10.7.5 Others�� 189 References�� 189 11 Algorithmic Design Considerations of Big Data Analytics�� 195 11.1 Complexity of Remote Sensing Big Data Analytic Algorithms�� 195 11.2 Challenges and Algorithm Design Considerations from Volume�� 197 11.3 Challenges and Algorithm Design Considerations from Velocity�� 200 11.4 Challenges and Algorithm Design Considerations from Variety�� 201 11.5 Challenges and Algorithm Design Considerations from Veracity�� 202

Contents

ix

11.6 Challenges and Algorithm Design Considerations from Value�� 202 References�� 203 12 Machine Learning and Data Mining Algorithms for Geospatial Big Data �� 207 12.1 Distributed and Parallel Learning �� 209 12.2 Data Reduction and Approximate Computing�� 210 12.2.1 Sampling �� 210 12.2.2 Approximate Computing�� 211 12.3 Feature Selection and Feature Extraction �� 212 12.4 Incremental Learning�� 214 12.5 Deep Learning�� 216 12.6 Ensemble Analysis�� 217 12.7 Granular Computing �� 218 12.8 Stochastic Algorithms �� 219 12.9 Transfer Learning�� 219 12.10 Active Learning�� 220 References�� 221 13 Modeling, Prediction, and Decision Making Based on Remote Sensing Big Data �� 227 13.1 A General Framework�� 227 13.2 Modeling �� 229 13.2.1 Data Models and Structures�� 229 13.2.2 Modeling with Remote Sensing Big Data�� 230 13.2.3 Validation with Remote Sensing Big Data�� 232 13.3 Decision Making�� 233 References�� 234 14 Examples of Remote Sensing Applications of Big Data Analytics—Fusion of Diverse Earth Observation Data�� 237 14.1 The Concept of Data Fusion �� 237 14.1.1 Definitions�� 237 14.1.2 Classification of Data Fusion�� 238 14.2 Data Fusion Architectures�� 239 14.3 Fusion of MODIS and Landsat with Deep Learning�� 240 14.3.1 The Problem�� 240 14.3.2 Data Fusion Methods�� 242 References�� 247 15 Examples of Remote Sensing Applications of Big Data Analytics—Agricultural Drought Monitoring and Forecasting�� 249 15.1 Agricultural Drought�� 249 15.2 Remote Sensing Big Data for Agricultural Drought�� 250 15.3 Geospatial Data Analysis Infrastructure GeoBrain�� 253

x

Contents

15.4 The Global Agricultural Drought Monitoring and Forecasting System Portal�� 256 References�� 257 16 Examples of Remote Sensing Applications of Big Data Analytics—Land Cover Time Series Creation �� 261 16.1 Remote Sensing Big Data for Land Cover Classification�� 261 16.2 Land Cover Classification Methodology�� 263 16.3 Results and Discussions�� 265 References�� 267 17 Geospatial Big Data Initiatives in the World�� 271 17.1 US Federal Government Big Data Initiative �� 271 17.1.1 Big Earth Data Initiative �� 272 17.1.2 NSF EarthCube �� 274 17.2 Big Data Initiative in China�� 275 17.3 Big Data Initiatives in Europe�� 275 17.4 Big Data Initiatives in Australia�� 276 17.5 Other Big Data Initiatives �� 276 References�� 277 18 Challenges and Opportunities in the Remote Sensing Big Data�� 281 18.1 Challenges�� 281 18.2 Opportunities�� 284 References�� 287 Index�� 293

About the Authors

Liping Di serves as Professor and Director of the Center for Spatial Information Science and Systems at George Mason University in Virginia, USA. He is internationally known for his extraordinary contributions to the geospatial information science/geoinformatics, especially to the development of geospatial interoperability technology and the federal, national, and international geographic information and remote sensing standards. He was one of the core members for the development of the NASA EOSDIS data standards, and is a pioneer in the development of web- based, advanced, distributed geospatial systems and tools. Dr. Di has engaged in the geoinformatics and Earth system research for more than 25 years and has published over 500 publications. Eugene Yu is a Research Professor and the Associate Director of the Center for Spatial Information Science and Systems, George Mason University, Fairfax, Virginia, USA. Dr. Yu received the B.Sc. degree in physical geography from the Peking University, Beijing, China, the M.Sc. in environmental remote sensing from the University of Aberdeen, Aberdeen, UK, the M.S. in information systems and the M.S. in computer science from George Mason University, Fairfax, Virginia, USA, and the Ph.D. in geography with the focus on remote sensing and geographic information systems from the Indiana State University, Terre Haute, Indiana, USA. His research interests include geographic information systems, remote sensing, intelligent image understanding, Sensor Web, semantic Web, computational vision, agro-informatics, and robotics.

xi

Chapter 1

Introduction

Abstract This chapter introduces the concept of big data and its major features. Major dimensions of big data are discussed. The history of big data is introduced. The primary 5 Vs (Volume, Variety, Velocity, Veracity, and Value) are briefed. The concept of remote sensing big data is briefly discussed. Keywords Big data · Remote sensing big data · Big data features · Big data management · Big data analytics Remote sensing has made massive contributions to the development of big data. Remote sensing big data has developed rapidly in academia, government, and industry (Liu et al. 2018).

1.1 Concepts of Big Data There are numerous definitions of big data. Wikipedia defines it as “a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data- processing application software” (Wikipedia 2020). In the Oxford English Dictionary, big data has been added and defined as “computing data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges; (also) the branch of computing involving such data” (GilPress 2013). In Webopedia, big data is defined as “a massive volume of both structured and unstructured data that is so large it is difficult to process using traditional database and software techniques. In most enterprise scenarios the volume of data is too big or it moves too fast or it exceeds current processing capacity” (Beal 2020). “Big-Data can be seen as the collection/generation, storage/communication, processing and interpretation of big volumes of data” (Vinck 2016).“‘Big data’ is high-volume, -velocity and -variety information assets that demand cost-effective,

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. Di, E. Yu, Remote Sensing Big Data, Springer Remote Sensing/ Photogrammetry, https://doi.org/10.1007/978-3-031-33932-5_1

1

2

1 Introduction

innovative forms of information processing for enhanced insight and decision making.” (Gartner http://www.gartner.com/it-glossary/big-data/). “Big data” was first mentioned in conference papers; NASA scientists used “big data” in a conference presentation in 1997 (Cox and Ellsworth 1997a). However, the origin of big data is still arguable, as discussed in Lohr (2013). According to a review paper by an economist Diebold (Diebold 2012), the origin of big data term may be linked to a talk of big data in computer processing by John R. Mashey, the Chief Scientist in SGI, with the title of “Big Data and the NextWave of InfraStress” (URL: http://usenix.org/publications/library/proceedings/usenix99/invited_talks/ mashey.pdf) (Mashey 1999). The first appearance of big data as a term in academic book may be seen in paper of Weiss and Indurkhya (Weiss and Indurkhya 1998). In 2000, the conference paper is discussing big data in dynamic factor models in economics (Diebold 2000). In 1997, the big data is mentioned in Cox and Ellsworth (1997a) as follows: Visualization provides an interesting challenge for computer systems: data sets are generally quite large, taxing the capacities of main memory, local disk, and even remote disk. We call this the problem of big data. When data sets do not fit in main memory (in core), or when they do not fit even on local disk, the most common solution is to acquire more resources.

By 2008, the term “big data” had been popularized by a number of prominent American computer scientists (Bryant et al. 2008). Although there is no standard definition of big data, attempted definitions all include the concepts of a large amount of data and data that is difficult to handle by traditional technologies. With such broad definitions, big data can be seen as a classic issue tracing back to early days of computing as listed in Table 1.1. The brief timeline of big data as a concept is shown in Fig. 1.1. The fundamental problem is that the rate of data and information increase far outpaces that of technological advancement. Another relevant term is “information explosion,” which was first used in 1941, according to the Oxford English Dictionary. We have always had to deal with large volumes of data in modern times, where the concept of a “large” volume is ever changing. A set of data considered “big data” may be no longer considered as big data in 5 years’ time because of technology advance. The origin of the big data problem can be traced back to 1880s, with the US census. Every 10 years, the United States conducts a census count. In 1880, U.S. population was about 63 million, and the census for that year took 8 years to tabulate. It was estimated that the 1890 census would take more than 10 years using the then-available methods. Without any advancement in methodology, tabulation would not have been complete before the 1900 census had to be taken. The problem was then known as information overload. The Hollerith tabulating machine (punch cards) “tamed” the big data problem in 1800s. It was used to tabulate the 1890 census in about a year. This was the start of IBM. Major milestones of big data progress have been briefly summarized in Table 1.2.

3

1.1 Concepts of Big Data Table 1.1 Development of big data Period Description 1970s Realize the limitation of computers to handle big data. Stored on several dozens of magnetic tapes, require sophisticated management systems. TREBIG for dealing with big data using FORTRAN program Big data collection Creating big data banks Big data database Expanding support of microchips to support big data Collecting, processing, and storage of big data files 1980s Big data processing Large terminological data bank Megabyte data (e.g., 176 mb) Big data set Big data database (many billion bytes) 1990s Big data collections and big data objects Big data sets Big data structure Terabyte of data “Big data” problems—manipulating and organizing large sets of data Data quality is a big data legacy problem (Adriaans 1997) In theory, “big data” can lead to much stronger conclusions for data-mining applications, but in practice many difficulties arise (Weiss and Indurkhya 1998) 2000s Statistical data analyses often require access to big data files of different structure Defines 3 “Vs” in 2001 (Volume, Velocity, and Variety) Data sharing (Lynch 2008) Big data curation and professional development (Howe et al. 2008) 2010s More on veracity Big data special issues and meeting in remote sensing big data Image big data Big data science Big data analytics Specialized big data

References Campbell et al. (1970), Robinson (1971), Blosser et al. (1972), Hubaux (1973), Wainer et al. (1974), Zhelesov (1975), Bisiani and Greer (1978), Surden (1978), Mochmann and Müller (1979), Sugarman (1979), Woolsto (1979)

Mahey (1980), Liebl et al. (1982), Martin (1982), Asher (1983), Chen and Jacquemin (1988)

Adriaans (1997), Cox and Ellsworth (1997a, b), Killian (1998), Kenwright (1999), Mashey (1999), Van Roy and Haridi (1999) Bengston (1994), Knutson and McCusker (1997), Tremblay et al. (1998), Weiss and Indurkhya (1998), Lynch (2008)

Victor and Sund (1977), Laney (2001), Diebold (2003), Nelson (2008), Chisholm (2009), Nielsen (2009), Ranganathan et al. (2011) Howe et al. (2008)

Preimesberger (2011), Liu (2015), Chi et al. (2015), Batty (2016), Al-Jepoori and Al-Khanjari (2018), Woo et al. (2018), Liu et al. (2018), Huang et al. (2018), Zhu (2019), Albornoz et al. (2020)

4

1 Introduction

Fig. 1.1 Timeline of Big Data

What are the reasons that make big data a hot topic now, when big data has already existed for so many years? Three main reasons for the rapid development in dealing with big data in 2010s are as follows: • Data explosion: proliferation of digital mobile devices, popularity of social media, low cost of sensors, and advancement of sensor technologies. • Breakthroughs in the computing technologies: web service, clouding computing, massive storage, and infrastructure as a service (IaaS, etc.). • Breakthroughs in data analytics fully explore the values buried in the big data. Data-driven discovery becomes possible with data mining, deep learning, and artificial intelligence (AI). These three reasons together offer many things that were not available before. It is possible to handle massive amounts of information in all sorts of formats, with near instantaneous processing on affordable computing machines using ordinary, low-cost hardware.

1.2 Features of Big Data Features of big data have been examined and described from many dimensions since the first look at data using views of dimensions (Laney 2001). The original 3 Vs are Volume, Velocity, and Variety (Laney 2001). In the literature, one may find 4 Vs (Camacho et al. 2014), 5 Vs (Ishwarappa and Anuradha 2015), 6 Vs (Rahman et al. 2016), 7 Vs (Khan et al. 2014), 8 Vs (Traverso et al. 2019), 9 Vs (Mircea et al.

1.2 Features of Big Data

5

Table 1.2 Major events in the history of big data Year Milestone and brief notes 1944 The Scholar and the Future of the Research Library: Fremont Rider, a Wesleyan University librarian, estimated that American university libraries were doubling in size every 16 years. Given this growth rate, Rider speculates that the Yale Library in 2040 will have “approximately 200,000,000 volumes, which will occupy over 6000 miles of shelves… [requiring] a cataloging staff of over six thousand persons.” (Rider 1944) 1956 Virtual memory: Developed by German physicist Fritz-Rudolf Güntsch as an idea that treated finite storage as infinite. Storage, managed by integrated hardware and software to hide the details from the user, permitted us to process data without the hardware memory constraints that previously forced the problem to be partitioned (making the solution a reflection of the hardware architecture, a most unnatural act) (Gordon 2015). 1961 Science since Babylon: Derek John de Solla Price charts the growth of scientific knowledge by looking at the growth in the number of scientific journals and papers. He concludes that the number of new journals has grown exponentially rather than linearly, doubling every 15 years and increasing by a factor of ten during every half century (de Solla Price 1961, 1978). 1975 Information flow census: The Ministry of Posts and Telecommunications in Japan starts conducting the information flow census, tracking the volume of information circulating in Japan. The census introduces “amount of words” as the unifying unit of measurement across all media. The 1975 census found that information supply was increasing much faster than information consumption (Press 2013). 1983 “Tracking the Flow of Information” in Science: Looking at growth trends in 17 major communications media from 1960 to 1977, Ithiel de Sola Pool concludes that “words made available to Americans (over the age of 10) through these media grew at a rate of 8.9 percent per year”… “words actually attended to from those media grew at just 2.9 percent per year” (de Sola Pool 1983). 1986 “Can users really absorb data at today’s rates? Tomorrow’s?” in Data Communications: Hal B. Becker estimates that “the recoding density achieved by Gutenberg was approximately 500 symbols (characters) per cubic inch—500 times the density of [4000 B.C. Sumerian] clay tablets. By the year 2000, semiconductor random access memory should be storing 1.25 × 10^11 bytes per cubic inch” (Becker 1986). 1990 “Saving All the Bits” in American Scientist: “The imperative for scientists to save all the bits forces us into an impossible situation: The rate and volume of information flow overwhelm our networks, storage devices and retrieval systems, as well as the human capacity for comprehension.” (Denning 1990) 1995 The World Wide Web populated (Berners-Lee et al. 1994). 1997 “Big data” was coined in Cox and Ellsworth (1997a). 1999 Internet of Things (IoT): The term “Internet of Things” was coined by British entrepreneur Kevin Ashton, cofounder of the Auto-ID Center at MIT, during a presentation linking the idea of RFID in supply chain to the Internet world (Aston 2009). 2001 Software as a service (SaaS): Web services, cloud computing has evolved through a number of phases which include grid and utility computing, application service provision (ASP), and software as a service (SaaS) (GCN 2013). 2001 The 3 Vs: Gartner analyst, Doug Laney, published a report—3D Data Management: Controlling Data Volume, Velocity, and Variety (Laney 2001). The “3 Vs” are the generally accepted dimensions of big data. (continued)

6

1 Introduction

Table 1.2 (continued) Year Milestone and brief notes 2006 Hadoop: An open source solution to the big data explosion. Hadoop was created in 2006 out of the necessity for new systems to handle the explosion of data from the web (Borthakur 2007). 2007 “The Expanding Digital Universe” by IDC: The white paper is the first study to estimate and forecast the amount of data growth (Press 2013). In 2006 alone, 161 exabytes were created (Gantz et al. 2007). It was forecast to double every 18 months (Gantz et al. 2007), but the amount of digital data created each year exceeded the original forecasts. The actual amount reach 1227 EB in 2010 rather than the forecasted 988 EB (Gantz and Reinsel 2010) and 2837+ EB in 2012 rather than the forecasted 1976 EB (Gantz and Reinsel 2012). 2008 Big data gained popularity: Several famous computer scientists promote the big data concept. 2008 Wired magazine claimed that the data deluge makes [existing] scientific methods obsolete (Anderson 2008). 2009 Linked data (Bizer et al. 2009). 2010 “Data, data everywhere” in The Economist (Cukier 2010). 2011 The Open Compute Project was initiated for open-source data center hardware design (Li et al. 2016). 2012 The US Big Data Research and Development Initiative declared by the Obama administration (Weiss and Zgorski 2012). 2018 IDC estimated worldwide spending on big data and analytics reaches $169 billion (Vesset et al. 2019). 2020 IDC forecasts that the big data and analytic market reach $203 billion (Vesset and George 2020). 2021 IDC forecasts that the big data and analytic market reach $215.7 billion, 10.1% increasing over 2020 (Vesset and George 2021; Shirer and Goepfert 2021).

2017), 10 Vs (Khan et al. 2018), 11 Vs (Venkatraman and Venkatraman 2019), 12 Vs (Self 2014), 42 Vs (Shafer 2017), and even up to 51 Vs (Khan et al. 2019). In this book, five of the most featured Vs in big data will be discussed. They are Volume, Velocity, Variety, Veracity, and Value, as shown in Fig. 1.2.

1.2.1 Big Data Volume Volume refers to the huge size of data generated or accumulated. • Different domains with more sensors that can acquire data: An example may be the millions of cameras that have been installed worldwide to monitor the environment, traffic conditions, public safety, etc. year-round. • Volume of sensor data increases as the detail and resolution are increased: The volume of data generated by those cameras at different scale and resolution is unimaginable. Figure 1.3 shows the increase of data volume over the years. • Accumulated data are expanding: NASA EOSDIS data reached 10 PB in early 2010s.

1.2 Features of Big Data

Fig. 1.2 Vs of Big Data

Fig. 1.3 Increase of Big Data volume

7

8

1 Introduction

Fig. 1.4 Increase of Big Data velocity

1.2.2 Big Data Velocity Velocity refers to the speed at which data are generated, processed, and/or moved around. Every minute the world generates petabytes of data, which need to be managed and analyzed, and near-real-time decisions might be made based on the analysis results. For example, in 2012 alone, 2834 EB of data were generated, which is equivalent to 5.4 PB per minute. The fastest network reaches 255 Tbps in 2014. Computing speed has increased over time. Tianhe-2, the world’s fastest computer developed by China, has a performance of 33.86 petaflops. The speed of data accumulation has also increased over years. Figure 1.4 shows the increase of data generation between 2014 and 2019.

1.2.3 Big Data Variety Variety refers to different types of data that the world generates and uses. • Different platforms/sensors: For example, in the geospatial field, we now need to deal with data from in situ, airborne, satellite platforms and citizen scientists’ mobile devices. • Different data types: The data type can range from hyperspectral images, videos, and model outputs, and can be structured or unstructured (Fig. 1.5).

1.2.4 Big Data Veracity Veracity refers to the trustworthiness of the data or the quality of the data. • Quality of the data: In the scientific world, understanding the quality and accuracy of the data is one of the biggest concerns that must be considered by every scientific experiment. • Different levels of accuracy and quality due to diverse sources: In the big data era, because the sources of the data are numerous and the qualifications of the

1.2 Features of Big Data

9

Fig. 1.5 Big Data variety

organizations or individuals who collect the data are not equal, the quality and accuracy of the data are less controllable. • Provenance: Users need to be aware of unevenness of data quality. Quality information of data should be provided to users. In a workflow, the propagation of error along the workflow is of great importance to understand the quality of data.

1.2.5 Big Data Value Value refers to the usefulness of the information and knowledge that can be derived from data. Value is the most important aspect of big data if it is deemed to be useful. In any application of big data, the first question coming to mind is what the value of big data is.

10

1 Introduction

1.3 Big Data Method and Technology Two of the most important tasks on big data are as follows: • Big data management: Big data management deals with data capture, curation, archive, storage, cataloging, discovery, search, access, sharing, quality control, and privacy. The ultimate purpose of big data management is to make big data from various sources easily accessible and usable by big data analytics. One of the biggest challenges for NASA’s EOS program is how to keep its data for at least 50 years. • Big data analytics: The core technology is to support extraction of useful information and knowledge from big data. In contrast to the traditional data analysis, big data analytics focuses more on seeking correlation or probability relationship instead of causal relationship. The focus shift is due to the increase of variable number, data volume, and quality uncertainty in big data. Contrary to the traditional data management and analysis technologies, big data management and analysis must consider and properly deal with big data’s five “V” characteristics.

1.4 Remote Sensing Big Data This book studies the theory, technology, and methodology of big data in the field of remote sensing. Remote sensing is one of the major driving sources behind the study of big data. Remote sensing is the science of technology for acquiring information about an object or phenomenon without making physical contact with the object. In the geospatial disciplines, remote sensing has become the most important method for collecting data. The data have been widely used in studying the Earth and its components, benefiting human society. With the advancement of both sensors and sensing technologies, huge volumes of time-series remote sensing data have been collecting at unprecedented speed by rapidly increased numbers of organizations worldwide. For example, the NASA Earth Observing System has collected more than 10 Pb of remote sensing data. Because of the volume of data, the diverse data forms, the widely distributed nature of the data, the multidimensionality, and wide variety of applications, remote sensing big data are special kind of big data with many unique characteristics that need to be handled specially. The author of this book has been working on the front of remote sensing big data and its applications for over 30 years, and this book is the result of his study and lessons learned. This advanced book is designed for readers who are interested in the concepts, theory, standards, implementation, and applications of remote sensing big data.

References

11

The book has 18 chapters. Here are brief introductions for each chapter: • Chapter 1: General introduction, concept of big data, features of big data, and brief introduction of chapters in the book. • Chapter 2: Remote sensing concept, sensors, and platforms, including satellite, airborne (e.g., airplane, UAV, and balloon), and in situ. • Chapter 3: Special feature of remote sensing big data. • Chapter 4: Remote sensing big data collection challenges and cyberinfrastructure and sensor web solutions. • Chapter 5: Computing power to handle big data: distributed and parallel computing. • Chapter 6: Big data management: remote sensing data archiving, cataloging, and dissemination; Big data management services: data discovery (collection and granule level) and access. • Chapter 7: Standards for big data management: archiving, metadata, data format, data discovery, and data access. • Chapter 8: Implementation examples of big data management systems for remote sensing. The Committee on Earth Observation Satellites (CEOS) Working Group on Information Systems and Services (WGISS) Integrated Catalog (CWIC) and the Global Earth Observation System of Systems (GEOSS) Common Infrastructure (GCI) are two examples. • Chapter 9: Big data analytics for remote sensing: concepts and standards. • Chapter 10: Big data analytic platforms, computer languages, tools, and services. • Chapter 11: Algorithmic design considerations of big data analytics. • Chapter 12: Machine learning and data mining algorithms for geospatial big data. • Chapter 13: Modeling, prediction, and decision making based on remote sensing big data. • Chapter 14: Examples of remote sensing applications of big data analytics— fusion of diverse Earth observation data. • Chapter 15: Examples of remote sensing applications of big data analytics— agricultural drought monitoring and forecasting. • Chapter 16: Examples of remote sensing applications of big data analytics—land cover time series creation. • Chapter 17: Geospatial big data initiatives in the world. • Chapter 18: Challenges and opportunities in the remote sensing big data

References Adriaans PW (1997) Industrial requirements for ML application technology. In: Proc. of the ICML’97 workshop on machine learning application in the real world: methodological aspects and implications, p 6–10 Albornoz VM, Cancela H, AFM C et al (2020) Special issue on “OR and big data in agriculture”. Int Trans Oper Res 27:699–700. https://doi.org/10.1111/itor.12696

12

1 Introduction

Al-Jepoori M, Al-Khanjari Z (2018) Framework for handling data veracity in big data. Int J Comput Sci Softw Eng 7:138–141 Anderson C (2008) The end of theory: the data Deluge makes the scientific method obsolete. Wired Asher J (1983) New software is linking small firms to big data banks. Phila. Inq Aston K (2009) That ‘internet of things’ thing. RFID J 22:97–114 Batty M (2016) Big data and the city. Built Environ 42:321–337. https://doi.org/10.2148/ benv.42.3.321 Beal V (2020) What is big data? Webopedia Definition. https://www.webopedia.com/TERM/B/ big_data.html. Accessed 20 May 2020 Becker HB (1986) Can users really absorb data at today’s rates? Tomorrow’s. Data Commun 15:177–193 Bengston C (1994) Handling big data sets. GIS User 8:60–62 Berners-Lee T, Cailliau R, Luotonen A et al (1994) The World-Wide Web. Commun ACM 37:76–82. https://doi.org/10.1145/179606.179671 Bisiani R, Greer K (1978) Recent improvements to the harpy connected speech recognition system. In: 1978 IEEE conference on decision and control including the 17th symposium on adaptive processes. IEEE, San Diego, CA, USA, pp 1429–1434 Bizer C, Heath T, Berners-Lee T (2009) Linked data - the story so far. Int J Semantic Web Inf Syst 5:1–22. https://doi.org/10.4018/jswis.2009081901 Blosser HG, Bardin BM, Schutte F et al (1972) Computer control panel discussion. Vancouver (CANADA), pp 538–561 Borthakur D (2007) The Hadoop distributed file system: architecture and design. The Apache Software Foundation Bryant RE, Katz RH, Lazowska ED (2008) Big-data computing: creating revolutionary breakthroughs in commerce, science, and society computing. Comput Res Initiat 21st Century Comput Res Assoc Available Httpwww Cra OrgcccfilesdocsinitBigData Pdf Camacho J, Macia-Fernandez G, Diaz-Verdejo J, Garcia-Teodoro P (2014) Tackling the Big Data 4 vs for anomaly detection. In: 2014 IEEE conference on computer communications workshops (INFOCOM WKSHPS). IEEE, Toronto, ON, Canada, pp 500–505 Campbell DES, Ekstedt J, Hedberg Å, Oldberg B (1970) Automatic data collection for computer calculation of instrument measurements in research laboratories. Comput Programs Biomed 1:171–178. https://doi.org/10.1016/0010-468X(70)90005-X Chen MC, Jacquemin M (1988) Massively parallel architectures. Department of Computer Science, Yale University, New Heaven Chi M, Plaza AJ, Benediktsson JA et al (2015) Foreword to the special issue on big data in remote sensing. IEEE J Sel Top Appl Earth Obs Remote Sens 8:4607–4609. https://doi.org/10.1109/ JSTARS.2015.2513662 Chisholm M (2009) The dawn of big data: are we on the cusp of a new paradigm that goes beyond what we can do with traditional data stores? Inf Manag 19:45 Cox M, Ellsworth D (1997a) Application-controlled demand paging for out-of-core visualization. In: Proceedings. Visualization’97 (Cat. No. 97CB36155). IEEE, pp 235–244 Cox M, Ellsworth D (1997b) Managing big data for scientific visualization. In: ACM Siggraph, pp 21–38 Cukier K (2010) Data, data everywhere. De Economist de Sola PI (1983) Tracking the flow of information. Science 221:609–613. https://doi.org/10.1126/ science.221.4611.609 de Solla Price DJ (1961) Science since babylon. Yale University Press, New Heaven de Solla Price DJ (1978) Science since Babylon, Enlarged ed., 3. pr edn. Yale University Press, New Haven Denning PJ (1990) The science of computing: saving all the bits. Am Sci 78:402–405 Diebold FX (2000) Big data dynamic factor models for macroeconomic measurement and forecasting. Discussion read to the Eight World Congress of the Econometric Society, Seattle

References

13

Diebold FX (2012) On the origin(s) and development of the term “Big Data”. SSRN Electron J. https://doi.org/10.2139/ssrn.2152421 Diebold FX (2003) Big data dynamic factor models for macroeconomic measurement and forecasting. In: Dewatripont M, Hansen LP, Turnovsky S (eds) Advances in economics and econometrics: theory and applications, eighth World Congress of the Econometric Society, pp 115–122 Gantz JF, Reinsel D (2010) The digital universe decade –are you ready? IDC, Framingham Gantz JF, Reinsel D (2012) The digital universe in 2020: big data, bigger digital shadows, and biggest growthin the far east. IDC, Framingham Gantz JF, Reinsel D, Chute C et al (2007) The expanding digital universe: a forecast of worldwide information growth through 2010. IDC, Framingham GCN (2013) 30 years of accumulation: a timeline of cloud computing -. In: GCN. https://gcn.com/ articles/2013/05/30/gcn30-timeline-cloud.aspx. Accessed 20 May 2020 GilPress (2013) Big data arrives at the Oxford english dictionary. In: Whats big data. https://whatsthebigdata.com/2013/06/15/big-data-arrives-at-the-oxford-english-dictionary/. Accessed 20 May 2020 Gordon A (2015) Official (ISC)2 guide to the CISSP CBK, 4th edn. Auerbach Publications, Boca Raton Howe D, Costanzo M, Fey P et al (2008) The future of biocuration. Nature 455:47–50. https://doi. org/10.1038/455047a Huang Y, Chen Z, Yu T et al (2018) Agricultural remote sensing big data: management and applications. J Integr Agric 17:1915–1931. https://doi.org/10.1016/S2095-3119(17)61859-8 Hubaux A (1973) A new geological tool-the data. Earth-Sci Rev 9:159–196. https://doi. org/10.1016/0012-8252(73)90089-5 Ishwarappa, Anuradha J (2015) A brief introduction on big data 5Vs characteristics and Hadoop technology. Proc Comput Sci 48:319–324. https://doi.org/10.1016/j.procs.2015.04.188 Kenwright D (1999) Automation or interaction: what’s best for big data? In: Proceedings visualization ‘99 (Cat. No.99CB37067). IEEE, San Francisco, CA, USA, pp 491–495 Khan MA, Uddin MF, Gupta N (2014) Seven V’s of big data understanding big data to extract value. In: Proceedings of the 2014 Zone 1 conference of the American Society for Engineering Education. IEEE, Bridgeport, CT, USA, pp 1–5 Khan N, Alsaqer M, Shah H et al (2018) The 10 Vs, issues and challenges of big data. In: Proceedings of the 2018 international conference on big data and education - ICBDE ‘18. ACM Press, Honolulu, HI, USA, pp 52–56 Khan N, Naim A, Hussain MR et al (2019) The 51 V’s of big data: survey, technologies, characteristics, opportunities, issues and challenges. In: Proceedings of the international conference on Omni-layer intelligent systems - COINS ‘19. ACM Press, Crete, Greece, pp 19–24 Killian E (1998) Challenges, not roadblocks. Computer 31:44–45 Knutson BJ, McCusker B (1997) An alternative method for computer entering large data sets. J Hosp Tour Res 21:120–128 Laney D (2001) 3D data management: controlling data volume, velocity and variety. META Group Res Note 6:1 Li D, Wang S, Yuan H, Li D (2016) Software and applications of spatial data mining: software and applications of SDM. Wiley Interdiscip Rev Data Min Knowl Discov 6:84–114. https://doi. org/10.1002/widm.1180 Liebl W, Franz N, Ziegler G et al (1982) A fast ADC interface with data reduction facilities for multi-parameter experiments in nuclear physics. Nucl Instrum Methods Phys Res 193:521–527. https://doi.org/10.1016/0029-554X(82)90245-2 Liu P (2015) A survey of remote-sensing big data. Front Environ Sci 3. https://doi.org/10.3389/ fenvs.2015.00045 Liu P, Di L, Du Q, Wang L (2018) Remote sensing big data: theory, methods and applications. Remote Sens 10:711. https://doi.org/10.3390/rs10050711

14

1 Introduction

Lohr S (2013) The origins of “Big Data”: an etymological detective story. In: Bits blog. https:// bits.blogs.nytimes.com/2013/02/01/the-origins-of-big-data-an-etymological-detective-story/. Accessed 20 May 2020 Lynch C (2008) How do your data grow? Nature 455:28–29. https://doi.org/10.1038/455028a Mahey P (1980) On the drawbacks of a linear model in the decentralization of a decision process. IFAC Proc 13:1–7. https://doi.org/10.1016/S1474-6670(17)64411-2 Martin RL (1982) Automated Repair Service Bureau: system architecture. Bell Syst Tech J 61:1115–1130. https://doi.org/10.1002/j.1538-7305.1982.tb04333.x Mashey JR (1999) Big data and the next wave of InfraStress problems, solutions, opportunities Mircea M, Stoica M, Ghilic-Micu B (2017) Using cloud computing to address challenges raised by the Internet of Things. In: Mahmood Z (ed) Connected environments for the Internet of Things. Springer International Publishing, Cham, pp 63–82 Mochmann E, Müller PJ (eds) (1979) Data protection and social science research: perspectives from 10 countries. Campus Verlag, Frankfurt [Main], New York Nelson S (2008) Big data: the Harvard computers. Nature 455:36–37 Nielsen M (2009) A guide to the day of big data. Nature 462:722–723 Preimesberger C (2011) “Big-data” analytics as a service. eWeek 28:28 Press G (2013) A very short history of big data. https://www.forbes.com/sites/gilpress/2013/05/09/ a-very-short-history-of-big-data/#1e8bcbf865a1. Accessed 20 May 2020 Rahman H, Begum S, Ahmed MU (2016) Ins and outs of big data: a review. In: Ahmed MU, Begum S, Raad W (eds) Internet of Things technologies for healthcare. Springer International Publishing, Cham, pp 44–51 Ranganathan S, Schonbach C, Kelso J et al (2011) Towards big data science in the decade ahead from ten years of InCoB and the 1st ISCB-Asia Joint Conference. BMC Bioinform 12:1 Rider F (1944) The scholar and the future of the research library. Hadham Press, New York Robinson F (1971) Problems of using external services for retrospective search. ASLIB Proc 23:523–526. https://doi.org/10.1108/eb050304 Self RJ (2014) Governance strategies for the cloud, big data, and other technologies in education. In: 2014 IEEE/ACM 7th international conference on utility and cloud computing. IEEE, London, pp 630–635 Shafer T (2017) The 42 V’s of big data and data science. https://www.elderresearch.com/blog/42- v-of-big-data. Accessed 23 May 2020 Shirer M, Goepfert J (2021) Global spending on big data and analytics solutions will reach $215.7 billion in 2021, according to a new IDC spending guide. https://www.businesswire.com/ news/home/20210817005182/en/Global-Spending-on-Big-Data-and-Analytics-Solutions- Will-Reach-215.7-Billion-in-2021-According-to-a-New-IDC-Spending-Guide. Accessed 29 Oct 2022 Sugarman R (1979) Computers: our’microuniverse’expands. IEEE Spectr 16:32–37 Surden E (1978) Parallel processors seen big data bases’ solution. Computerworld 12:48 Traverso A, Dankers FJWM, Wee L, van Kuijk SMJ (2019) Data at scale. In: Kubben P, Dumontier M, Dekker A (eds) Fundamentals of clinical data science. Springer International Publishing, Cham, pp 11–17 Tremblay M, Grohoski G, Burgess B et al (1998) Challenges and trends in processor design. Computer 31:39–48 Van Roy P, Haridi S (1999) Mozart: a programming system for agent applications. AgentLink News 3 Venkatraman S, Venkatraman R (2019) Big data security challenges and strategies. AIMS Math 4:860–879. https://doi.org/10.3934/math.2019.3.860 Vesset D, George J (2020) Worldwide big data and analytics spending guide Vesset D, George J (2021) Worldwide big data and analytics spending guide. IDC ULR Httpswww Idc Comgetdoc Jsp 2 Vesset D, Olofson CW, Bond S et al (2019) IDC FutureScape: worldwide data, integration, and analytics 2020 predictions. International Data Corporation (IDC), Framingham

References

15

Victor N, Sund M (1977) The importance of standardized interfaces for portable statistical software. In: Cowell W (ed) Portability of numerical software. Springer, Berlin, Heidelberg, pp 484–503 Vinck AJH (2016) Information theory and big Data: typical or not-typical, that is the question. In: Vinck AJH, Harutyunyan AN (eds) PROCEEDINGS of international workshop on information theory and data science: from information age to big data era -a Claude Shannon centenary event. VMware Armenia Training Center, Yerevan, pp 5–8 Wainer H, Gruvaeus G, Blair M (1974) TREBIG: A 360/75 FORTRAN program for three-mode factor analysts designed for big data sets. Behav Res Methods Instrum 6:53–54 Weiss R, Zgorski L-J (2012) Obama Administration Unveils “Big Data” initiative: announces $200 million in new R&D Investments Weiss SM, Indurkhya N (1998) Predictive data mining: a practical guide. Morgan Kaufmann Publishers, San Francisco Wikipedia (2020) Big data. Wikipedia Woo J, Seung-Jun S, Seo W, Meilanitasari P (2018) Developing a big data analytics platform for manufacturing systems: architecture, method, and implementation. Int J Adv Manuf Technol 99:2193–2217 Woolsto JE (1979) International cooperative information systems. In: International cooperative information systems: proceedings of a seminar. International Development Research Centre, Vienna, pp 13–19 Zhelesov Z (1975) Some problems of multimachine systems. COMpZJ K Zhu Q (2019) Research on road traffic situation awareness system based on image big data. IEEE Intell Syst 1–1. https://doi.org/10.1109/MIS.2019.2942836

Chapter 2

Remote Sensing

Abstract The chapter introduces basic concepts of remote sensing. Major remote sensors are reviewed by radiometric spectrum and by work mode. One focus of the review is data generation rate and the contribution of different sensors to remote sensing big data. By radiometric spectrums, there are multi- and hyperspectral remote sensing, active microwave remote sensing, passive microwave remote sensing, active optical remote sensing, GPS remote sensing, and imaging sonar. By work mode, there are frame camera, whiskbroom scanner, pushbroom scanner, side-scanning sensor, and conical scanning sensor. Major platforms for remote sensing are also reviewed, including satellites, airborne, in situ, and shipborne. Keywords Remote sensing · Radiometric spectrum · Optical remote sensing · Microwave remote sensing · Sonar · Sensor work mode

2.1 Concepts Remote sensing has had various definitions as shown in Table 2.1, depending on the perspective, time, and application. In a broad sense, remote sensing is the technology for acquiring information about an object or phenomenon without making physical contact with it. With this broad definition, the photos taken by your mobile phone are remote sensing images. Based on the definition of big data, the remote sensing community has been dealing with big data issues since the launch of Landsat-1 in 1972. In the context of remote sensing big data, the focus will be on data accrued use different sensors. The following definition will be used: Remote sensing is the digital recording of energy responses from objects/phenomena through a sensor at some distance.

This definition consists of a sensor, a platform, an object/phenomenon, and a digital record. The target of sensing is object or phenomenon which measurements can be © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. Di, E. Yu, Remote Sensing Big Data, Springer Remote Sensing/ Photogrammetry, https://doi.org/10.1007/978-3-031-33932-5_2

17

18

2 Remote Sensing

Table 2.1 Definitions of remote sensing Time Definition 1962 “the measurement of some property of an object without having the measuring device physically in contact with the object” 1968 “the recording of energy responses from objects by means of a sensing device operated at some unprescribed distance; the energy responses are recorded in a form which can be analyzed for some specific purpose” 1976 “the art or science of telling something about an object without touching it” 1979 “the use of reflected and emitted energy to measure the physical properties of distant objects and their surroundings” 1982 “Remote sensing is the acquisition of data and derivative information about objects or materials (targets) located at the Earth’s surface or in its atmosphere by using sensors mounted on platforms located at a distance from the targets to make measurements (usually multispectral) of interactions between the targets and electromagnetic radiation” 1999 “the set of instruments (sensors), platforms, and data-processing techniques that are used to derive information about the physical, chemical, and biological properties of the Earth’s surface (i.e. the land, atmosphere, and oceans) without recourse to direct physical contact” 2001 “provide data from energy emitted, reflected, and/or transmitted from all parts of the electromagnetic spectrum” “capturing images of the extent of seagrass from various airborne or satellite platforms in forms such as photos or digital data” 2007 “the measurement of object properties on the earth’s surface using data acquired from aircraft and satellites” 2008 “collection and interpretation of information about an object without being in physical contact with the object” “a technology for sampling reflected and emitted electromagnetic (EM) radiation from the Earth’s terrestrial and aquatic ecosystems and atmosphere” 2011 “the practice of deriving information about the Earth’s land and water surfaces using images acquired from an overhead perspective, using electro-magnetic radiation in one or more regions of the electromagnetic spectrum, reflected or emitted from the Earth’s surface” 2019 “Remote sensing is the science and technology of capturing, processing and analysing imagery, in conjunction with other physical data of the Earth and the planets, from sensors in space, in the air and on the ground.”

References Parker (1962), Robinove (1963) Carneggie (1968)

Fischer et al. (1976) Moore (1979) Short (1982)

Barnsley (1999)

Estes et al. (2001) McKenzie et al. (2001) Schowengerdt (2007) ISO (2008) Horning (2008)

Schowengerdt (2007)

ISRPS (2019)

taken from. The result is expected to be in digital form, which narrows the range of data and recording media. Further details on major sensors and platforms will be covered later in this chapter. Most of the remotely sensed data are being accrued from repetitive recordings over a long time, especially in the orbit of the Earth. These are the major types of remote sensing to be further elaborated.

2.2 Sensors

19

2.2 Sensors Remote sensing big data has had significantly fast growth over years because data flows every second and sensors are increasingly launched into space. Different sensors are developed and growing to cover the complete electromagnetic spectrum. They can be designed to work in different operational modes. In this context, sensors are reviewed in two criteria—the radiometric spectrum covered and operational mode worked.

2.2.1 Sensors by Radiometric Spectrums 2.2.1.1 Multi- and Hyperspectral Remote Sensing Passive optical remote sensing uses visible, near-infrared, and short-waved sensors to capture the reflectance from the Earth surface sourced from solar energy. Figure 2.1 shows the relationship between the sun, the Earth, and optical remote sensors. Depending on how many bands are in a continuous spectral range, optical remote sensing can be separated into multispectral remote sensing and hyperspectral remote sensing. Multispectral remote sensing generally has 3–20 bands, where the bands

Fig. 2.1 Multi- and Hyper-Spectral Remote Sensing in VIS/IR

20

2 Remote Sensing

are somewhat discrete. Hyperspectral remote sensing has a large number of continuous narrower bands in a spectral range. For example, sensors covering the same range of 400–1100 nm in steps of narrower 1 nm are hyperspectral. Sensors covering the same range of 400–1100 nm, but only in discrete and selected bands, are multispectral sensors. Landsat Thematic Mapper is a good example of multispectral sensor. Table 2.2 lists common multispectral sensors aboard satellites. Table 2.3 lists the common hyperspectral remote sensors aboard satellites. Most of the hyperspectral sensors have more than 100 narrow bands over a spectral range. The first successful hyperspectral remote sensors onboard satellites is Hyperion onboard the EO-1 satellite launched on November 21, 2000. It has 220 bands covering spectral range of 400–2500 nm with 10 nm spectral resolution. Optical remote sensing has wide applications in many different fields. Products are derived from multispectral or hyperspectral remote sensing. Products are derived for soil organic carbon (Gomez et al. 2008), land use (Rogan and Chen 2004), biomethane potential (Udelhoven et al. 2013), water pond classification (Lacaux et al. 2007), and sediment concentration (Volpe et al. 2011) using hyperspectral visible or near-infrared remote sensing. Multi- and hyperspectral thermal infrared remote sensing has been applied in chemical agent detection and identification (Farley et al. 2007), anomaly (Sekertekin and Arslan 2019)/hard target detection (Manolakis et al. 2003), soil/surface contaminants (Nascimento and Dias 2007; Slonecker et al. 2010), combustion characterization (Wang et al. 2015), and plume/ flare (Steffke et al. 2010). 2.2.1.2 Active Microwave Remote Sensing Microwave radiation is electromagnetic radiation with a wavelength between 1 mm (300 GHz) and 100 m (300 MHz). Table 2.4 lists common letter designations, frequency ranges, and wavelength ranges (IEEE-AESS 2019). Microwave remote sensing uses microwave radiation as a measurement tool. Active microwave remote sensing sends microwave radiation to a target and receives reflected radiation. The sensor transmits a radio signal in microwave and records the backscattered signal from targets. Two measurements in the process form the basics of the data received: intensity of backscattered signals (the amplitude of reflected signal) and distance between sensor and target in terms of signal travel time (the phase of reflected signal). The system is RADAR, which is short for radio detection and ranging. The most common active microwave remote sensor is the synthetic aperture radar (SAR). Figure 2.2 shows the principle for SAR. The sensor transmits microwave signals and receives backscattered signals at a different time. The synthetic aperture as shown in Fig. 2.2 is larger than the actual aperture of sensor which gives the name of SAR. Interferometric synthetic aperture radar (InSAR) is a technology using two or more SAR images of an area to measure displacements on the Earth’s surface to a precision up to millimeters (Krieger et al. 2010). The phase difference can determine

30

LS-6 LS-7 LS-8

LS-9 Terra, Aqua

NOAA-20

Terra

Terra

IKONOS

ETMg ETM+h OLIi

OLI-2j MODIS

VIIRSk

ASTER SWIRo

ASTER TIRp

OSAq

ASTERm VNIRn Terra

30 250 (bands 1–2), 500 (bands 3–7), 1000 (bands 8–36) 375 (I-bands), 750 (M-bands) 15

LS-4,5

TMe

0.8, 3.2

90

30 30 30

30

Platform Spatial (m) d LS -1,2,3,4,5 57 × 79

Sensor MSSc

Table 2.2 Selected multispectral remote sensors

B1:0.52–0.60, B2:0.63–0.69, B3:0.76–0.86 B4:1.600–1.700, B5:2.145–2.185, B6:2.185–2.225, B7:2.235–2.285, B8: 2.295–2.365, B9:2.360–2.430 B10:8.125–8.475, B11:8.475–8.825, B12:8.925–9.275, B13:10.25–10.95, B14:10.95–11.65 1 visible, 4 multispectral

5 I-bands, 16 M-bands, 1 DNBl

Bands (μm) B4: 0.5–0.6; B5: 0.6–0.7; B6: 0.6–0.8; B7: 0.8–1.1 B1: 0.45–0.52; B2: 0.52–0.62; B3: 0.63–0.69; B4: 0.76–0.9; B5: 1.55–1.75; B6: 10.40–12.50f; B7: 2.08–2.35 " " B1: 0.44–0.45; B2: 0.45–0.52; B3:0.53– 0.59; B4: 0.64–0.67; B5:0.85–0.88; B6: 1.57–1.65; B7: 2.11–2.29; B9: 1.36–1.38 " 36 bands, spread from blue (0.4) to infrared (21.6).

Spectral

9/1999

12/1999

12/1999

8

12

12/1999

11/2017

9/2021 12/1999

10/1993 4/1999 2/2013

7/1982

8

12

14 12

8 8 8

8

3/2015 320

4.2

23

62

440 6.1

12/2012 85 5/2003 150 440

(continued)

16

90

236

1734 24

335 591 1734

335

Data acquisition Rate Annually (Mbps) (TB)b 15 59

11/2011 85

Temporal Quan. (bit)a Start End 6 7/1972 10/1992

2.2 Sensors 21

PlanetScope

8 multispectral

panchromatic 1 panchromatic, 8 multispectral 1 panchromatic, 8 multispectral, 8 SWIR, 12 CAVISw 1 panchromatic, 4 multispectral 1 panchromatic, 8 multispectral

16 bands (1 visible, 3 NIR, 10 IR) 1 NIR 1 panchromatic, 4 multispectral 13 bands from VNIR to SWIR

Bands (μm) 1 visible, 3 NIR

Spectral

11/2016 11/2016 9/2008 6/2015

11

7/2014

11/2016 9/2022

11 9/2007 11 10/2009 11/14 8/2014

10 14 11 12

800 800 800

75 7.7 740 450

Data acquisition Rate Annually (Mbps) (TB)b 2.6

1/2019 800

Temporal Quan. (bit)a Start End 10 5/1994

a

Quantization b The annual potential data accumulation of raw data is an estimate by assuming the ideal data rate. Actual data acquisition may vary depending on operation and degradation of instrument and communication component c Multispectral Scanner System d Landsat e Thematic Mapper f 6 Band 6 (thermal infrared) spatial resolution is 120 m for Landsat 4, 5, and 6 and 60 m for Landsat 7. Data product may be provided as resampled to 30 m g Enhanced Thematic Mapper h Enhanced Thematic Mapper Plus i Operational Land Imager

0.5 0.46, 1.85 0.31, 1.24, 3.7, 30

8000–14,000 0.41, 1.65 60, 10, 20

WorldView-4 0.31, 1.24 WorldView 0.29, 1.16 Legion Planet Dove 3.0–4.1

WV60 WV-110 WV110

ABIs GLMt GISu MSIv

WV110 WV Legion

Spatial (m) 1000, 4000

Platform GOES- 9,10,11, 12,13,14,15 GOES-16,17 GOES-16,17 GeoEye-1 Sentinel- 2A,2B WorldView-1 WorldView-2 WorldView-3

Sensor GOESr Imager

Table 2.2 (continued)

22 2 Remote Sensing

Operational Land Imager-2 Visible Infrared Imaging Radiometer Suite l Day Night Band m Advanced Spaceborne Thermal Emission and Reflection Radiometer n Visible and Near Infrared Radiometer o Shortwave Infrared Radiometer p Thermal Infrared Radiometer q Optical Sensor Assembly r Geostationary Operational Environmental Satellites s Advanced Baseline Imager t Geostationary Lightning Mapper u GeoEye Imaging System v MultiSpectral Instrument w Clouds, Aerosols, Vapors, Ice, & Snow

k

j

2.2 Sensors 23

24

2 Remote Sensing

Table 2.3 Selected hyperspectral remote sensors Spectral Spatial Sensor Platform (m) Bands (μm) c Hyperion EO-1 30 220 bands between 0.4–2.5 LACd EO-1 250 256 bands between 0.89–1.6 PRISMAe PRISMA 30 66 VNIR bands and 171 SWIR bands HISUIf ISSg 20x30 185 bands of VNIR and SWIR HSIh Tiangong-1 64 VNIR bands and 64 SWIR bands HySISi HySIS 30 70 VNIR bands and 256 SWIR bands

Quan. (bit)a 12

Temporal

Data acquisition Rate Annually Start End (Mbps) (TB)b 11/2000 3/2017 105

12

11/2000 3/2017 95

12

3/2019

155

12

12/2019

400

12

2011

2015

12/2018

Quantization The annual potential data accumulation of raw data is an estimate by assuming the ideal data rate. Actual data acquisition may vary depending on operation and degradation of instrument and communication component c Earth Observing One d LEISA (Linear Etalon Imaging Spectral Array) Atmospheric Corrector e PRecursore IperSpettrale della Missione Applicativa f Hyperspectral Imager Suite g International Space Station h Hyperspectral Imager i Hyperspectral Imaging Satellite a

b

surface deformation with millimeter-level precision. However, the measurement of absolute distance is less accurate. A reference point is needed to accurately determine the absolute distance of the deformed surface. Table 2.5 lists active SAR sensors onboard satellite missions. Sensors may operate in different modes. For example, the C-SAR onboard Sentinel-1 has Stripmap (SM), Interferometric Wide Swath (IWS), Extra Wide Swath (EWS), and Wave (WM). The AMI (Active Microwave Instrument) onboard ERS-1 has Imaging, Wave Mode, and Wind Scatterometer Mode (SCAT). Different modes may operate with different features on polarization, incidence angle, spatial resolution, swath width, and raw data rate. Table 2.5 gives only one mode close to imaging mode or stripmap.

2.2 Sensors

25

Table 2.4 Microwave Bands Band designation HF VHF UHF L S C X Ku K Ka V W mm

Frequency range (unit: GHz) 0.003–0.03 0.03–0.3 0.3–1 1–2 2–4 4–8 8–12 12–18 18–27 27–40 40–75 75–110 110–300

Wavelength range (unit: mm) 10,000–100,000 1000–10,000 300–1000 150–300 75–150 37.5–75 25–37.5 16.7–37.5 11.1–16.7 7.5–11.1 4–7.5 2.7–4 1–2.7

Fig. 2.2 System of SAR

Example applications of active microwave remote sensing are 3D-model creation (Toutin and Gray 2000), soil moisture extraction (Walker et al. 2004), and surface deformation (Hooper et al. 2004). Doppler radar uses the Doppler effect to produce velocity data about objects at a distance. In weather applications, systems of Doppler radar have been deployed to track weather events. The Next Generation Weather Radar (NEXRAD) system has 160 sites throughout the States and selected overseas locations. Doppler radar has

5 5 1

Sentinel-1a Sentinel-1b Gaofen-3

RCMi × 3 JERS-1j ALOSl ALOS-2 SAOCOM-1Am TerraSAR-X COSMO-SkyMed-1n COSMO-SkyMed-2 COSMO-SkyMed-3 COSMO-SkyMed-4 TanDEM-X

SAR SAR PALSARk PALSAR-2 SAR TSX-SAR SAR-2000 SAR-2000 SAR-2000 SAR-2000 TDX-SAR

1.9 × 1.8 18 7–44 3 10 3 3 3 3 3 3

Spatial (m) 10–30 10–30 25 × 28 28 25 × 28 3 × 2

Platform ERS-1f ERS-2 RADASAT-1 EnviSat RADASAT-2 RISAT-1

Sensor AMIe AMI SAR ASARg SAR RISATh- SAR C-SAR C-SAR C-SAR 5.405 1.275 1.270 1.257 1.275 9.65 9.6 9.6 9.6 9.6 9.65

5.405 5.405 5.4

Spectral C. Freq.a (GHz) 5.3 5.3 5.3 5.331 5.405 5.350

Table 2.5 Active SAR sensors onboard satellites

HH, VV, HV, VH HH HH, VV SP, DP HH, VV HH, VV, HV, VH HH, VV, HV, VH HH, VV, HV, VH HH, VV, HV, VH HH, VV, HV, VH HH, VV, HV, VH

HH-HV, VV-VH HH-HV, VV-VH HH, HV, VV, VH

Pol.b LV LV HH VV, HH HH, VV, HV, VH HH, HV, VV, VH

5 3 5 8 8 8 8 8 8 8 8

10 10 8

Quan. (bit)c 5 5 4 5 8 6

6/2019 2/1992 1/2006 5/2014 10/2018 6/2007 6/2007 12/2007 10/2008 11/2010 6/2010

4/2014 4/2016 8/2016

10/1998 4/2011

Temporal Start End 7/1991 3/2000 4/1995 9/2011 11/1995 5/2013 3/2002 4/2012 12/2007 4/2012

105 60 240 800 310 580 310 310 310 310 300

520 520 1280

Data acquisition Rate (Mbps) Annually (TB)d 105 105 105 100 105 1112

26 2 Remote Sensing

Platform KOMPSAT-5p PAZq CSG-1s

Spatial (m) 3 3 3

Spectral C. Freq.a (GHz) 9.66 9.65 9.6 Pol.b HH, VV, HV, VH HH, VV, HV, VH HH, VV, HV, VH

Quan. (bit)c 8 8 10

Temporal Start End 8/2013 2/2018 12/2019

Data acquisition Rate (Mbps) Annually (TB)d 310 300 2400

b

a

Center frequency Polarization c Quantization d The annual potential data accumulation of raw data is an estimate by assuming the ideal data rate. Actual data acquisition may vary depending on operation and degradation of instrument and communication component e C-band Active Microwave Instrument f European Remote Sensing Satellite g Advanced Synthetic Aperture Radar h Radar Imaging Satellite i RADARSAT Constellation Mission j Japanese Earth Resources Satellite k Phased Array L-band Synthetic Aperture Radar l Advanced Land Observing Satellite m SAtélite Argentino de Observación COn Microondas n COnstellation of Satellites for the Mediterranean basin Observation o COrea SAR Instrument p Korean Multi-Purpose Satellite q “peace” in Spanish, formerly known as SEOSAR/PAZ (Satélite Español de Observación SAR—SAR Observation Spanish Satellite) r COSMO-SkyMed Second Generation Synthetic Aperture Radar s Constellation of Small Satellites for Mediterranean basin observation

Sensor COSIo Paz-SAR CSG-SARr

2.2 Sensors 27

2 Remote Sensing

28

been used widely in monitoring weather and tracking severe storms (Doviak and Zrnić 2006). 2.2.1.3 Passive Microwave Remote Sensing Passive microwave remote sensing detects microwave radiation naturally emitted by the Earth. Table 2.6 lists passive microwave sensors onboard satellite missions. Table 2.6 Passive microwave sensors onboard satellites

Sensor SMMRe

SMMR

NSCATf AMSRh

AMSR-Ei

SSMISj

Spectral Spatial Freq.a Platform (km) (GHz) Nimbus-7 60 37, 21, 18, 10.69, 6.6 SeaSat 60 37, 21, 18, 10.69, 6.6 ADEOSg 50 62 ADEOS-II 3–70 6.925– 89.0 (12 channels) Aqua 4–75 6.9–89 (12 channels) DMSPk 13–70 19.35– 183.31 (24 channels) SMAPl 40 14–14.27

L-band radiomet-er AMRm Jason-3

15–20 18.7, 23.8, 34

Temporal Quan. Pol.b (bit)c Start End H, 8 7/1978 7/1988 V

Data acquisition Rate Annually (Mbps) (TB)d 0.025

H, V

0.025

8

6/1978 7/1978

V, H 8/1996 6/1997 0.0029 V, H 12/10 12/2002 10/2003 0.13

V, H 12/10 5/2002 10/2011 0.0874

H, V

8

V, H 16

10/2003

0.0142

1/2015

4.3

1/2016

0.838

Frequency Polarization c Quantization d The annual potential data accumulation of raw data is an estimate by assuming the ideal data rate. Actual data acquisition may vary depending on operation and degradation of instrument and communication component e Scanning Multichannel Microwave Radiometer f NASA Scatterometer g Advanced Earth Observing Satellite h Advanced Microwave Scanning Radiometer i Advanced Microwave Scanning Radiometer for EOS j Special Sensor Microwave Imager/Sounder k Defense Meteorological Satellite Program l Soil Moisture Active Passive m Advanced Microwave Radiometer a

b

2.2 Sensors

29

Passive microwave remote sensing has been applied in large-scale soil moisture retrieval (Njoku and Entekhabi 1996; Drusch et al. 2001) and atmospheric water vapor measurements (Bobylev et al. 2010). 2.2.1.4 Active Optical Remote Sensing LIDAR is the short name for light detection and ranging. It uses pulsed laser beams to measure ranges to the Earth. Airborne LIDAR has been widely done to produce DEM in details. There are two satellite missions that have LIDAR systems. One is the Geoscience Laser Altimeter System (GLAS) onboard the Ice, Cloud and land Elevation Satellite (ICESat) between January 2003 and February 2010. Another is the CALIOP (Cloud-Aerosol Lidar with Orthogonal Polarization) onboard the CALIPSO (Cloud-Aerosol Lidar and Infrared Pathfinder Satellite Observations). The CALIPSO was launched in April 2006. LIDAR has been applied in vegetation profiles (Bergen et al. 2009), 3-D urban model (Wang et al. 2018), and Digital Elevation Model (DEM) generation (Ma 2005). 2.2.1.5 GPS Remote Sensing The global positioning system (GPS) satellite constellation has 32 satellites in orbit to help in navigation. Similarly, there are other navigation satellite systems in orbit. The Global’naya Navigatsionnaya Sputnikovaya Sistema (GLObal NAvigation Satellite System or GLONASS) has 24 satellites in orbit. The Galileo, the global navigation satellite system (GNSS) by the European Union through the European GNSS Agency (GSA), has 30 satellites. The BeiDou Navigation Satellite System (BDS) has 35 satellites in orbit by middle 2020. The applications of these global positioning satellite systems have been extended beyond navigations and special military uses. Their global availability and signal characteristics support their varieties of remote sensing applications in atmosphere (Davis et al. 1996; Ladreiter and Kirchengast 1996; Li et al. 2018; Bai et al. 2020), oceans (Garrison and Katzberg 2000; Yun et al. 2016; Chen et al. 2019), and Earth science (Yunck and Melbourne 1996; Qin et al. 2018; Hao et al. 2018). Among these applications, two broad categories of GNSS remote sensing techniques reflect the different use of signals—reflectometry and refractometry (Jin et al. 2014a; Yu et al. 2014). The reflectometry analyzes signals reflected or scattered from the Earth’s surface to infer its surface properties, such as altimetry (Martin-Neira 1993), ocean winds (Zavorotny and Voronovich 2000), sea-ice coverage (Yan and Huang 2019), vegetation (Wu et al. 2021), wetlands (Rodriguez-Alvarez et al. 2019), and soil moisture (Wu et al. 2020, p.). The refractometry analyzes signals refracted from the atmosphere to profile the Earth’s atmosphere and ionosphere in applications such as weather forecasting (Jin et al. 2014b; Bai et al. 2020), climate monitoring (Gleisner et al. 2022), space weather, and ionospheric research (Wu et al. 2022).

30

2 Remote Sensing

2.2.1.6 Imaging Sonar Sonar (sound navigation ranging) is a technique that uses sound to detect and navigate targets. Sonar imaging is the technique to apply sonar in forming an image of underwater ground surface. Sonar imaging can detect reflecting objects in the dark and around corners. Imaging sonar has been applied in generating DEM (Abu Nokra 2004), underwater object detection (Cho et al. 2015), and navigation (Johannsson et al. 2010).

2.2.2 Sensors by Work Mode Remote sensors have different work modes. Major work modes are frame camera, whiskbroom scanner, pushbroom scanner, side scanning scanner, and conical scanning scanner. 2.2.2.1 Frame The frame work mode of remote sensors has a similar geometry to that of a digital camera. It often uses a 2-dimension array of light-sensing pixels (e.g. charge-coupled device (CCD) and complementary metal-oxide-semiconductor (CMOS)) mounted at the focal plane of the camera. It is also known as staring, staring plane, or focal plane. Figure 2.3 illustrates the frame work mode. The image is a grid of pixels or picture elements. The frame sensing mode is popularly adopted by microsatellites to reduce cost with readily available two-dimensional CCD array sensors, commercial off-theshelf lenses, and gravity-gradient stabilized microsatellites (Fouquet and Ward 1998; Junichi Kurihara et al. 2018). TMSAt, or Thai-Phutt, launched on July 10, 1998, has a wide angle camera (WAC) with an imaging frame of 568 × 560 pixels and a narrow angle camera (NAC) with an imaging frame of 1020 × 1020 pixel CCD sensor (Kramer 2002). The British small satellite UoSAT-12, launched on April 21, 1999 and operated until September 21, 2003, has a WAC sensor, a surry high-resolution panchromatic camera (SHC) with an imaging frame of 1024 × 1024 pixels, and two multispectral imager (MSI) with an imaging frame of 1024 × 1024 pixels. Midwavelength and long-wavelength infrared sensors with frame mode are also developed (Gunapala et al. 2005). Focal-plane array sensors are adopted in design of laser detection and ranging (LADAR) and laser radar imaging systems (Marino et al. 2003; Itzler et al. 2010).

2.2 Sensors

31

Fig. 2.3 Remote Sensing Work Mode - Frame

2.2.2.2 Whiskbroom The whiskbroom work mode, or across track scanning, has a rotating mirror that scans across the satellite’s path and reflects light into a single detector which collects data one pixel at a time (Nummedal 1980). Figure 2.4 illustrates the configuration and the work mode. A whiskbroom scanner is heavier and more expensive than other work mode sensors due to the moving scanning mechanisms. The moving parts in a whiskbroom sensor also make it prone to wearing out. The Multispectral Scanner System (MSS), the Thematic Mapper (TM), the Enhanced Thematic

32

Fig. 2.4 Remote Sensing Work Mode - Whiskbroom

2 Remote Sensing

2.2 Sensors

33

Fig. 2.5 Remote Sensing Work Mode - Pushbroom

Mapper (ETM), and the Enhanced Thematic Mapper Plus (ETM+) onboard the Landsat series (from Landsat 1 to 7) are all whiskbroom multispectral scanners. The Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) works in whiskbroom mode. The mature whiskbroom mode with its consistency on sensibility is adopted in hyperspectral UAV design (Uto et al. 2016).

34

2 Remote Sensing

2.2.2.3 Pushbroom The pushbroom work mode, or along track scanning, has a line of sensors arranged perpendicular to the flight direction of the spacecraft (Nummedal 1980). Figure 2.5 illustrates the pushbroom work mode. There is no moving mechanism in the pushbroom sensor system. This makes it lighter and less expensive than those with whiskbroom work parts. Comparing to whiskbroom sensors, the quality of images from a pushbroom sensor may be subject to the varying sensitivity of individual detectors (Graña and Duro 2008). Examples for pushbroom sensors are the Operational Land Imager (OLI) and the Thermal Infrared Sensor (TIRS) onboard Landsat-8 (Reuter et al. 2011), the Advanced Land Imager (ALI) onboard the Earth Observing-1 (EO-1) (Hearn et al. 2001), HRV (Haute Resolution Visible) onboard SPOT satellites (Herve et al. 1995; Gupta and Hartley 1997), CMT (China Mapping Telescope) onboard Beijing-1 (Poli and Toutin 2012), WV60 (WorldView-60 camera) onboard WorldView-1 (Johansen et al. 2008), WV110 (WorldView-110 camera) onboard WorldView-2 (Poli and Toutin 2012), HiRI (High-Resolution Imager) onboard Pleiades-1A/1B (Poli et al. 2015), and GeoEye Imaging System (GIS) onboard GeoEye-1 (Aguilar et al. 2012). 2.2.2.4 Side Scanning The side scanning mode has a side-look geometry where the sensor scans the ground orthogonal to flight track on one side. There are three main types of acquisition modes with side scanning, that is, stripmap mode, spotlight mode, and scan mode. In the stripmap mode, the antenna stays in a fixed position. In the spotlight mode, the antenna is steered continually to illuminate the same patch over a long period of time, which resulted in very high azimuth resolution. In the scan mode, the antenna sweeps periodically to achieve a balance between the azimuth resolution and the scan area. Different sensors may have slight variants of acquisition modes. Figure 2.6 illustrates different side-scanning acquisition modes available for Sentinel-1. Sentinel-1 does provide spotlight acquisition mode. In the stripmap mode, the beam points to the side at the same angle without moving (Callow 2003). The ground swath is illuminated by a continuous sequence of pulses, while the antenna beam is pointing to a fixed azimuth angle and an approximately fixed off-nadir angle. The single look of Sentinel-1 has a swath of 80 km. In the Interferometric Wide Swath (IW) mode, it adopts a variant type of ScanSAR mode—Terrain Observation with Progressive Scan (TOPS). The beam is steered back and forth in range as well as in the azimuth direction. This variation of scan mode aims at achieving the same coverage and resolution as scan mode and the improvement with a nearly uniform signal-to-noise ration and distributed target ambiguity ratio. The Extra Wide Swath (EW) mode covers a wider swath with lower spatial resolution by adopting the same technique as IW. The Wave (WV) mode takes a ‘leap frog’ acquisition pattern which

2.2 Sensors

35

Fig. 2.6 Remote Sensing Work Mode – Side Scanning

is suitable for open ocean. The Phased Array type L-band Synthetic Aperture Radar-2 (PALSAR-2) onboard Advanced Land Observing Satellite 2 (ALOS-2) has three acquisition modes–spotlight mode, stripmap mode, and ScanSAR mode. In the spotlight mode, a narrow beam is rotated mechanically or electronically perpendicular to flight track onto a small patch of the target area as it passes (Jakowatz 1996). Figure 2.7 illustrates the side look spotlight acquisition mode. The spotlight illuminates the same on the ground, which results in polar grid (Yocky et al. 2006). 2.2.2.5 Conical Scanning Conical scanning is a formation where the sensor beam is scanned conically around the z-axis, which point to the nadir direction (Okamoto et al. 2009). The scan angle is the same as the cone angle. Figure 2.8 illustrates the configuration of a conical scanning sensor. The Advanced Microwave Scanning Radiometer for Earth Observing System (EOS) (AMSR-E) onboard is a conical scanning system (Kawanishi et al. 2003). It scans across the Earth along a conical surface and maintaining a constant Earth incidence angle of 55°.

36

2 Remote Sensing

Fig. 2.7 Remote Sensing Work Mode – Side Look Spotlight acquisition mode

2.3 Platforms 2.3.1 Satellites Satellites are by far the most common platforms in civilian remote sensing. The orbit modes of satellites are mainly sun-synchronous orbit, geostationary orbit, and in-between orbits (in-between polar orbiting and geostationary). The sun- synchronous orbit results in covering a given area in the same local overpass time. The most common remote sensors’ orbit takes polar orbiting or near-polar orbiting, where the path passes above or nearly above both poles. Sun-synchronous polar orbits are normally low Earth orbits between 200 and 1000 km. This is the most common orbit mode. Example satellites are Landsat, Terra, and Aqua. Satellites

2.3 Platforms

37

Fig. 2.8 Remote Sensing Work Mode – Conical Scanning

with a geostationary orbit have a fixed nadir point. A circular geosynchronous orbit has a constant altitude of 35,786 km (22,236 mi), and all geosynchronous orbits share that semi-major axis to match an orbital period of Earth’s rotation on its axis, which is 23 h, 56 min, and 4 s. Most are weather satellites that require coverage of the same area with very high temporal resolution. Example satellites are Geostationary Operational Environmental Satellite (GOES) series, FengYun (FY) series, Elektro-L, Meteosat series, and Indian National Satellite System (INSAT) series. Other orbits are in-between sun-synchronous polar orbit and geostationary orbit. They may be medium Earth orbit (MEO) or intermediate circular orbit (ICO) between 2000 km and 35,786 km. Many navigation, communication, and geodetic satellites fall in this category. The non-sun-synchronous satellites are less commonly used in remote sensing.

38

2 Remote Sensing

Fig. 2.9 Remote Sensing Satellites

There have been many satellite missions for remote sensing since the launch of Landsat-1 in 1972. Figure 2.9 shows the timeline of major satellite missions. The manned International Space Station (ISS) has also been used as a platform in space to carry out a series of remote sensing missions. The Global Ecosystem Dynamics Investigation (GEDI) is a LIDAR instrument system onboard ISS to measure surface topography, canopy height metrics, canopy cover metrics, and vertical structure metrics (Patterson et al. 2019; Dubayah et al. 2022). The DESIS (DLR Earth Sensing Imaging Spectrometer) is a hyperspectral Earth observation instrument to measure changes in the ecosystems on the Earth’s surface (Reulke et al. 2018). The Hyperspectral Imager Suite is another hyperspectral instrument onboard ISS to experiment with a full-scale application development of hyperspectral remote sensing for oil/gas/mineral resource exploration, agriculture, forestry, and coastal studies (Matsunaga et al. 2022). The Prototype HyspIRI Thermal Infrared Radiometer (PHyTIR) of the ECOsystem Spaceborne Thermal Radiometer Experiment on Space Station (ECOSTRESS) program measures land surface temperature for ecosystem stress studies (Gorokhovich et al. 2022).

2.3.2 Airborne Airborne remote sensing includes both airplanes and Unmanned Aerial Vehicles (UAVs). Airborne remote sensing is widely used for collecting high spatial resolution data for small areas. It is also commonly used in validation and verification of satellite remote sensing. Airborne sensors are used for aero-photography of high resolution mapping and accurate Digital Elevation Model (DEM). UAV becomes

References

39

poplar due to its low cost and easy to control. NASA Ames operates the UAV fleet for remote sensing. They can be effectively used in real-time surveying and monitoring because of their quick deployment and very high spatial resolution.

2.3.3 In Situ There are different types of sensors for different spectral bands. They can be fixed, mounted on a tower or the ground. They can also be mobile, such mounted on cars or man-carried. In-situ sensors can be used both to collect data of the Earth’s surface and data of atmosphere properties.

2.3.4 Shipborne Shipborne remote sensing instruments are mostly sonar that used to measure underwater conditions or surface. Side-scanning sonar is one of the most used sensors in shipborne remote sensing.

References Abu Nokra NM (2004) Generation an ideal DEM by fusion shape from shading and interferometry bathymetries for seafloor remote sensing. In: Proceedings of SPIE. SPIE, Barcelona, pp 204–215 Aguilar MA, Aguilar FJ, del Mar Saldaña M, Fernández I (2012) Geopositioning accuracy assessment of GeoEye-1 panchromatic and multispectral imagery. Photogramm Eng Remote Sens 78:247–257. https://doi.org/10.14358/PERS.78.3.247 Bai W, Deng N, Sun Y et al (2020) Applications of GNSS-RO to numerical weather prediction and tropical cyclone forecast. Atmos 11:1204. https://doi.org/10.3390/atmos11111204 Barnsley M (1999) Digital remotely-sensed data and their characteristics. In: Longley PA, Goodchild MF, Maguire DJ, Rhind DW (eds) Geographical information systems, 2nd edn. Wiley, New York, pp 451–466 Bergen KM, Goetz SJ, Dubayah RO et al (2009) Remote sensing of vegetation 3-D structure for biodiversity and habitat: review and implications for lidar and radar spaceborne missions, J Geophys Res Biogeosci 114, G00E06. https://doi.org/10.1029/2008JG000883 Bobylev LP, Zabolotskikh EV, Mitnik LM, Mitnik ML (2010) Atmospheric water vapor and cloud liquid water retrieval over the arctic ocean using satellite passive microwave sensing. IEEE Trans Geosci Remote Sens 48:283–294. https://doi.org/10.1109/TGRS.2009.2028018 Callow HJ (2003) Signal processing for synthetic aperture sonar image enhancement. Ph.D. Dissertation, University of Canterbury Carneggie DM (1968) Remote sensing: review of principles and research in range and wildlife management. In: Paulsen HA, Reid EH (eds) Range and wildlife habitat evaluation - a research symposium. U.S. Dept. of Agriculture, Forest Service, Flagstaff and Tempe, Arizona, pp 165–178

40

2 Remote Sensing

Chen F, Liu L, Guo F (2019) Sea surface height estimation with multi-GNSS and wavelet De-noising. Sci Rep 9:15181. https://doi.org/10.1038/s41598-019-51802-9 Cho H, Gu J, Joe H et al (2015) Acoustic beam profile-based rapid underwater object detection for an imaging sonar. J Mar Sci Technol 20:180–197. https://doi.org/10.1007/s00773-014-0294-x Davis JL, Cosmo ML, Elgered G (1996) Using the global positioning system to study the atmosphere of the earth: overview and prospects. In: Beutler G, Melbourne WG, Hein GW, Seeber G (eds) GPS trends in precise terrestrial, airborne, and spaceborne applications. Springer, Berlin, Heidelberg, pp 233–242 Doviak RJ, Zrnić DS (2006) Doppler radar and weather observations, 2nd edn. Dover ed., Dover Publications, Mineola, N.Y Drusch M, Wood EF, Jackson TJ (2001) Vegetative and atmospheric corrections for the soil moisture retrieval from passive microwave remote sensing data: results from the Southern Great Plains Hydrology Experiment 1997. J Hydrometeorol 2:181–192. https://doi.org/10.1175/1525- 7541(2001)0022.0.CO;2 Dubayah R, Armston J, Healey SP et al (2022) GEDI launches a new era of biomass inference from space. Environ Res Lett 17:095001. https://doi.org/10.1088/1748-9326/ac8694 Estes J, Kline K, Collins E (2001) Remote sensing. In: Smelser NJ, Baltes PB (eds), International Encyclopedia of the social & behavioral sciences. Pergamon, Oxford, UK. pp 13144–13150 Farley V, Vallières A, Villemaire A et al (2007) Chemical agent detection and identification with a hyperspectral imaging infrared sensor. In: Kamerman GW, Steinvall OK, Lewis KL et al (eds), Proc. SPIE 6661, Imaging Spectrometry XII, 66610L, Florence, Italy, p 673918, https://doi. org/10.1117/12.736731 Fischer WA, Hemphill WR, Kover A (1976) Progress in remote sensing (1972–1976). Photogrammetria 32:33–72. https://doi.org/10.1016/0031-8663(76)90013-2 Fouquet M, Ward J (1998) Cost-driven design of small satellite remote sensing systems. J Reduc Space Mission Cost 1:159–175. https://doi.org/10.1023/A:1009917804847 Garrison JL, Katzberg SJ (2000) The application of reflected GPS signals to ocean remote sensing. Remote Sens Environ 73:175–187. https://doi.org/10.1016/S0034-4257(00)00092-4 Gleisner H, Ringer MA, Healy SB (2022) Monitoring global climate change using GNSS radio occultation. Npj Clim Atmos Sci 5:6. https://doi.org/10.1038/s41612-022-00229-7 Gomez C, Viscarra Rossel RA, McBratney AB (2008) Soil organic carbon prediction by hyperspectral remote sensing and field vis-NIR spectroscopy: an Australian case study. Geoderma 146:403–411. https://doi.org/10.1016/j.geoderma.2008.06.011 Gorokhovich Y, Cawse-Nicholson K, Papadopoulos N, Oikonomou D (2022) Use of ECOSTRESS data for measurements of the surface water temperature: significance of data filtering in accuracy assessment. Remote Sens Appl Soc Environ 26:100739. https://doi. org/10.1016/j.rsase.2022.100739 Graña M, Duro RJ (eds) (2008) Computational intelligence for remote sensing. Springer, Berlin Gunapala SD, Bandara SV, Liu JK et al (2005) 1024 × 1024 pixel mid-wavelength and long- wavelength infrared QWIP focal plane arrays for imaging applications. Semicond Sci Technol 20:473–480. https://doi.org/10.1088/0268-1242/20/5/026 Gupta R, Hartley RI (1997) Linear pushbroom cameras. IEEE Trans Pattern Anal Mach Intell 19:963–975. https://doi.org/10.1109/34.615446 Hao M, Zhang J, Niu R et al (2018) Application of BeiDou navigation satellite system in emergency rescue of natural hazards: a case study for field geological survey of Qinghai−Tibet plateau. Geo-Spat Inf Sci 21:294–301. https://doi.org/10.1080/10095020.2018.1522085 Hearn DR, Digenis CJ, Lencioni DE et al (2001) EO-1 advanced land imager overview and spatial performance. In: IGARSS 2001. Scanning the present and resolving the future. Proceedings. IEEE 2001 international geoscience and remote sensing symposium (Cat. No.01CH37217). IEEE, Sydney, pp 897–900 Herve D, Coste G, Corlay G et al (1995) SPOT 4’s HRVIR and vegetation SWIR cameras. In: Andresen BF, Strojnik M (eds), Proc. SPIE 2552, Infrared Technology XXI, SPIE, San Diego, CA, USA. pp 833–842. https://doi.org/10.1117/12.218284

References

41

Hooper A, Zebker H, Segall P, Kampes B (2004) A new method for measuring deformation on volcanoes and other natural terrains using InSAR persistent scatterers: a new persistent scatterers method. Geophys Res Lett 31. https://doi.org/10.1029/2004GL021737 Horning N (2008) Remote sensing. In: Encyclopedia of ecology. Elsevier, pp 2986–2994 IEEE-AESS (2019) IEEE standard letter designations for radar-frequency bands. IEEE, New York ISO (2008) ISO/TS 19101-2:2008(en): Geographic information — Reference model — Part 2: Imagery. ISO ISRPS (2019) Statutes International Society for photogrammetry and remote sensing Itzler MA, Entwistle M, Owens M et al (2010) Design and performance of single photon APD focal plane arrays for 3-D LADAR imaging. In: Dereniak EL, Hartke JP, LeVan PD et al (eds), Proceedings Volume 7780, Detectors and Imaging Devices: Infrared, Focal Plane, Single Photon, 77801M, San Diego, CA, USA, p 77801M-1 - 77801M-15. Jakowatz CV (ed) (1996) Spotlight-mode synthetic aperture radar: a signal processing approach. Kluwer Academic Publishers, Boston Jin S, Cardellach E, Xie F (2014a) GNSS remote sensing: theory, methods and applications, 1st edn. Springer Netherlands: Imprint: Springer, Dordrecht Jin S, Cardellach E, Xie F (2014b) Atmospheric sensing using GNSS RO. In: GNSS remote sensing. Springer, Dordrecht, pp 121–157 Johannsson H, Kaess M, Englot B et al (2010) Imaging sonar-aided navigation for autonomous underwater harbor surveillance. In: 2010 IEEE/RSJ international conference on intelligent robots and systems. IEEE, Taipei, pp 4396–4403 Johansen K, Roelfsema C, Phinn S (2008) High spatial resolution remote sensing for environmental monitoring and management preface. J Spat Sci 53:43–47. https://doi.org/10.1080/1449859 6.2008.9635134 Kawanishi T, Sezai T, Ito Y et al (2003) The advanced microwave scanning radiometer for the earth observing system (AMSR-E), NASDA’s contribution to the EOS for global energy and water cycle studies. IEEE Trans Geosci Remote Sens 41:184–194. https://doi.org/10.1109/ TGRS.2002.808331 Kramer HJ (2002) Observation of the earth and its environment: survey of missions and sensors. Springer, Berlin/New York Krieger G, Hajnsek I, Papathanassiou KP et al (2010) Interferometric Synthetic Aperture Radar (SAR) missions employing formation flying. Proc IEEE 98:816–843. https://doi.org/10.1109/ JPROC.2009.2038948 Kurihara J, Takahashi Y, Sakamoto Y et al (2018) HPT: a high spatial resolution multispectral sensor for microsatellite remote sensing. Sensors 18:619. https://doi.org/10.3390/s18020619 Lacaux JP, Tourre YM, Vignolles C et al (2007) Classification of ponds from high-spatial resolution remote sensing: application to Rift Valley Fever epidemics in Senegal. Remote Sens Environ 106:66–74. https://doi.org/10.1016/j.rse.2006.07.012 Ladreiter HP, Kirchengast G (1996) GPS/GLONASS sensing of the neutral atmosphere: model-independent correction of ionospheric influences. Radio Sci 31:877–891. https://doi. org/10.1029/96RS01094 Li X, Tan H, Li X et al (2018) Real-time sensing of precipitable water vapor from BeiDou observations: Hong Kong and CMONOC networks. J Geophys Res Atmos. https://doi. org/10.1029/2018JD028320 Ma R (2005) DEM generation and building detection from Lidar data. Photogramm Eng Remote Sens 71:847–854. https://doi.org/10.14358/PERS.71.7.847 Manolakis D, Marden D, Shaw G (2003) Hyperspectral image processing for automatic target detection applications. Linc Lab J 14:79–116 Marino RM, Stephens T, Hatch RE et al (2003) A compact 3D imaging laser radar system using Geiger-mode APD arrays: system and measurements. In: Kamerman GW (ed), Proceedings Volume 5086, Laser Radar Technology and Applications VIII, SPIE, Orlando, FL, USA p 1–15. https://doi.org/10.1117/12.501581 Martin-Neira M (1993) A passive reflectometry and interferometry system (PARIS): application to ocean altimetry. ESA J 17:331–355

42

2 Remote Sensing

Matsunaga T, Iwasaki A, Tachikawa T et al (2022) The status and early results of Hyperspectral Imager Suite (HISUI). In: IGARSS 2022 - 2022 IEEE international geoscience and remote sensing symposium. IEEE, Kuala Lumpur, Malaysia, pp 5399–5400 McKenzie LJ, Finkbeiner MA, Kirkman H (2001) Methods for mapping seagrass distribution. In: Global Seagrass research methods. Elsevier, pp 101–121 Moore GK (1979) What is a picture worth? A history of remote sensing / Quelle est la valeur d’une image? Un tour d’horizon de télédétection. Hydrol Sci Bull 24:477–485. https://doi. org/10.1080/02626667909491887 Nascimento JMP, Dias JMB (2007) Unmixing hyperspectral data: independent and dependent component analysis. In: Chang C-I (ed) Hyperspectral data exploitation. Wiley, Hoboken, pp 149–177 Njoku EG, Entekhabi D (1996) Passive microwave remote sensing of soil moisture. J Hydrol 184:101–129. https://doi.org/10.1016/0022-1694(95)02970-2 Nummedal K (1980) Wide-field imagers-Pushbroom or Whiskbroom scanners. In: Wolfe WL, Zimmerman J (eds), Proc. SPIE 0226, Infrared Imaging Systems Technology, SPIE, Washington, D.C., pp 38–52 Okamoto K, Shige S, Manabe T (2009) A conical scan type spaceborne precipitation radar. In: línea. 34th conference on radar meteorology, Virginia, US, pp 155–272 Parker DC (1962) Some basic considerations related to the problem of remote sensing. In: Proceedings of the first symposium on remote sensing of environment, University of Michigan, 19Ö2, p 7–18 Patterson PL, Healey SP, Ståhl G et al (2019) Statistical properties of hybrid estimators proposed for GEDI—NASA’s global ecosystem dynamics investigation. Environ Res Lett 14:065007. https://doi.org/10.1088/1748-9326/ab18df Poli D, Toutin T (2012) Review of developments in geometric modelling for high resolution satellite pushbroom sensors: geometric modelling for high resolution satellite pushbroom sensors. Photogramm Rec 27:58–73. https://doi.org/10.1111/j.1477-9730.2011.00665.x Poli D, Remondino F, Angiuli E, Agugiaro G (2015) Radiometric and geometric evaluation of GeoEye-1, WorldView-2 and Pléiades-1A stereo images for 3D information extraction. ISPRS J Photogramm Remote Sens 100:35–47. https://doi.org/10.1016/j.isprsjprs.2014.04.007 Qin S, Wang W, Song S (2018) Comparative study on vertical deformation based on GPS and leveling data. Geod Geodyn 9:115–120. https://doi.org/10.1016/j.geog.2017.07.005 Reulke R, Sebastian I, Krutz D et al (2018) DESIS - DLR earth sensing imaging spectrometer for the International Space Station ISS. In: Neeck SP, Kimura T, Martimort P (eds) Sensors, systems, and next-generation satellites XXII. SPIE, Berlin, p 18 Reuter D, Irons J, Lunsford A et al (2011) The Operational Land Imager (OLI) and the Thermal Infrared Sensor (TIRS) on the Landsat Data Continuity Mission (LDCM). In: Shen SS, Lewis PE (eds), Proc. SPIE 8048, Algorithms and Technologies for Multispectral, Hyperspectral, and Ultraspectral Imagery XVII, Orlando, FL, USA. p 804812 Robinove CJ (1963) Photography and imagery: a clarification of terms. Photogramm Eng 29:880–881 Rodriguez-Alvarez N, Podest E, Jensen K, McDonald KC (2019) Classifying inundation in a tropical wetlands complex with GNSS-R. Remote Sens 11:1053. https://doi.org/10.3390/ rs11091053 Rogan J, Chen D (2004) Remote sensing technology for mapping and monitoring land-cover and land-use change. Prog Plan 61:301–325. https://doi.org/10.1016/S0305-9006(03)00066-7 Schowengerdt RA (2007) The nature of remote sensing. In: Remote sensing. Elsevier, p 1–X Sekertekin A, Arslan N (2019) Monitoring thermal anomaly and radiative heat flux using thermal infrared satellite imagery – A case study at Tuzla geothermal region. Geothermics 78:243–254. https://doi.org/10.1016/j.geothermics.2018.12.014 Short NM (1982) The Landsat tutorial workbook: basics of satellite remote sensing. National Aeronautics and Space Administration, Scientific and Technical …

References

43

Slonecker T, Fisher GB, Aiello DP, Haack B (2010) Visible and infrared remote imaging of hazardous waste: a review. Remote Sens 2:2474–2508. https://doi.org/10.3390/rs2112474 Steffke AM, Fee D, Garces M, Harris A (2010) Eruption chronologies, plume heights and eruption styles at Tungurahua Volcano: integrating remote sensing techniques and infrasound. J Volcanol Geotherm Res 193:143–160. https://doi.org/10.1016/j.jvolgeores.2010.03.004 Toutin T, Gray L (2000) State-of-the-art of elevation extraction from satellite SAR data. ISPRS J Photogramm Remote Sens 55:13–33. https://doi.org/10.1016/S0924-2716(99)00039-8 Udelhoven T, Delfosse P, Bossung C et al (2013) Retrieving the bioenergy potential from maize crops using hyperspectral remote sensing. Remote Sens 5:254–273. https://doi.org/10.3390/ rs5010254 Uto K, Seki H, Saito G et al (2016) Development of a low-cost hyperspectral Whiskbroom imager using an optical fiber bundle, a swing mirror, and compact spectrometers. IEEE J Sel Top Appl Earth Obs Remote Sens 9:3909–3925. https://doi.org/10.1109/JSTARS.2016.2592987 Volpe V, Silvestri S, Marani M (2011) Remote sensing retrieval of suspended sediment concentration in shallow waters. Remote Sens Environ 115:44–54. https://doi.org/10.1016/j. rse.2010.07.013 Walker JP, Houser PR, Willgoose GR (2004) Active microwave remote sensing for soil moisture measurement: a field evaluation using ERS-2. Hydrol Process 18:1975–1997. https://doi. org/10.1002/hyp.1343 Wang Y, Tian F, Huang Y et al (2015) Monitoring coal fires in Datong coalfield using multi-source remote sensing data. Trans Nonferrous Met Soc China 25:3421–3428. https://doi.org/10.1016/ S1003-6326(15)63977-2 Wang R, Peethambaran J, Chen D (2018) LiDAR point clouds to 3-D Urban Models$:$ a review. IEEE J Sel Top Appl Earth Obs Remote Sens 11:606–627. https://doi.org/10.1109/ JSTARS.2017.2781132 Wu X, Ma W, Xia J et al (2020) Spaceborne GNSS-R soil moisture retrieval: status, development opportunities, and challenges. Remote Sens 13:45. https://doi.org/10.3390/rs13010045 Wu X, Guo P, Sun Y et al (2021) Recent progress on vegetation remote sensing using spaceborne GNSS-reflectometry. Remote Sens 13:4244. https://doi.org/10.3390/rs13214244 Wu DL, Emmons DJ, Swarnalingam N (2022) Global GNSS-RO electron density in the lower ionosphere. Remote Sens 14:1577. https://doi.org/10.3390/rs14071577 Yan Q, Huang W (2019) Sea ice remote sensing using GNSS-R: a review. Remote Sens 11:2565. https://doi.org/10.3390/rs11212565 Yocky D, Wahl D, Jakowatz C Jr (2006) Spotlight-Mode SAR image formation utilizing the Chirp Z-transform in two dimensions. In: 2006 IEEE international symposium on geoscience and remote sensing. IEEE, Denver, pp 4180–4182 Yu K, Rizos C, Burrage D et al (2014) An overview of GNSS remote sensing. EURASIP J Adv Signal Process 2014:134. https://doi.org/10.1186/1687-6180-2014-134 Yun Z, Binbin L, Luman T et al (2016) Phase altimetry using reflected signals from BeiDou GEO satellites. IEEE Geosci Remote Sens Lett 13:1410–1414. https://doi.org/10.1109/ LGRS.2016.2578361 Yunck TP, Melbourne WG (1996) Spaceborne GPS for earth science. In: Beutler G, Melbourne WG, Hein GW, Seeber G (eds) GPS trends in precise terrestrial, airborne, and spaceborne applications. Springer, Berlin, Heidelberg, pp 113–122 Zavorotny VU, Voronovich AG (2000) Scattering of GPS signals from the ocean with wind remote sensing application. IEEE Trans Geosci Remote Sens 38:951–964. https://doi. org/10.1109/36.841977

Chapter 3

Special Features of Remote Sensing Big Data

Abstract This chapter briefly covers the five core dimensions of remote sensing big data, that is, volume, variety, velocity, veracity, and value. There are also other Vs to be explored, like Visualization for effectively high-dimensional visuals and exploration (Huang et al. J Integrat Agric 17:1915–1931, 2018), Volatility for data time-sensitivity (Antunes et al. GIScience Remote Sens 56:536–553, 2019), Validity for the exploration of hidden relationships among elements (Shelestov et al. Front Earth Sci 5 2017), and Viscosity for the complexity (Manogaran and Lopez Int J Biomed Eng Technol 25:182, 2017). Remote sensing big data may cover as many Vs as other big data (Khan et al. Proceedings of the International Conference on Omni-Layer Intelligent Systems - COINS ‘19. ACM Press, Crete, Greece, 2019). Keywords Remote sensing big data · Volume · Variety · Velocity · Veracity · Value Remote sensing has been one of the major sources for continuously accumulating data and growing big data. The growth of data in databases has become too complicated and voluminous to be efficiently confined, formed, stored, managed, shared, processed, analyzed, and visualized using conventional database software tools (Rathore et al. 2015). Remote sensing big data has similar dimensions to other big data, but with special features specific to the remote sensing domain. This chapter focuses on discussing five dimensions of remote sensing big data or 5Vs—Volume, Variety, Velocity, Veracity, and Value. These features are briefed in several review papers (Liu 2015; Ma et al. 2015; Liu et al. 2018; Huang et al. 2018). In this chapter, we discuss the special features of remote sensing big data:

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. Di, E. Yu, Remote Sensing Big Data, Springer Remote Sensing/ Photogrammetry, https://doi.org/10.1007/978-3-031-33932-5_3

45

46

3 Special Features of Remote Sensing Big Data

3.1 Volume of Remote Sensing Big Data The volume of remote sensing big data is large and its growth of volume is exponentially growing with more and more remote sensors are launched. ESA collects 1.6 Tb/day from Sentinel-2A and 2B; ESA publishes 10 TB/day of Sentinel data; Sentinel archives back to 2014 (S1) is now greater than 5 Petabytes; USGS (United States Geological Survey) currently publishes 1.5 TB/day of Landsat 7 & 8. NASA’s Earth Observation System Data and Information System (EOSDIS) archive had more than 7.5 petabytes by 2012, with the Earth observation archive growing 4 TB/ day on average in 2012 (Ramapriyan et al. 2013). Collective public archives of remote sensing data exceed 1 EB (Ramapriyan et al. 2013). At the China Centre for Resources Satellite Data and Application (CRESDA), the total archive exceeded 7 PB by 2014, 4 PB of data alone acquired in 2014, and 10 TB of daily growth rate in 2014 with only 13 satellites with 10 in operation (Shao et al. 2015). The Gaia satellite sends 100 GB of data daily to New Norcia (Australia) and Cebreros (Spain) ground stations (Brunet et al. 2012).

3.2 Variety of Remote Sensing Big Data The variety of remote sensing big data is rooted in the special features of remote sensors—multisource, multitemporal, multi-format, and multi-scale (Liu 2015; Chi et al. 2016). Sensors can be designed with different spectral bands. There are Sentinel series, Landsat series, SPOT (Satellite Pour l’Observation de la Terre or Satellite for observation of Earth) series, GOES (Geostationary Operational Environmental Satellites) series, and many other major satellite series that continuously acquire data. More than 200 satellite sensors in operation are of varieties of spatial, temporal, and radiometric resolutions (Ma et al. 2015; Salazar Loor and Fdez-Arroyabe 2019). In addition to these large Earth observing satellites, a growing number of small satellites or minisatellites (100–500 kg in weight), microsatellites (10–100 kg in weight), nanosatellites (1–10 kg in weight) (Shiroma et al. 2011), picosatellites (0.1–1 kg in weight), and even femtosatellites (10–100 g in weight) have been launched into space for collecting data (Shiroma et al. 2011). Annual analysis reports on the nano-/microsatellite market by SpaceWorks since 2011 showed the market is growing in size and variety year by year (Depasquale et al. 2010; DePasquale and Charania 2011; DePasquale and Bradford 2013; Buchen 2014, 2015; Doncaster et al. 2016, 2017a, b; Williams et al. 2018; DelPozzo et al. 2019; DelPozzo and Williams 2020). These satellites provide a diversified array of remote sensing data types to be acquired from space at different spatial and temporal resolution with different radiometric resolution. Example small satellites for Earth Observing are PARASOL (Polarization & Anisotropy of Reflectances for Atmospheric Sciences coupled with Observations from a Lidar) with sensor POLDER (POLarization and Directionality of the Earth’s Reflectances) (Li et al.

3.2 Variety of Remote Sensing Big Data

47

2018) and Sistema Satelital para Observación de la Tierra (SSOT), or FASat- Charlie, with multispectral sensor NAOMI-1 (New AstroSatOptical Modular Instrument) (Mattar et al. 2014). Example microsatellites are SkySat series with CMOS (Complementary Metal-Oxide Semiconductor) frame detectors (Murthy et al. 2014). Example nanosatellites for Earth observations are the Flock series with a telescope and a frame CCD (Charge-Coupled Device) camera (Safyan 2020) and RapidEye with the multispectral pushbroom imager (Tyc et al. 2005). Onboard different platforms (i.e., satellite, airborne, and shipborne), different sensors are used. Data may come in different forms and scales, including hyperspectral data, GEDI (Global Ecosystem Dynamics Investigation) LiDAR (Light Detection and Ranging), airborne LiDAR (discrete return and full-wave forma and multispectral and single photon), aerial photos and point clouds, terrestrial laser scanning, harvester data, field-based reference data, crowd-sourced data, personally collected data from UAVs (Unmanned Aerial Vehicles), and other sensors. The variety of remote sensing big data has been reviewed and examined in Liu (2015). The leading causes for variety of remote sensing big data are summarized as follows (Liu 2015): • Multisource: There are multiple sources for acquiring remote sensing data. For example, there are laser, radar, optical, and microwave remote sensing. • High-dimensional: There are spectral dimensions. With hyperspectral remote sensing, the number of dimensions to be considered is high. There are also temporal dimensions. • Multitemporal/dynamic-state: Data can be acquired on different dates. The captured state of target can be varied. • Multi-scale or multi-resolution: Data can be acquired at different scales, which result in different spatial resolutions. • Isomer: Different representation structures can be used for targets at the same location or geographic coordinates. Either vector or raster may be used to represent the same data. • Nonlinearity: The relationship between variables is nonlinear and cannot be analyzed accurately using linear methods. Different data centers have to deal with varieties of growing remote sensing big data. At CRESDA, there were 13 land-observing satellites to acquire different types of observations with onboard sensors and their derived products are even more with diverse applications (Shao et al. 2015). Direct applications of remote sensing big data produced thousands of products. By 2012, NASA EOSDIS had more than 7000 unique dataset types archived (Ramapriyan et al. 2013). They cover all major nature science disciplines, including atmospheric science, land processes, oceanography, hydrology, and cryospheric science. More than 1.5 million individuals had been served by 2012 (Ramapriyan et al. 2013). These metrics have been increased every year. By 2019, there are more than 11,000 unique collections archived and 3.5 million distinct users a year (EOSDIS 2020).

48

3 Special Features of Remote Sensing Big Data

3.3 Velocity of Remote Sensing Big Data There are two perspectives on the velocity of remote sensing big data, referring to the growth rate (data generation and acquisition) and the processing rate (data processing and delivery to end users). The growth rate of remote sensing big data is very high. At NASA EOSDIS, the total data archived is 33.6 petabytes between October 1, 2018 and September 30, 2019 (EOSDIS 2020). During this period, there were 11,929 unique data products distributed. Distinct users of EOSDIS data and services are more than 3.5 million in 1 year, and the website had more than 3.4 million visits per minute. Average daily archive growth is 20 terabytes per day. At CRESDA, the data archive had grown more than 2000 times by 2014 since 2000 (Shao et al. 2015). The aggregated velocity to grow the data archives at CRESDA can be as high as 3.7 GBps, or about 2 terabytes per day, counting all the 13 satellites downlinking bandwidth (Ma et al. 2015). The growth rate of remote sensing big data can be also observed in the staggering growth rate of metadata in major remote sensing catalogs. For example, the number of metadata records in EOSDIS catalog database grew to over 129 million in 2012 and was increasing at an average rate of 66,000 records daily (Ramapriyan et al. 2013). To make use of such large volume of incoming data for real-time and near-real- time applications, we need to process and analyze big remote sensing data in seconds, in applications such as hazard monitoring. It was estimated that fully functional systems need to have more than 110 petaflops operations to handle the more than 7 petabytes of remotely sensed data in 2014, which is more than twice of the theoretical peak capability (about 54 petaflops) of Tianhe-2A, the fastest machine in the world in 2014 (Shao et al. 2015). Data from multiple sensors or sensor constellation allow the measurement and monitoring of targets on Earth up to every second. The speed of revisit becomes shorter and shorter. At NASA EOSDIS, end user distribution is close to 2 billion products during the 1-year period between October 1, 2018 and September 30, 2019 (EOSDIS 2020), which is much higher than the distribution of about 630 million data files in 2012 (Ramapriyan et al. 2013). End user average daily distribution volume is 102.8 terabyte per day, which is much higher than the rate of 20 terabytes daily in 2012 (Ramapriyan et al. 2013).

3.4 Veracity of Remote Sensing Big Data The veracity of remote sensing big data concerns the data quality and data provenance. The uncertainty and errors in remote sensing are due to sensor inconsistency, incompleteness, ambiguities, latency, deception, and physical model approximation. The accuracy of remotely sensed data depends on many factors during the process of acquisition, preprocessing, and communication (Li et al. 2016; NIST Big Data Public Working Group, Definitions and Taxonomies Subgroup 2019a, b). Remotely sensed data may be fed into Earth science models to produce information

3.5 Value of Remote Sensing Big Data

49

products. The model may exhibit different levels of uncertainty in the final products due to errors in the model initial data and their propagation along the workflow (Yang et al. 2017). Noise levels may differ depending on the source of errors such as sensor error, cloud, and atmospheric distortion. Sensor failure may cause missing data. Projections are commonly involved in analytics, which is relevant to geographic location and cross-reference modeling between different data sources. Projection introduces certain levels of uncertainty which has different effects in different aspects, e.g. measurements of distance and area. Understanding the sources of errors and their propagation along the data processing workflow are important to users when they carry out quantitative analytics (NIST Big Data Public Working Group 2019).

3.5 Value of Remote Sensing Big Data The value of remote sensing big data is rooted in its diverse applications. The applications of remote sensing big data have covered all nature science disciplines, including atmospheric science, land processes, oceanography, hydrology, and cryospheric science (Ramapriyan et al. 2013). There are many use cases that effectively use remote sensing big data. NASA’s Earth Observing System Data and Information System (EOSDIS) remote sensing data repository has millions of users and downloads (Ramapriyan et al. 2013; NIST Big Data Public Working Group, Definitions and Taxonomies Subgroup 2019a, b). The volume of distribution is extremely high which reaches the overall cumulative archive volume in a given year (NIST Big Data Public Working Group, Definitions and Taxonomies Subgroup 2019a, b). Other cases include the EISCAT (European Incoherent Scatter Scientific Association) 3D Radar System provides 3D monitoring of the atmosphere and ionosphere continuously (Lindgren et al. 2010); the ENVRI (Common Operations of Environmental Research Infrastructures) to form a distributed, long-term, remote-controlled observation networks to help in understanding processes (Nieva de la Hidalga et al. 2017); the UAVSAR (Unmanned Air Vehicle Synthetic Aperture Radar) to effectively identify landscape changes due to seismic activity, landslides, deforestation, vegetation changes, and flooding (Koo et al. 2012); the iRODS to provide federated data for climate researchers, weather forecasters, and technologists in instrument development (Hedges et al. 2007); the MERRA (Modern-Era Retrospective Analysis for Research and Applications) Analytic Services (MERRA/AS) to provide pivotal support for global climate research and wildfire monitoring (Schnase et al. 2017); the atmospheric turbulence event discovery and predictive analytics with MERRA and NARR (North American Regional Reanalysis through teleconnections and association data mining (Scarsoglio et al. 2016; Zappala et al. 2020); the Web-Enabled Landsat Data (WELD) supports global terrestrial monitoring at high-spatial resolution (Roy et al. 2010) and the integration of in situ sensor networks and satellite remote sensing to cover earth study at multiple scales (NIST Big Data Public Working Group, Definitions and Taxonomies Subgroup 2019a, b).

50

3 Special Features of Remote Sensing Big Data

References Antunes RR, Blaschke T, Tiede D et al (2019) Proof of concept of a novel cloud computing approach for object-based remote sensing data analysis and classification. GIScience Remote Sens 56:536–553. https://doi.org/10.1080/15481603.2018.1538621 Brunet P-M, Montmorry A, Frezouls B (2012) Big data challenges, an insight into the Gaia Hadoop solution. In: SpaceOps 2012 conference: Stockholm, Sweden, 11–15 June 2012, Stockholm, Sweden, pp 1263–1274 Buchen E (2014) SpaceWorks’ 2014 nano/microsatellite market assessment. In: Proceedings of the small satellite conference, technical session I: private endeavors. Utah State University, Logan, Utah, USA, pp 1–5 Buchen E (2015) Small satellite market observations. In: Proceedings of the small satellite conference, technical session VII: opportunities, trends and initiatives. SpaceWorks, Atlanta, pp 1–5 Chi M, Plaza A, Benediktsson JA et al (2016) Big data for remote sensing: challenges and opportunities. Proc IEEE 104:2207–2219. https://doi.org/10.1109/JPROC.2016.2598228 Nieva de la Hidalga A, Magagna B, Stocker M, et al (2017) The Envri Reference Model (Envri Rm) Version 2.2, 30Th October 2017. Zenodo DelPozzo S, Williams C (2020) SpaceWorks’ 2020 nano/microsatellite market forecast. SpaceWorks Enterprises, Inc. (SEI), Atlanta DelPozzo S, Williams C, Doncaster B (2019) SpaceWorks’ 2019 nano/microsatellite market forecast. SpaceWorks Enterprises, Inc. (SEI), Atlanta DePasquale D, Bradford J (2013) Nano/microsatellite market assessment 2013. Public Release, Revision A, SpaceWorks DePasquale D, Charania A (2011) Nano/microsatellite launch demand assessment 2011. SpaceWorks Commercial, November Depasquale J, Charania AC, Kanamaya H, Matsuda S (2010) Analysis of the earth-to-orbit launch market for nano and microsatellites. In: AIAA SPACE 2010 Conference & Exposition. American Institute of Aeronautics and Astronautics, Anaheim Doncaster B, Shulman J, Bradford J, Olds J (2016) SpaceWorks’ 2016 nano/microsatellite market assessment. In: Proceedings of the small satellite conference, technical session II: launch. Utah State University, Logan, Utah, United States, pp 1–6 Doncaster B, Williams C, Shulman J (2017a) SpaceWorks’ 2017 nano/microsatellite market forecast. SpaceWorks Enterprises, Inc. (SEI), Atlanta Doncaster B, Williams C, Shulman J, Olds J (2017b) SpaceWorks’ 2017 nano/microsatellite market assessment. In: Proceedings of the small satellite conference, Swifty session 2. Utah State University, Logan, Utah, United States EOSDIS (2020) System performance and metrics | Earthdata. In: System performance and metrics. https://earthdata.nasa.gov/eosdis/system-performance/. Accessed 16 Jun 2020 Hedges M, Hasan A, Blanke T (2007) Curation and preservation of research data in an iRODS data grid. In: Third IEEE international conference on e-science and grid computing (e-Science 2007). IEEE, Bangalore, India, pp 457–464 Huang Y, Chen Z, Yu T et al (2018) Agricultural remote sensing big data: management and applications. J Integr Agric 17:1915–1931. https://doi.org/10.1016/S2095-3119(17)61859-8 Khan N, Naim A, Hussain MR et al (2019) The 51 V’s of big data: survey, technologies, characteristics, opportunities, issues and challenges. In: Proceedings of the international conference on Omni-layer intelligent systems - COINS ’19. ACM Press, Crete, Greece, pp 19–24 Koo VC, Chan YK, Gobi V et al (2012) A new unmanned aerial vehicle synthetic aperture radar for environmental monitoring. Prog Electromagn Res 122:245–268. https://doi.org/10.2528/ PIER11092604 Li S, Dragicevic S, Castro FA et al (2016) Geospatial big data handling theory and methods: a review and research challenges. ISPRS J Photogramm Remote Sens 115:119–133. https://doi. org/10.1016/j.isprsjprs.2015.10.012

References

51

Li Y, Li L, Zha Y (2018) Improved retrieval of aerosol optical depth from POLDER/PARASOL polarization data based on a self-defined aerosol model. Adv Space Res 62:874–883. https:// doi.org/10.1016/j.asr.2018.05.034 Lindgren T, Ekman J, Backen S (2010) A measurement system for the complex far-field of physically large antenna arrays under noisy conditions utilizing the equivalent electric current method. IEEE Trans Antennas Propag 58:3205–3211. https://doi.org/10.1109/TAP.2010.2055780 Liu P (2015) A survey of remote-sensing big data. Front Environ Sci 3. https://doi.org/10.3389/ fenvs.2015.00045 Liu P, Di L, Du Q, Wang L (2018) Remote sensing big data: theory, methods and applications. Remote Sens 10:711. https://doi.org/10.3390/rs10050711 Ma Y, Wu H, Wang L et al (2015) Remote sensing big data computing: challenges and opportunities. Futur Gener Comput Syst 51:47–60. https://doi.org/10.1016/j.future.2014.10.029 Manogaran G, Lopez D (2017) A survey of big data architectures and machine learning algorithms in healthcare. Int J Biomed Eng Technol 25:182. https://doi.org/10.1504/IJBET.2017.087722 Mattar C, Hernández J, Santamaría-Artigas A et al (2014) A first in-flight absolute calibration of the Chilean Earth Observation Satellite. ISPRS J Photogramm Remote Sens 92:16–25. https:// doi.org/10.1016/j.isprsjprs.2014.02.017 Murthy K, Shearn M, Smiley BD et al (2014) SkySat-1: very high-resolution imagery from a small satellite. In: Meynart R, Neeck SP, Shimoda H (eds), Proceedings Volume 9241, Sensors, Systems, and Next-Generation Satellites XVIII, SPIE, Amsterdam, p 92411E NIST Big Data Public Working Group (2019) NIST Big Data Interoperability Framework :: volume 2, big data taxonomies version 3. National Institute of Standards and Technology, Gaithersburg NIST Big Data Public Working Group, Definitions and Taxonomies Subgroup (2019a) NIST Big Data Interoperability Framework:: volume 1, definitions version 3. National Institute of Standards and Technology, Gaithersburg NIST Big Data Public Working Group, Definitions and Taxonomies Subgroup (2019b) NIST Big Data Interoperability Framework:: volume 3, use cases and general requirements version 3. National Institute of Standards and Technology, Gaithersburg Ramapriyan H, Brennan J, Walter J, Behnke J (2013) Managing big data: NASA tackles complex data challenges. Earth Imaging J, 2013-10-18. https://eijournal.com/print/articles/ managing-big-data Rathore MMU, Paul A, Ahmad A et al (2015) Real-time big data analytical architecture for remote sensing application. IEEE J Select Top Appl Earth Observ Remote Sens 8:4610–4621. https:// doi.org/10.1109/JSTARS.2015.2424683 Roy DP, Ju J, Kline K et al (2010) Web-enabled Landsat Data (WELD): Landsat ETM+ composited mosaics of the conterminous United States. Remote Sens Environ 114:35–49. https://doi. org/10.1016/j.rse.2009.08.011 Safyan M (2020) Planet’s Dove satellite constellation. In: Pelton JN (ed) Handbook of small satellites. Springer International Publishing, Cham, pp 1–17 Salazar Loor J, Fdez-Arroyabe P (2019) Aerial and satellite imagery and big data: blending old technologies with new trends. In: Dey N, Bhatt C, Ashour AS (eds) Big data for remote sensing: visualization, analysis and interpretation. Springer International Publishing, Cham, pp 39–59 Scarsoglio S, Iacobello G, Ridolfi L (2016) Complex networks unveiling spatial patterns in turbulence. Int J Bifurc Chaos 26:1650223. https://doi.org/10.1142/S0218127416502230 Schnase JL, Duffy DQ, Tamkin GS et al (2017) MERRA analytic services: meeting the big data challenges of climate science through cloud-enabled climate analytics-as-a-service. Comput Environ Urban Syst 61:198–211. https://doi.org/10.1016/j.compenvurbsys.2013.12.003 Shao J, Xu D, Feng C, Chi M (2015) Big data challenges in China Centre for resources satellite data and application. In: 2015 7th workshop on hyperspectral image and signal processing: evolution in remote sensing (WHISPERS). IEEE, Tokyo, Japan, pp 1–4 Shelestov A, Lavreniuk M, Kussul N et al (2017) Exploring Google Earth Engine Platform for big data processing: classification of multi-temporal satellite imagery for crop mapping. Front Earth Sci 5. https://doi.org/10.3389/feart.2017.00017

52

3 Special Features of Remote Sensing Big Data

Shiroma W, Martin L, Akagi J et al (2011) CubeSats: a bright future for nanosatellites. Open Eng 1. https://doi.org/10.2478/s13531-011-0007-8 Tyc G, Tulip J, Schulten D et al (2005) The RapidEye mission design. Acta Astronaut 56:213–219. https://doi.org/10.1016/j.actaastro.2004.09.029 Williams C, Doncaster B, Shulman J (2018) SpaceWorks’ 2018 nano/microsatellite market forecast. SpaceWorks Enterprises, Inc. (SEI), Atlanta Yang C, Yu M, Hu F et al (2017) Utilizing cloud computing to address big geospatial data challenges. Comput Environ Urban Syst 61:120–128. https://doi.org/10.1016/j. compenvurbsys.2016.10.010 Zappala DA, Barreiro M, Masoller C (2020) Mapping atmospheric waves and unveiling phase coherent structures in a global surface air temperature reanalysis dataset. Chaos: an interdisciplinary. J Nonlin Sci 30:011103. https://doi.org/10.1063/1.5140620

Chapter 4

Remote Sensing Big Data Collection Challenges and Cyberinfrastructure and Sensor Web Solutions

Abstract The Chapter introduces major Remote Sensing Big Data Collection Challenges—5Vs (Volume, Variety, Velocity, Veracity, and Value) and other remote- sensing- data-collection-specific challenges: identification, storage, distribution, representation, fusion, and visualization. Cyberinfrastructure is introduced as one of the platforms to enable the collection and management of remote sensing big data. Three major cyberinfrastructures for remote sensing big data were briefed, that is, Global Earth Observation System of Systems (GEOSS), NASA Earth Observing System (EOS) Data and Information System (EOSDIS), and ESA Federated Earth Observation (FedEO). Sensor web is introduced as a solution to enable lively collect data from sensors. A framework for sensor web cyberinfrastructure, Self-adaptive Earth Predictive System (SEPS), is briefed. Example applications of SEPS as a cyberinfrastructure building framework were discussed, including areas of climate, weather, disasters, and agriculture. Keywords Remote sensing big data · Cyberinfrastructure · Earth Science Modeling Framework · Earth Observing System · Federated catalog · Earth System · Sensor web

4.1 Remote Sensing Big Data Collection Challenges Remote sensing big data started with the launch of Landsat-1 in 1972, after which digital data had been received continuously in large volumes. The general definition of remote sensing is interpreted as a technology for acquiring information about an object or a phenomenon without making physical contact with the object or phenomenon. Big data refers to large or complex datasets that are beyond the capacity of traditional data-processing software. The launch of Landsat-1 in 1972 catalyzed both remote sensing concepts and big data challenges. Since then, a variety of remote sensors have been sent into the sky for collecting data. The challenges associated with remote sensing big data collections are primarily related to the five dimensions of big data discussed in the previous chapter, that is, © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. Di, E. Yu, Remote Sensing Big Data, Springer Remote Sensing/ Photogrammetry, https://doi.org/10.1007/978-3-031-33932-5_4

53

54

4 Remote Sensing Big Data Collection Challenges and Cyberinfrastructure and Sensor…

Variety, Volume, Velocity, Veracity, and Value. Variety refers to the varieties of the working spectrums from visible to sound wave, working modes from frame to conical scanning, platforms ranging from ships to satellites, organizations and individuals engaged in data collection, and data formats and projections. Volume refers to the exponential growth of data volumes. Velocity refers to on-demand real-time sensing for responding to emergencies. Veracity refers to the quality of data collected by all different methods and organizations. Value refers to the various applications and uses of remote sensing big data. Remote sensing data collections have been supported by different agencies. NASA is the largest agency that creates, launches, and operates remote sensors into space to collect various data. As of October 2019, NASA Earth Observing Systems (EOS) had 20 historical missions (NASA 2019a), 48 completed missions (NASA 2019b), 28 current missions (NASA 2019c), and 15 future missions (NASA 2019d). Figure 4.1 shows current missions as of October 10, 2019 (NASA 2019e). All of these missions acquire a large variety of remotely sensed data. These form the original data products available through NASA National Environmental Satellite, Data, and Information Service (NESDIS). More derived products have been produced and served through NESDIS. As of June 27, 2020, the European Space Agency (ESA) had 23 shuttle/historical missions, 41 completed missions, 18 current missions, and 15 future missions (ESA 2020a, b). ESA distributes earth observation data from ESA Earth Observation (EO) Missions, Third Party Missions (TPMs), ESA Campaigns, the ESA Global Monitoring for Environment and Security (GMES) Space Component (GSC), and sample and auxiliary data from several missions and instruments. Data distributed by ESA is available under different data policies and by various access mechanisms.

Fig. 4.1 NASA EOS Operating missions. (Source: https://eospso.nasa.gov/sites/default/files/u15/ CURRENT-Earth-Missions10_2019%5B1%5D.png)

4.1 Remote Sensing Big Data Collection Challenges

55

Current missions include XMM (the X-ray Multi-Mirror Mission or the High Throughput X-ray Spectroscopy Mission), two Cluster II missions, INTEGRAL (the INTErnational Gamma-Ray Astrophysics Laboratory), Mars Express, CryoSat, SWARM, Gaia, Sentinel-1A, Sentinel-2A, ExoMars/TGO (the ExoMars Trace Gas Orbiter), Sentinel-1B, Sentinel-2B, Sentinel-5P, Aeolus (the Atmospheric Dynamics Mission Aeolus), BepiColombo, OPS-SAT, and Solar Orbiter. The National Remote Sensing Center of China (NRSCC) distributes major Earth-observing remote sensing products, including FengYun series, GaoFeng series, and CBERS (China-Brazil Earth Resources Satellite) series. Other Earth missions of China include Huanjing, HaiYang, Pujiang, Tian Hui, Ziyuan series, and CSES (China Seismo-Electromagnetic Satellite) (NRSCC 2020). The Indian Space Research Organisation (ISRO) manages the space program in India. Major Earth-observing satellites of India include IRS (Indian Resource Satellite) series, INSAT (Indian Satellite) geostationary series, and RISAT (Radar Imaging Satellite) series (ISRO 2020). The Japan Aerospace Exploration Agency (JAXA) has the Marine Observation Satellite-1 (MOS-1), Japanese Earth Resources Satellite 1 (JERS-1), ADEOS (Advanced Earth Observing Satellite 1), ADEOS II (Advanced Earth Observing Satellite 2), Greenhouse Gases Observing Satellite (GOSat), ALOS (Advanced Land Observing Satellite), and ALOS-2 (Advanced Land Observing Satellite 2) (JAXA 2020). Remote sensing data collection shares the same challenges that big data has, that is, big data computing, big data sharing, and big data analytics (Chi et al. 2016). Big data computing needs to be capable of handling a large volume of data with reasonable performance. Big data collaborations call for sharing of data with variety and enabling the discovery of data. Big data analytics require data to be analytic-ready with proper technologies in preprocessing and transforming data into a model- ready form. In addition to these general big data challenges, remote sensing big data collection has special challenges, that is, remote sensing big data identification, remote sensing big data storage, remote sensing big data distribution, remote sensing big data representation, remote sensing big data fusion, and remote sensing big data visualization (Chi et al. 2016). Identification of the proper remote sensing big data for an application is the first step in application development. Proper identification of remote sensing data includes proper selection of spectral bands, spatial resolution, and temporal resolution. Data storage for remote sensing big data needs to meet the requirements for fast discovery and retrieval, especially when traditional relational database management may not be a good fit. Analytic-ready data needs to select proper data preparation, data management technologies, and data-processing methods. Fusion of data from different scales, resolutions (spatially and temporally), and radiometric spectrums is necessary for fully utilizing varieties of remote sensing big data in real applications. Proper representation of remote sensing data helps in filling the semantic gap that similar signatures denote different object/phenomenon while different signatures denote the same object/phenomenon (Liu et al. 2007, 2012). Effective high-dimensional data visualization is required for exploratory analysis of remote sensing big data.

56

4 Remote Sensing Big Data Collection Challenges and Cyberinfrastructure and Sensor…

4.2 Remote Sensing Big Data Collection Cyberinfrastructure The term cyberinfrastructure was coined in the United States. In the science domain, it refers to the environment that consists of computing systems, data, information resources, networking, digitally enabled sensors, instruments, virtual organizations, and observatories, along with an interoperable suite of software services and tools (Myers and Dunning 2006; National Science Foundation, Cyberinfrastructure Council 2007). Cyberinfrastructure describes research environments that support advanced data acquisition, data storage, data management, data integration, data mining, data visualization, and other computing and information-processing services distributed over the Internet beyond the scope of a single institution. In scientific usage, cyberinfrastructure is a technological and sociological solution to the problem of efficiently connecting laboratories, data, computers, and people to enable the derivation of novel scientific theories and knowledge. Remote sensing cyberinfrastructure is to support remote sensing missions and meet the 5V challenges. There are major cyberinfrastructures established for Earth sciences and remote sensing big data. This section discusses three of the systems, that is, the Global Earth Observation System of Systems (GEOSS), NASA Earth Observing System (EOS) Data and Information System (EOSDIS), and ESA Federated Earth Observation (FedEO).

4.2.1 Global Earth Observation System of Systems (GEOSS) The intergovernmental Group on Earth Observations (GEO) is an international collaboration started in 2005 to promote the sharing of Earth Observations (EO) data. GEO is coordinating efforts to build a Global Earth Observation System of Systems (GEOSS) based on a 10-year implementation plan for the period 2005–2015 and a renewed 10-year implementation plan. The plan defines a vision statement for GEOSS, its purpose and scope, expected benefits, and the nine “Societal Benefit Areas” of disasters, health, energy, climate, water, weather, ecosystems, agriculture, and biodiversity as shown in Fig. 4.2. GEOSS started with the design of the GEOSS Common Infrastructure (GCI). GCI follows a three-tier architecture model that consists of registries (i.e., Component and Service Registry (CSR), Standards Registry, and Best practice wiki), GEOSS Clearinghouse, and GEO Web Portal. As it evolves, GEOSS Registries and GEOSS Clearinghouse are merged into the GEO Discovery and Access Broker (GEO DAB) that allows data discovery and access. Figure 4.3 shows the current architecture of GEOSS.

4.2 Remote Sensing Big Data Collection Cyberinfrastructure

57

Fig. 4.2 GEOSS and Nine Societal Benefit Areas. (Source: http://www.earthobservations.org/ documents/200904_geo_info_sheets.pdf)

Fig. 4.3 GEOSS Architecture. (Source: httphttps://www.earthobservations.org/geoss.php)

58

4 Remote Sensing Big Data Collection Challenges and Cyberinfrastructure and Sensor…

4.2.2 NASA Earth Observing System (EOS) Data and Information System (EOSDIS) The NASA Earth Observation System (EOS) Data and Information System (EOSDIS) was the earliest operational remote sensing cyberinfrastructure that served as a distributed architecture with 12 major data centers, starting with eight data centers in the early 1990s. Figure 4.4 shows the distribution of 12 DAACs and their themes. It manages petabytes of data, which has the largest remote sensing data asset in the world. Up to the end of 2019, it manages over 34 petabytes of data across 11,000 unique research products (Nagaraja 2020). It is estimated to grow to 247 petabytes in size by 2025 (Nagaraja 2020). The data archives increase over the years. Figure 4.5 shows the total archive and the total archives of DAACs since 2008 (EARTHDATA 2020a), with total archives significantly increasing as well as the rate of archive volume growth. Figure 4.6 shows common services and components managed by EOSDIS (EARTHDATA 2020b). DAAC is the primary data archive and distribution service provider. Each DAAC has some focused, specific services. For example, the NASA Goddard Earth Sciences (GES) Data and Information Service Center (DISC) provides science data, information, and services. At GES DISC, tools and service functionalities include (1) Automated Subscriptions to any data in the archive, (2) Subsetting Services (including On-demand Channel and Parameter Subset, On-the- Fly subsetting), (3) MOVAS (MODIS L3 atmospheric data Online Visualization

Fig. 4.4 NASA DAACs. (Source: https://www.earthdata.nasa.gov/s3fs-public/imported/DAAC_ map_without_ECS.jpg)

4.2 Remote Sensing Big Data Collection Cyberinfrastructure

59

40,000.00

35,000.00

30,000.00 ASDC ASF CDDIS

Volume (gigabyte)

25,000.00

GESDISC GHRC LP DAAC

20,000.00

LAADS DAAC NSIDC 15,000.00

OB.DAAC ORNL PO.DAAC

10,000.00

SEDAC total

5,000.00

-

FY08

FY09

FY10

FY11

FY12

FY13

FY14

FY15

FY16

FY17

FY18

FY19

Fiscal Year

Fig. 4.5 NASA Data Volume Increase over last decade (2008–2019). (Source: https://cdn.earthdata.nasa.gov/conduit/upload/13076/FY19_Annual_DAAC_Profile_V1.xlsx)

Data Providers

data

DAAC

browse images metadata

Global Imagery Browse Services

Common Metadata Repository

Worldview

Earthdata Search Client

DAAC Search tool logs

Earthdata Login

Other clients

Fig. 4.6 NASA EOSDIS Common Services

EOSDIS Metrics System

60

4 Remote Sensing Big Data Collection Challenges and Cyberinfrastructure and Sensor…

and Analysis System) for quick exploration, analyses, and visualization of MODIS Atmospheres Monthly Leve-3 product, (4) NADM (Near-line Archive Data Mining) for enabling users to upload, test their algorithms and mine data from GDAAC online cache, and (5) WebGIS as an online web software that supports Open GIS Consortium (OGC) standards for mapping requests and rendering.

4.2.3 ESA Federated Earth Observation (FedEO) FedEO is a federated remote sensing infrastructure led by European Space Agency (ESA). It started as Heterogeneous Missions Accessibility (HMA) in 2005—a collaborative project in Europe and Canada by the Ground segment Coordination (GSCB). The overall objectives are to (1) guarantee a seamless and harmonized access to heterogeneous EO datasets from multiple mission ground segments, including national missions and ESA missions; (2) standardize the ground segment interfaces of satellite missions for easier access to EO data; and (3) provide interoperability for coordinated data access enabling the interactions with services or Value Adders and EO Contributing Missions. It has a two-track approach—operational implementations and parallel “standardization and support activities” (e.g., software development and conformance testing). HMA standards are defined through the work of 25 companies over 10 countries and with the contribution from HMA project partners (agencies and users). All have been released as OGC standards. They are as follows: 1. OGC’s Cataloguing of ISO Metadata using the ebRIM profile of CS-W (OGC 07-038, OGC 13-084) (Lesage 2007; Voges et al. 2013) for collection and service discovery. 2. OGC’s GML 3.1.1 Application Schema for EO Products (OGC-06-080) and EOP O&M (OGC 10-157) (Gasperi et al. 2016) for EO product metadata and ISO 19115 Geographic Information—Metadata (OGC 11-035) (Houbie and Smolders 2013) for EO collection metadata. 3. OGC’s Catalogue Service Specification 2.0 Extension Package for ebRIM Application Profile: Earth Observation Products (OGC 06-131) (Primavera 2008) for catalog service. 4. OGC’s Ordering Services for Earth Observation Products (OGC 06-141, OGC 13-042) (Marchionni and Pappagallo 2012; Marchionni 2014) for ordering services from catalogs. 5. OGC’s Sensor Planning Service Application Profile for EO Sensors (OGC 10-135, OGC 13-039, OGC 14-012) (Robin and Mérigot 2011; Fanjeau and Ulrich 2013, 2014) for programming feasibility analysis. 6. OGC’s WMS EO Extension (OGC 07-063) (Lankester 2007) for Web map service. 7. OGC WCS 2.0 extension for EO (OGC 10-140) (Baumann et al. 2018) for online data access as Web cover service and Download Service for EO products (OGC

61

4.2 Remote Sensing Big Data Collection Cyberinfrastructure

13-043) (Marchionni and Cafini 2013) for online data access through download service. 8. OGC’s User Management Interfaces for Earth Observation Services (OGC 07-118) (Denis and Jacques 2013) for identity (user) management. FedEO is a system providing a brokered discovery (and access) capability to European (and Canadian) EO mission data based on HMA (and other) standard interfaces. Figure 4.7 shows the overall architecture. The FedEO services support the OpenSearch interface as specified in the following specifications: • CEOS OpenSearch Best Practice v1.2 (CEOS 2017) • WGISS CDA OpenSearch Client Guide v1.0 (WGISS CDA System-Level Team 2019) • WGISS FedEO Data Partner Guide-OpenSearch v1.0 (Coene and Della Vecchia 2019) • OGC 10-157r4, Earth Observation Metadata profile of Observations & Measurements, Version 1.1 (Gasperi et al. 2016) • OGC 10-032r8, OpenSearch Geo and Time Extensions (Gonçalves 2014) • OGC 13-026r8, OpenSearch Extension for Earth Observation (Gonçalves and Voges 2014) • Relevant OASIS searchRetrieve specifications (Denenberg et al. 2013).

FedEO Clients

FedEO

Gateway

FedEO OpenSearch Interface Mediator Core

Open Search I/F

Open Search I/F

Open Search I/F

Open Search I/F

Open Search I/F

Open Search I/F

Open Search I/F

Open Search I/F

Open Search I/F

Open Search I/F

CWIC Connector

ISO Connector

ISO Connector

OGC 06-131 Connector

OGC 06-131 Connector

Open Search Connector

ASF Connector

OGC 06-131 Connector

OGC 06-131 Connector

OGC 06-131 Connector

Open Search Connector

FedEO Collections

ESA/RSS Reference data

ESA M2CS

CDS EO-DAIL

ESA G-P00

ASF

VITO

DLR

EUMETSAT

SuperSites VA4

CWIC

Open Search I/F

CSW ISO AP (OGC 07-045)

CSW CWIC

CSW 115 EP (OGC 13-084)

CSW EOP EP (OGC 06-131)

OpenSearch(Geo and Time Extension [OGC 10-032] Extension for Earth Observation [OGC 13-026])

Fig. 4.7 FedEO architecture

(Source: Coene, Y., 2015. FedEO FEDEO Client Partner Guide. )

62

4 Remote Sensing Big Data Collection Challenges and Cyberinfrastructure and Sensor…

4.3 Sensor Web Sensor web is “a coordinated observation infrastructure composed of a distributed collection of resources—e.g., sensors, platforms, models, computing facilities, communications infrastructure—that can collectively behave as a single, autonomous, task-able, dynamically adaptive and reconfigurable observing system that provides raw and processed data, along with associated meta-data, via a set of standards-based service-oriented interfaces” (Di et al. 2010). It emerges from challenges of dealing with “so many diverse sensors and so many data archives” and how to use task-specific sensors and integrate data from multiple sensors to respond to emergency, on-demand sensing, and event-driven sensing. For on-demand sensing to support modeling, Earth observation (EO) through scientific sensors is the most important method for measuring the current state of the Earth system (ES). In recent years, one of the most significant technological developments in EO is the EO sensor web that advances the possibility to lively feed sensor information into Earth science models. The EO sensor web is a web of interconnected, heterogeneous, geospatial sensors that are interoperable, intelligent, dynamic, flexible, and scalable. The sensor web approach employs new data acquisition strategies and systems for the integration of multiple sensor data. ES models (ESM) simulate the ES and/or its components, for understanding the functioning of the system and predicting its future state. ESM requires initial states to run as well as ground truth to validate the model. The information can be provided through sensor web observation. Although both areas have been heavily invested, traditionally there has been little integration and coupling between EO and ESM. In response to these requirements, a service-oriented general framework was developed for facilitating the dynamic connection and interoperation between the EO sensor web and Earth system models (model Web) and provides examples of its applications in GEOSS societal benefit areas. The framework is named as Self-adaptive Earth Predictive System (SEPS) (Di 2007). SEPS adopts and implements the self-adaptation concept, which is a central piece of control theory used widely in engineering and military systems. Such a system contains a predictor and a measurer: (1) the predictor makes a prediction based on initial conditions, (2) the measurer then measures the state of a real-world phenomenon, (3) a feedback mechanism is built in that automatically feeds the measurement back to the predictor, and (4) the predictor uses the measurement to calculate the prediction error and adjust its internal state based on the error. Thus, the predictor learns from its error and makes a more accurate prediction in the next step. With the prediction-feedback mechanism, a system can soon learn the pattern of the real-world phenomenon and make accurate predictions. In the Earth science domain, we can consider the sensor web (SW) as the measurer and the ESM as the predictor. The concept of Self-adaptive Earth Predictive System (SEPS) is the result of applying the self-adaptation concept to the Earth system prediction. A SEPS consists of an EO component (SW) and an ESM component (MW), coupled by a SEPS framework component (the Connector). EO SW

4.3 Sensor Web

63

measures the ES state, while ESM predicts the evolution of the state. A feedback mechanism processes EO measurements and feeds them into ESM during model runs as well as initial conditions. A feed-forward mechanism compares ESM predictions with scientific goals for scheduling or selecting further optimized/targeted observations. The SEPS framework automates the feedback and feed-forward mechanisms called the FF loop. The SEPS framework is implemented with open standard components under a service-oriented architecture (SOA). The importance of improving the model prediction and emergency responses through rapid interaction between sensor systems and Earth system models has been recognized in recent years. The multiagency (NSF, NOAA, and NIH) research program on Dynamic, Data Driven Application System (DDDAS) is an example (Darema 2004). The current problem with integration of sensor web and Earth science models includes the following: (1) Implementation does not recognize the sensors and model as an integrated system. (2) Integration ignores the feed-forward from model to sensor for dynamic sensor planning and refined observation. (3) The coupling between the sensors and the Earth system models is done case by case in an ad hoc fashion. Implementations cannot be reused with different models and different sensor systems. To efficiently deal with the repetitive core FF loop in integrating sensor webs and Earth science models, a standard-based general SEPS framework is needed. The framework should work with various ESMs and EO sensors/data systems, fully support the automation of the FF loop, and provide generalization through standard interfaces and services under service-oriented architecture. Figure 4.8 shows the overall architecture of SEPS. The framework consists of five subcomponents interconnected by standard interfaces: (1) Data Discovery and Retrieval Services (DDRS), (2) Data Pre-processing,

Fig. 4.8 Overall Architecture of SEPS

64

4 Remote Sensing Big Data Collection Challenges and Cyberinfrastructure and Sensor…

Integration, and Assimilation Services (PIAS), (3) Science Goal Monitoring Services (SGMS), (4) Data and Sensor Planning Services (DSPS), and (5) Coordination and Event Notification Services (CENS). Open standards are the glue that ties all the components together to form a framework. There are three major groups of standard interfaces in the framework. The first group interfaces to the sensor world (sensor web and data archives). The second group interfaces to the Earth system models (model webs and legacy models). The third group is internal data and service interfaces between components of the framework. ISO 19130 is the base standard for sensor Data Model for Imagery and Gridded Data which provides the conceptual framework for enabling the sensor web (ISO 2014, 2018) which is the fundamental standard adopted in SEPS. The OGC Sensor Web Enablement (SWE) initiatives have developed six implementation specifications for the sensor web, which are the core open geospatial specifications adopted in the implementation of SEPS. First, SensorML is a general model and XML encoding of the geometric, dynamic, and observational characteristics of a sensor (Botts and Robin 2007). Second, Observation and Measurements (O&M) is a framework and GML encoding for measurements and observations (Cox 2010, 2011; Gasperi et al. 2016). Third, Sensor Planning Service (SPS) is a planning service by which a client can determine the feasibility of the desired set of collection requests for one or more mobile sensors/platforms (Simonis and Dibner 2007). Fourth, Web Notification Service (WNS) is an OGC best practice by which a client may conduct an asynchronous dialog with one or more other services (Simonis and Echterhoff 2006). Fifth, Sensor Alert Service (SAS) is an alert service by which a client can register for and receive sensor alert messages, part of the OGC Sensor Alert Service Interoperability Experiment (Simonis 2006). Sixth, Sensor Observation Service (SOS) provides an API for managing deployed sensors and retrieving sensor data (Na and Priest 2007). Internally, standard interfaces are adopted by individual components within the SEPS framework. Data and metadata interoperation is primarily supported with OGC Data Interoperability Protocols. They include Web Map Services (WMS), Web Feature Services (WFS), Web Coverage Services (WCS), and Catalog Services—Web Profile (CSW). Data access also adopts OPeNDAP data access protocols. Geo-processing interoperation is enabled with W3C, OASIS, and OGC web geo-processing protocols. They include Web Service Description Language (WSDL), Web Services Business Process Execution Language (WS-BPEL), Universal Description, Discovery, and Integration (UDDI), Simple Object Access Protocol (SOAP), and OGC Web Processing Service (WPS). Data Discovery and Retrieval Services (DDRS) Component: DDRS is one of the two subcomponents in the framework that connects to the EO sensor and data world. The major functions of DDRS are to discover and obtain data. It uses open standards for interfacing with data sources, allowing a web sensor or an EO data system equipped with standard interfaces to be easily plugged into the framework. For data discovery, OGC CSW is primarily used for all data sources. For data access to traditional sensor data systems, the OGC WCS, OGC WFS, and OPeNDAP protocols are used. For connection to web-enabled sensors, OGC SWE protocols (e.g., SOS and SensorML) are used. The interfaces between DDRS and PIAS are CSW for catalog

4.3 Sensor Web

65

search and WCS and WFS for data communication. DDRS hides the complexity of the EO world behind its standard interfaces. To EO systems, DDRS is their client. But to PIAS, DDRS is the single point of entry for all required data from the EO world. Data Preprocessing, Integration, and Assimilation Services (PIAS) Component: ESMs need high-level geophysical products in a model-specific form as input. Such products are normally not available directly from sensor measurements but are derived from multi-sensor measurements and assimilated with data from other sources. The major function of PIAS is to generate ESM-specific products from data obtained through DDRS. For the SEPS framework to support the easy plug- and-play of ESMs, a general approach for product generation is needed. We use the customizable virtual geospatial product approach. ESMs use products generated by PIAS in two ways: (1) as the boundary conditions, and (2) as run-time ground truth. Science Goal Monitoring Services (SGMS) Component: SGMS takes ESM outputs and in many cases the geophysical products generated by PIAS as inputs to analyze against science goals for determining whether additional or refined geophysical products are needed by the model to meet these goals. The use of geophysical products in SGMS rather than sensor measurements hides the complexity of the EO world from science-goal analysis and enables dynamic plug-in and removal of sensors without affecting SGMS. Like PIAS, SGMS needs a general method to evaluate whether diverse science goals are met. Objective metrics such as information content and system uncertainty are used to ensure that SGMS can provide an accurate description of the additional or refined geospatial products needed. Interfaces between SGMS and PIAS are WCS and WFS. SGMS takes ESM outputs in either netCDF or the ESMF (Earth System Modeling Framework) state through WCS, OPeNDAP, or FTP. Data and Sensor Planning Services (DSPS) Component: The major function of DSPS is to translate the requirements for geophysical products to EO data requirements and to schedule the acquisition of such data. The translation service uses geospatial processing models to trace back and identify all raw datasets required for generating the products. From the specifications on geophysical products (e.g., time, spatial location, accuracy), the service determines the specifications for raw data. Then DSPS finds the best source for such data and schedules data acquisition through working with the CSW portal in DDRS. If the source is a data system, DSPS provides information to CENS for scheduling the ordering and retrieval. If the source is a schedulable Web sensor, DSPS uses OGC SPS to schedule observations and register the observation event in CENS. Because a geophysical product can be traced back to multiple raw datasets, a request for the refined product may result in scheduling or discovering simultaneous multiple-sensor observations. A coincident search capability is provided in the CSW portal to help DSPS discover simultaneous observations. Coordination and Event Notification Services (CENS) Component: CENS works as a rhythm controller for the FF loop. Since everything in the framework is implemented as a Web service, the loop can be implemented as a large workflow (FF loop workflow). Two mechanisms, dynamic data retrieval and asynchronous data notification, enable the dynamic data flow in the FF loop. These two mechanisms share the same goal of using dynamic data but from the ESM and EO standpoints. For dynamic data retrieval, a request is initiated from an ESM. CENS then acts upon the

66

4 Remote Sensing Big Data Collection Challenges and Cyberinfrastructure and Sensor…

request, searches, prepares, and downloads data, and then generates and feeds the specific products to the model. For asynchronous data notification, a sensor or data event initiates the FF loop. A notification service manages the event subscription and registry and monitors the state change of the registered events. Event-based communication is established between the server or the notification service and the client or the ESM. The notification service keeps a register of events supported by the sensor web. OGC WNS and SAS are the key service protocols for the asynchronous mechanism. The SEPS aims at bridging existing Earth science models and the sensor web. Major interoperation between ESM and SEPS takes place in the interfaces between PIAS and ESM and between SGMS and ESM. Modern sensor web technology is based on a service-oriented architecture that utilizes Web service standards and technology, while most Earth science models still use conventional server/client architecture or even a single system under the Earth Science Modeling Framework (ESMF). The FF loop mechanism bridges the two worlds. The modeling community has developed the standard for interoperability between models, the Earth Science Modeling Framework (ESMF). ESMF is a high-performance, flexible software infrastructure to increase the ease of use, performance portability, interoperability, and reuse in climate, numerical weather prediction, data assimilation, and other Earth science applications. It defines an architecture for composing multicomponent applications and includes data structures and utilities for developing model components. It defines standard function methods for organizing a model component. It also defines an ESMF data structure (ESMF state) for communication between components. There are two states in ESMF: the import state and the export state, which are used in SEPS to exchange states between SEPS service components and ESMF models. Interoperations between ESMF models and SEPS services are done by exchanging and modifying the import and export states of the outmost component of a model. It is also possible some intermediate result or export states may be accessed and exposed through standard SEPS Web service interfaces. The exchange of states between the SEPS framework and ESMF has been implemented as an OGC WPS process. Figure 4.9 illustrates the state exchange between SEPS and ESMF to enable the interoperation between SEPS and ESM.

4.4 Applications 4.4.1 Climate The SEPS framework and cyberinfrastructure have been applied in interoperating sensor web and climate models. Experiments have been carried out on interoperating with selected Earth science models. The Community Atmosphere Model (CAM) was one of the selected models. Further information about the model can be found

4.4 Applications

67 ESMF Superstructure AppDrive Component Classes: GridComp, CplComp, State

Input State Model n Each of the boxes is an ESMF component, including the couples

Serve input state through standard WPS/WCS/WFS intefaces to ESM trigger the execution of model

Retrieve and check input data from reuqests

GridComp WPS CplComp

Model 1

Model 2

...

GridComp

GridComp

CplComp

GridComp

GridComp

Process and create an input state

PIAS: Data Preprocessing Integation, and Assimilation Services

GridComp

• Parameters to form an input state • Information about required output data The information is eitehr from other parts of SEPS most probably CENS / DDRS or from the client directly through standard interfaces.

CplComp

GridComp

GridComp

Output State

Retrieve output state / data generated by ESM

Process the output / data WCS / WFS to make it meet user’s requirements

ESMF Infrastructure Data Classes: Bundle, Field, Grid, Array Utility Classes: Clock, Log, Prof, DELayout, Machine

Fig. 4.9 State exchange between SEPS and ESMF

at http://www.ccsm.ucar.edu/models/atm-cam/. A service prototype was developed to establish a data link between CAM and SEPS through standard geospatial interfaces by providing/retrieving Import/Export states to/from the ESMF model. This demonstration shows that it is possible to change the initial state of the ESMF models and retrieve the resulting states of the running ESMF models. Experiment results confirm the architecture to build interoperation between SEPS and ESMF-enabled Earth system models.

4.4.2 Weather The SEPS framework was applied in the Detection and Tracking of Satellite Image Features Associated with Extreme Physical Events for Sensor Web Targeting Observing. It generalizes detection and tracking concepts into workflow architecture to enable parallel development of exchangeable and interacting components. The results are fed into SEPS to update WCS and WFS capabilities with developed detection and tracking data models and implemented Web Processing Services (WPS) for processing algorithms. Operational steps and verification methods are formulated to enable automated detection of enhanced-V severe weather signature for future public watches and warnings. The completed SEPS implementation demonstrated the technical feasibility for supporting sensor web targeting applications with cloud-top detection and tracking components utilizing GOES and MODIS IR radiance observations. Figure 4.10 shows one of the outputs from dynamically linked ESM output when new MODIS sensor data became available.

68

4 Remote Sensing Big Data Collection Challenges and Cyberinfrastructure and Sensor…

Fig. 4.10 MODIS Couplet Detector and Correlation Matrix for the Enhanced-V Severe Storm Signature

4.4.3 Disasters Flood and drought are the two most significant types of disasters that happen every year, causing significant damage to agriculture. Two systems have been developed using the SEPS framework to provide near-real-time disaster information and post- disaster damage estimation by integrating sensors with models. The first one is the Global Agricultural Drought Monitoring and Forecasting System (https://gis.csiss. gmu.edu/GADMFS/) that links remote sensors to agricultural drought models to produce agricultural drought indices. The system enables interoperation among drought measure algorithms/prediction models and produces on-demand monitoring and forecasting. The second one is a Remote-sensing-based Flood Crop Loss Assessment Service System (RF-CLASS) (http://dss.csiss.gmu.edu/RFCLASS) that enables interoperation between crop flood assessment models and sensor web. The flood loss model is driven by sensors to provide information for flood insurance and mitigation. Multi-sensor near-real-time and historic data at different spatial and temporal resolutions have been used to drive the models and algorithms.

4.4.4 Agriculture Near-real-time crop conditions and progress information for large geographic areas are very important for agriculture decision making. The technology described in this presentation has been integrated into the National Crop Condition and Progress Monitoring System (NCCPMS). Near-real-time remote sensing data are automatically fed into models to generate the crop condition and progress information. Figure 4.11 shows the overall architecture of NCCPMS and its interactions under SEPS.

References

69

CDL workflow

Cropland Data Layer

Users

Virtual product

Crop progress

Standard Geospatial Interfaces (e.g. WFS, WCS)

WEKA data mining knowledge flow

Standard Geospatial Interfaces (e.g. WFS, WCS)

Standard Geospatial Interfaces (e.g. WFS, WCS)

WOFOST

Crop conditions

Reports & buletin

Workflow and virutal product generator

SWE

Satellite & sensors

Sensor Web Technology

Fig. 4.11 National Crop Condition and Progress Monitoring System (NCCPMS)

References Baumann P, Meissl S, Yu J (2018) OGC® Web Coverage Service 2.0 interface standard - earth observation application profile. Open Geospatial Consortium Inc., Wayland Botts M, Robin A (2007) OpenGIS® Sensor Model Language (SensorML) implementation specification. Open Geospatial Consortium Inc., Wayland CEOS (2017) CEOS OpenSearch best practice document version 1.2. CEOS Chi M, Plaza A, Benediktsson JA et al (2016) Big data for remote sensing: challenges and opportunities. Proc IEEE 104:2207–2219. https://doi.org/10.1109/JPROC.2016.2598228 Coene Y, Della Vecchia A (2019) CEOS Working Group on information systems and servicesWGISS connected data assetsFedEO data partner guide (OpenSearch). CEOS Working Group on Information Systems and Services Cox S (2010) Geographic information: observations and measurements OGC abstract specification topic 20. Open Geospatial Consortium Inc., Wayland Cox S (2011) Observations and measurements - XML implementation. Open Geospatial Consortium Inc., Wayland Darema F (2004) Dynamic data driven applications systems: a new paradigm for application simulations and measurements. In: Bubak M, van Albada GD, Sloot PMA, Dongarra J (eds) Computational science - ICCS 2004. Springer, Berlin, Heidelberg, pp 662–669 Denenberg R, Dixson L, Levan R et al (2013) searchRetrieve: part 0. Overview version 1.0. OASIS Denis P, Jacques P (2013) OGC user management interfaces for earth observation services. Open Geospatial Consortium Inc., Wayland Di L (2007) A general framework and system prototypes for the Self-Adaptive Earth Predictive Systems (SEPS)--dynamically coupling sensor web with earth system models (AIST-05-0064). In: ESTO-AIST sensor web PI meeting. NASA, San Diego, California, USA

70

4 Remote Sensing Big Data Collection Challenges and Cyberinfrastructure and Sensor…

Di L, Moe K, van Zyl TL (2010) Earth observation sensor web: an overview. IEEE J Select Top Appl Earth Observ Remote Sens 3:415–417. https://doi.org/10.1109/JSTARS.2010.2089575 EARTHDATA (2020a) EOSDIS annual metrics reports | Earthdata. https://earthdata.nasa.gov/eosdis/system-performance/eosdis-annual-metrics-reports/. Accessed 26 Jul 2020 EARTHDATA (2020b) EOSDIS components | Earthdata. https://earthdata.nasa.gov/eosdis/ science-system-description/eosdis-components/. Accessed 26 Jul 2020 ESA (2020a) Mission history. https://www.esa.int/About_Us/ESOC/Mission_history. Accessed 25 Jul 2020 ESA (2020b) Our missions. https://www.esa.int/ESA/Our_Missions. Accessed 25 Jul 2020 Fanjeau N, Ulrich S (2013) OGC® sensor planning service interface standard 2.0 earth observation satellite tasking extension. Open Geospatial Consortium Inc., Wayland Fanjeau N, Ulrich S (2014) OGC RESTful encoding of OGC sensor planning service for earth observation satellite tasking. Open Geospatial Consortium Inc., Wayland Gasperi J, Houbie F, Woolf A, Smolders S (2016) OGC® earth observation metadata profile of observations & measurements. Open Geospatial Consortium Inc., Wayland Gonçalves P (2014) OGC® OpenSearch geo and time extensions. Open Geospatial Consortium Inc., Wayland Gonçalves P, Voges U (2014) OGC® OpenSearch geo and time extensions. Open Geospatial Consortium Inc., Wayland Houbie F, Smolders S (2013) EO product collection, service and sensor discovery using the CS-W ebRIM catalogue. Open Geospatial Consortium Inc., Wayland ISO (2014) Geographic information — imagery sensor models for geopositioning — part 2: SAR, InSAR, lidar and sonar. ISO, Geneva ISO (2018) Geographic information — imagery sensor models for geopositioning — part 1: fundamentals. ISO, Geneva ISRO (2020) All missions - ISRO. https://www.isro.gov.in/all-missions-0. Accessed 26 Jul 2020 JAXA (2020) JAXA | Missions. In: JAXA | Japan Aerospace Exploration Agency. https://global. jaxa.jp/projects/. Accessed 26 Jul 2020 Lankester THG (2007) EO product collection, service and sensor discovery using the CS-W ebRIM catalogue. Open Geospatial Consortium Inc., Wayland Lesage N (2007) OGC® Cataloguing of ISO Metadata (CIM) using the ebRIM profile of CS-W. Open Geospatial Consortium Inc., Wayland Liu T, Zhang L, Li P, Lin H (2012) Remotely sensed image retrieval based on region-level semantic mining. EURASIP J Image Video Process 2012. https://doi.org/10.1186/1687-5281-2012-4 Liu Y, Zhang D, Lu G, Ma W-Y (2007) A survey of content-based image retrieval with high-level semantics. Pattern Recogn 40:262–282. https://doi.org/10.1016/j.patcog.2006.04.045 Marchionni D (2014) OGC RESTful encoding of ordering services framework for earth observation products. Open Geospatial Consortium Inc., Wayland Marchionni D, Cafini R (2013) OGC Download Service for earth observation products best practice. Open Geospatial Consortium Inc., Wayland Marchionni D, Pappagallo S (2012) Ordering services framework for earth observation products interface standard. Open Geospatial Consortium Inc., Wayland Myers JD, Dunning TH (2006) Cyberenvironments and cyberinfrastructure: powering cyber- research in the 21st century. In: Proceedings of the foundations of molecular modeling and simulation, Blaine Na A, Priest M (2007) Sensor observation service. Open Geospatial Consortium Inc., Wayland Nagaraja MP (2020) Earth data | Science Mission Directorate. https://science.nasa.gov/earth- science/earth-data#eodis. Accessed 26 Jul 2020 NASA (2019a) Missions: historical missions | NASA’s Earth Observing System. https://eospso. nasa.gov/mission-category/2. Accessed 25 Jul 2020 NASA (2019b) Completed missions | NASA’s Earth Observing System. https://eospso.nasa.gov/ completed-missions. Accessed 25 Jul 2020

References

71

NASA (2019c) Current missions | NASA’s Earth Observing System. https://eospso.nasa.gov/ current-missions. Accessed 25 Jul 2020 NASA (2019d) Future missions | NASA’s Earth Observing System. https://eospso.nasa.gov/future- missions. Accessed 25 Jul 2020 NASA (2019e) NASA’s Earth Observing System. https://eospso.nasa.gov/. Accessed 25 Jul 2020 National Science Foundation, Cyberinfrastructure Council (2007) CyberinfrastruCture vision for 21st century discovery. National Science Foundation, Arlington NRSCC (2020) 中华人民共和国科学技术部国家遥感中心. http://www.nrscc.gov.cn/nrscc/ cyyfw/geosjgx2/zgygwxsjml/. Accessed 26 Jul 2020 Primavera R (2008) EO product collection, service and sensor discovery using the CS-W ebRIM catalogue. Open Geospatial Consortium Inc., Wayland Robin A, Mérigot P (2011) OGC® sensor planning service interface standard 2.0 earth observation satellite tasking extension. Open Geospatial Consortium Inc., Wayland Simonis I (2006) OGC® sensor alert service candidate implementation specification. Open Geospatial Consortium Inc., Wayland Simonis I, Dibner PC (2007) OpenGIS® sensor planning service implementation specification. Open Geospatial Consortium Inc., Wayland Simonis I, Echterhoff J (2006) Draft OpenGIS® web notification service implementation specification. Open Geospatial Consortium Inc., Wayland Voges U, Houbie F, Lesage N, Vautier M-L (2013) I15 (ISO19115 metadata) extension package of CS-W ebRIM profile. Open Geospatial Consortium Inc., Wayland WGISS CDA System-Level Team (2019) WGISS connected data AssetsClient partner guide(OpenSearch). CEOS Working Group on Information Systems and Services

Chapter 5

Remote Sensing Big Data Computing

Abstract This chapter covers computing platforms that are fit for remote sensing big data computing. The evolution of the geospatial computing environment is briefly reviewed, that is, stand-alone, centralized, and distributed. The high- performance computing environment is also briefly reviewed, that is, supercomputer, grid computing, and cloud computing. Service-oriented architecture is discussed in detail. Specialized big data processing frameworks are reviewed, including MapReduce/Hadoop, SPARK, and SCALE. Keywords Distributed computing · Parallel computing · Geospatial computing platform · Service-oriented architecture · Super computing · Cluster computing · Grid computing · Cloud computing · Web service chaining · SPARK · MapReduce · Hadoop Computing technologies have evolved. There are different technologies in use that work for storage, management, analytics, processing, and presentation of remote sensing big data. There are different computing system architectures as well, such as distributed computing, cluster computing, grid computing, utility computing, cloud computing, fog computing, edge computing, and jungle computing (Kumar et al. 2019).

5.1 Computing Power to Handle Big Data: Distributed and Parallel Computing Among the 5 Vs of big data, the 3 Vs are mostly related to computing power (including the storage): Velocity, Volume, and Variety. Velocity requires a quick turnover time for processing. Volume requires the capacity to deal with the exponential increase of data volume. Variety requires data and processing powers to process geographically distributed and heterogeneous data. Geospatial platform and system architecture have evolved with the overall evolution of computer and information technology and capabilities to deal with the 3 Vs. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. Di, E. Yu, Remote Sensing Big Data, Springer Remote Sensing/ Photogrammetry, https://doi.org/10.1007/978-3-031-33932-5_5

73

74

5 Remote Sensing Big Data Computing

5.2 Evolution of Geospatial Computing Platform The architecture of computer software systems has evolved through the following sequence: stand-alone software system architecture, client-server software system architecture, and distributed computing software architecture.

5.2.1 Stand-Alone Software System Architecture All data and processing happen at a single machine. Figure 5.1 shows stand-alone software architecture. Data can be stored as files or in a relational database. All processes and data management are handled on a single computer. This architecture is not suitable for handling remote sensing big data since it is impossible to fit everything in a stand-alone computer.

5.2.2 Client-Server Software System Architecture A client-server software system relies on a centralized server to provide all capabilities for data management and processing. Figure 5.2 shows the general architecture of the client-server system. A server hosts all data and provides all computing

Fig. 5.1 Standalone software architecture

5.2 Evolution of Geospatial Computing Platform

75

Fig. 5.2 Client-server software architecture

capabilities. Clients access the server through the network to get storage and timeshares of computing capabilities. This architecture is not fit for handling remote sensing big data since it is difficult to have a powerful server be capable of hosting remote sensing big data and maintaining the constant inputs and outputs of data flow from remote sensors.

5.2.3 Distributed Computing The distributed computing software architecture consists of multiple service providers that can be accessed by clients through a network, generally the Internet. A distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other to achieve a common goal. The most commonly used network is the Internet. Data and computing power are distributed over multiple service providers in the network. Figure 5.3 shows distributed computing architecture. The distributed computing architecture is fit for handling remote sensing big data. In a distributed computing architecture, data can be stored and managed by different services, and computing is handled by different services as requested.

76

5 Remote Sensing Big Data Computing

Fig. 5.3 Distributed computing architecture

Three significant characteristics of distributed systems are the concurrency of components, lack of a global clock, and independent failure of components. There are different implementations of distributed systems. Examples are service-oriented architecture (SOA), massively multiplayer online games, and peer-to-peer applications.

5.3 Service-Oriented Architecture (SOA) The service-oriented architecture (SOA) is the most popular computer software system architecture in distributed computing. SOA is an architectural pattern in computer software design in which application components provide services to other components via a communications protocol, typically over a network. The principles of service orientation are independent of any vendor, product, or technology. Under SOA, every component is a service that is a self-contained unit of functionality that can be discretely invoked over the network, in most cases, the Internet.

5.3 Service-Oriented Architecture (SOA)

77

Fig. 5.4 Roles in a SOA system

5.3.1 Service Roles In an SOA system, roles of services may be grouped into three major groups—service provider, service requester, and service broker. Figure 5.4 shows roles and their interactions in an SOA system. The service provider publishes services to a broker (registry) and delivers services to service requestors. The service requestor performs discovery operations on the service broker to find the service providers it needs and then accesses service providers for the provision of the desired service. The service broker helps service providers and service requestors to find each other by acting as a registry or clearinghouse of services and content. Catalog and registry services can be used as brokers.

5.3.2 Service Operations As shown in Fig. 5.4, there are operations associated with different roles. The operation Publish advertises (or removes) data and services to a broker (e.g., a registry, catalog, or clearinghouse). The operation Find is performed by service requestors and service brokers. Service requestors describe the kinds of services they are looking for to the broker and the broker delivers the results that match the request. The operation Bind is performed by a service requestor and a service provider to negotiate as appropriate so the requestor can access and invoke the services of the provider. The operation Chain is the operation that binds a sequence of services to reach a goal.

78

5 Remote Sensing Big Data Computing

5.3.3 Service Chaining A Service Chain is defined as a sequence of services where, for each adjacent pair of services, the occurrence of the first action is necessary for the occurrence of the second action. When services are chained, they are combined in a dependent series to achieve larger tasks. Three types of chaining are defined in ISO 19119 and OGC (ISO 2016, p. 19119): • User-defined (transparent)—the Human user defines and manages the chain. • Workflow-managed (translucent)—the Human user invokes a service that manages and controls the chain, where the user is aware of the individual services in the chain. • Aggregate (opaque)—the Human user invokes a service that carries out the chain, where the user has no awareness of the individual services in the chain.

5.3.4 Web Services A Web service consists of building blocks for the interface, operation, message, and binding protocols. Figure 5.5 shows the building blocks of a service. The implementation of a service on the Web is the Web service. This is the most common form of SOA implementation. Web services are self-contained, self- describing, modular applications that can be published, located, and dynamically invoked across the Web. Web services perform functions, which can be anything from simple requests to complicated business processes. Once a Web service is deployed, other applications (and other Web services) can discover and invoke the deployed service.

5.3.5 Common Technology Stack for Web Services Web services can be implemented following different specifications. Figure 5.6 shows common technology stacks in implementing an SOA system. Security stack may include transportation level security and authentication as well as message- level security (e.g., WS-Security if SOAP binding is used). WSDL is the description language for Web services. UDDI is the registry for the discovery of Web services. Service binding can be simply HTTP or wrapped with a SOAP envelope. Workflow languages may use WS-BPEL which is specifically designed to chain Web services described by WSDL.

5.3 Service-Oriented Architecture (SOA)

79

Fig. 5.5 Building blocks of a service in a SOA system

5.3.5.1 Web Services Description Language (WSDL) WSDL is a specification from W3C to describe networked services. It describes what a Web service can do, where it resides, and how to invoke it. W3C defines “WSDL is an XML format for describing network services as a set of endpoints

80

5 Remote Sensing Big Data Computing

Fig. 5.6 Common technology stacks for Web service

operating on messages containing either document-oriented or procedure-oriented information. The operations and messages are described abstractly and then bound to a concrete network protocol and message format to define an endpoint. Related concrete endpoints are combined into abstract endpoints (services)” (Christensen et al. 2001; Booth and Liu 2007; Chinnici et al. 2007a, b). 5.3.5.2 Universal Description, Discovery, and Integration (UDDI) UDDI provides a mechanism for clients to dynamically find Web services. UDDI consists of two main parts: the UDDI cloud of operator nodes, an Internet-wide repository (made up of white, yellow, and green pages) for Web services metadata, and an API and data model standard for a Web services metadata repository. The former hosts the data, while the latter provides a means to access it. There are several public UDDI registries. UDDI is an OASIS standard. 5.3.5.3 The Simple Object Access Protocol (SOAP) SOAP is a protocol specification that defines a uniform way of passing XML- encoded data. It also defines a way to perform remote procedure calls (RPCs) using HTTP as the underlying communication protocol. SOAP has three major parts for defining a message exchanged between two components: (1) The envelope defines a framework for describing message content and how to process it. (2) The encoding rules define a serialization mechanism used to exchange application-defined data types. (3) The remote procedure call (RPC) convention enables basic request/ response interactions. SOAP is a W3C standard.

5.3 Service-Oriented Architecture (SOA)

81

5.3.5.4 Business Process Execution Language (BPEL) It was called “Web Service Flow Language” (WSFL). It is officially published an OASIS standard for Web service chaining with the name WS-BPEL. It was originally developed by IBM, “is an XML language for the description of Web Services compositions. BPEL considers two types of Web Services compositions: The first type (flow models) specifies the appropriate usage pattern of a collection of Web Services, in such a way that the resulting composition describes how to achieve a particular business goal; typically, the result is a description of a business process. The second type (global models) specifies the interaction pattern of a collection of Web Services; in this case, the result is a description of the overall partner interactions.”

5.3.6 Web Service Applications Web service is considered as one of the key Internet technologies that can be widely used in E-business (both B2B and B2C), E-government, and E-science. A considerable amount of business opportunity exists in Web services. XML is the fundamental language in Web services. W3C is one of the primary organizations in standardizing the interfaces, messages, and implementation recommendations.

5.3.7 Web Service Standards The World Wide Web Consortium (W3C) is the major organization setting Web services standards. The core technologies, XML, WSDL, and HTTP, are defined and standardized by W3C. Another organization setting the Web service standards is the Organization for the Advancement of Structured Information Standards (OASIS). Standards on EbXML, WS-BPEL, and UDDI are recommended by OASIS.

5.3.8 OGC Web Services Open Geospatial Consortium (OGC) is an international organization setting geospatial information standards, particularly the implementation standards. Geospatial Web services are the Web services that perform geospatial functions. OGC Web services (OWS) refers to those services that reflect the OGC vision for geospatial data and application interoperability. OGC Web service follows the mainstream Web service standards. OGC Web service has extensions for the geospatial domain.

82

5 Remote Sensing Big Data Computing

Fig. 5.7 Abstract Model of Fundamental Services and Data Building Blocks for OGC Web Service (OWS)

Figure 5.7 shows the abstract model adopted by OGC in designing and specifying implementation standards for geospatial Web services. OGC Web services define two major groups of components: The first group includes those that provide operations through which content is transformed in some way. The second group includes those that provide operations through which content is accessed or described. The former is “operation” components, the latter are “data” components. 5.3.8.1 Operation Components OGC “Operation” components operate on or with “data” components. Operation components fall into four categories: (1) client services (e.g., viewers and editors), (2) catalog and registry services, (3) data services, and (4) application services. 5.3.8.1.1 Client Services The client services provide the services directly to users. It bridges the users and application services through its client-side and server-side interfaces. The client- side interacts with users. The server-side interacts with Server-side Client

5.3 Service-Oriented Architecture (SOA)

83

Applications, Application Servers, and Data Servers. The client services are primarily user-interface application components that provide views of the underlying data and operations and allow the user to exert control over them. 5.3.8.1.2 Catalog and Registry Services A catalog and registry service supports access to catalogs and registries, which are comprised of collections of metadata and types. Catalogs and registries are essentially repositories for metadata. Catalogs contain information about instances of datasets and services. Catalog services provide a search operation that can return metadata or the names of instances of datasets and services. Registries contain information about types. Types are defined by well-known vocabulary. Registry services implement a search operation that can return metadata or the names of types. 5.3.8.1.3 Data Services Data services are the foundational service building blocks that serve data, specifically geospatial data in OWS. Data services provide access to collections of data (repositories). Resources inserted by data services are generally given a name. Data services usually maintain indexes to help speed up the process of finding items by name or by other attributes of the item. It is essential that multiple, distributed data services are accessed and their contents “exposed” in a consistent manner to other major components. OGC’s Web Map Server (WMS), Web Feature Server (WFS), Web Coverage Server (WCS), and Sensor Observation Service (SOS) are examples of data services. 5.3.8.1.4 Application Services Application services operate on geospatial data and provide “value-add” services. Application services are components that given one or more inputs perform value- add processing on data and produce outputs. They can transform, combine, or create data. Application services can be tightly or loosely coupled with data services. OGC Web Processing Service (WPS) is the standard for application services. 5.3.8.2 Data Components OGC data components include (1) geospatial data, (2) geospatial metadata, (3) names, (4) relationship, and (5) containers.

84

5 Remote Sensing Big Data Computing

5.3.8.2.1 Geospatial Data Data are information about things or just plain information. Data can be created, stored, operated on, deleted, viewed, etc. Being taken as a group (or individually) data can have metadata, which itself is a kind of data. In OGC, GML and O&M are standards for encoding the feature data. 5.3.8.2.2 Geospatial Metadata Geospatial metadata is data about geospatial data. Metadata can have different meanings, depending on the context, but it is generally considered to be data about data. Metadata about collections of resources and resource types can be stored in catalogs or registries. If a catalog/registry holds metadata records about many different resources/resource types, it is possible to find and use these resources/ resource types based upon their metadata. ISO 19115 and 19115-2 are metadata standards for OGC. 5.3.8.2.3 Names Names–Names are identifiers. There are many different naming schemes in use today. The most well-known are WWW, other familiar URLs, geographic names, etc. Names themselves are only meaningful if you know the context in which the name is valid (this is called the namespace). When a data item is stored in a repository (accessible by a data service), it can be given a name that is valid within the repository, and if the repository itself has a name, then the two names together help find the original item. Names can refer to data or operators (and by extension to metadata, relationships, other names, application services, catalog/registry services, data services, and client services). 5.3.8.2.4 Relationship Links between any two information elements form relationships. These can be simple links such as WWW hyperlinks. They can be complex, n-way relationships among many elements. Relationships tend to link named elements. Sometimes they can link/point into named elements at a finer level of granularity. OGC refers to geospatial-oriented relationships as “geolinks.”

5.4 High-Throughput Computing Infrastructure

85

5.3.8.2.5 Containers A container is an encoded, transportable form of a collection of data or content on the Web. Containers have well-known namespaces, schemas, and protocols. OGC has developed two related specifications: Location Organization Folders (LOF) and XML for Imagery and Map Annotations (XIMA), both are based upon Geography Markup Language (GML).

5.4 High-Throughput Computing Infrastructure There are roughly four types of high-throughput computing infrastructure. First, a traditional supercomputer has many processors connected by a local high-speed computer bus. The supercomputer is normally in a centralized location, expensive, and parallelized at the CPU level. Second, a cluster of computers is formed by a group of linked computers, working together closely so that in many respects they form a single computer. Within the cluster, computers are connected by a fast Local Area Network. It is also normally in a centralized location, but relatively cheaper than a single supercomputer. Third, distributed computing infrastructure is used to support high-throughput computing. Component computing units are connected through the Internet. Grid computing is one typical distributed computing infrastructure. Fourth, modern cloud computing infrastructure can support high- throughput computing.

5.4.1 Super Computing A supercomputer is formed by putting multiple processors in one big system. Figure 5.8 shows Titan, one of the most powerful supercomputers in the world. Titan operates on the Cray system and runs on an Opteron 6274 16-core 2.2GHz processor. The processor is boosted by NVIDIA GPUs, which delivers 17.6 petaflops of performance when combined with 561,000 cores. Figure 5.9 shows another supercomputer—Tianhe-2. It has a performance of 33.9 petaflops and runs on a mixture of Intel Xeon E5 processors, custom processors, and Intel Xeon Phi coprocessors. Behind the computer’s amazing performance are 3,120,000 cores.

5.4.2 Cluster Computer Computer clusters can be formed using different component computers. Figure 5.10 shows a cluster formed by connecting multiple tower servers.

86

5 Remote Sensing Big Data Computing

Fig. 5.8 No. 2 Supercomputer in the world: Titan at ORNL

Fig. 5.9 No. 1 Supercomputer in the world: Tianhe -2

5.4.3 Grid Computing Grid computing is an example of a distributed computing system. Grid computing is the collection of computer resources from multiple locations to reach a common goal. The grid can be thought of as a distributed system with noninteractive workloads that involve a large number of files. Grid computing distinguishes itself in several aspects from conventional high- performance computing systems such as cluster computing. Grid computers have each node set to perform a different task/application. Grid computers also tend to be more heterogeneous and geographically dispersed (thus not physically coupled) than cluster computers. A cluster computer system could be a node of a grid system. Grid computing combines computers from multiple administrative domains to

5.4 High-Throughput Computing Infrastructure

87

Fig. 5.10 Cluster computer

Fig. 5.11 Grid computing

reach a common goal, to solve a single task, and may then disappear just as quickly. Grid computing is the infrastructure to run applications. SOA is a way to build applications. Figure 5.11 shows one overall architecture for grid computing. The difference between grids and traditional super/cluster computers can be further seen as grids are a form of distributed computing, whereby a “super virtual computer” is composed of many networked loosely coupled computers acting together to perform large tasks. For certain applications, “distributed” or “grid” computing can be seen as a special type of parallel computing that relies on

88

5 Remote Sensing Big Data Computing

complete computers (with onboard CPUs, storage, power supplies, network interfaces, etc.) connected to a computer network (private or public) by a conventional network interface, such as Ethernet. This is in contrast to the traditional notion of a supercomputer, which has many processors connected by a local high-speed computer bus.

5.4.4 Cloud Computing Cloud computing, also known as “on-demand computing,” is a kind of Internet- based computing, where shared resources, data, and information are provided to computers and other devices on-demand. It enables ubiquitous, on-demand access to a shared pool of configurable computing resources. Cloud computing and storage solutions provide users and enterprises with various capabilities to store and process their data in third-party data centers. It relies on sharing of resources to achieve coherence and economies of scale, similar to a utility (like the electricity grid) over a network. At the foundation of cloud computing is the broader concept of converged infrastructure and shared services. 5.4.4.1 What Does the Cloud Provide? Cloud computing enables ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort. The benefits of using cloud computing may allow companies to avoid upfront infrastructure costs, and focus on projects that differentiate their businesses instead of on infrastructure. Cloud computing allows enterprises to get their applications up and running faster, with improved manageability and less maintenance. It enables IT to more rapidly adjust resources to meet fluctuating and unpredictable business demands. Cloud providers typically use a “pay as you go” model. This can lead to unexpectedly high charges if administrators do not adapt to the cloud pricing model. Figure 5.12 shows the operation environment of cloud computing. 5.4.4.2 What Make Cloud Possible? The following technologies are the driving forces for cloud • • • • •

availability of high-capacity networks, low-cost computers and storage devices, the widespread adoption of hardware virtualization, service-oriented architecture, autonomy and utility computing.

5.4 High-Throughput Computing Infrastructure

89

Fig. 5.12 Cloud computing

5.4.4.3 Characteristics of Cloud Computing The agility of cloud computing deployment improves with users’ ability to re- provision technological infrastructure resources. Cost reductions: A public-cloud delivery model converts capital expenditure to operational expenditure. This purportedly lowers barriers to entry, as infrastructure is typically provided by a third party and does not need to be purchased for one-time or infrequent intensive computing tasks. Pricing on a utility computing basis is fine- grained, with usage-based options and fewer IT skills are required for implementation (in-house). The European e-FISCAL project’s repository contains several articles looking into cost aspects in more detail, most of them concluding that costs savings depend on the type of activities supported and the type of infrastructure available in-house. Device and location independence enable users to access systems using a web browser regardless of their location or what device they use (e.g., PC and mobile phone). As infrastructure is off-site (typically provided by a third party) and accessed via the Internet, users can connect from anywhere. Maintenance of cloud computing applications is easier because they do not need to be installed on each user’s computer and can be accessed from different places. Multi-tenancy enables sharing of resources and costs across a large pool of users, thus allowing for centralization of infrastructure in locations with lower costs (real

90

5 Remote Sensing Big Data Computing

estate, electricity, etc.), peak-load capacity increases (users need not engineer for highest possible load-levels), utilization and efficiency improvements for systems that are often only 10–20% utilized. Performance is monitored. Consistent and loosely coupled architectures are constructed using Web services as the system interface. Productivity may be increased when multiple users can work on the same data simultaneously, rather than waiting for it to be saved and emailed. Time may be saved as information does not need to be reentered when fields are matched, nor do users need to install application software upgrades to their computer. Reliability improves with the use of multiple redundant sites, which makes well- designed cloud computing suitable for business continuity and disaster recovery. Scalability and elasticity via dynamic (“on-demand”) provisioning of resources on a fine-grained, self-service basis in near real-time, without users having to engineer for peak loads. This gives the ability to scale up when the usage increases or down if resources are not being used. Security can improve due to centralization of data, increased security-focused resources, etc. Security is often as good as or better than other traditional systems, in part because providers can devote resources to solving security issues that many customers cannot afford to tackle, but concerns can persist about loss of control over certain sensitive data and the lack of security for stored kernels. 5.4.4.4 Comparing Cloud Computing with Grid Computing Table 5.1 compares grid computing with cloud computing on different aspects, including definition, users, providers, costs, and maintenance. 5.4.4.5 Software Platforms for Distributed Processing of Big Data in Cloud Computing 5.4.4.5.1 MapReduce with Hadoop MapReduce is a parallelized processing technique and a program model for distributed computing. It contains two major tasks—Map converts a set of data into a parallel-processing-ready dataset normally in tuples (key/value pairs) and Reduce combines outputs from a map to form a reduced set of the dataset. Hadoop is the software framework that clusters computers and enables the processing of big data with MapReduce. Many applications have Hadoop/MapReduce in processing remote sensing big data (Rathore et al. 2015; Kishore Raju 2017; Zou et al. 2018).

91

5.4 High-Throughput Computing Infrastructure Table 5.1 Comparing Cloud Computing and Grid Computing Grids Grids enable access to shared computing power and storage capacity from a client. Who provides Research institutes and universities the service? federate the services around the world. Who uses the Research collaborations. “Virtual service? organization” comprising researchers located around the world. Who pays for the Governments: providers and users are service? usually publically funded research organizations. Where are In computing centers distributed computing across different sites, countries, and resources continents. located? Why use? Free from buying or maintaining computer hardware. Deal with a difficult problem that cannot be handled with personal computers. Sharing data with a distributed team. What are they Grids were designed to handle large useful for? sets of limited-duration jobs that produce or use huge quantities of data. How do they Grids are normally open source work? technology. Resource users and providers may contribute to the management of grids. What?

Clouds Clouds enable access to leased computing power and storage capacity from a client. Large individual companies, for example, Amazon. Small to medium commercial businesses. Researchers with generic IT needs. Government. Cloud provides pay for the computing resources. Users pay to lease them. In the cloud provider’s private data centers, which are often centralized.

Free from buying or maintaining computer hardware. Deal with a difficult problem that cannot be handled with personal computers. Access extra resources when needed. Clouds best support long-term services and longer-running jobs.

Clouds are normally proprietary technology. Clouds are managed by different companies with different technologies.

5.4.4.5.2 Spark Spark is a cluster-computing framework that creates a high performance from multiple individual instances. It can be used together with clouding computing to create manageable resources to be adaptive to computing demands and resource requirements in dealing with remote sensing big data. Spark has been used in processing remote sensing big data (Sun et al. 2015, 2019; Chebbi et al. 2018). 5.4.4.5.3 SCALE SCALE (NGA 2020) is an open-source system that enables on-demand, near-real- time, automated processing of large datasets (satellite, medical, audio, etc.) using a dynamic bank of algorithms. It manages Docker instances and can seamlessly distribute algorithm execution across thousands of CPU cores. Integration with SEED

92

5 Remote Sensing Big Data Computing

(Meyer et al. 2019), a general standard to aid in the discovery and consumption of a discrete unit of work contained within a Docker image, to allow dynamically find and use algorithms in Docker managed resources. 5.4.4.5.4 Other Platforms There are specialized platforms to work with remote sensing big data developed and studied over the years. ScienceEarth is a SPARK-based, cloud-based cluster- computing environment that is specially designed and developed for processing remote sensing big data (Xu et al. 2020). It has three main components: ScienceGeoData stores and manages remote sensing data, ScienceGeoIndex provides a spatial index, and ScienceGeoSpark is the execution engine with clustering computing in cloud computing. Spark Sensing is a cloud-based framework to process remote sensing big data (Lan et al. 2018).

References Booth D, Liu CK (2007) Web Services Description Language (WSDL) Version 2.0 Part 0: Primer. W3C Chebbi I, Boulila W, Mellouli N et al (2018) A comparison of big remote sensing data processing with Hadoop MapReduce and Spark. In: 2018 4th international conference on advanced technologies for signal and image processing (ATSIP). IEEE, Sousse, pp 1–4 Chinnici R, Haas H, Lewis AA, et al (2007a) Web Services Description Language (WSDL) Version 2.0 Part 2: Adjuncts. W3C Chinnici R, Moreau J-J, Ryman A, Weerawarana S (2007b) Web Services Description Language (WSDL) Version 2.0 Part 1: Core Language. W3C Christensen E, Curbera F, Meredith G, Weerawarana S (2001) Web Services Description Language (WSDL) 1.1. W3C ISO (2016) ISO 19119: 2016–geographic information–services, 2nd edn. International Organization for Standardization, Geneva Kishore Raju KYKG (2017) Realbda: a real time big data analytics for remote sensing data by using Mapreduce paradigm. Zenodo https://doi.org/10.5281/zenodo.260070 Kumar Y, Kaul S, Sood K (2019) A comprehensive view of different computing techniques - a systematic detailed literature review. SSRN Electron J. https://doi.org/10.2139/ssrn.3382724 Lan H, Zheng X, Torrens PM (2018) Spark sensing: a cloud computing framework to unfold processing efficiencies for large and multiscale remotely sensed data, with examples on Landsat 8 and MODIS data. J Sens 2018:1–12. https://doi.org/10.1155/2018/2075057 Meyer J, Holt M, Tobe J, Smith E (2019) Seed. https://ngageoint.github.io/seed/. Accessed 27 Jul 2020 NGA (2020) Scale. http://ngageoint.github.io/scale/. Accessed 27 Jul 2020 Rathore MMU, Paul A, Ahmad A et al (2015) Real-time big data analytical architecture for remote sensing application. IEEE J Sel Top Appl Earth Obs Remote Sens 8:4610–4621. https://doi. org/10.1109/JSTARS.2015.2424683 Sun J, Zhang Y, Wu Z et al (2019) An efficient and scalable framework for processing remotely sensed big data in cloud computing environments. IEEE Trans Geosci Remote Sens 57:4294–4308. https://doi.org/10.1109/TGRS.2018.2890513

References

93

Sun Z, Chen F, Chi M, Zhu Y (2015) A Spark-based big data platform for massive remote sensing data processing. In: Zhang C, Huang W, Shi Y et al (eds) Data Science. Springer International Publishing, Cham, pp 120–126 Xu C, Du X, Yan Z, Fan X (2020) ScienceEarth: a big data platform for remote sensing data processing. Remote Sens 12:607. https://doi.org/10.3390/rs12040607 Zou Q, Li G, Yu W (2018) MapReduce functions to remote sensing distributed data processing- global vegetation drought monitoring as example. Softw Pract Exp 48:1352–1367. https://doi. org/10.1002/spe.2578

Chapter 6

Remote Sensing Big Data Management

Abstract Remote sensing big data management covers aspects of governance, curation, organization, administration, and dissemination—data discovery (both collection and granule levels) and access. This chapter first discusses different aspects of remote sensing big data governance, which blueprint standards and policy. Secondly, remote sensing data curation covers metadata, archiving, cataloging, quality assessment, and provenance/lineage. Finally, dissemination services for remote sensing big data are discussed briefly. Data discovery covers catalog services and federation at both collection level and granule level. Different data access includes access to different types of data, including features, images, maps, and sensor observations. Keywords Remote sensing · Big data management · Data governance · Version control · Data discovery · Data access · Metadata · Curation · Organization · Administration · Dissemination · Data collection · Data granule Big data management covers aspects of governance, curation, organization, administration, and dissemination of big data—data discovery (both collection and granule levels) and access (Siddiqa et al. 2016). “The goal of big data management is to ensure a high level of data quality and accessibility for business intelligence and big data analytics applications” (Huang 2015). For remote sensing big data, the goal is to ensure a high level of data quality and accessibility for scientific research and applications. Remote sensing big data management deals primarily with three major tasks: (1) governance (organizational structure/communication, data policy, strategy, and standards), (2) administration/curation (organization, archiving, cataloging, quality assessment, provenance/lineage, updating, and version control), and (3) dissemination services (data discovery and access).

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. Di, E. Yu, Remote Sensing Big Data, Springer Remote Sensing/ Photogrammetry, https://doi.org/10.1007/978-3-031-33932-5_6

95

96

6 Remote Sensing Big Data Management

6.1 Remote Sensing Big Data Governance Data governance is “a system of decision rights and accountabilities for information- related processes, executed according to agreed-upon models which describe who can take what actions with what information, and when, under what circumstances, using what methods” (Thomas 2014). It ensures the quality, availability, usability, integrity, consistency, auditability, and security of data. Big data governance is defined as “part of a broader information governance program that formulates policy relating to the optimization, privacy, and monetization of big data by aligning the objectives of multiple functions” (Soares 2013). Remote sensing big data governance is the framework to organize the establishment of strategy, objectives, and policy for effectively managing remote sensing data. Why is big data governance required? The importance of remote sensing data governance: (1) Complex system, (2) projects lack a view of the overall picture, (3) traditional projects lack data management focus, (4) data quality issues are hidden and persistent, (5) data is a problem beyond an information technology concern alone, and (6) data is valuable enterprise asset. The objective of data governance is to ensure the right people are involved in setting data standards, usage, and integration across projects, subject areas, and lines of enterprise. Data governance enables improved decision making, protects data stakeholders’ needs, adopts common approaches to resolve big data issues, builds standard, repeatable processes, reduces the cost and increase efficiency, ensures transparency, and standardizes the data definition, vocabulary, and semantics. In general, data governance makes a large volume of remote sensing data easy to use and valuable to data stakeholders. Common challenges to big data governance are organizational commitment, data ownership, and data sharing policy, intellectual property rights, response to changing funding and technology environment, and uneven skills in the personnel at different organizations. Despite these challenges, big data governance gains more and more attention in information governance. Big data governance brings in increased productivity, improved compliance, increased data auditability, and transparency, reduced overall data management cost across organizations, improved user satisfaction, improved confidence and trust in data, simplified data management across organizations, and enhanced interorganizational collaborations on big data. Big data governance is an extension of existing data governance. There are several big data frameworks proposed in recent years. A comprehensive review and list of 12 big data frameworks can be found in Al-Badi et al. (2018). The components or domains may be different depending on the definitions and foci. The major domains of big data governance are data principles, data quality, metadata, data access, and data life cycle (Khatri and Brown 2010; Alhassan et al. 2016). The domains can be expanded to include organization, metadata, privacy, data quality, business process integration, master data integration, and information life cycle management (Soares 2013). Data governance may have a different focus, such as (1) on policy, standards,

6.1 Remote Sensing Big Data Governance

97

Fig. 6.1 Components of remote sensing big data governance

strategy, (2) on data quality, (3) on privacy/compliance/security, (4) on architecture/ integration, (5) on data warehouses and business intelligence (BI), and (6) on management support (Thomas 2014). With such considerations, the components of big data governance may include (1) mission and vision; (2) goals, governance metrics/success measures, and funding strategies; (3) data rules and definitions; (4) decision rights; (5) accountabilities; (6) controls; (7) data stakeholders; (8) data governance office; (9) data stewards; and (10) proactive, reactive, and ongoing data governance processes (Thomas 2014). Figure 6.1 shows one data governance framework as formed by the First San Francisco Partners (O’Neal 2012a, 2013). There are mainly five of the components that need to be adopted and extended to meet the requirements of big data governance. These are strategy, organization, policies, measurements, and technology, which are also applicable for remote sensing big data governance (O’Neal 2012a).

6.1.1 Strategy It is important to align the data governance with the overall strategy of the enterprise (Malik 2013). The concept and types of big data need to be incorporated into the existing data governance vision and mission statements, objectives, and guiding principles (O’Neal 2012a). The enterprise implementation plan should include a big data governance strategy. These revisions and adaptations to existing data governance are applicable when remote sensing big data is considered.

98

6 Remote Sensing Big Data Management

6.1.2 Organizational Structure/Communications People and their organizations are essential parts of governance. The operational models and interactions need to be extended to include dedicated organizations and functions for big data management. The organization needs to include big data stakeholders, that is, business steward leads, data stewards, steering committee, and science committee. Big data governance should also be reflected in the charter, roles, responsibilities, ownership, and accountability (O’Neal 2012a).

6.1.3 Data Policy The policy should be extended to include policies, processes, and standards (PPS) for big data. Big data privacy, security, risk, retention, archiving and regulatory compliance, and data classification requirements fall into the scope of big data PPS (O’Neal 2012a).

6.1.4 Measurements Measurements for quality and performance metrics are important for data governance to identify and define. These data quality metrics and key performance indicators should be extended to include big data completeness, timeliness, accuracy, consistency, etc. (O’Neal 2012a).

6.1.5 Technology Data governance often defines technologies that underline data management. Big data technology should be included in big data governance to properly manage big data, such as No SQL distributed processing engines, distributed file systems, advanced analytics, and modeling tools, information life cycle management (ILM) tools, and geospatial high-performance cloud-based clustering platform (e.g., GeoMesa and GeoSpark) (O’Neal 2012a; Yu et al. 2015; Hughes et al. 2015). Interaction between data governance and technology is illustrated in Fig. 6.2 (O’Neal 2012b). Technology provides means to track progress and provides feedback to data governance overall, while data governance provides guidance and creates and enforces policies in using technology. Upon implementation and adoption of technology, remote sensing big data governance provides overall guidance and policies in standardized methods, data definitions, personnel roles and responsibilities, decision arbiters and escalation, and data statistics/analysis/monitoring. On the other hand, remote sensing big data technology track progresses and provide feedbacks to

6.2 Remote Sensing Big Data Curation

99

Fig. 6.2 Interaction between governance and technology

remote sensing big data governance with means for data discovery and profiling, data cleaning, data duplicate detection, data maintenance and management, data performance measurement and monitoring, and data sharing (O’Neal 2012b).

6.2 Remote Sensing Big Data Curation Data curation is the active and ongoing management of data through its life cycle of interest and usefulness to scholarship, science, and education. It manages the data through its life cycle. Data curation has a long history which starts from when the digital data was first created. Remote sensing big data curation has similar tasks as traditional data curation, but also has additional specific requirements related to the five Vs of big data, plus the domain-specific characteristics of remote sensing specialty. It curates data of Volumes (large volumes of data), Variety (data from multiple sources, organizations, sensors), Velocity (high speed of information generation and delivery to meet data requirements in case of emergency), Veracity (data must be trustworthy and its quality and lineage/provenance information is tractable), and Values (information derived from data are valuable in problem solving). Remote sensing big data administration and curation deal with data organization, data archiving, data cataloging, data quality, data usability, and data version control.

6.2.1 Remote Sensing Big Data Organization Remote sensing data are very diverse. If no proper approach is adopted for organizing the data, it would quickly become unmanageable as the volume of the data grows. Three aspects need to be considered carefully in organizing remote sensing big data, that is, data format, metadata, and projection.

100

6 Remote Sensing Big Data Management

6.2.1.1 Data Format Remote sensing big data comes in different formats which require specialized handling. Encoding big remote sensing data in a few formats can significantly reduce the cost and simplify the management task. A format suitable for handling big remote sensing data has to be flexible, extensible, and source/sensor independent. The currently widely used data formats include Hierarchical Data Format (HDF) (Mahammad et al. 2002; Koranne 2011), Hierarchical Data Format—Earth Observing System (HDF-EOS) (Demcsak 1997; McGrath and Yang 2002; Andersson 2005), netCDF (Rew and Davis 1990; Rew et al. 1997; Lee et al. 2008; Eaton et al. 2014), GeoTIFF (Ritter and Ruth 1997; Ritter et al. 2000; Mahammad and Ramakrishnan 2003), JPEG2000 (Marcellin et al. 2000; Skodras et al. 2001; Rabbani 2002), and GML-Cov (Baumann et al. 2017; Hirschorn 2017, 2019). The analytics of remote sensing big data needs a data format that is capable of managing data in large volumes. 6.2.1.2 Metadata Metadata is the description of a resource or data. It used to be defined as “Data about data.” Metadata is the core of any data management activity. Data discovery and access need metadata. Sufficient metadata description is important for managing big remote sensing data. The remote sensing big data governance should support the publication of well-defined metadata for enabling the integration of resources and data discovery and access. 6.2.1.3 Map Projection Geospatial data have to be referenced by a spatial coordinate system, which is a map projection (Bugayevskiy and Snyder 2013). Many map projections are available. Each has a unique feature suitable for specific applications. Integration of data from multiple sources has to be co-registered on a single map projection. A common set of map projection definitions with defined projection parameters has to be used across the entire community to facilitate interoperability with remote sensing big data. Reprojection services should be defined for the conversion of data from one projection to another. The standard for describing map projections should be selected. Commonly available spatial reference definition systems include United States Geological Survey (USGS) PROJ 4 formats (Urbanek 2011), European Petroleum Survey Group (EPSG) projection (Surveying and Committee 2005; Nicolai and Simensen 2008), and OGC WKT (Lott 2018).

6.2 Remote Sensing Big Data Curation

101

6.2.2 Remote Sensing Big Data Archiving Data archiving is to put data in a secure place for later use (Whitlock et al. 2010). Much scientific research requires time-series data to be kept for a long time to diagnose long-term trends and changes. For example, the Landsat Continuity program has kept and accumulated data for more than 50 years (Wulder et al. 2019). To maintain the long-term archives of remote sensing data, data managers face the following challenges (Bleakly 2002): 1. Storage media deterioration: Storage media deteriorates over time. Digital data have to be periodically refreshed and backed up in new media (Bleakly 2002). 2. Technology advances: Technology obsolescence may require the update of technology to be sustainable with reasonable cost-effectiveness and technical support (Bleakly 2002). 3. What should be kept in the archive? Data archiving is not just about storing data only. Necessary description about data, or metadata, needs to be kept. Information about retrieval, processing, and analytics needs to be stored along with data. Information may include what software to read and process the data and what hardware requirements the software can run with.

6.2.3 Remote Sensing Big Data Cataloging Cataloging is one of the essential data curation activities to support the discovery of remote sensing data (Fayyad and Smyth 1996). It facilitates big data discovery, access, and utilization (Xie et al. 2010). Catalogs contain metadata that describes the data. There are several challenges (Duerr et al. 2004). The core goal of the catalog is to maintain a quality set of metadata that sufficiently describes the data. The first challenge is to reduce the laborious efforts required for data producers to generate discoverable metadata. Automatic generation of metadata to populate the data catalog is an issue in big data where manual processing is not feasible (Campbell et al. 1988). Another issue is how to keep metadata updated. Keep the metadata current in the catalog is also very tedious and labor-intensive which would be become a serious issue when dealing with big data.

6.2.4 Remote Sensing Big Data Quality Assessment Quality assessment is a data curation activity to measure the quality of data. Quality is relative to a specific use. Data quality is an important aspect for remote sensing big data analysis. Data quality measures accuracy, completeness, consistency, validity, uniqueness, and timeless. How to measure the quality of a large volume of data is a big issue for remote sensing big data.

102

6 Remote Sensing Big Data Management

6.2.5 Remote Sensing Big Data Usability Documenting the usability of data from users’ points of view is very important for improving users’ experience on the data. Usability in scientific big data is currently tracked in two ways. One is mining the scientific papers to find the usages of specific datasets. Another is through social media, by letting users comment on the usage as well as the quality of specific datasets.

6.2.6 Remote Sensing Big Data Version Control Keeping the metadata and data current is an issue, particularly when multiple versions of data are distributed across organizations. Currently, unique ID through digital object identifier (DOI) has been used to track the version of data and metadata. This requires data producers to issue DOI for each digital object.

6.3 Remote Sensing Big Data Dissemination Services Traditional data access services are involved in data managers in the middle. Data access is offline. Media are primarily tapes or disks. The process is slow and expensive if considering labor. With the rapid increase of Internet communication speed, online web-based data access services have become popular. File-based transfer protocol, FTP, used to be one of the most popular batch data access protocols. It does not support access with spatial or spatiotemporal subsetting. Recently, Web service protocols have become popular for disseminating big geospatial data, including remote sensing data: data discovery, and data access. Data discovery can be done at two levels—collection level and instance level.

6.3.1 Data Discovery The purpose of data discovery is to let users discover where the needed data are located and how to access the data. Users include both human users (through a Web client) and machine users. Machine users may be Web services and/or Web clients. Many Earth Observing (EO) data agencies have catalog services. For example, NASA Common Metadata Repository (CMR) manages all metadata from different Distributed Active Archive Center (DAAC). FedEO is a gateway to discover data available in ESA (European Space Agency). EO-CAT (Earth Observation

6.3 Remote Sensing Big Data Dissemination Services

103

Catalogue), replacing EOLI-SA (Earth Observation Catalogue and Ordering Services), is the Earth Observation Catalog tool to search ESA mission collections. There are many catalogs to look for specific data. If users do not know where the needed data is located, it is difficult to find it. To enable the search across all catalog services, the solution may take the following two approaches: • Centralized: A centralized catalog is developed to harvest all catalogs. The advantages of the centralized approach may give the maximum control over data and performance turning on harvested metadata. The drawbacks are (1) the centralized catalog can grow too big to be manageable and (2) data may be out of synchronization with data providers’ specialized catalogs. It may become impractical in many cases for remote sensing big data. • Distributed: This is a federated approach. The search request is forwarded to each participant catalog and responses are integrated and sent back to users. This approach would not have problems of manageability for big data and out-of- synchronization that the centralized one has. Performance may depend on the network and each catalog. The CSISS Catalog Service Federation (CSF) is one of the first operational federation catalogs implemented for Earth Observations. The CSISS CSF Service explores the idea of federating different catalog services into a logical one (catalog service) with OGC CSW interfaces to serve end-user needs. These participating catalog services may have different interfaces, protocols, or even different metadata schema and definitions. A CSF can have three ways to expose catalog services to which it connects at the backend. They are opaque, translucent, and transparent. CSISS CSF supports both opaque and transparent. The main work of the CSF gateway performed includes metadata mapping, protocol translation, and results merging. Figure 6.3 shows the context diagram for the CSISS CSF.

Fig. 6.3 Context diagram for CSISS Catalog Service Federation

104

6 Remote Sensing Big Data Management

The CSISS CSF provides a single standard access point to multiple catalog services. It binds the data discovery with data access: What you find is what you can get immediate and customized data access and services through OGC WCS interfaces. It provides federation search for geospatial datasets/services among three geospatial catalog services: CSISS CSW, NASA ECHO, and DOE LLNL. The CSISS CSW hosts a data catalog of 17 TB data in the repository, including Global Landsat data. It has a service instance catalog, a service type catalog, and a data type catalog. The NASA ECHO supports efficient discovery and access to Earth Science data. It is a metadata clearinghouse and order broker being built by NASA’s Earth Science Data and Information System (ESDIS). The DOE LLNL hosts a simulation dataset that covers a global area. The CSISS CSF also has services to support service chaining. The service chaining enables the catalog to • • • • •

Find data instances by giving search criteria. Find service instances by giving search criteria. Find data instances by giving data type. Find service instances by giving data type. Find data-service association.

6.3.2 Data Access Once the needed data is found, the next step is to obtain the data, either for Web service consumption or for downloading to the local machine. Data access services may be different depending on the data sources and accessing methods. Data types include raster-based Earth Observation data, vector-based geographical features, tabular attributes, streaming sensor observations, and modeling data. Accessing restriction is another factor to be considered in data access, such as protected downloading direct access and ordering service.

References Al-Badi A, Tarhini A, Khan AI (2018) Exploring big data governance frameworks. Proc Comput Sci 141:271–277. https://doi.org/10.1016/j.procs.2018.10.181 Alhassan I, Sammon D, Daly M (2016) Data governance activities: an analysis of the literature. J Decis Syst 25:64–75. https://doi.org/10.1080/12460125.2016.1187397 Andersson K (2005) Software for processing HDF-EOS data Baumann P, Hirschorn PE, Masó J (2017) OGC coverage implementation schema, version 1.1 Bleakly DR (2002) Long-term spatial data preservation and archiving: what are the issues. Sand Rep SAND 2002:107 Bugayevskiy LM, Snyder J (2013) Map projections: a reference manual. CRC Press

References

105

Campbell WJ, Roelofs L, Goldberg M (1988) Automated cataloging and characterization of space- derived data. Telematics Inform 5:279–288 Demcsak MF (1997) HDF-EOS library users guide for the ECS project volume 2: function reference guide. Document 170-TP-006-003, Hughes Applied Information Systems Upper Marlboro, Maryland Duerr RE, Parsons MA, Marquis M et al (2004) Challenges in long-term data stewardship. In: MSST, pp 47–67 Eaton B, Gregory J, Drach B et al (2014) NetCDF Climate and Forecast (CF) metadata conventions. CF Conventions Fayyad U, Smyth P (1996) From massive data sets to science catalogs: applications and challenges. In: Proceedings of a workshop on massive data sets Hirschorn E (2017) OGC coverage implementation Schema-ReferenceableGridCoverage Extension Hirschorn E (2019) OGC coverage implementation Schema-ReferenceableGridCoverage Extension with Corrigendum. Schema-ReferenceableGridCoverage extension version 1.0. 1 Huang Q (2015) Innovative testing and measurement solutions for smart grid. IEEE, Wiley, Singapore Hughes JN, Annex A, Eichelberger CN et al (2015) GeoMesa: a distributed architecture for spatiotemporal fusion. In: Pellechia MF, Palaniappan K, Doucette PJ et al (eds) Geospatial informatics, fusion, and motion video analytics V. Baltimore, p 94730F Khatri V, Brown CV (2010) Designing data governance. Commun ACM 53:148–152. https://doi. org/10.1145/1629175.1629210 Koranne S (2011) Hierarchical data format 5 : HDF5. In: Handbook of open source tools. Springer US, Boston, pp 191–200 Lee C, Yang M, Aydt R (2008) NetCDF-4 performance report. Juni Lott R (2018) Geographic information — well-known text representation of coordinate reference systems, version 2.0.6. Open Geospatial Consortium Inc., Wayland Mahammad SS, Ramakrishnan R (2003) GeoTIFF-A standard image file format for GIS applications. Map India:28–31 Mahammad SS, Dhar D, Ramakrishnan R (2002) HDF-A suitable scientific data format for satellite data products. In: XXII INCA international congress Malik P (2013) Governing big data: principles and practices. IBM J Res Dev 57:1:1–1:13. https:// doi.org/10.1147/JRD.2013.2241359 Marcellin MW, Gormish MJ, Bilgin A, Boliek MP (2000) An overview of JPEG-2000. In: Proceedings DCC 2000. Data Compression Conference. IEEE, pp 523–541 McGrath RE, Yang M (2002) Conversion of from HDF4 to HDF5:‘Hybrid’HDFEOS files. Earth Obs 14:19–23 Nicolai R, Simensen G (2008) The new EPSG geodetic parameter registry. In: 70th EAGE conference and exhibition incorporating SPE EUROPEC 2008. European Association of Geoscientists & Engineers, p cp-40-00115 O’Neal K (2012a) Big data: governance is the critical starting point O’Neal K (2012b) The first step in master data Management sustainable data governance tutorial, Toronto O’Neal K (2013) Top 5 artifacts every data governance program must have. Little Rock Rabbani M (2002) JPEG2000: image compression fundamentals, standards and practice. J Electron Imaging 11:286 Rew R, Davis G (1990) NetCDF: an interface for scientific data access. IEEE Comput Graph Appl 10:76–82 Rew R, Davis G, Emmerson S et al (1997) NetCDF user’s guide. Unidata Program Cent, June 1:997 Ritter N, Ruth M (1997) The GeoTiff data interchange standard for raster geographic images. Int J Remote Sens 18:1637–1647 Ritter N, Ruth M, Grissom BB et al (2000) Geotiff format specification geotiff revision 1.0. SPOT Image Corp 1

106

6 Remote Sensing Big Data Management

Siddiqa A, Hashem IAT, Yaqoob I et al (2016) A survey of big data management: taxonomy and state-of-the-art. J Netw Comput Appl 71:151–166. https://doi.org/10.1016/j.jnca.2016.04.008 Skodras A, Christopoulos C, Ebrahimi T (2001) The jpeg 2000 still image compression standard. IEEE Signal Process Mag 18:36–58 Soares S (2013) IBM InfoSphere: a platform for Big Data governance and process data governance. Mc Press Surveying OGP, Committee P (2005) EPSG geodetic parameter dataset. http://www.epsg.org/. [Online 2006-12-08] Thomas G (2014) The DGI data governance framework, vol 20. Data Governance Institute, Orlando Urbanek S (2011) proj4: a simple Interface to the PROJ. 4 cartographic projections library (R package version 1.0-4) Whitlock MC, McPeek MA, Rausher MD et al (2010) Data archiving. Am Nat 175:145–146. https://doi.org/10.1086/650340 Wulder MA, Loveland TR, Roy DP et al (2019) Current status of Landsat program, science, and applications. Remote Sens Environ 225:127–147. https://doi.org/10.1016/j.rse.2019.02.015 Xie B-C, Chen L, Zhao L, Li S-S (2010) Mechanism of cataloging and retrieval over distributed massive remote sensing image. Comput Eng 20 Yu J, Wu J, Sarwat M (2015) GeoSpark: a cluster computing framework for processing large-scale spatial data. In: Proceedings of the 23rd SIGSPATIAL international conference on advances in geographic information systems – GIS '15. ACM Press, Bellevue, Washington, pp 1–4

Chapter 7

Standards for Big Data Management

Abstract This chapter reviews major standards that are fit for remote sensing big data. Two groups of standards are reviewed: metadata standards and data standards. The International Organization for Standardization (ISO) standards are discussed as the major standard efforts in enabling storing, managing and transporting of remote sensing data. The Open Geospatial Consortium (OGC) and other major community specifications are reviewed for data discovery and access. Data formats include HDF (Hierarchical Data Format) and HDF-EOS (Hierarchical Data Format—Earth Observing System), netCDF (Network Common Data Form), GeoTIFF (Geographic Tagged Image File Format), JPEG2000 (Joint Photographic Experts Group format 2000), OGC GMLCov (Geography Markup Language Coverage), and NITF (National Imagery Transmission Format). Discovery services include OGC CSW (Catalog Service for the Web), OpenSearch, and Z39.40 Geo-profile. Data access services include OGC Web Coverage Service (WCS), Web Map Service (WMS), Web Feature Service (WFS), Sensor Observation Service (SOS), and Open-source Project for a Network Data Access Protocol (OPeNDAP). Keywords Standard · ISO · OGC · Web service · Metadata · Catalog · OpenSearch · Data format · Data access Standards for big data management include archiving, metadata, data format, data discovery, and data access. Among all of those standards, the standard for metadata is the key for big data management, which will be the focus of this chapter.

7.1 Standards for Remote Sensing Data Archiving ISO has encouraged the development of standards in support of the long-term preservation of digital information obtained from observations of the terrestrial and space environments. ISO requested that the Consultative Committee for Space Data Systems (CCSDS) Panel 2 coordinate the development of those standards. CCSDS has subsequently reorganized and the work is now situated in the Data Archive Ingest (DAI) Working Group. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. Di, E. Yu, Remote Sensing Big Data, Springer Remote Sensing/ Photogrammetry, https://doi.org/10.1007/978-3-031-33932-5_7

107

108

7 Standards for Big Data Management

The initial effort has been the development of a Reference Model for an Open Archival Information System (OAIS) (Dryden 2009; Lee 2010). The OAIS Reference Model has been reviewed and is pending some editorial updates. It has been approved as an ISO Standard and as a CCSDS Recommendation (CCSDS 2019). The development history of this effort can be seen by surveying the many past US, French, British, and international workshops. With the growing acceptance of the OAIS Reference Model, attention is turning to identifying and starting additional archival standardization efforts. This is reflected in the Digital Archive Directions (DADs) Workshop and the Archival Workshop on Ingest, Identification, and Certification Standards (AWIICS). ISO 19165 (Geographic information—Preservation of digital data and metadata) is a new ISO TC 211 standard project started in 2015. The first part (fundamentals) has been published in 2018(ISO/TC 211 2003).

7.2 Standards for Remote Sensing Big Data Metadata 7.2.1 What Is Metadata? There are many definitions of metadata. It was defined as “Data About Data” in ISO 19115:2003 (ISO/TC 211 2003), which defines both Label and Catalogue. It is revised as “Information about a resource” in the new version of ISO 19115, the ISO 19115-1 (ISO/TC 211 2014). Basic functions metadata should support include identification (Label, Catalogue), access (Catalogue), history (Label, Catalogue), uses (Label, Catalogue), and instructions (Label). The development of metadata standards emerged in the 1990s. Two FGDC metadata standards have been developed in the 1990s. The first one is FGDC Content Standard for Digital Geospatial Metadata, which is the base metadata standard. The first version was released in 1994. A revision was released in 1998 (Federal Geographic Data Committee 2002). It has two profiles: one for biological data (Federal Geographic Data Committee 1999) and another for shoreline (Federal Geographic Data Committee 2001). The second is FGDC Content Standard for Digital Geospatial Metadata, an extension for remote sensing metadata (Di et al. 2000). It was released in 2002 (Federal Geographic Data Committee 2002). ISO 19115:2003 (ISO/TC 211 2003) is based on the FGDC base standard with extensions and modifications. ISO 19115-2:2009 has extensions of 19115:2003 and is based on FGDC remote sensing standards. ISO 19115-1:2014 is a revision of ISO 19115:2003 which is not backward compatible with ISO 19115:2003. It was released in 2014. NAP is the North American Profile of ISO 19115:2003. ISO 19139 is XML implementation of ISO 19115:2003. ISO 19139-2 is XML implementation of ISO 19115-2:2009. ISO 19115-3 is XML implementation of ISO 19115-1:2014.

7.2 Standards for Remote Sensing Big Data Metadata

109

7.2.2 The FGDC Content Standard for Digital Geospatial Metadata Under Executive Order 12906 (1994) (Clinton 1994), the content standard for digital geospatial metadata was developed as part of the National Spatial Data Infrastructure. It documents all new geospatial data using an FGDC standard and fosters the development of FGDC standards. The standard was released in June 1998 (Federal Geographic Data Committee 1998). The FGDC remote sensing extensions are heavily drawn from the EOSDIS Core System (ECS) Core Metadata (Di et al. 2000). There is ECS metadata not supported by the FGDC standard to be extended to support remote sensing data. They are data hierarchies, algorithm/processing, georeferencing, mission and platform, and instrument. The standardization path of remote sensing metadata starts as an FGCD metadata standard first. It evolves and gets developed into ISO standards. ISO 19115 is based on FGDC standards and revised by the international process. Remote sensing extensions started at the same time while ISO 19115 is in development. Elements from ISO 19115 are adapted to support remote sensing big data. The FGDC base standard is “FGDC Standard for Digital Geospatial Metadata” (Federal Geographic Data Committee 1998). The remote sensing extensions are released in “Content Standard for Digital Geospatial Metadata: Extensions for Remote Sensing Metadata” (Federal Geographic Data Committee 2002). The element production rule is as follows: Element = Mandatory + (Optional) + 1{Multiple}n + 0{One mandatory-if-applicable}1 + 0(Many mandatory-if-applicable}n + (1{Optional and Multiple}n)+ [Alternative|Other Alternative] The top level of FGDC metadata is as follows: Metadata = Identification_Information + 0{Data_Quality_Information}1 + 0{Spatial_Data_Organization_Information}1 + 0{Spatial_Reference_Information}1 + 0{Entity_and_Attribute_Information}1 + 0{Distribution_Information}n + Metadata_Reference_Information

110

7 Standards for Big Data Management

The mandatory identification information includes the following: • • • • • • •

Citation–of the data set Abstract and purpose Time—measurement or production State of progress—interim or final Spatial extent—geographic Subject, time, place Limitations on use and access The optional identification information is

(Point of Contact) (1{Browse Graphic}n) (Identification of contributors) (Security Information) (Native Data Set Environment) Software, OS, size filename (1{Citations for related data sets}n) The data quality information has 0{Attribute Accuracy}1 Logical Consistency Report Completeness Report 0{Positional Accuracy}1 Lineage (Cloud Cover) The Lineage is further expanded with • 0{Source Information}n • 1{Process Step}n –– –– –– –– –– ––

Process Description 0{Input data set citation}n Process Date (Process Time) 0{Intermediate data set citation}n (Process Contact)

The Spatial Data Organization Information has • • • •

Indirect Spatial Reference—identifiers Direct Spatial Reference Method Point and Vector Object Information Raster Object Information Raster Object Type (Row Count +

7.2 Standards for Remote Sensing Big Data Metadata

111

Column Count + 0{Vertical Count}1) The Spatial Reference Information has • Horizontal Coordinate System [Geographic—latlon| 1{Planar—projected}n| Local—not associated with Earth] 0{Earth shape parameters}1 • Vertical Coordinate System The Planar Coordinate System is defined with [Map_Projection—name, description| Grid_Coordinate_System—based on a projection| Local_Planar] + Planar_Coordinate_Information • rectangular, distance and bearing • resolution, units The Entity and Attribute Information defines details about the information content of the data set, including the entity types, their attributes, and the domains from which attribute values may be assigned. The rule is as follows: [1{Detailed_Description}n |1{Overview_Description}n | 1{Detailed_Description} n + 1{Overview_Description}n] The Distribution Information is defined as follows: Distributor 0{Identifier of data set}1 + Distribution Liability + 0{Order Process}n + (Technical_prerequisites) + (When available) The Order Process is defined as follows: [Non-Digital Form| Digital Form] Format (Transfer Size) [Offline option| Online option] • Access (URL) • Options for earlier technology

112

7 Standards for Big Data Management

The Metadata Reference Information defines • • • • •

Date of creation, last and next review Metadata Contact Standard used Access, use, security constraints Extension information The Citation Information (domain) is defined as follows:

1{Data developer}n Publication Date (time), 0{place, publisher}1 Title, 0{Edition}1 Format (kind of hardcopy, audio, video) 0{Series Information}1 + 0{Publication Information}1 + 0{URL}1 0{Larger Work Citation}1 etc The Time Period Information (domain) is defined as follows: [Single Date/Time| Multiple Dates/Times| Range of Dates/Times] The Contact Information (domain) is defined as follows: Contact person/organization (or both) (Contact Position) + 1{Contact_Address}n + 1{Contact_Telephone: voice, TTY}n + (1{ fax}n) + (1{Electronic mail}n) + (Hours of Service) + (etc) The FGDC metadata standard is oriented toward maps. It considers in situ measurement and older technology.

7.2.3 The FGDC Remote Sensing Metadata Extensions The remote sensing extensions are based on the FGDC standard for digital content (Di et al. 2000). Production rules and data dictionaries take a similar form as those in the FGDC standard for digital content.

7.2 Standards for Remote Sensing Big Data Metadata

113

The production rule for the Remote Sensing Extensions is as follows: Identification_Information 0{Data_Quality_Information}1 0{Spatial_Data_Organization_Information}1 0{Spatial_Reference_Information}1 0{Entity_and_Attribute_Information}1 0{Platform_and_Mission_Information}1 0{Instrument_Information}n The Identification Information is defined as follows: Dataset Identifier (unique) Description Documentation—1{Data Dictionary}n, 1{User’s Guide}n, 1{Science Paper}n Spatial Domain—frame area, overlap, reference 0{Processing Level}n 0{Measuring system IDs}n 0{Aggregation info}n—data hierarchies The Identification defines Mission, platform, instrument (including names, (short names), and identifiers), Band (Number and Individual IDs), and Multiple thematic layers. The Data Quality Information defines the following: Lineage • Algorithm Information • Processing Information (Image Quality)—qualitative (Acquisition Information) • sun and satellite position The Algorithm and Processing define Text or reference Description History (1{Algorithm peer review}n) Processing environment Processing data sets • Input, • Ancillary The Spatial Data Organization Information defines Cell Value Type

114

7 Standards for Big Data Management

Dimension Description Number of Dimensions n 1{Name of Dimension + Length along dimension}n The Spatial Referencing Information is defined as follows: 0{Vertical coordinate system}1 • Units other than length (pressure, layers) [Georectified Raster| Georeferenceable Raster] The Georectified (EOS Grid) is defined with the following: Map location of one grid point Map dimensions of single grid cell Location of a grid point in pixel Storage order Distances corresponding to pixel sides The Georeferenceable (EOS Swath) is defined as follows: {Georeferencing Description}n [Ground Control Point Information| Referencing polynomial| Position and attitude information| Text description] (1{Aerotriangulation Reference}n) 0{Swath track ground properties}1 The Entity and Attribute Information defines Content with scaling functions (including Polynomial or ratio and Non-polynomial) and Usage. Usage defines detector quantity to physical quantity and a scaled value stored for convenience. The Platform and Mission Information is defined with (Mission Information) (Description) (History) 1{Platform Information}n Start Date (Description) (Platform Orbit) (Flight Protocol) The Orbit is defined as follows:

7.2 Standards for Remote Sensing Big Data Metadata

115

Ephemeris (the date at which orbit valid) [Keplerian Orbit| Nominal Geostationary position] The Flight Path is defined as follows: Height (Availability of GPS and/or INS) The Instrument Information is defined as follows: Type (imager, sounder, etc.) 0{What it’s doing}1 (scanning, calibration..) How it collects (frame, whiskbroom, laser...) Sensor Orientation Information for individual instruments References in addition to or instead of all or some of information The Frame Camera is defined as follows: (Frame Hardware) Optical system Geometry of the image (Operation) – mount and motion (Radiometric Calibration) Spectral Properties The Scan is defined as follows: Process in each direction Cross-track Elevation Profile 1{Pixel description}n, units Radiometric calibration Spectral properties

7.2.4 ISO 19115 Geographic Information—Metadata The ISO TC 211 is a specialized committee for geographic information—geomatics in the International Organization for Standardization (ISO). It was developed out of European standardization, established around 1995, and run out of Norway. The standards governed by the TC 211 are a suite of standards that starts with 191**. The metadata standard is included in the suite of standards. The ISO 19115 is the metadata standard for geographic information and the base for remote sensing data extension. It started with FGDC Metadata Standard and has

116

7 Standards for Big Data Management

been modified by the international review process. The standard adapted some terms from the Remote Sensing extensions. ISO 19115 was started to develop in 1998, and published in 2003. The ISO 19115 is defined using Unified Modeling Language (UML). The data dictionary contains information in UML and term definitions, short names. It has a preference for code lists versus free text. It supports multiple languages and characters. Class and its hierarchy are used in ISO 19115 standard. A Class may have attributes and operations. A subclass inherits properties of class and others. An aggregated class consists of components of another class. Figure 7.1 shows the core UML of ISO 19115 metadata. The core class MD_Metadata is a subclass of Metadata_Reference_Information. It adds scope and hierarchy level and language and character set. It is used in aggregated classes of Maintenance and Constraints. The Identification is defined as class MD_Identification. The subclass MD_ DataIdentification adds topic code list, spatial representation (grid, vector), and resolution (scale, sample). Service information is added. It is used in constraint aggregate class. The data quality is defined in class DQ_DataQaulity. It has scope attribute (geographic extent and applicability—level of aggregation, features, attributes, software, services, and models) and element component. The element component,

Fig. 7.1 ISO 19115 Core Class UML

7.2 Standards for Remote Sensing Big Data Metadata

117

DQ_Element, contains test name, criterion identifier/description, test method type/ description, citation for procedure, and result. The data quality tests completeness, consistency (value domain, format, and time sequence), and accuracy (spatial position, time position, and sequence). The resulting class, DQ_Result, has a conformance subclass (requirement, the meaning of conformance, and status—pass or fail) and a quantitative subclass (type and units of value, statistical method, and value). The Lineage class has a source component (Extent, Reference System, Citation, and Scale) and Process Step component (Description and explicit Rationale). The class MD_Maintenance has frequency code, next update, regular schedule, and scope code and description. The class MD_SpatialRepresentation has a Vector subclass (curves, polygons, and dimension/description) and Raster subclass (described in extended FGDC standard). Remote sensing extension includes spatial reference information that has a Georectified subclass (defined by four corners) and Georeferenceable subclass (orientation parameters and control point availability). The reference system defines the reference system name, namespace, domain of validity, and geodetic model. It does not specify the parameters of projections. It does not handle vertical dimensions. The class MD_ContentInformation has MD_CoverageDescription subclass (records describing attributes and physical measurement, image, thematic) and MD_FeatureCatalogueDescirption subclass (data set, citation, and list of feature types from catalog). The coverage description has a subclass for range dimension description (name/dimension description, band properties, and bits per value) and image description (calibration availability, distortion availability, image conditions, and processing level). The class MD_Distribution defines the access method, primarily online access. It provides digital transfer options, distributor, and distribution format (media type). The standard supports extension with class MD_MetadataExtensionInformation. Information needed in new extensions include Names, Definition, Data type, Permitted value, Multiplicity, Place in the data structure, and Source of the element. Portrayal catalog reference and application schema are defined by classes MD_ PortrayalCatalogueReference and MD_ApplicationSchemaInformation, respectively. The class EX_Extent data type is a subclass to Extent that defines both spatial and temporal extent. The CI_Citation and Responsible Party data types include identifiers, ISBN, presentation form (digital or hardcopy), contact information. 7.2.4.1 ISO 19115-2 The ISO 19115-2 is an extension for imagery and remote sensing metadata. It was extended from ISO 19115:2003. It was developed based on the FGDC Remote Sensing Metadata extension. It was published in 2009 as ISO 19115-2:2009.

118

7 Standards for Big Data Management

The ISO 19115-2 adds an additional class, MD_AcquisitiionInformation, for adding information on the acquisition of remote sensing data. On data quality, it extends the result to support coverage and extended processing classes in Lineage. Quality tests are extended to include completeness (logical, commission, and omission), consistency (logical, conceptual, domain, and format), accuracy (positional, absolute external position, and gridded data position), and usability. Georectified and georeferenceable classes are extended to support remote sensing data. Ground control point quality is defined using data quality classes (lineage and data quality element). The content information class is extended to provide band properties and image descriptions. Acquisition information includes platform, plan, operation, requirement, environment record, objective, and instrument. Classes are added to describe details of acquisition information. 7.2.4.2 ISO 19115-1 ISO19115-1 is a revision for ISO 19115 or ISO 19115:2003. ISO standards are reviewed every 5 years (every 3 years for TS) to determine if ISO should keep the standard as is for another 5 years, Revise, or Withdraw. The process is called a systematic review. The systematic review decided to revise ISO 19115:2003, which resulted in ISO 19115-1. The revised standard was published in 2014 as ISO 19115-1:2014: Geographic Information—Metadata—Part 1: Fundamentals. In the revised version of ISO 19115-1, the concept of “Core metadata” was removed. Metadata for services were added, derived from ISO 19119:2005 and ISO 19119:2005/Amd 1:2008. Data quality was moved to ISO 19157. Annex F was added to describe metadata for the discovery of service and nonservice resources. Many extra code lists were added. The use of “Short name” and “Domain code” was dropped for metadata elements and codes, respectively. As a result of the revision, the ISO 19115-1 is not fully back-compatible with ISO 19115:2003. The ISO 19115-2 still links to ISO 19115:2003. ISO 19115, ISO 19115-1, and ISO 115-2 have been adopted as the US national and federal metadata standards. They are the most important metadata standards for remote sensing big data.

7.2.5 ISO Standards for Data Quality Some standards support geospatial data quality. ISO 19115 and ISO 19158 define the type of qualities that should be measured. The series of ISO 19115 and related ISO 19157 specifies the standards for data quality assessment (ISO 2009, 2013, 2014, 2016a, b, c). The ISO 19158 specifies quality assurance of data supply (ISO 2012).

7.3 Standards for Remote Sensing Big Data Format

119

7.3 Standards for Remote Sensing Big Data Format The adoption of a standard data format should be considered to reduce the complexity of remote sensing big data management. The commonly used standard remote sensing data formats include Hierarchical Data Format (HDF) and HDF-EOS (Hierarchical Data Format—Earth Observing System), netCDF (Network Common Data Form), GeoTIFF, JPEG2000, OGC GMLCov (OGC GML Application Schema—Coverages), and NITF (The National Imagery Transmission Format). HDF is a collection of open-source software for storing and managing data and metadata in a hierarchical structure. HDF was initially developed at the National Center for Supercomputing Application (NCSA) as a portable scientific data format. It has been evolved and adopted by Earth science communities. NASA has adopted it as the base format to store, manage, and distribute data of the Earth Observing System (EOS) projects (Ullman 1999). HDF-EOS is the adapted version for remote sensing data and data format standard for the Earth Observing System Data Information System (EOSDIS) data production, archive, and distribution (Ullman 1999). HDF-EOS regulates the structure of storing remotely sensed data in three major types—point, swath, and grid. It is recognized that the mapping between ISO 19115 and HDF-EOS can be well achieved (Wei et al. 2007; Di and McDonald 2006). The remote sensing extension of the FGDC digital data standard has been influenced by the standard of EOSDIS, while HDF-EOS is the standard data format adopted within EOSDIS. There are two versions of HDF—HDF4 and HDF5. Corresponding to these two formats, there are two versions of HDF-EOS. HDF5 and HDFEOS5 support larger file management. HDF4 has a limit of a single file size of 2GB. The new version of HDF5 and its HDFEOS uses 64-bit offsets which put the practical limit at about 16 exabytes which put it in a good position to manage remote sensing big data. The netCDF is based on HDF and profiled for climate community for data creation, access, and sharing (Rew and Davis 1990; Rew et al. 2009, 2006; Eaton et al. 2003). There are conventions to keep the structure standardized in netCDF, primarily the Climate and Forecast (CF) Metadata Conventions (Eaton et al. 2003). The current version of CF is 1.9 which standardizes encoding and structuring for storing metadata in netCDF, especially coordinate types, coordinate systems, data representative of cells, and discrete sampling geometries (Eaton et al. 2003, 2020; Hassell et al. 2017). The new library of netCDF also supports 64-bit offset which eliminates the limits of the traditional netCDF file size of 4GB. The GeoTIFF is a public domain data format for storing grid data with metadata (Ritter et al. 2000; Mahammad and Ramakrishnan 2003). It is based on TIFF (Tagged Image File Format) to store raster graphics. The maximum size for TIFF is 4GB one single file. GeoTIFF can be implemented on BigTIFF, a variant of the TIFF format that uses 64-bit offsets. The GeoTIFF based on BigTIFF can store up to 18,000 petabytes in size. It is suitable for managing remote sensing big data (Pennefather and Suhanic 2009; Gumelar et al. 2020).

120

7 Standards for Big Data Management

JPEG 2000 is a wavelet-based image compression format developed by the Joint Photographic Experts Group (JPEG) (Christopoulos et al. 2000; Marcellin et al. 2000; Rabbani 2002; Schelkens et al. 2009). It provides improved compression and multiresolution streaming capabilities. The corresponding ISO standards are series of ISO/IEC 15444, including ISO/IEC 15444-1 for core coding system (ISO/IEC JTC 1/SC 29 2019a), ISO/IEC 15444-2 for extensions (SO/IEC JTC 1/SC 29 2004), ISO/IEC 15444-3 for motion JPEG 2000 (ISO/IEC JTC 1/SC 29 2007a), ISO/IEC 15444-4 for conformance testing (ISO/IEC JTC 1/SC 29 2004), ISO/IEC 15444-5 for reference software (ISO/IEC JTC 1/SC 29 2015), ISO/IEC 15444-6 for compound image file format (ISO/IEC JTC 1/SC 29 2013a), ISO/IEC 15444-8 for security (ISO/IEC JTC 1/SC 29 2007b), ISO/IEC 15444-9 for interactivity (ISO/ IEC JTC 1/SC 29 2005), ISO/IEC 15444-10 for 3-D data extension (ISO/IEC JTC 1/SC 29 2011), ISO/IEC 15444-11 for wireless (ISO/IEC JTC 1/SC 29 2007c), ISO/IEC 15444-13 for an entry-level JPEG encoder (ISO/IEC JTC 1/SC 29 2008), ISO/IEC 15444-14 for XML representation (ISO/IEC JTC 1/SC 29 2013b), ISO/ IEC 15444-15 for high-throughput encoding (ISO/IEC JTC 1/SC 29 2019b), and ISO/IEC 15444-16 for ISO image format (ISO/IEC JTC 1/SC 29 2019c). The great compression rate and multiscale pyramids find their applications in storing and serving remote sensing data in the Web environment (González-Conejero et al. 2009). Geopspatial data format can be encoded and efficiently compressed in JPEG 2000 format. OGC has released a standard of encoding GML (Geography Markup Language) in JPEG 2000 (Colaiacomo et al. 2017). The OGC GML Application Schema—Coverages (GMLCOV) is a standard for representing coverages on a referenceable grid (Hirschorn 2017). The implementation specification specifies how to represent grid and coverages in GML (Baumann et al. 2017). The format provides a mechanism for exchanging remotely sensed data in XML (Campalani et al. 2013; Baumann et al. 2016). It is possible to exchange remote sensing big data with proper encapsulating compression approaches. The National Imagery Transmission Format Standard (NITFS) is a suite of standards for encoding imagery in the intelligence community. NITF is used for the exchange, storage, and transmission of digital imagery and image-related products and metadata (Whitson 1996; NITFS Technical Board 2006). It has been used in encoding and distributing remote sensing data (Di et al. 2002; Dial et al. 2003).

7.4 Standards for Remote Sensing Big Data Discovery Common cataloging and discovery mechanisms for remote sensing big data include widely adopted OGC standard catalog interfaces, community OpenSearch, and Z39.40 Geo-profile. The OGC Catalogue Service for Web (CSW) is widely adopted as the catalog service to manage geospatial metadata in the Web (Nebert 2007). It is an interface standard that specifies what operations are available and how the catalog is to be interoperated with. It has several profiles to support different encoding and bindings

7.4 Standards for Remote Sensing Big Data Discovery

121

(Houbie and Bigagli 2010; Gasperi 2010). The core is based on Dublin core metadata (Martell and Parr-Pearson 2009; Nebert 2007; Martell 2009a, b). ISO19115 is one of the most popular profiles in encoding and modeling metadata (Voges and Senkler 2007; Voges et al. 2013, p. 0). It has been extensively used in cataloging remote sensing data. The latest version adds support to OpenSearch (Bigagli et al. 2016; Nebert et al. 2016a, b). The OpenSearch specifications are community-driven open standards (Clinton 2019). It is highly flexible and adaptable with extensions. There are spatial and temporal extensions that allow the discovery of data with spatiotemporal constraints (Gonçalves and Voges 2014). In the Committee on Earth Observation Satellites (CEOS), a best practice for implementing and enabling search services has been developed(CEOS 2017). It specifies searchable terms and response encodings for managing metadata for remote sensing big data. The Z39.40 Geo-profile is an Application Profile for Geospatial Metadata (or Geo-profile) of a library catalog developed by the Federal Geographic Data Committee (FGDC) (Nebert 1999). It was included as a profile for OGC CSW(Neal et al. 2006). Popular protocols for remote sensing data discovery are mainly two types: OGC CSW (Dublin core or ISO 19115 profile) and OpenSearch (OGC extension).

7.4.1 OGC Catalog Service for Web (CSW) The OGC specification has the core part and extension parts (profiles). The core part defines the fundamental structure and client-server interactions. The extension part defines the specific information models and binding mechanism. Major ones include ebRIM (based on Dublin core), ISO 19115, and Z39.50 (used to be popular). The Catalog service document number is OGC 07-006r1. The version is 2.0.2. CSW is a protocol binding defined in 07-006r1 (others include Z39.50 and CORBA). CSW uses HTTP as the distributed computing platform (DCP). The basic interaction model is request/response. Asynchronous requests are also supported. The CSW API is patterned after the Web Feature Service (WFS). Query languages in CSW support the Filter Encoding Specification for specifying query predicates (with extensions). The primary query language is Common Query Language (CQL) which is similar to the SQL WHERE clause. CSW can be implemented with profiles that may define other languages (e.g., XPath) OGC CSW defines a standard API for Creating, Updating, Deleting, and Querying catalog records which can be implemented on top of existing servers as well (Z39.50). Service requests may be encoded in XML or as Keyword-Value pair. The specification supports the HTTP methods POST and GET and describes how to use SOAP (basically message literal). A specific information model is not defined (i.e., agnostic). It is expected that profiles will be defined to support specific catalog information models. Current application profiles include FGDC, ISO19119/ ISO19115, and ebRIM.

122

7 Standards for Big Data Management

The OGC® Catalogue Service Specification defines a set of common queryables and returnables. The intent is to support CROSS-PROFILE query interoperability. All CSW implementation must support the core queryables/returnables, typically referred to as csw:Record, or following the Dublin core metadata specification. The following operations are defined for the CSW: • GetCapabilities: It provides service metadata. • DescribeRecord: It allows clients to get a schema description of the catalog’s information model. • GetDomain: It allows clients to discover the runtime value space for API parameters as well as other elements within the information model. • GetRecords: It is the primary method for querying the catalog. It also supports distributed queries. • GetRecordById: This is a convenience request for getting records using their identifiers. • Transaction: This is the primary method for creating, updating, and deleting catalog records (PUSH). • Harvest: This operation allows the catalog service to retrieve web-accessible metadata and register it in the catalog. It is analogous to Transaction, but performs a PULL rather than a PUSH. It also supports periodic re-Harvesting of the resource. In CSW version 2.0.2, ebRIM is the preferred catalog information model. ebRIM stands for E-business Registry Information Model. The following are reasons for ebRIM to be adopted as the preferred information model. • The flexibility of ebRIM satisfies the requirements to manage many kinds of artifacts – data set descriptions, service offers (service interface descriptions), coordinate reference systems, Units of Measure, application schemas, feature types, map styles and symbol libraries, access control policies, sensor description, ontological descriptions, digital rights, organizations, and projects. • The information model with ebRIM allows associating artifacts with one another (associations). Through associations, data and services may be associated with each other. Data discovery may go beyond existing data but services that offer that data and vice versa. • The ebRIM supports classifications of artifacts in many ways (classification schema). Different classification schemes may be provided and adopted in ebRIM. • The ebRIM allows collections of artifacts in logical groups (packages). • The ebRIM provides a common interface for a wide range of metadata management. The ebRIM was designed to meet different requirements as a catalog in dealing with multiple artifacts, associations, multiple classification schema, and packages through a uniform interface. OGC adapted the ebRIM to be the basis for a profile of the Catalogue 2.0-CSW specification.

7.4 Standards for Remote Sensing Big Data Discovery

123

The ebRIM information model can be seen in Fig. 7.2. It is a general information model that is expressed in XML. It can be loaded into an ebRIM Registry. It is an ISO Standard—ISO 15000-3. It supports the notion of Registry-Repository as shown in Fig. 7.3. The ExtrinsicObject is a representative or proxy for repository items. It is extensible by slots and object type attributes. For example, a CRS dictionary entry can be an Extrinsic Object extended by slot. A FeatureType can be an Extrinsic Object extended by an object type attribute. A Classification Scheme is a basis for the taxonomy of RegistryObjects. It is user-defined. It is a collection of Classification Nodes. An Association is named relationship between Registry Objects. It is also user-defined. A Package is a logical collection of Registry Objects (ExtrinsicObjects, Classification Schemes, and Associations). Packages in ebRIM encapsulate all the object types, association types, classification schemes, slots, stored queries, etc. It creates or loads an ebRIM Package using XML and WRS interfaces (Insert/GetRegistryObject). A catalog can support multiple packages at the same time. Associations and ExtrinsicObjects can be reused across packages. OGC standardization focuses on Extension Packages for Geography. For remote sensing data, there is an EO Products Extensions Package for ebRIM (06-131r2). The basic extension package for the CSW ebRIM profile has the following Extrinsic Object Type Definitions:

Fig. 7.2 UML of ebRIM

124

7 Standards for Big Data Management

Fig. 7.3 Notion of Registry-Repository

• ServiceProfile—It defines service features/capabilities (e.g., OGC Capabilities document). • ServiceModel—It defines service computational characteristics and behavior (e.g., WSDL interface). • ServiceGrounding—It defines how to access the service (e.g., WSDL service description). • Dataset—It describes a geographic data set (e.g., ISO 19115/19139). • Schema—It provides a formal description of a conceptual model (e.g., GML application schema). • StyleSheet—It defines a set of rules for styling an information resource (e.g., XSLT style sheet). • Document—It is a document of any kind. • Annotation—It is commentary that is intended to interpret or explain some resource. • Image—It provides a symbolic visual resource other than text. • Rights—It provides information about rights held in and over a resource. The basic extension package for the CSW ebRIM profile includes the following classification schema: • • • • •

Services Taxonomy (source ISO 19119, Subclause 8.3) Slots (source DCMI metadata terms). Country Codes (ISO 3166-1) Geographical Regions (UN Statistics Division) Feature Codes (Digital Geographic Information Exchange Standard (DIGEST)).

Further classification schema may be introduced to meet other requirements. For example, the Defence Geospatial Information Working Group (DGIWG) Feature Data Dictionary (DFDD) and Feature and Attribute Coding Catalogue (FACC) dictionary are for feature, attribute, and data types. Sensors may be classified into different types, such as Passive (Visible/Reflective Infrared, Thermal Infrared, and Microwave), and Active (Synthetic Aperture Radar and LiDAR). Semantics may be introduced in classification. There is a draft recommendation paper at the OASIS regarding an “ebXML registry profile for OWL.” The paper is entitled “ebXML Registry Profile for Web Ontology Language (OWL), version 1.5”

7.4 Standards for Remote Sensing Big Data Discovery

125

by Asuman Dogac (http://ebxml.xml.org/node/162). It has been approved as a specification since April 2007. This paper details how to model ontologies within an ebXML registry and define stored procedures at the DB level to perform semantic operations. It defines a ClassifiedAs Operator. The basic extension package for the CSW ebRIM profile includes the following Association Objects: • OperatesOn—It associates a service offer with a description of the data that the service operates on as input or output (from ISO 19119). • Presents—It associates a service offer with a description of what the service does. • Supports—It associates a service offer with how an agent can access the service. • DescribedBy—It associates a service offer with a description of its computational characteristics and the semantic content of supported requests. • Annotates—It associates an annotation resource with the registry object that it explains or evaluates. • RepositoryItemFor—It associates a source Externallink that refers to an item in an external repository with a target ExtrinsicObject. • GraphicOverview—It associates a source Dataset with an Image that illustrates or summarizes the data content. In remote sensing data catalog, associations may be used to relate different artifacts. Figure 7.4 shows associations for Earth Observations or remote sensing data. The basic extension package for the CSW ebRIM profile includes the following Stored Queries. • • • •

findServices—It returns a list of rim:Service elements. listExtensionPackages—It returns a list of all deployed extension packages. showStoredQueries—It returns a list of available stored query definitions. getVersionHistory—It returns the version history of a specified registry object.

7.4.2 OpenSearch OpenSearch has an OGC extension. Recently, it is developed as part of the standard in OGC Catalogue Service 3.0. It uses GEORSS as the base to provide a geographic feature definition. It has an extension to define a temporal range. The CEOS OpenSearch Best Practice supports a two-step search—Collection level and Instance level. It provides guidelines and a hierarchical level of compliances in providing an essential description of the data. A GeoJSON encoding specification for Earth Observation has been released by OGC. It also supports the response in JONS-LD that is readily usable for semantic applications.

126

7 Standards for Big Data Management

Fig. 7.4 Associations for Earth Observation Sensors

7.5 Standards for Remote Sensing Big Data Access Common standards to enable the access of remote sensing big data include OGC data access services (i.e., WCS, WMS, WFS, and SOS) and OpenDAP. Protocols for data access of big remote sensing data include • • • • •

OGC Web Coverage Service (WCS), OGC Web Feature Service (WFS), OGC Web Map Service (WMS), OGC Sensor Observation Service (SOS), OpenDAP.

7.5.1 OGC Web Coverage Service (WCS) Web Coverage Service (WCS) is an OGC implementation specification that delivers coverage. A coverage is of geographic data with implicit geometry. A coverage can be raster data, imagery, or grid data. OGC WCS allows a client to obtain Grid coverage in the form the client wants. The current OGC WCS specification is OGC® WCS 2.0 Interface Standard— Core, version 2.0.1. It allows users to access real data from the remote data server. The WCS 2.0 core specifies a core set of requirements that a WCS implementation must fulfill. WCS extension standards add further functionality to this core; some of these are required in addition to the core to obtain a complete implementation. The core indicates which extensions, at a minimum, need to be considered in addition to this core to allow for a complete WCS implementation. The core does not prescribe

7.5 Standards for Remote Sensing Big Data Access

127

support for any particular coverage encoding format. The core holds for GML as a coverage delivery format: while GML constitutes the canonical format for the definition of WCS. However, it is not required by this core that a WCS implements the GML coverage format. WCS extensions specifying the use of data encoding formats in the context of WCS are designed in a way that the GML coverage information contents specified in this core are consistent with the contents of an encoded coverage. The core data model is shown in Fig. 7.5. The Coverage Service Model defines three core service operations: GetCapabilities, DescribeCapabilities, and GetCoverage. The construction of the queries is based on service metadata defined in OWS Common. The required extensions to WCS 2.0 Core are as follows: • Index-based subsetting, • Protocol binding: get-kvp, Xml/post, or Xml/soap, • Coverage encoding formats: GeoTiff, netCDF, JPEG200, or GML-JP2. The new revision of WCS is moving toward adopting OpenAPI specification which comes with a suite of tools and software to support automation of interactions between service and client.

Fig. 7.5 OGC WCS Core Data Model

128

7 Standards for Big Data Management

7.5.2 OGC Web Feature Service (WFS) OGC Web Features Service allows a client to retrieve geospatial data encoded in GML from multiple Web Feature Services. OpenGIS Web Feature Service Implementation Spec defines interfaces for describing data manipulation operations on geographic features using HTTP as the distributed computing platform. Requirements for a Web Feature Service are as follows: • Interfaces must be defined in XML. • GML must be used to express features. • The server must be able to present features using GML. The server must return feature data in GML format. • Filter language is defined in XML and derived from Common Query Language (CQL). • The Data store is opaque to the client. Data can only be viewed through the WFS interface. • The WFS must use a subset of XPath expressions for referencing properties. The WFS specification document describes the OGC Web Feature Service (WFS) operations, which support the following operations on geographic features using HTTP: • • • • • •

INSERT UPDATE DELETE QUERY DISCOVERY ACCESS.

Common formats include GML and GeoJSON. The new version of WFS, that is, WFS 3, is based on OpenAPI.

7.5.3 OGC Web Map Service (WMS) A Web Map Service (WMS) produces maps of spatially referenced data dynamically. The latest version of OGC WMS specification is version 1.3.0 which is the same as ISO 19128:2005. The first version was released in 1999. It is widely implemented worldwide. The International Standard defines a map to be a portrayal of geographic information as a digital image file suitable for display on a computer screen. The map is rendered in the form of JPEG, GIF, or PNG. The definition is as follows. A web map server (WMS) is a web application that provides a portrayal of geographic data which is stored on the server. This data can be stored in a variety of data formats but is served in a limited number of image formats.

7.5 Standards for Remote Sensing Big Data Access

129

The International Standard defines three operations: • An operation to return service-level metadata, • An operation to return a map whose geographic and dimensional parameters are well-defined, • An operation to return information about particular features shown on a map (optional). These are corresponding to three interfaces of WMS respectively: • GetCapabilities: Produces information about the service. Provides information about what types of maps the server can deliver. • GetMap: Generates map images or vector data. Generates a map as a picture or set of features based on the client’s specification and delivers the map to the client. • GetFeatureInfo: Answers basic queries about the content of the map. Provides information about the content of a map such as the value of a feature at a location. The example applications for a WMS client are as follows: • List the contents of a map-based catalog. • Select map layers, viewing regions, and scales. • Compose and display maps constructed from data coming from one or more remote servers. • Query through the Web for attribute information of a map feature selected from a map displayed in a client. • Support applications, based on visualization of map data obtained in real time from disparate data sources. WMS produces an image that can be directly rendered by browsers (de La Beaujardière 2002; Lankester 2009; Voidrot-Martinez et al. 2012; Strobel et al. 2016). Commonly supported outputs of maps include png, jpeg, and tiff. Styles can be supported to create maps from data services—WFS and WCS (Lupp 2007).

7.5.4 OGC Sensor Observation Service (SOS) The OGC Sensor Observation Service (SOS) is a data service that is primarily designed for serving data from sensors (Na and Priest 2007). Encoding may be those directly from sensors, such as Observations and Measures (O&M) (Cox 2010, 2011; Gasperi et al. 2016).

7.5.5 OpenDAP OpenDAP is an acronym for Open-source Project for a Network Data Access Protocol. It is a data access protocol. It works for sequence data as well as grids. The standard defines a framework of server and client service system that is specially

130

7 Standards for Big Data Management

designed for serving dynamic data flow in climate and weather communities where data flow in and out dynamically (Cornillon et al. 2003; Sgouros 2004). The most commonly supported data format underlining the Data Access Protocol (DAP) is netCDF (Hankin et al. 2010). Other data formats supported are GeoTIFF, JPEG2000, JSON, and plain ASCII (Hankin et al. 2010; Baart et al. 2012; Fulker 2016; Bereta et al. 2018).

References Baart F, de Boer G, de Haas W et al (2012) A comparison between WCS and OPeNDAP for making model results and data products available through the internet. Trans GIS 16:249–265 Baumann P, Merticariu V, Dumitru A, Misev D (2016) Standards-based services for big spatio- temporal data. Int Arch Photogramm Remote Sens Spat Inf Sci 41:691 Baumann P, Hirschorn E, Masó J (2017) OGC coverage implementation schema. Open Geospatial Consortium Inc., Wayland Bereta K, Stamoulis G, Koubarakis M (2018) Ontology-based data access and visualization of big vector and raster data. In: IGARSS 2018–2018 IEEE international geoscience and remote sensing symposium. IEEE, Piscataway, pp 407–410 Bigagli L, Nebert D, Voges U et al (2016) OGC® catalogue services 3.0 specification – HTTP protocol binding – abstract test suite. http://docs.opengeospatial.org/is/14-014r3/14-014r3.html. Accessed 4 Aug 2020 Campalani P, Beccati A, Baumann P (2013) Improving efficiency of grid representation in GML. In: EnviroInfo. IEEE, Hamburg, pp 703–708 CCSDS (2019) Reference model for an open archival information system (OAIS), recommended practice CCSDS 650.0-P-3. Space Operations Mission Directorate, NASA Headquarters, Washington, DC, USA. CEOS (2017) CEOS OpenSearch Best Practice document version 1.2. CEOS Christopoulos C, Skodras A, Ebrahimi T (2000) The JPEG2000 still image coding system: an overview. IEEE Trans Consum Electron 46:1103–1127 Clinton WJ (1994) Executive order 12906. Coordinating geographic data acquisition and access: the national spatial data infrastructure. Fed Regist 59:17671–17674 Clinton D (2019) OpenSearch 1.1 Draft 6. In: GitHub. https://github.com/dewitt/opensearch. Accessed 4 Aug 2020 Colaiacomo L, Masó J, Devys E, Hirschorn E (2017) OGC GML in JPEG 2000 (GMLJP2) encoding standard. Open Geospatial Consortium Inc., Wayland Cornillon P, Gallagher J, Sgouros T (2003) OPeNDAP: accessing data in a distributed, heterogeneous environment. Data Sci J 2:164–174 Cox S (2010) Geographic information: observations and measurements OGC abstract specification topic 20. Open Geospatial Consortium Inc., Wayland Cox S (2011) Observations and measurements – XML implementation. Open Geospatial Consortium Inc., Wayland de La Beaujardière J (2002) Web map service implementation specification. Open Geospatial Consortium Inc., Wayland Di L, McDonald KR (2006) The NASA HDF-EOS web GIS software suite. In: Qu JJ, Gao W, Kafatos M et al (eds) Earth science satellite remote sensing. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 245–253 Di L, Schlesinger B, Kobler B (2000) US FGDC content standard for digital geospatial metadata: extensions for remote sensing metadata. Int Arch Photogramm Remote Sens 33:78–81

References

131

Di L, Yang W, Deng M et al (2002) Interoperable access of remote sensing data through NWGISS. In: IEEE international geoscience and remote sensing symposium. IEEE, Piscataway, pp 255–257 Dial G, Bowen H, Gerlach F et al (2003) IKONOS satellite, imagery, and products. Remote Sens Environ 88:23–36 Dryden J (2009) The open archival information system reference model. J Arch Organ 7:214–217 Eaton B, Gregory J, Drach B et al (2003) NetCDF climate and forecast (CF) metadata conventions. Version Eaton B, Gregory J, Drach B et al (2020) NetCDF climate and forecast (CF) metadata conventions – version 1.9 draft. Version Federal Geographic Data Committee (1998) FGDC content standard for digital geospatial metadata. Federal Geographic Data Committee, Washington, DC Federal Geographic Data Committee (1999) Content standard for digital geospatial metadata part 1: biological data profile. Federal Geographic Data Committee, Washington, DC Federal Geographic Data Committee (2001) Shoreline metadata profile of the content standards for digital geospatial metadata. Federal Geographic Data Committee, Washington, DC Federal Geographic Data Committee (2002) Content standard for digital geospatial metadata: extensions for remote sensing metadata. Federal Geographic Data Committee, Washington, DC Fulker D (2016) Intro and recent advances: remote data access via OPeNDAP web services Gasperi J (2010) OpenGIS geography markup language (GML) application schema for earth observation products. Open Geospatial Consortium Inc., Wayland Gasperi J, Houbie F, Woolf A, Smolders S (2016) OGC® earth observation metadata profile of observations & measurements. Open Geospatial Consortium Inc., Wayland Gonçalves P, Voges U (2014) OGC® OpenSearch geo and time extensions. Open Geospatial Consortium Inc., Wayland González-Conejero J, Bartrina-Rapesta J, Serra-Sagrista J (2009) JPEG2000 encoding of remote sensing multispectral images with no-data regions. IEEE Geosci Remote Sens Lett 7:251–255 Gumelar O, Saputra RM, Yudha GD et al (2020) Remote sensing image transformation with cosine and wavelet method for SPACeMAP visualization. IOP Conf Ser Earth Environ Sci. IOP Publishing 500:012079 Hankin SC, Blower JD, Carval T et al (2010) NetCDF-CF-OPeNDAP: standards for ocean data interoperability and object lessons for community data standards processes. In: Oceanobs 2009, Venice Convention Centre, Venise, 21–25 Sept 2009 Hassell D, Gregory J, Blower J et al (2017) A data model of the Climate and Forecast metadata conventions (CF-1.6) with a software implementation (cf-python v2.1). Geosci Model Dev 10:4619–4646. https://doi.org/10.5194/gmd-10-4619-2017 Hirschorn E (2017) OGC coverage implementation schema – ReferenceableGridCoverage Extension. Open Geospatial Consortium Inc., Wayland Houbie F, Bigagli L (2010) OGC® catalogue services standard 2.0 extension package for ebRIM application profile: earth observation products. Open Geospatial Consortium Inc., Wayland ISO/IEC JTC 1/SC 29 (2004) ISO/IEC 15444-4:2004 Information technology—JPEG 2000 image coding system: conformance testing, 2nd edn. International Organization for Standardization, Geneva ISO/IEC JTC 1/SC 29 (2005) ISO/IEC 15444-9:2005 Information technology—JPEG 2000 image coding system: interactivity tools, APIs and protocols, 1st edn. International Organization for Standardization, Geneva ISO/IEC JTC 1/SC 29 (2007a) ISO/IEC 15444-3:2007 Information technology—JPEG 2000 image coding system: motion JPEG 2000, 2nd edn. International Organization for Standardization, Geneva ISO/IEC JTC 1/SC 29 (2007b) ISO/IEC 15444-8:2007 Information technology—JPEG 2000 image coding system: secure JPEG 2000, 1st edn. International Organization for Standardization, Geneva ISO/IEC JTC 1/SC 29 (2007c) ISO/IEC 15444-11:2007 Information technology—JPEG 2000 image coding system: wireless, 1st edn. International Organization for Standardization, Geneva

132

7 Standards for Big Data Management

ISO/IEC JTC 1/SC 29 (2008) ISO/IEC 15444-13:2008 Information technology—JPEG 2000 image coding system: an entry level JPEG 2000 encoder, 1st edn. International Organization for Standardization, Geneva ISO (2009) ISO 19115-2:2009 geographic information — metadata — part 2: Extensions for imagery and gridded data. ISO, Geneva, Switzerland. ISO/IEC JTC 1/SC 29 (2011) ISO/IEC 15444-10:2011 Information technology—JPEG 2000 image coding system: extensions for three-dimensional data, 2nd edn. International Organization for Standardization, Geneva ISO (2012) ISO/TS 19158:2012 geographic information — quality assurance of data supply. ISO, Geneva, Switzerland. ISO (2013) ISO 19157:2013 geographic information — data quality. ISO, Geneva, Switzerland. ISO/IEC JTC 1/SC 29 (2013a) ISO/IEC 15444-6:2013 Information technology—JPEG 2000 image coding system—Part 6: compound image file format, 2nd edn. International Organization for Standardization, Geneva ISO/IEC JTC 1/SC 29 (2013b) ISO/IEC 15444-14:2013 Information technology—JPEG 2000 image coding system—Part 14: XML representation and reference, 1st edn. International Organization for Standardization, Geneva ISO (2014) ISO 19115-1:2014 geographic information — metadata — part 1: fundamentals. ISO, Geneva, Switzerland. ISO/IEC JTC 1/SC 29 (2015) ISO/IEC 15444-5:2015 Information technology—JPEG 2000 image coding system: reference software, 2nd edn. International Organization for Standardization, Geneva ISO (2016a) ISO 19119: 2016–geographic information–services, 2nd edn. International Organization for Standardization, Geneva, Switzerland. ISO (2016b) ISO/TS 19115-3:2016 geographic information — metadata — part 3: XML schema implementation for fundamental concepts. ISO, Geneva, Switzerland. ISO (2016c) ISO/TS 19157-2:2016 geographic information — data quality — part 2: XML schema implementation. ISO, Geneva, Switzerland. ISO/IEC JTC 1/SC 29 (2019a) ISO/IEC 15444-1:2019 Information technology—JPEG 2000 image coding system—Part 1: core coding system, 4th edn. International Organization for Standardization, Geneva ISO/IEC JTC 1/SC 29 (2019b) ISO/IEC 15444-15:2019 Information technology—JPEG 2000 image coding system—Part 15: high-throughput JPEG 2000, 1st edn. International Organization for Standardization, Geneva ISO/IEC JTC 1/SC 29 (2019c) ISO/IEC 15444-16:2019 Information technology—JPEG 2000 image coding system—Part 16: encapsulation of JPEG 2000 Images into ISO/IEC 23008-12, 1st edn. International Organization for Standardization, Geneva ISO/TC 211 (2003) ISO 19115:2003 Geographic information—Metadata. International Organization for Standardization ISO/TC 211 (2014) Content standard for digital geospatial metadata: extensions for remote sensing metadata. International Organization for Standardization Lankester THG (2009) OpenGIS® web map services – profile for EO products. Open Geospatial Consortium Inc., Wayland Lee CA (2010) Open archival information system (OAIS) reference model. In: Bates, MJ, Maack MN (eds) Encyclopedia of Library and Information Sciences (3rd ed), Taylor & Francis, Boca Raton, FL, USA, pp 4020-4030. Lupp M (2007) Styled layer descriptor profile of the web map service implementation specification. Open Geospatial Consortium Inc., Wayland Mahammad SS, Ramakrishnan R (2003) GeoTIFF-A standard image file format for GIS applications. In: Map India, Hyderabad, pp 28–31 Marcellin MW, Gormish MJ, Bilgin A, Boliek MP (2000) An overview of JPEG-2000. In: Proceedings DCC 2000. Data compression conference. IEEE, Los Alamitos, pp 523–541 Martell R (2009a) CSW-ebRIM Registry Service – part 1: ebRIM profile of CSW. Open Geospatial Consortium Inc., Wayland

References

133

Martell R (2009b) CSW-ebRIM Registry Service – part 2: basic extension package. Open Geospatial Consortium Inc., Wayland Martell R, Parr-Pearson J (2009) CSW-ebRIM Registry Service – part 3: abstract test suite. Open Geospatial Consortium Inc., Wayland Na A, Priest M (2007) Sensor observation service. Open Geospatial Consortium Inc., Wayland Neal P, Davidson J, Westcott B (2006) OpenGIS® catalogue service implementation specification 2.0.1 – FGDC CSDGM application profile for CSW 2.0. Open Geospatial Consortium Inc., Wayland Nebert DD (1999) Z39. 50 application profile for geospatial metadata or “GEO”/version 2.2/US Federal Geographic Data Committee, 2.2. US Federal Geographic Data Committee, Reston Nebert D (2007) Corrigendum for OpenGIS implementation specification 07-006: catalogue services, version 2.0.2. Open Geospatial Consortium Inc., Wayland Nebert D, Voges U, Bigagli L (2016a) OGC® catalogue services 3.0 – general model. In: Cat. SWG. http://docs.opengeospatial.org/is/12-168r6/12-168r6.html. Accessed 4 Aug 2020 Nebert D, Voges U, Vretanos P et al (2016b) OGC® catalogue services 3.0 specification – HTTP protocol binding. In: Cat. 30 SWG. http://docs.opengeospatial.org/is/12-176r7/12-176r7.html. Accessed 4 Aug 2020 NITFS Technical Board (2006) National Imagery Transmission Format Version 2.1 for the National Imagery Transmission Format Standard. National Geospatial-Intelligence Agency (NGA) National Center for Geospatial Intelligence Standards (NCGIS), Reston Pennefather PS, Suhanic W (2009) BioTIFF: a new BigTIFF file structure for organizing large image datasets and their associated metadata. Biophys J 96:30a Rabbani M (2002) JPEG2000: image compression fundamentals, standards and practice. J Electron Imaging 11:286 Rew R, Davis G (1990) NetCDF: an interface for scientific data access. IEEE Comput Graph Appl 10:76–82 Rew R, Davis G, Emmerson S et al (2009) NetCDF User’s Guide. Unidata Program Center, March 2009 Rew R, Hartnett E, Caron J (2006) NetCDF-4: software implementing an enhanced data model for the geosciences. In: 22nd International conference on interactive information processing systems for meteorology, oceanograph, and hydrology, Atlanta Ritter N, Ruth M, Grissom BB et al (2000) Geotiff format specification geotiff revision 1.0. SPOT Image Corp 1 Schelkens P, Skodras A, Ebrahimi T (2009) The JPEG 2000 suite. John Wiley & Sons, Chichester Sgouros T (2004) OPeNDAP user guide, version 1.14. Univ R I SO/IEC JTC 1/SC 29 (2004) ISO/IEC 15444-2:2004 Information technology—JPEG 2000 image coding system: extensions, 1st edn. International Organization for Standardization, Geneva Strobel S, Sarafinof D, Wesloh D, Lacey P (2016) DGIWG – web map service 1.3 profile – revision. Open Geospatial Consortium Inc., Wayland Ullman RE (1999) HDF-EOS, NASA’s standard data product distribution format for the Earth Observing System data information system. In: IEEE 1999 international geoscience and remote sensing symposium. IGARSS’99 (Cat. No.99CH36293). IEEE, Hamburg, pp 276–278 Voges U, Senkler K (2007) OpenGIS® catalogue services specification 2.0.2 – ISO metadata application profile. Open Geospatial Consortium Inc., Wayland Voges U, Houbie F, Lesage N, Vautier M-L (2013) I15 (ISO19115 metadata) extension package of CS-WebRIM profile. Open Geospatial Consortium Inc., Wayland Voidrot-Martinez M-F, Little C, Seib J et al (2012) OGC best practice for using web map services (WMS) with time-dependent or elevation-dependent data. Open Geospatial Consortium Inc., Wayland Wei Y, Di L, Zhao B et al (2007) Transformation of HDF-EOS metadata from the ECS model to ISO 19115-based XML. Comput Geosci 33:238–247. https://doi.org/10.1016/j. cageo.2006.06.006 Whitson KD (1996) National Imagery Transmission Format (NITF) standard: a government/industry model. In: Standards for electronics imaging technologies, devices, and systems: a critical review. International Society for Optics and Photonics, p 1028305

Chapter 8

Implementation Examples of Big Data Management Systems for Remote Sensing

Abstract Two examples of big data management systems for remote sensing are presented in this chapter. They are CWIC (CEOS (Committee on Earth Observation Satellites) WGISS (the Working Group on Information Systems and Services) Integrated Catalog) and GCI (GEOSS (Global Earth Observation System of Systems) Common Infrastructure). Both are international efforts in managing remote sensing big data worldwide. The discussions cover architecture, design, implementation, and usages of remote sensing big data management systems. The two systems take slightly different approaches in designing and implementing remote sensing big data management systems. CWIC focuses on managing metadata and accessing data in a distributed system. In CWIC, data providers manage and maintain their metadata and data. Search and queries are achieved in a distributed manner. GEOSS GCI manages metadata and data by registration and harvesting metadata. All metadata is periodically harvested and managed in a central repository for indexing and brokerage construction. Search and query are performed against the harvested repository. Keywords Cyberinfrastructure · Big data management system · Catalog · Earth observation · Metadata · GEOSS · CEOS There are many implementation examples of big data management systems for remote sensing. In this chapter, two of such remote sensing big data management are reviewed. They are the Committee on Earth Observation Satellites (CEOS) Working Group on Information Systems and Services (WGISS) Integrated Catalog (CWIC) and The Global Earth Observation System of Systems (GEOSS) Common Infrastructure (GCI).

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. Di, E. Yu, Remote Sensing Big Data, Springer Remote Sensing/ Photogrammetry, https://doi.org/10.1007/978-3-031-33932-5_8

135

136

8 Implementation Examples of Big Data Management Systems for Remote Sensing

8.1 CWIC 8.1.1 Introduction Existing Earth Observation data catalogs are individual portals with different query interfaces under heterogeneous metadata models. To discover EO data, users need to deal with different web portals, get familiar with different query interfaces, and understand different metadata models. For example, legacy EO Catalogs include the NASA EOS Metadata Clearinghouse (ECHO), the NOAA Comprehensive Large Array-data Stewardship System (CLASS), the USGS Earth Resources Observation and Science (EROS), the Brazil Instituto Nacional de Pesquisas Espaciais (INPE), the China Academy of Optic-Electronic (AOE) Satellite Image Catalog, the Indian Space Research Organisation (ISRO) Meteorological & Oceanographic Satellite Data Archival Centre (MOSDAC) catalog, the ISRO National Remote Sensing Centre (NRSC), the European Organisation for the Exploitation of Meteorological Satellites (EUMETSAT) catalog, the National Remote Sensing Center of China (NRSCC) catalog, and the Canada Centre for Mapping and Earth Observation (CCMEO) catalog. Each catalog implements its own information model and discovery service. Metadata are modeled with different information models. Users need to search and access individual portals and catalogs following the specification and guides released at each Web site.

8.1.2 CEOS WGISS CEOS is the international interagency organization coordinating the satellite EO programs of the world’s governmental space agencies. The Working Group on Information Systems and Services (WGISS) is a working group of CEOS, which aims to promote collaboration in the development of systems and services that manage and supply EO data to users worldwide. Initiated by NOAA and NASA, the CEOS WGISS Integrated Catalog (CWIC) was started as a WGIS project in 2010. CWIC aims to provide a single access point for the catalog systems of major CEOS member agencies. CWIC is offered as the CEOS community catalog and as a part of the GEO common infrastructure.

8.1.3 CWIC Architecture Design The core architecture to unify the interfaces to discover and access data from diverse data providers is the mediator-wrapper architecture. It has primarily two types of components. One is the mediator that faces end-users to provide standard interfaces with a unified information model. The mediator receives a query from the data

8.1 CWIC

137

Fig. 8.1 Mediator-wrapper architecture for the federated CWIC catalog

client, dispatches the request to the relevant wrapper, and assembles and returns the query results. Another type is the wrapper. The wrapper translates global query language to the native query language of backend data providers, executes data inventory search, converts native metadata model to global metadata model, and returns the response to the mediator. Figure 8.1 shows the relationships among users, mediators, and wrappers (Bai et al. 2007).

8.1.4 CWIC System Implementation The CWIC system supports two types of interfaces—OGC CSW (Nebert et al. 2007) and OpenSearch (Gonçalves and Voges 2014; CEOS 2017; Clinton 2019). There are two tasks to be enabled to bridge the client and the data provider. They are translating the query from users to a native request in data providers and mapping native responses from data providers to the common information model to be sent to users. Both tasks are completed in a wrapper. Query parameters and requests are translated into the native request in data providers. CWIC takes queries from CSW interfaces or OpenSearch interfaces. All queries are first mapped into a common query class that manages all the query items. The wrapper translates the items in the common query class to the query in native languages in data providers and invokes the request. The native response from data providers needs to be mapped into the common information model to produce responses in Dublin core, ISO 19115, or OpenSearch Atom feed response. Figure 8.2 shows wrappers and mediators for CSW interfaces (both Dublin core and ISO 19115 profile). Figure 8.3 shows wrappers and mediators for the OpenSearch interface. The unified information model requires the support of common core query items. To reduce the loss of information from data providers, the CWIC wrapper also relays all the unmanaged information by putting them in

138

8 Implementation Examples of Big Data Management Systems for Remote Sensing

Fig. 8.2 CWIC mediator and wrappers for CSW interfaces

relevant properties or attributes. This can be easily done with the flexibility and comprehensiveness of the ISO 19115 profile. With the Dublin core, it may not be possible to pass through all the information from data providers. OpenSearch atom feed can be added with extra properties to pass the extra information. The CWIC OpenSearch interface follows the community specification of CEOS in CEOS (2017). The following lists some of the core query parameters supported with the CWIC OpenSearch interface: • Dataset Identifier (Mandatory): The dataset identifier (datasetId) parameter is used to specify the data set identifier which could be retrieved from IDN as DIF Entry ID. This parameter is a required parameter. The client cannot specify more than one data set in a single request. The error-handling process for nonexist dataset identifier is to set HTTP Status code to 400, generate a user-friendly exception message in , and send the error message response to the client. • Temporal Extension (Optional): The temporal extent query parameter follows the specification of OGC OGC 10-032r8 (Gonçalves and Voges 2014). The required namespace is “http://a9.com/-/opensearch/extensions/time/1.0/.” Parameters include time:start as a string describing the start of the temporal interval to search (bigger or equal to) and time:end as a string describing the end of the temporal interval to search (smaller or equal to). The format is a character string with the start of the temporal interval according to RFC-3339 (Klyne and

8.1 CWIC

139

Fig. 8.3 CWIC mediator and wrappers for OpenSearch interface

Newman 2002). The date only is in the format of YYYY-MM-DD, where YYYY is year, MM is month, and DD is day of the month. The date-time is in YYYY- MM-DDTHH:MI:SSZ, where HH is hour, MI is minute, and SS is second. • GEO Extension (Optional): The spatial extent adopted the open specification of OGC 10-032r8 (Gonçalves and Voges 2014). The required namespace “http:// a9.com/-/opensearch/extensions/geo/1.0/.” The query parameter is geo:box to specify the bounding box specifying the area of interest. The format of the box is defined by “west, south, east, north” coordinates of longitude, latitude, in EPSG:4326 decimal degrees.

8.1.5 Results and Conclusion Figure 8.4 shows the overall concept of operations and interactions with resource providers and users. The first interaction is between data providers and the CEOS International Directory Network (IDN). CWIC connectors are components to bridge data providers and CWIC. Data providers register collections to be searchable through CWIC in IDN. The IDN is managed and hosted at the NASA Common Metadata Repository (CMR). The second interaction is between IDN and the CWIC mediators or metadata search agents. The CWIC mediator synchronizes search data

140

8 Implementation Examples of Big Data Management Systems for Remote Sensing

Fig. 8.4 Overall concept of operations

collections from IDN/CMR to build a mapping list that maps registered collections to CWIC connectors (data providers). The third interaction is between users and the CEOS IDN. CWIC clients perform the first-step search for collection metadata (granule metadata end point). CWIC allows a two-step search: collection search and granule search. The collection search return matching collections that are closely tied to data providers. The fourth interaction is between users and the CWIC mediator. This is the granule search or the second-step search in the two-step search process. CWIC clients perform a second-step search for granule metadata (data access options). The fifth interaction happens behind the scene between the CWIC mediator and CWIC data connectors. The CWIC mediator dispatches granule search to respective CWIC connectors through indexed mappings between collection and connectors. CWIC connectors delegate CWIC client’s requests by searching against backend data provider catalogs. The sixth interaction is between users and data provider data access services. CWIC client accesses (order/download) granules through end points (native access protocols) returned in CWIC metadata. CWIC implements both CSW and OpenSearch interfaces. Different interfaces passed different messages passed through interactions to accomplish the two-step search. Figure 8.5 shows interactions through OGC CSW interfaces. Messages passed among CWIC, IDN, and clients are encoded in CSW filtering languages for query requests and Dublin core (Nebert et al. 2007) or ISO 19115 metadata (Voges and Senkler 2007, p. 19115) for query responses. Figure 8.6 shows interactions through OpenSearch interfaces. Messages passed among CWIC, IDN, and clients are encoded in OpenSearch query parameters in a template for query requests and atom feed for query responses (WGISS CDA System-Level Team 2019). Table 8.1 lists the assets connected through CWIC by August 1, 2020. There are 3176 total data collections and the total granule number is 193 million granules searchable through CWIC. There are developed clients to interact with the CWIC catalog. A few example clients are the CWIC test client, the GMU GeoBrain, the NASA ECHO Reverb, the Canada Centre for Remote Sensing (CCRS), the Italy Earth and Space Science Informatics—Laboratory (ESSI-Lab), the UGS Land Surface Imaging (LSI) Explorer, and the EarthData Search engine. The CWIC test client supports both

8.1 CWIC

141

Fig. 8.5 Interactions among CWIC, IDN, and clients with OGC CSW interfaces

Fig. 8.6 Interactions among CWIC, IDN, and clients with OpenSearch interfaces

OpenSearch and CSW interfaces to interact with the CWIC catalog. Figure 8.7 shows the process to use the CWIC test client to interact with the catalog through CSW interfaces. Requests and responses follow the specification of the OGC CSW ISO 19115 profile.

142

8 Implementation Examples of Big Data Management Systems for Remote Sensing

Table 8.1 Connected data assets through CWIC (by August 1, 2020) Connector USGSLSI INPE GHRSST NOAA_NCEI CCMEO ISRO/MOSDAC EUMETSAT ISRO/NRSC NRSCC NASA

Total data collection number 23 20 79 2 2 20 8 7 6 3009

Total granule number 11803239 1138153 2259819 6600 1379 1904652 78513 385988 23678 176283923

Fig. 8.7 CWIC Test Client to interact with the catalogue through CSW interface

The GeoBrain online analysis system (GeOnAS) is a Web-based, rich client that uses open standard Web services to provide geospatial analysis functions (Di et al. 2007; Han et al. 2008; Zhao et al. 2012). Figure 8.8 shows the process to interact with the CWIC catalog using the GeoBrain GeOnAS. The GeoBrain GeoDataDownload is a generic data access broker service that allows users to find and access data from standard OGC data access Web services, including WCS, WFS, and SOS (Di and Deng 2010). Figure 8.9 shows the process to find and access data through the GeoBrain GeoDataDownload. The CWIC catalog collects and analyzes logs of the CWIC services to create metrics. Each operation of the CWIC catalog Web services are logged along with access IP addresses and types of operations. The log files are analyzed by a log analyzer module to be run periodically in the background to incrementally update

8.1 CWIC

143

Fig. 8.8 GeoBrain GeOnAS to interact with the catalogue

Fig. 8.9 GeoBrain GeoDataDownload to interact with the catalogue

the CWIC metric database managed as a relational database—HSQLDB by default. Aggregation and indexing are automatically triggered when new logs are analyzed and ingested. Figure 8.10 shows the data flow and process for collecting and analyzing CWIC logs and building up CWIC metric database. Figure 8.11 shows the metrics summary of the last month using the CWIC catalog. Figure 8.12 shows the breakdown by operations and alternative breakdown summary options—by data set, by countries, and by data providers.

144

8 Implementation Examples of Big Data Management Systems for Remote Sensing

Fig. 8.10 CWIC metrics collection and analysis data flow

Fig. 8.11 CWIC Metrics Portal – Summary and charts for the last month

8.1.6 Future Work The CWIC catalog is in further development. As it develops, it will integrate more data catalogs and engage more data providers. Query languages will be expanded, possibly taking into account semantic catalog capabilities. CWIC metrics will be evolved to include related metrics. For example, for CWIC access metrics, visits, and unique visitors may be summarized and reported. For CWIC resource metrics, data center rank and data set rank may be generated to represent usages of resources.

8.2 The Registry in GEOSS GCI

145

Fig. 8.12 CWIC Metrics Portal – Summary by operations and options to view different breakdown summaries

For CWIC performance metrics, more details on query response time may be summarized. The OpenSearch interface is in development. More options are coming into play. Items to be queried may be expanded along with additional semantic contents.

8.2 The Registry in GEOSS GCI 8.2.1 Background 8.2.1.1 GEO The Group on Earth Observations (GEO) was launched in response to calls for action by the 2002 World Summit on Sustainable Development and the G8 (the Group of Eight) leading industrialized countries (Plag 2008; Ryan and Cripe 2014). GEO is a voluntary partnership of governments and international organizations. As of August 2020, GEO’s Members include 111 Governments and the European Commission, and 130 intergovernmental, international, and regional organizations.

146

8 Implementation Examples of Big Data Management Systems for Remote Sensing

Fig. 8.13 The architecture of GEOSS – engineering view

8.2.1.2 The Role of the Registry The overall architecture of the GEOSS Common Infrastructure (GCI) is introduced in Sect. 4.2.1. This section dives into the GEOSS Component and Service Registry (CSR)—one of the core components to enable the discovery of resources (Fig. 8.13). The Registry provides the cataloguing service for all resources managed by the GEOSS. The engineering view of the GCI architecture, Fig. 8.13, shows the functions of three major components (i.e., Registry, Portal, and Clearing Housee) in the GEOSS (Percivall et al. 2007). Figure 8.14 shows interactions among the GEOSS Registries, Portal, and Clearinghouse (Percivall et al. 2007).

8.2.2 The GEOSS Component and Service Registry The GEOSS Component and Service Registry (CSR) was implemented and released by the Center for Spatial Information Sciences and Systems, George Mason University, in collaboration with officials from the Federal Geographic Data Committee (FGDC) on designing, implementing, maintaining, and upgrading the Registry.

8.2 The Registry in GEOSS GCI

147

Fig. 8.14 Interactions among components of the GEOSS

8.2.2.1 Functionalities The CSR provides the following capabilities for data providers: (1) user registration, (2) resource registration, (3) the GEOSS Data Collection of Open Resources for Everyone (GEOSS Data-CORE) self-nomination, and (4) resource approval. CSR clients, for example, GEOSS Clearinghouse, can discover the resources registered in CSR through the following interfaces: (1) OGC Catalog Service for Web (CSW), (2) Universal Description Discovery and Integration (UDDI), and (3) ebXML Registry Services Specifications (ebRS). 8.2.2.2 Concept The CSR consists of three interlinked resources and functional groups—Component, Service, and Registry. A Component defines Earth Observation resources contributed by a GEO Member or Participating organization. The followings are typical types of components: (1) Data sets, (2) Monitoring and Observation Systems, (3) Computational Models, (4) Education or Research Initiatives, (5) Websites and Documents, (6) Analysis and Visualization Systems, (7) Alerts, RSS (RDF (Resource Description Framework) Site Summary or Really Simple Syndication), and Information Feeds, (8) Catalogs, Inventories and Metadata Collections, and (9) Software and Applications. A Service (Interfaces) defines a set of functionality provided by a component through its system interfaces. Services communicate primarily using structured

148

8 Implementation Examples of Big Data Management Systems for Remote Sensing

messages, based on the Services Oriented Architecture (SOA) view of complex systems. A Registry maintains the descriptive information, or metadata, about the components and their service interfaces. As it evolves, the concept of Resource was introduced to include both Component and Service (interface). Component Types are renamed as Resource Categories. Service (interface) is treated as a resource category. 8.2.2.3 System Design The active roles of CSR include GEO members, GEO Secretariat, GEOSS Clearinghouse, public users, and system administrators. Figure 8.15 shows roles, resources, and interactions among them. Different roles perform a different set of functions. Use cases include (1) GEO members register resources (component and service), (2) GEO Secretariat approves registration, (3) public users query the CSR catalog, (4) GEOSS Clearinghouse harvests resources from the CSR catalog, and (5) System administrators maintain the Registry. The functionality of CSR was implemented with Java classes following the specifications of GEO and OGC catalog services (Voges and Senkler 2007). Figure 8.16 shows major classes implemented in CSR. Component class and ServiceInstance

Registration

Register Component Update Component GEO Member

Update Service

Delete Component

GEO Secretariat

Register Service

Approve Component

Delete Service Query Service

Query Component Public User

GEOSS Clearinghouse

Fig. 8.15 Ineractions and Use Cases for CSR

System Admin

Maintain Registration Info

8.2 The Registry in GEOSS GCI

149

Fig. 8.16 Classes in CSR

class are dealing with two major types of resources—component and service, respectively. The service resource in CSR was initially captured by a five-level taxonomy model that is (1) Category, (2) Individual standard, (3) Specific version, (4) Specific binding, and (5) Specific profile (Bai et al. 2009). Figure 8.17 shows the taxonomy

150

8 Implementation Examples of Big Data Management Systems for Remote Sensing

Service Category Consists of

Service Standard Consists of

Standard Version Defines

Service Binding Consists of Identifies

Service Profile

Identifies

Identifies

URN Fig. 8.17 Service Taxonomy in CSR

model originally used in CSR. URN (Uniform Resource Name) is used to identify each level explicitly. Extension to the taxonomy is doable by proposing new items at each level, with the tree architecture maintained. The model is internally maintained in the CSR. The model is simplified to two levels with the evolution of the CSR catalog. They are (1) service type at level one, for example, Data Access Service, Catalog/Registry Service, and (2) individual service standards at level two, for example, Under “Data Access” type, OGC WCS 2.0. The two-level taxonomy model is maintained in the GEOSS Standards and Special Arrangements Registry. Figure 8.18 illustrates the simplified two-level taxonomy model.

8.2.3 System Implementation 8.2.3.1 Logical Design and Main Functionalities The logical architecture of CSR is shown in Fig. 8.19. Main functionalities are as follows: (1) For GEOSS Resource Providers, CSR allows User registration, Register Earth Observation resources, and/or affiliated Service Interfaces by referencing GEOSS endorsed Standards, Search/modify/delete registered records, and Request for approval of their records. (2) For GEOSS CSR end users, CSR facilitates

8.2 The Registry in GEOSS GCI

151

Fig. 8.18 Simplified two level service taxonomy model in CSR

Fig. 8.19 Logical architecture of the CSR

nonsecure public search interface and the Holdings page. (3) For GEOSS Clearinghouses, CSR provides a dedicated OGC Catalog Service Interface (CSW 2.0.2) for discovery and harvest. (4) For GEO Secretariat and CSR Operators at CSISS/GMU and FGDC, CSR authorizes Record Approval. 8.2.3.2 Registry Pages CSR provides pages for users to register and query resources. Figure 8.20 shows the homepage of CSR and highlights major content sections.

152

8 Implementation Examples of Big Data Management Systems for Remote Sensing

Fig. 8.20 CSR registry homepage

Registration is required for users, normally GEO members, who register resources into the Registry. Once a GEO member log in to the Registry, CSR presents a list of resource categories that are agreed upon by the GEOSS community. The current categories are as follows: (1) Data sets, (2) Monitoring and Observation Systems, (3) Computational Models, (4) Initiatives, (5) Websites and Documents, (6) Analysis and Visualization, (7) Alerts, RSS, and Information Feeds, (8) Catalogs, Inventories and Metadata Collections, (9) Software and Applications, and (10) Service Interfaces. Providers need to select one that is the most suitable for their resource. If the “Service Interfaces” category is selected, the registration page will be changed accordingly to solicit service interface information instead. On resource-sharing properties, the GEOSS Data Sharing Task Force (DSTF) has proposed the following terms to capture/classify the characteristics of resource- sharing properties: (1) GEOSS User Registration, (2) GEOSS No Monetary Charge, (3) GEOSS Attribution, and (4) GEOSS Data-CORE. Each of these terms identifies different combinations of situations/requirements that need to meet when accessing the resources. In particular, GEOSS Data-CORE is the GEOSS Data Collection of Open Resources for Everyone is a distributed pool of documented data sets, contributed by the GEO community based on full and open exchange (at no more than the cost of reproduction and distribution) and unrestricted access. To semantically define each registered resource, the GEOSS Semantic Working Group has collected and proposed GEOSS Earth Observation Vocabulary to be used, which establishes a solid foundation to semantically describe the critical Earth Observations targeted by the resources. It is multiple-level taxonomy, for example, Atmosphere → Atmospheric Temperature → Surface Air Temperature. Multiple selections are allowed for a single resource.

References

153

Providers may further define the specific standards or special arrangements that apply to the resources. By telling more about the resources, providers make it easy for users to evaluate these valuable resources, and therefore to promote the usages of registered resources. When selecting “Service Interfaces” from the resource category, providers will be presented with a service interface registration page that allows inputs of service information. Among the required fields, the information URL is basically for a human-readable information page, while the Interface URL refers to the end points used by software to invoke the service itself. If there is no dedicated web page/ document for the user to check, it is acceptable to provide the interface URL as the Information URL, though it is not expected. Spatial and temporal extent is required to describe resources. Providers may select to request approval when registering their resources. They may also send this request later when they are comfortable with their resources and the corresponding records in CSR. Upon receiving the requests, the CSR operators will review records and decide to either approve the records or back to the providers with modification suggestions. Pending records will still be maintained by the CSR, and only the providers can modify or even delete them. 8.2.3.3 The Registry In summary, the GEOSS Component and Service Registry enables (1) Registration of geospatial resources and affiliated service interfaces by providing key metadata, identifying referenced public/nonpublic standards, or selecting targeted critical Earth Observations; and (2) Discovery of geospatial resources by defining key metadata phases and/or defining implementation service standards of interest, and through web-based search interface, and/or dedicated API interfaces.

References Bai Y, Di L, Chen A et al (2007) Towards a geospatial catalog federation service. Photogramm Eng Remote Sens 73:699–708 Bai Y, Di L, Wei Y (2009) A taxonomy of geospatial services for global service discovery and interoperability. Comput Geosci 35:783–790. https://doi.org/10.1016/j.cageo.2007.12.018 CEOS (2017) CEOS OpenSearch best practice document version 1.2. CEOS Clinton D (2019) OpenSearch 1.1 Draft 6. In: GitHub. https://github.com/dewitt/opensearch. Accessed 4 Aug 2020 Di L, Deng M (2010) Enhancing remote sensing education with GeoBrain cyberinfrastructure. In: 2010 IEEE international geoscience and remote sensing symposium (IGARSS). IEEE, Honolulu, Hawaii, USA, pp 98–101 Di L, Zhao P, Han W et al (2007) Web service-based GeoBrain online analysis system (GeOnAS). In: NASA science technology conference, pp 19–21 Gonçalves P, Voges U (2014) OGC® OpenSearch geo and time extensions. Open Geospatial Consortium Inc., Wayland

154

8 Implementation Examples of Big Data Management Systems for Remote Sensing

Han W, Di L, Zhao P et al (2008) Design and implementation of GeoBrain online analysis system (GeOnAS). In: Proceedings of the 8th international symposium on web and wireless geographical information systems. Springer, Berlin, Heidelberg, pp 27–36 Klyne G, Newman C (2002) Date and time on the internet: timestamps (RFC 3339). Techn Ber Internet Eng Task Force Nebert D, Whiteside A, Vretanos P (Peter) (2007) OpenGIS® catalog services specification. Open Geospatial Consortium Inc., Wayland Percivall G, Simonis I, Nebert D (eds) (2007) GEOSS Core architecture implementation report, 1.0. GEO Architecture and Data Committee Plag H-P (2008) GEO, GEOSS and IGOS-P: The framework of global Earth observations. In: Proceedings of IGS 2006 workshop. GEO (Group on Earth Observations), Darmstadt, Geneva Ryan B, Cripe D (2014) The Group on Earth Observations (GEO) through 2025, pp A0.1–1-14 Voges U, Senkler K (2007) OpenGIS® catalog services specification 2.0.2 – ISO metadata application profile. Open Geospatial Consortium Inc., Wayland WGISS CDA System-Level Team (2019) WGISS connected data assets client partner guide (OpenSearch) Zhao P, Di L, Han W, Li X (2012) Building a web-services based geospatial online analysis system. IEEE J Sel Top Appl Earth Obs Remote Sens 5:1780–1792

Chapter 9

Big Data Analytics for Remote Sensing: Concepts and Standards

Abstract This chapter covers concepts of big data analytics in general, including definitions, categories, and use case overview. Processes and requirements for remote sensing big data analytics are discussed. Two standard efforts on big data analytics are briefly covered, that is, ISO (International Organization for Standardization) Big Data Working Group and IEEE (Institute of Electrical and Electronics Engineers) Big Data Standard group. Keywords Big data analytics · Standard · Descriptive analytics · Diagnosis analytics · Predictive analytics · Data science · Machine learning · Data mining · Predictive modeling

9.1 Big Data Analytics Concepts 9.1.1 What Is Big Data Analytics? Value is the most important V of all the big data Vs. Analytics is how one realizes the value of big data. Data analytics is the process of examining large amounts of data of a variety of types to uncover hidden patterns, unknown correlations, and other useful information (Sagiroglu and Sinanc 2013; Shang et al. 2013; Zheng et al. 2013; Elgendy and Elragal 2014; Hu et al. 2014; Babu and Sastry 2014). “Value is created only when the data is analyzed and acted on” (Watson 2014). Analytics uses descriptive and predictive models to gain valuable knowledge from data. Thus, analytics is not so much concerned with individual analyses or analysis steps, but with the entire methodology. Table 9.1 lists differences between traditional analytics and big data analytics (Watson 2014; Wang et al. 2018). The main differences are over three aspects— focus, dataset, and support. Traditional data analytics focus on descriptive or diagnosis analytics, while big data analytics focus on predictive and modeling analytics. Traditional data analytics deals with primarily cleaned, managed datasets, while big data analytics deals with a large, raw dataset. Traditional data analytics © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. Di, E. Yu, Remote Sensing Big Data, Springer Remote Sensing/ Photogrammetry, https://doi.org/10.1007/978-3-031-33932-5_9

155

156

9 Big Data Analytics for Remote Sensing: Concepts and Standards

Table 9.1 Differences between traditional data analytics and big data analytics Traditional analytics Descriptive analytics Diagnosis analytics Datasets Limited datasets Cleansed data Simple models Focus

Supports Causation: what happened and why? Business intelligence and decision support systems

Big data analytics Predictive analytics Data science Large-scale datasets More types of data Raw data Complex data models Correlation: new insight More accurate answers Evidence-based decision making and action taking

tries to explore a causal relationship, while big data analytics looks into correlation. One illustrative example is given on sensor wireless networks (Baraniuk 2011; Tsai et al. 2015). In a traditional sensor wireless network, data are served through one sensor while the system of processing, communication, and storage lacks data. In big data sensor networks, multiple and heterogeneous sensors participate in data provision while the system of processing, communication, and storage becomes the bottleneck, overwhelmed with data flowing from sensors (Baraniuk 2011).

9.1.2 Categories of Big Data Analytics There are different classifications of big data analytics. The most popular one is to classify big data analytics into three major categories based on the stage of analysis: descriptive, predictive, and prescriptive (LaValle et al. 2011; Hu et al. 2014; Watson 2014). Figure 9.1 shows major categories of big data analytics and their relationship. Descriptive analytics is about what has happened and tries to find hidden patterns and associations among big data. It answers questions about what has happened. This is a stage of looking back and sifting through historical data to get a picture of facts (Hu et al. 2014; Watson 2014; Lepenioti et al. 2019). Questions to be answered are as follows: What happened? What is happening now? Why did it happen? During this stage of big data analytics, the focus is on preparing the data for advanced analysis. Data modeling is on collect, store, and partition data for efficient big data mining. Visualization may be the results of indicators, charts, tables, and maps. Regression is used to correlate features in big data. Predictive analytics is about what is going to happen and predicting the future outcome. This is a stage of looking forward and forecasting according to existing patterns (Hu et al. 2014; Watson 2014). The assumption is that trends and patterns of business remain fairly consistent over time and historical patterns can be used to predict an outcome with developed algorithms and models. The focus of this stage is on the prediction of model probabilities and verification of models. Questions to be answered during this stage are as follows: What will happen? Why will it happen?

9.1 Big Data Analytics Concepts

157

Fig. 9.1 Categories of big data analytics by stages

Data mining algorithms are used to extracting stable patterns from large datasets to provide insight and forecast into the future. Predictive models may be developed with statistical methods and techniques such as linear or logistic regression to extrapolate the trend along temporal dimension to predict future outcomes. Prescriptive analytics is about what should be done and the actions needed for getting expected outcomes. This is a stage of creating actionable plans and modeling actions (Hu et al. 2014; Watson 2014; Lepenioti et al. 2019). Recommended actions and strategies are provided base on best practices and optimal models among understood models. Simulation models are in place to monitor outcomes of actions. The focus is on decision making and efficiency. Questions to be answered during

158

9 Big Data Analytics for Remote Sensing: Concepts and Standards

this stage are as follows: What should be put into action? Why should the action be taken? Optimization plays an important role in this stage to provide an optimized plan that is modeled to arrive at the most optimal outcome. Simulation provides the forecasting of outcomes with alternative actions. Predictive analytics is specially applicable to remote sensing data since it provides the opportunity for continuously observing the phenomena over time. The time series of observations provide the base for extrapolation over the time dimension or prediction. Many prediction methods can be adapted or applied with big data. Table 9.2 briefly summarizes major methods that are used in prediction for big data (Nagarajan and LD 2019). Table 9.2 lists some prediction methods. Extrapolation predicts a value if the underline trends continue for a longer period which is difficult to detect with big data. Regression models can be difficult to build with big data. Parallel processing for regression is also difficult to implement. Probabilistic models have high complexity for their requirements of calculating probabilities. Classification algorithms, such as decision trees, Naïve Bayes, support vector machine (SVM), have been applied in predictions of class labels. Clustering algorithms are unsupervised and group data based on some attribute similarities. Their applications to big data are challenging as most algorithms require access to the whole data with iterations. Association algorithms are less applicable to the prediction of big data for their focus on mining association rules. Machine learning algorithms, such as neural networks, deep learning, fuzzy logic system, and ensemble, have been used in big data predictions. Deep learning algorithms, such as deep belief networks (DBN), convolution neural networks (CNN), and recurrent neural networks (RNN), have been widely experimented for big data predictions can be applied to predictions by using historical time series as independent parameters because of their universal approximating power up to high accuracy. The main difficulties with deep learning algorithms for big data prediction are their time complexity during the training stage and their possible overfitting. Prediction modeling with long time series of remote sensing big data has been studied in different applications. In Sayad et al. (2019), time series of normalized difference vegetation index (NDVI), land surface temperature (LST), and thermal anomaly (TA) have been used to predict wildfires. All three parameters are derived from remote sensing data. In Sproles et al. (2018), time series of snow cover products from remote sensing data have been applied to predict stream flow. In Chen et al. (2019), a deep learning algorithm, region-based convolutional neural network (R-CNN), is applied to predict strawberry yield using high-resolution aerial orthoimages captured from unmanned aerial vehicles (UAV). In Rhif et al. (2020), a long short-term memory (LSTM) neural network, which is a deep learning algorithm, is used to build up and forecast time series of NDVI using remote sensing big data. The categories of big data analytics may be grouped into more groups. For example, big data analytics may be broken down into four groups to include diagnostic analytics between descriptive analytics and predictive analytics. The diagnostic analytics is primarily separated from descriptive analytics which represents the part

9.1 Big Data Analytics Concepts

159

Table 9.2 Prediction methods for big data Method Principle Extrapolation Generally linear extension beyond the observation range

Pro Close relationship Linear/simple

Con Model shift over time Not able to model the past Not applicable to nonlinear relationships Regression Linear regression model, Simple Sensitive to outliers multivariate regression model, Different models Require independent logistic regression model, time series for different variables regression model, survival analysis, situations Multicollinearity ridge regression, loess regression problem Bayesian Use probability Handle uncertainty Prior distribution statistics Probability Computationally expensive Classification Predicted target as classification Many models for Different levels of goal; Naïve Bayes, decision trees, use interpretability K-nearest neighbor, support vector Robust Selection of machine dependent variables Model shift Training data Clustering Group data with similar patterns; Unsupervised Sensitivity to initials K-Means, hierarchical clustering, Parameter setting density clustering Association Relationship association Association Not widely mining applicable for prediction Neural Connectionist method Nonlinear Black box networks No parametric Selection of assumption mostly architecture Overfitting Deep learning Complex of connectionist networks. Improved Black box Restricted Boltzmann machine, nonlinear model Require massive DBN, CNN, RNN, auotencoders, Eliminating the data recursive neural tensor nodes, vanishing gradient Computationally generative adversarial networks problem expensive during Parallelism-ready training Fuzzy system Fuzzy logic Interpretability Poor generalization Handle uncertainty Rule-based Ensemble Combination of more than one weak Stable Complex leaner. Bagging, boosting, stacking Selection of weak learners

to answer the question “why did it happen” (Banerjee et al. 2013; Hardoon and Shmueli 2013; Simpao et al. 2014; Khalifa 2018; Vassakis et al. 2018; Deshpande et al. 2019). Diagnostic analytics emphasizes corrections beyond just simply reporting (Deshpande et al. 2019).

160

9 Big Data Analytics for Remote Sensing: Concepts and Standards

Figure 9.2 shows another view of classifying big data analytics based on processes. This is similar to those based on stages but further expanding beyond the core analytic stages to include planning, data preprocessing, and data processing. Big data analytics starts with identifying sources and domains of data. Planning analytics selects techniques and methodologies and lays out blueprints. Processing processes data. Reporting explores hidden patterns and relationships among data. Analytics provides forecasting and actionable plans. Figure 9.3 shows yet another view of classifying and grouping big data analytics based on techniques. Suitable techniques for big data analytics include A/B testing (Siroker and Koomen 2013), association rule mining (Abdel-Basset et al. 2018), cluster analysis (Jain et al. 1999), data analytics through crowdsourcing (Satish and Yusof 2017), data fusion (Jabbar et al. 2018), data integration (Arputhamary and Arockiam 2015), ensemble learning (Galicia et al. 2019), genetic algorithms (Gill and Buyya 2019; Hajeer and Dasgupta 2019), machine learning (Ma et al. 2014; Athmaja et al. 2017), natural language processing (Subramaniyaswamy et al. 2015), neural networks (Chiroma et al. 2019), pattern recognition (Zerdoumi et al. 2018), anomaly detection (Hayes and Capretz 2015; Ariyaluran Habeeb et al. 2019), predictive modeling (Junqué de Fortuny et al. 2013), regression (Ma and Sun 2015), sentiment analysis (Ragini et al. 2018), signal processing (Giannakis et al. 2014),

Fig. 9.2 Processes of big data analytics

9.1 Big Data Analytics Concepts

161

Fig. 9.3 Techniques for big data analytics

supervised learning (Hussain and Cambria 2018), unsupervised learning (Nasraoui and Ben N’Cir 2019), simulation (Garcia-Magarino et al. 2018), time series analysis (Rezaee et al. 2018), and visualization (Keim et al. 2013). All these techniques can be grouped into three big categories based on core techniques applied in analytics— clustering, classification, and regression (Elgendy and Elragal 2014; Tsai et al. 2015). The purpose of clustering is to group data into clusters of similarities. Commonly used clustering algorithms can be partition-based, hierarchical, grid-based, density- based, or model-based (Rokach and Maimon 2005). Partitioning-based algorithms are based on similarities to group data into clusters. Commonly used partitioning- based clustering methods include K-Means clustering (MacQueen 1967), K-Metroids clustering (Kaufman and Rousseeuw 1990), and Fuzzy c-Means (FCM) (Bezdek 1981). K-Means group data into clusters based on the center or means of data which is highly sensitive to anomaly and outliers. K-Medoids, also known as Partitioning Around Medoids (PAM), is based on objects which are less sensitive to outliers. Clustering Large Applications (CLARA) is a variation of PAM applied to large datasets through sampling approach (Kaufman and Rousseeuw 1990). FCM clusters data based on its degree of belongings where data may belong to multiple clusters with a certain degree (Bezdek et al. 1984). Hierarchical clustering algorithms seek to build a hierarchy of clusters (Johnson 1967). There are two types of strategies to group data into hierarchical clusters— agglomerative (bottom-up) and divisive (top-down) (Rokach and Maimon 2005). Sequential agglomerative hierarchical nonoverlapping (SAHN) is the classical clustering hierarchical clustering algorithm based on the pair-group comparison (Day and Edelsbrunner 1984). Clustering Using Representatives (CURE) is another

162

9 Big Data Analytics for Remote Sensing: Concepts and Standards

hierarchical clustering algorithm that integrates with distance-based metrics (Guha et al. 2001). The DIANA (DIvisive ANAlysis) is a top-down approach (Kaufman and Rousseeuw 1990). The Divisive hieRArchical maximum likelihOod clusteriNg approach (DRAGON) (Sharma et al. 2017) is a recently developed divisive approach for hierarchical clustering. Grid-based clustering algorithms convert data space into a grid-based cell space, compute the cell density for each cell, and use density to identify clusters (Schikuta 1993, 1996; Ritchie 2015; Reddy and Bindu 2017). The Statistical Information Grid Approach to Spatial DataMining (STING) is one of the grid-based clustering algorithms (Wang et al. 1997). Density-based clustering algorithms use density to group clusters where clusters are separated from each other by contiguous regions of low object density (Kriegel et al. 2011). The density-based spatial clustering of applications with noise (DBSCAN) (Ester et al. 1996) defines clusters as maximal sets of density- connected points. Model-based clustering algorithms hypothesize a model for each cluster and find the best fit of models to data (Gan et al. 2007). The Gaussian Mixture Modelling for Model-Based Clustering, Classification, and Density Estimation (MCLUST) is the most popularly used model-based clustering algorithm (Fraley and Raftery 1999). The purpose of classification is to group objects into predefined classes by comparing similarities. Clustering algorithms have been adapted and evolved to deal with big data analytics. Common algorithms include perceptron learning algorithm, support vector machines (SVMs) (Cavallaro et al. 2015b), decision tree and ensemble methods (Kadaru and Reddy 2018), artificial neural networks (Chiroma et al. 2019), and Naïve Bayes classifiers (Sun et al. 2018). Table 9.3 lists selected opensource implementation of SVM for big data analytics. Many of the SVM implementations support parallel processing and leverage cloud computing capabilities. The purpose of regression is to find the correlation among data (Ma and Sun 2015). Regression finds relationships between variables. Machine learning approaches can be applied to build regression models (Pérez-Rave et al. 2019).

Table 9.3 Selected open-source SVM classification platforms for big data analytics Technology Apache Mahout Apache Spark Twister/ParallelSVM Scikit-learn piSVM GPU LibSVM pSVM

Platform/language Hadoop/Java Spark/Java Hadoop/Tiwster/Java Python MPI/C CUDA MPI/C

Analytics Single processing Parallel linear process Parallel SVMs Single processing Parallel SVMs Parallel SVMs Parallel SVMs

9.2 Remote Sensing Big Data Analytics Concepts

163

Fig. 9.4 Use cases for big data analytics

9.1.3 Big Data Analytics Use Cases Applications of big data analytics cover all domains, including health, Earth science, intelligence, data discovery, and business intelligence (Watson 2014; Vassakis et al. 2018; Nasraoui and Ben N’Cir 2019). Figure 9.4 shows major categories of use cases for big data analytics. Big data analytics involves data and pattern discovery, reporting, predictive forecasting, and prescriptive action planning.

9.2 Remote Sensing Big Data Analytics Concepts 9.2.1 Remote Sensing Big Data Challenges Challenges for remote sensing big data analytics are rooted in the multiple dimensions of remote sensing big data, primarily five Vs (Volume, Variety, Velocity, Veracity, and Value). What is new and different in remote sensing big data analytics compared to traditional remote sensing data analytics? In remote sensing, scientists and practitioners have been managing large volumes of heterogeneous datasets

164

9 Big Data Analytics for Remote Sensing: Concepts and Standards

since the launch of remote sensors that are continuously sending data back to systems. Researchers have adopted a series of analytics and methods in analyzing data. What distinguishes remote sensing big data analytics from traditional approaches is the complexity of analytics in big data to extract needed knowledge. The complexity makes it necessary to “divide and conquer” the knowledge extraction processes with different types at abstraction levels. In dealing with the sheer complexity of big data, what is necessary is to grow and implement the ability to efficiently analyze data and information to extract knowledge from big data. The motivation is to glean knowledge from data and information, to examine large amounts of data of a variety of types to uncover hidden patterns, unknown correlations, and other useful information. From the point of advancing sciences, it is important to leverage the ever-evolving data management systems and their analytic capacities for scientists to understand and develop ways to examine and process varieties of data relationships to expose enhanced knowledge and facilitate science. Remote sensing big data analytics need to deal with variety by leveraging information technology advancement to provide means to discover unobvious scientific relationships that were previously invisible to the scientific eye.

9.2.2 Categories of Remote Sensing Big Data Analytics Similar to big data analytics in general, remote sensing big data analytics can be grouped into different categories. The core three stages apply to remote sensing big data analytics, that is, descriptive, predictive, and prescriptive analytics (Madhukar and Pooja 2019). In addition, two more stages may be added to emphasize the overall stages of remote sensing analytics, that is, discovery and diagnostic analytics. Figure 9.5 shows all five categories of analytics for remote sensing big data. Each of these categories associates best with sets of analytic tools and techniques. This classification of remote sensing big data analytics is based on stages. The stage of discovery is added beyond the stage of prescriptive analytics to emphasize the information discovery and science exploration brought in with the analytics of remote sensing big data.

9.2.3 Processes of Remote Sensing Big Data Analytics Earth science data analytics is defined as “the process of examining, preparing, reducing, and analyzing large amounts of spatial (multi-dimensional), temporal, or spectral data using a variety of data types to uncover patterns, correlations, and other information to better understand our Earth”(Kempler 2016). This definition encompasses two additional stages before data analysis: (1) Data Preparation:

9.2 Remote Sensing Big Data Analytics Concepts

165

Fig. 9.5 Categories of remote sensing big data anlytics

Preparing heterogeneous data so that they can be jointly analyzed; (2) Data Reduction: Correcting, ordering, and simplifying data in support of analytic objectives; and (3) Data Analysis: Applying techniques/methods to derive results. These processes apply to remote sensing big data analytics. In addition, remote sensing big data analytics has a demand for parallel and scalable analytics to support the processing and classification of remotely sensed big data (Cavallaro et al. 2015a).

9.2.4 Objectives of Remote Sensing Big Data Analytics The survey of Earth science data analytics revealed fundamental requirements and objectives that apply to those for remote sensing big data. Earth science is one of the most important application domains for remote sensing big data. Figure 9.6 shows requirements, tools, techniques, and systems for Earth science data analytics (Kempler 2016). Identified requirements of Earth science data analytics include (1) to calibrate data, (2) to validate data, (3) to assess data quality, (4) to perform coarse data preparation, (5) to intercompare datasets, (6) to tease out information from data, (7) to glean knowledge from data and information, (8) to forecast/predict/ model phenomena, (9) to derive conclusions, and (10) to derive new analytics tools. These requirements apply to remote sensing big data analytics in general.

166

9 Big Data Analytics for Remote Sensing: Concepts and Standards

Fig. 9.6 Summary of earth science data analytics. (Excepted from the presentation of Steve Kempler (2016))

9.3 Big Data Analytics Standards Standards emerged to support interoperation among big data analytics. The Institute of Electrical and Electronics Engineers (IEEE) has a big data standard group to develop standards for big data analytics. The International Organization for Standardization has several standards for big data analytics and management. These standards provide the base of interoperation for remote sensing big data analytics. These are more specialized on standardizing the processes of big data analytics than those data management standards covered in Chap. 7.

9.3.1 IEEE Big Data Analytics Standards The IEEE big data group has identified areas to be standardized for big data analytics. They are metadata standard for big data management, mobile health platform, curation of electronic health records (EHRs), data representation in big data management, data access interface, big data metadata standards, wireless sensor network big data, smart grid big data, energy-efficient data acquisition, green data centers for big data, wireless and device big data analytics, mobile big data, big data for 5G networks, and mobile cloud computing (MCC) big data. Health is one of the most active domains among IEEE big data standardization efforts. Medical data format, management, simulation, and visualization are among standards being actively working on.

References

167

9.3.2 ISO Big Data Working Group: ISO/IEC JTC 1/ SC 42/WG 2 The big data standard efforts are currently under the ISO/IEC JTC 1/SC 42/WG 2. IEC JTC 1 is a joint technical committee of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC). The Standards Committee (SC) 42 is for artificial intelligence. The Working Group (WG) 2 under ISO/IEC JTC 1/SC 42 is coordinating the big data standard effort within ISO initiated by the now-disbanded ISO JTC1/WG 9. Current standards for big data include ISO/IEC 20546 for big data overview and vocabulary (ISO 2019), ISO/IEC TR 20547-2 for use cases and derived requirements (ISO 2018a), ISO/IEC 20547-3 for reference architecture (ISO 2020a), ISO/IEC TR 24028 for an overview of trustworthiness in artificial intelligence (ISO 2020b), and ISO/IEC TR 20547 for standards roadmap (ISO 2018b).

References Abdel-Basset M, Mohamed M, Smarandache F, Chang V (2018) Neutrosophic association rule mining algorithm for big data analysis. Symmetry 10:106. https://doi.org/10.3390/sym10040106 Ariyaluran Habeeb RA, Nasaruddin F, Gani A et al (2019) Clustering-based real-time anomaly detection—a breakthrough in big data technologies. Trans Emerg Telecommun Technol:e3647. https://doi.org/10.1002/ett.3647 Arputhamary B, Arockiam L (2015) Data integration in big data environment. Bonfring Int J Data Min 5:01–05. https://doi.org/10.9756/BIJDM.8001 Athmaja S, Hanumanthappa M, Kavitha V (2017) A survey of machine learning algorithms for big data analytics. In: 2017 international conference on innovations in information, embedded and communication systems (ICIIECS). IEEE, Coimbatore, pp 1–4 Babu MSP, Sastry SH (2014) Big data and predictive analytics in ERP systems for automating decision making process. In: 2014 IEEE 5th international conference on software engineering and service science. IEEE, Beijing, pp 259–262 Banerjee A, Bandyopadhyay T, Acharya P (2013) Data analytics: hyped up aspirations or true potential? Vikalpa 38:1–12 Baraniuk RG (2011) More is less: signal processing and the data deluge. Science 331:717–719. https://doi.org/10.1126/science.1197448 Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Springer US, Boston Bezdek JC, Ehrlich R, Full W (1984) FCM: the fuzzy c-means clustering algorithm. Comput Geosci 10:191–203 Cavallaro G, Riedel M, Bodenstein C et al (2015a) Scalable developments for big data analytics in remote sensing. In: 2015 IEEE international geoscience and remote sensing symposium (IGARSS). IEEE, Milan, Italy, pp 1366–1369 Cavallaro G, Riedel M, Richerzhagen M et al (2015b) On understanding big data impacts in remotely sensed image classification using support vector machine methods. IEEE J Sel Top Appl Earth Obs Remote Sens 8:4634–4646 Chiroma H, Abdullahi UA, Abdulhamid SM et al (2019) Progress on artificial neural networks for big data analytics: a survey. IEEE Access 7:70535–70551. https://doi.org/10.1109/ ACCESS.2018.2880694

168

9 Big Data Analytics for Remote Sensing: Concepts and Standards

Chen Y, Lee WS, Gan H, et al (2019) Strawberry Yield Prediction Based on a Deep Neural Network Using High-Resolution Aerial Orthoimages. Remote Sens 11:1584. https://doi.org/10.3390/ rs11131584 Day WH, Edelsbrunner H (1984) Efficient algorithms for agglomerative hierarchical clustering methods. J Classif 1:7–24 Deshpande PS, Sharma SC, Peddoju SK (2019) Predictive and prescriptive analytics in big-data era. In: Security and data storage aspect in cloud computing. Springer Singapore, Singapore, pp 71–81 Elgendy N, Elragal A (2014) Big data analytics: a literature review paper. In: Perner P (ed) Advances in data mining. Applications and theoretical aspects. Springer International Publishing, Cham, pp 214–227 Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd ACM international conference on knowledge discovery and data mining (KDD). AAAI Press, Portland, OR, USA, pp 226–231 Fraley C, Raftery AE (1999) MCLUST: software for model-based cluster analysis. J Classif 16:297–306 Galicia A, Talavera-Llames R, Troncoso A et al (2019) Multi-step forecasting for big data time series based on ensemble learning. Knowl-Based Syst 163:830–841. https://doi.org/10.1016/j. knosys.2018.10.009 Gan G, Ma C, Wu J (2007) Data clustering: theory, algorithms, and applications. Society for Industrial and Applied Mathematics Garcia-Magarino I, Lacuesta R, Lloret J (2018) Agent-based simulation of smart beds with internet- of-things for exploring big data analytics. IEEE Access 6:366–379. https://doi.org/10.1109/ ACCESS.2017.2764467 Giannakis GB, Bach F, Cendrillon R et al (2014) Signal processing for big data. IEEE Signal Process Mag 31:15–16. https://doi.org/10.1109/MSP.2014.2330054 Gill SS, Buyya R (2019) Bio-inspired algorithms for big data analytics: a survey, taxonomy, and open challenges. In: Big data analytics for intelligent healthcare management. Elsevier, pp 1–17 Guha S, Rastogi R, Shim K (2001) Cure: an efficient clustering algorithm for large databases. Inf Syst 26:35–58. https://doi.org/10.1016/S0306-4379(01)00008-4 Hajeer M, Dasgupta D (2019) Handling big data using a data-aware HDFS and evolutionary clustering technique. IEEE Trans Big Data 5:134–147. https://doi.org/10.1109/TBDATA.2017.2782785 Hardoon DR, Shmueli G (2013) Getting started with business analytics: insightful decision- making. CRC Press, Boca Raton Hayes MA, Capretz MA (2015) Contextual anomaly detection framework for big sensor data. J Big Data 2. https://doi.org/10.1186/s40537-014-0011-y Hu H, Wen Y, Chua T-S, Li X (2014) Toward scalable systems for big data analytics: a technology tutorial. IEEE Access 2:652–687. https://doi.org/10.1109/ACCESS.2014.2332453 Hussain A, Cambria E (2018) Semi-supervised learning for big social data analysis. Neurocomputing 275:1662–1673. https://doi.org/10.1016/j.neucom.2017.10.010 ISO (2018a) ISO/IEC TR 20547-2:2018 information technology — big data reference architecture — part 2: use cases and derived requirements, 1st edn. International Organization for Standardization, Geneva ISO (2018b) ISO/IEC TR 20547-5:2018 information technology — big data reference architecture — part 5: standards roadmap, 1st edn. International Organization for Standardization, Geneva ISO (2019) ISO/IEC 20546:2019 information technology — big data — overview and vocabulary, 1st edn. International Organization for Standardization, Geneva ISO (2020a) ISO/IEC JTC 1/SC 42 ISO/IEC 20547-3:2020 information technology — big data reference architecture — part 3: reference architecture, 1st edn. International Organization for Standardization, Geneva ISO (2020b) ISO/IEC TR 24028:2020 information technology — artificial intelligence — overview of trustworthiness in artificial intelligence, 1st edn. International Organization for Standardization, Geneva

References

169

Jabbar S, Malik KR, Ahmad M et al (2018) A methodology of real-time data fusion for localized big data analytics. IEEE Access 6:24510–24520. https://doi.org/10.1109/ACCESS.2018.2820176 Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv CSUR 31:264–323. https://doi.org/10.1145/331499.331504 Johnson SC (1967) Hierarchical clustering schemes. Psychometrika 32:241–254 Junqué de Fortuny E, Martens D, Provost F (2013) Predictive modeling with big data: is bigger really better? Big Data 1:215–226. https://doi.org/10.1089/big.2013.0037 Kadaru BB, Reddy BRS (2018) A novel ensemble decision tree classifier using hybrid feature selection measures for Parkinson’s disease prediction. Int J Data Sci 3:289–307 Kaufman L, Rousseeuw PJ (eds) (1990) Finding groups in data. Wiley, Hoboken Keim D, Qu H, Ma K-L (2013) Big-data visualization. IEEE Comput Graph Appl 33:20–21. https://doi.org/10.1109/MCG.2013.54 Kempler S (2016) Earth science data analytics cluster. In: ESIP federation meeting, Durham, NC, USA Khalifa M (2018) Health analytics types, functions and levels: a review of literature. In: Hasman A, Gallos P, Liaskos J et al (eds) Data, informatics and technology: an inspiration for improved healthcare. IOS Press, Amsterdam, pp 137–140 Kriegel H, Kröger P, Sander J, Zimek A (2011) Density-based clustering. WIREs Data Min Knowl Discov 1:231–240. https://doi.org/10.1002/widm.30 LaValle S, Lesser E, Shockley R et al (2011) Big data, analytics and the path from insights to value. MIT Sloan Manag Rev 52:21–32 Lepenioti K, Bousdekis A, Apostolou D, Mentzas G (2019) Prescriptive analytics: a survey of approaches and methods. In: Abramowicz W, Paschke A (eds) Business information systems workshops. Springer International Publishing, Cham, pp 449–460 Ma P, Sun X (2015) Leveraging for big data regression. Wiley Interdiscip Rev Comput Stat 7:70–76. https://doi.org/10.1002/wics.1324 Ma C, Zhang HH, Wang X (2014) Machine learning for big data analytics in plants. Trends Plant Sci 19:798–808. https://doi.org/10.1016/j.tplants.2014.08.004 MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. Oakland, CA, USA, pp 281–297 Madhukar M, Pooja (2019) Earth science [big] data analytics. In: Dey N, Bhatt C, Ashour AS (eds) Big data for remote sensing: visualization, analysis and interpretation. Springer International Publishing, Cham, pp 99–128 Nagarajan G, L.D DB (2019) Predictive Analytics On Big Data - An Overview. Informatica 43:425-459. https://doi.org/10.31449/inf.v43i4.2577 Nasraoui O, Ben N’Cir C-E (eds) (2019) Clustering methods for big data analytics: techniques, toolboxes and applications. Springer International Publishing, Cham Pérez-Rave JI, Correa-Morales JC, González-Echavarría F (2019) A machine learning approach to big data regression analysis of real estate prices for inferential and predictive purposes. J Prop Res 36:59–96. https://doi.org/10.1080/09599916.2019.1587489 Ragini JR, Anand PMR, Bhaskar V (2018) Big data analytics for disaster response and recovery through sentiment analysis. Int J Inf Manag 42:13–24. https://doi.org/10.1016/j. ijinfomgt.2018.05.004 Reddy KSS, Bindu CS (2017) A review on density-based clustering algorithms for big data analysis. In: 2017 international conference on I-SMAC (IoT in social, mobile, analytics and cloud) (I-SMAC). IEEE, Palladam, Tamilnadu, India, pp 123–130 Rezaee Z, Dorestani A, Aliabadi S (2018) Application of time series analyses in big data: practical, research, and education implications. J Emerg Technol Account 15:183–197. https://doi. org/10.2308/jeta-51967 Ritchie NWM (2015) Diluvian clustering: a fast, effective algorithm for clustering compositional and other data. Microsc Microanal 21:1173–1183. https://doi.org/10.1017/S1431927615014701

170

9 Big Data Analytics for Remote Sensing: Concepts and Standards

Rokach L, Maimon O (2005) Clustering methods. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, New York, pp 321–352 Rhif M, Ben Abbes A, Martinez B, Farah IR (2020) A deep learning approach for forecasting non-stationary big remote sensing time series. Arab J Geosci 13:1174. https://doi.org/10.1007/ s12517-020-06140-w Sagiroglu S, Sinanc D (2013) Big data: a review. In: 2013 international conference on collaboration technologies and systems (CTS). IEEE, San Diego, CA, USA, pp 42–47 Sayad YO, Mousannif H, Al Moatassime H (2019) Predictive modeling of wildfires: a new dataset and machine learning approach. Fire Saf J 104:130–146. https://doi.org/10.1016/j. firesaf.2019.01.006 Satish L, Yusof N (2017) A review: big data analytics for enhanced customer experiences with crowd sourcing. Proc Comput Sci 116:274–283. https://doi.org/10.1016/j.procs.2017.10.058 Schikuta E (1993) Grid-clustering: a fast hierarchical clustering method for very large data sets. Houston, Center for Research on Parallel Computing, Rice University Schikuta E (1996) Grid-clustering: an efficient hierarchical clustering method for very large data sets. In: Proceedings of 13th international conference on pattern recognition. IEEE, pp 101–105 Shang W, Jiang ZM, Hemmati H et al (2013) Assisting developers of big data analytics applications when deploying on Hadoop clouds. In: 2013 35th international conference on software engineering (ICSE). IEEE, San Francisco, CA, USA, pp 402–411 Sharma A, López Y, Tsunoda T (2017) Divisive hierarchical maximum likelihood clustering. BMC Bioinf 18. https://doi.org/10.1186/s12859-017-1965-5 Simpao AF, Ahumada LM, Gálvez JA, Rehman MA (2014) A review of analytics and clinical informatics in health care. J Med Syst 38. https://doi.org/10.1007/s10916-014-0045-x Siroker D, Koomen P (2013) A/B testing: the most powerful way to turn clicks into customers. Wiley, Hoboken Sproles EA, Crumley RL, Nolin AW, et al (2018) SnowCloudHydro—A New Framework for Forecasting Streamflow in Snowy, Data-Scarce Regions. Remote Sens 10:1276. https://doi. org/10.3390/rs10081276 Subramaniyaswamy V, Vijayakumar V, Logesh R, Indragandhi V (2015) Unstructured data analysis on big data using map reduce. Proc Comput Sci 50:456–465. https://doi.org/10.1016/j. procs.2015.04.015 Sun N, Sun B, Lin JD, Wu MY-C (2018) Lossless pruned Naive Bayes for big data classifications. Big Data Res 14:27–36 Tsai C-W, Lai C-F, Chao H-C, Vasilakos AV (2015) Big data analytics: a survey. J Big Data 2. https://doi.org/10.1186/s40537-015-0030-3 Vassakis K, Petrakis E, Kopanakis I (2018) Big data analytics: applications, prospects and challenges. In: Skourletopoulos G, Mastorakis G, Mavromoustakis CX et al (eds) Mobile big data. Springer International Publishing, Cham, pp 3–20 Wang W, Yang J, Muntz RR (1997) STING: a statistical information grid approach to spatial data mining. In: Proceedings of the 23rd international conference on very large data bases. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 186–195 Wang Y, Kung L, Byrd TA (2018) Big data analytics: understanding its capabilities and potential benefits for healthcare organizations. Technol Forecast Soc Change 126:3–13. https://doi. org/10.1016/j.techfore.2015.12.019 Watson HJ (2014) Tutorial: big data analytics: concepts, technologies, and applications. Commun Assoc Inf Syst 34:1247–1268. https://doi.org/10.17705/1CAIS.03465 Zerdoumi S, Sabri AQM, Kamsin A et al (2018) Image pattern recognition in big data: taxonomy and open challenges: survey. Multimed Tools Appl 77:10091–10121. https://doi.org/10.1007/ s11042-017-5045-7 Zheng Z, Zhu J, Lyu MR (2013) Service-generated big data and big data-as-a-service: an overview. In: 2013 IEEE international congress on big data. IEEE, Santa Clara, CA, USA, pp 403–410

Chapter 10

Big Data Analytic Platforms

Abstract This chapter defines the basic concepts of big data analytic platforms. Big data analytic platforms can be classified by different criteria. There are a variety of infrastructures for big analytics platforms, including a peer-to-peer system, clusters, high-performance computers, grid computing, and cloud computing. There are several options for storage and memory management, including file-based, distributed file systems, SQL (Structured Query Language) databases, and NoSQL (not only SQL) databases. There are different data-processing strategies, such as batch- processing, stream-processing, hybrid-processing, and graph-processing. Major big data analytic tools are exploratory analysis, machine learning, recommend systems, and specialized intelligence. Data visualization is important for exploring data and disseminating information and knowledge. There are visualization tools for data, information, scientific domains, and knowledge and intelligence. Several open- source big data analytic platform software packages are reviewed, including GeoMesa, GeoTrellis, and RasterFrames. Selected remote sensing big data analytic infrastructures are reviewed, including Google Earth Engine, EarthServer, NEX (NASA (National Aeronautics and Space Administration) Earth Exchange), and Giovanni (Goddard Earth Sciences Data and Information Services Center (GES DISC) Online Visualization and Analysis Infrastructure). Keywords Big data analytic platform · Storage · Processing · Data visualization · SQL · NoSQL · Batch process · Stream process

10.1 Big Data Analytic Platforms Big data analytic platform provides a unified solution for data-driven applications (Pawar and Attar 2016). Requirements for unifications in a big data analytic platform include data collections, processing capabilities, development tools, and graphical interactive interfaces. Figure 10.1 shows major categories of big data analytic platforms and their relationships to infrastructures and applications. A big data analytic platform sits on top © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. Di, E. Yu, Remote Sensing Big Data, Springer Remote Sensing/ Photogrammetry, https://doi.org/10.1007/978-3-031-33932-5_10

171

172

10 Big Data Analytic Platforms

Fig. 10.1 Big data analytics and its categories

of some computing infrastructure to establish a set of unified interfaces and support for data access, data processes, data analysis, and visualization. Big data analytics platforms may be developed on top of the peer-to-peer system through loose collaboration. P2P-MPI is one of the early implementations on top of grids or grid computing (Genaud and Rattanapoka 2007). In P2P systems, MPI (Message Passing Interface) is used as the de facto standard for message passing and coordination (Gropp et al. 1999). P2P systems are loose collaborated through TCP/IP by MPI which leads to its lack of fault-management support (Singh and Reddy 2015). This limits the use of P2P in big data analytics. MPI is still used in fewer fault systems to collaborate and coordinate services to stream data from edges or sensors (Sievert and Casanova 2004; Hajibaba and Gorgin 2014; Garcia Lopez et al. 2015). The peer-to-peer architecture is used in the NoSQL database of key-value tables (Kalid et al. 2017), such as Amazon DynamoDB (DeCandia et al. 2007) and Apache Cassandra (Lakshman and Malik 2009). A cluster computer is a type of distributed processing system consisting of a collection of interconnected stand-alone computers working together as a single integrated resource (Morrison 2003). The cluster can be built with different architectures. Beowulf clusters use a client/server model of computing in its architecture that has a single node to control all participating nodes with no keyboard, mice, video cards,

10.2 Data Storage Strategy in Big Data Analytic Platforms

173

or monitors (Morrison 2003). Each node is dedicated to parallel computing. Beowulf clusters have been used in big data analytics (Reyes-Ortiz et al. 2015). In big data analytics, commonly used clustering technologies include Apache Hadoop (Ghazi and Gangodkar 2015) and HPCC (High-Performance Computing Cluster), or DAS (Data Analytics Supercomputer) (Furht and Villanustre 2016; Herrera et al. 2019). High-performance computers are mostly built from massively parallel processors (MPP) to achieve high performance (Bo et al. 2012). They are generally built as a single computer. A high-performance computer can be used in big data analytics. Climate simulation and weather forecasting are example application domains of HPC (Michalakes 2020). Grid computing is a distributed computing technology to “enable resource sharing and coordinated problem-solving in dynamic, multi-institutional virtual organizations” (Foster et al. 2001, 2008). The Open Grid Service Architecture is the standard to enable the integration of services and resources across virtual organizations that are accessible through a network (Foster et al. 2002). Hadoop clusters can utilize grids to build a big data platform (He et al. 2012; Garlasu et al. 2013). Cloud computing is “a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction” (Mell and Grance 2011). It facilitates the delivery and on-demand availability of computing resources or services including computing power, storage, network, and software. Common open-source cloud software stacks include OpenStack (Sefraoui et al. 2012), CloudStack (Kumar et al. 2014), OpenShift OKD (Caban 2019), Eucalyptus (Yadav 2013), OpenNebula (Yadav 2013), and AppScale (Krintz 2013).

10.2 Data Storage Strategy in Big Data Analytic Platforms As shown in Fig. 10.1, there are various data storage strategies to be selected for big data analytics. The data can be simply managed as file-based, distributed file systems, SQL relational database, and Not-only-SQL database. File-based systems Big data analytics can be processed on big data files directly. Domain-specific file formats can support large sizes of data. For example, both HDFEOS5 and BigTiff, common remote sensing data file formats, can support up to exabyte of data in one single file since they use the 64-bit address to address up to 16 exabytes (Ullman et al. 2008, p. 5; Pennefather and Suhanic 2009; Folk et al. 2011; Koranne 2011, p. 5). However, for parallel processing, a limit of 2 GB may be imposed for a single tile or file since MPI parallel algorithm count is a 32-bit integer (Yang et al. 2004; Chilan et al. 2006). Distributed file systems Distributed File System (DFS) is commonly used in supporting large datasets for big data analytics (Dutta 2017). When the size of a file is in terabytes or beyond, the speed of retrieving and modifying the file can be extremely slow even though the file can be fit in one single node. Big data analytic

174

10 Big Data Analytic Platforms

on such a file would take too long to complete. DFS solves the problem of slow reading and writing large files by balancing the tasks across multiple nodes. The throughput of multiple node distributed file systems can be much faster than what a single node can achieve. Table 10.1 lists some open-source distributed file systems. In theory, most of the distributed file systems can be extended to manage the unlimited size of files and volume. However, in practice, because of implementation, there exist actual limits on volume and file size. For example, HDFS (Hadoop Distributed File System) is implemented using Java long integers to manage blocks. If the block size is determined, there is a maximum size of the file. The default block size is used to be 64 megabytes which would result in a maximum file size of 512 yottabytes (a yottabyte is a trillion terabytes). Relational databases Relational databases have been dominated by traditional databases. Data integrity is ensured up to the principle of ACID (Atomicity, Consistency, Isolation, and Durability). Relational databases have been used and supported in cloud computing. They are also scaled out using Internet technology to manage big data in some cases. For example, Google Cloud SQL Server is an Internet-scale relational database based on MySQL (Krishnan and Gonzalez 2015). Microsoft Azure SQL Database is a platform as a service (PaaS) based on Microsoft SQL Server (Lee et al. 2009; Campbell et al. 2010). MySQL Cluster in Cloudera is a cluster of relational MySQL database instances to deal with big data (Tummalapalli and Rao Machavarapu 2016). Traditional relational databases are best for relatively small databases. It is relatively a challenge to scale out a relational database to deal with big data. Their use in big data analytics is limited. NoSQL databases NoSQL (not only SQL) databases do not maintain full relations and store data with schema-less models, such as key-value data store, column data store, document data store, tabular store, object data store, and graph databases (Loshin 2013a; Deka et al. 2017; Dutta 2017; Meier and Kaufmann 2019). It is found that scaling NoSQL databases is relatively easier than relational databases (Borkar et al. 2012; Loshin 2013a, b). The less constraint on schema models with NoSQL increases its flexibility and scalability in distributed computing and handling data with heterogeneous models. Table 10.2 lists some examples of data stores for big data analytics under different categories. Some of these data stores implement multiple models. For example, both Microsoft Azure Cosmos DB and Amazon DyanmoDB support multiple models, including document, key value, and wide column.

10.3 Data-Processing Strategy in Big Data Analytic Platforms The strategies for data processing in big data analytic platforms include batch processing, streaming processing, iterative processing, and graph processing.

175

10.3 Data-Processing Strategy in Big Data Analytic Platforms Table 10.1 Open Source Distributed File Systems

HDFS

Characteristics Java. NameNode + DataNodes; CheckpointNode (secondary NameNode), Job Tracker, Task Tracker. Reliability: replicating data across multiple hosts—default 3—two on one rack and one on another rack. Availability: automatic failovers Immutable files (not for concurrent write operations). Add multiple NameNode.

HDFS Federation (Awaysheh et al. 2020) QFS (Read 2011; C++. Siebers and Balaji 2013) Chunk server, metaserver, client API. Fault tolerance: Reed-Solomon error correction on 9 independent machines (3 parity+6stripes); 6 of the 9 can recover. Used in SAM-QFS(Storage and Archive Manager Quick File System) or OHSM (Oracle Hierarchical Storage Manager). Stopped to be developed. Replaced by ScoutFS under VSM(Versity Storage Manager). OrangeFS (Bonnie et al. C. 2011; Wang et al. 2014) Parallel Virtual File System (PVFS). MPI. Part of Linux kernel. Alluxio (Li 2018) Virtual DFS. APIs (such as Hadoop HDFS API, S3 API, FUSE API) provided by Alluxio to interact with data from various storage systems at a fast speed. BeeGFS (Herold and PVFS. Breuner 2018) Fault-tolerance: replication of storage volumes with automatic failover and self-healing. Gfarm file system PVFS. (Tatebe et al. 2010) Designed for a grid computing system. GlusterFS (Selvaganesan Scale-out network-attached storage file and Liazudeen 2016) system. Now Red Hat Gluster Storage. LizardFS (Korenkov Fault-tolerant. et al. 2015) Scalable. Distributed. Two metadata servers, a set of chunkservers, and clients. A fork of MooseFS.

Maximum volume and size Maximum file size: 263 × block size (default 128 MB).

Unlimited. Maximum file size: 8 EiB. Maximum number of files: 4.1 billion per filesystem.

(continued)

10 Big Data Analytic Platforms

176 Table 10.1 (continued)

Lustre (Wang et al. 2009)

MooseFS (Piotr Robert Konopelko 2016)

RozoFS (Fizians 2014) Tahoe-LAFS (Selimi and Freitag 2014; Williams et al. 2019) XtreemFS (Hupfeld et al. 2008; Martini and Choo 2014)

ScoutFS (Brown et al. 2018; Brown 2019)

Characteristics Block distributed file system. MDS (Metadata Service), OSS (Object Storage Servers), client.

DFS. Fault-tolerance: replication. Striping (chunks 64 MB). Highly available. Highly performing: balancing. Consists of Exports server, Storage servers, Clients. Tahoe Least-Authority File Store. Peer-to-peer. Fault-tolerant. Object-based, distributed file system Partial replicas, objects fetched on demand. POSIX compatibility. Plugins for authentication policies, replica selection. RAID0 (striping) with parallel I/O over stripes[3]. POSIX. Scout Archive Manager (ScoutAM) to manage. Parallel file system. Replacement of SAM-QFS or HSM.

Maximum volume and size Maximum volume size: 300 PB, 16 EB (theoretical). Maximum file size: 3.2 PB (ext4), 16 EB (ZFS). Volume: 16 EB. File size: 128 PB. Files: 2**31.

Table 10.2 NoSQL databases Category Document store

Column store

Key-value store Graph store

Applicability in big data analytics Document-based data. Applicable to deal with unstructured data, semi- structured data. Store data in records or row by Cassandra (Wahid and Kashyap 2019), row. Columns can be large. HBase(Wang et al. 2019; Zhang et al. 2019), Most popularly adopted in Accumulo (Lv et al. 2019), Google Cloud BigTable, Microsoft Azure Cosmos DB, Amazon open-source big data store. DynamoDB The simplest form of database Redis (Puangsaijai and Puntheeranurak 2017), Amazon DynamoDB, Microsoft Azure Cosmos management systems. DB Neo4j (Perçuku et al. 2017) Store graphs with nodes, edges, and properties. Example stores MongoDB (Kang et al. 2015), CouchDB (Lomotey and Deters 2015)

10.3 Data-Processing Strategy in Big Data Analytic Platforms

177

Fig. 10.2 Typical technology stack for batch processing big data analytics

Batch processing Data is processed offline in parallelism. Distributed data file systems can be efficiently used in storing and managing data for batch processing. Hadoop (Borthakur 2007) is a popular distributed processing framework. Figure 10.2 shows a technology stack for batch processing on Hadoop. Data ingestion can be done using HDFS commands to load data, Apache Sqoop for bulk data loading (Vohra 2016), Apache Flume for logging data (Hoffman 2013), and Apache Gobblin as a framework to ingest data (Qiao et al. 2015). Apache Gobblin is a unified framework that works for ingestion utilities for both batch and streaming processes. MapReduce is a parallel data-processing model suitable for batch processing (Goudarzi 2017). Apache Hive enables a SQL-like interface for data analysis (Bansal et al. 2019). Apache Pig is another model for data analysis (Bansal et al. 2019). Apache Cascading is an abstract layer to simplify the creation and execution of data-processing workflows with MapReduce jobs (Erraissi et al. 2017). Other analytic tools and libraries may include SparkR (an R library) (Venkataraman et al. 2016), Shark (SQL on SPARK) (Xin et al. 2013), GraphX (graph-parallel computation) (Gonzalez et al. 2014), and MLlib (machine learning library) (Meng et al. 2016). The selection of data visualization depends on applications and data domain. Map may be served through OGC Web Map Service (WMS). A proper client-side library may be used to display the results, such as OpenLayers for maps and D3 for

178

10 Big Data Analytic Platforms

general graphs and charts. Apache Giraph may be used for general graph processing (Sakr et al. 2016). SocNetV may be used in visualizing network-based graphs (Kalamaras 2014). Stream processing Stream-processing processes data without delay in contrast to batch processes which process data in bulk and batches. The process can be either time-based or event-based. The time-based process keeps on streaming computation at given time intervals. An event-based process is triggered when an event occurs. Figure 10.3 shows a set of the technology stack to support stream processing on Storm. Data ingestion may be fed in real time by using Apache Flume for logging data, WebHDFS API (Application Programming Interface) for directly updating the data in the background (Weili Kou et al. 2016), or other REST (Representational State Transfer) API for direct ingestion of data into models (Rodrigues and Chiplunkar 2018). Gobblin is a general framework to unify the process of ingestion using different modules (Qiao et al. 2015). Apache Kafka is a stream-processing platform to unify the process of stream ingestion of data (Shaheen 2017). Kestrel is a simple, distributed message queue system without features of clustering, failover, client coordination, and multiple queuing which are available in Kafka (Van-Dai Ta et al. 2016). DistributedLog is a

Fig. 10.3 Typical technology stack for stream processing big data analytics

10.3 Data-Processing Strategy in Big Data Analytic Platforms

179

distributed, high-throughput, low-latency log service (Guo et al. 2017). Apache Pulsar is a distributed messaging and streaming platform (Kjerrumgaard 2020). Apache Storm is a real-time processing system (Jain 2017). Apache Flume is for log data (Hoffman 2013). Apache Trident is an abstraction layer that models real-time computation on Apache Storm (Wu et al. 2017). Apache S4 (Simple Scalable Streaming System) is a platform for processing continuous data streams in real time (Neumeyer et al. 2010). Apache Spark Streaming is an extension to Spark API for stream processing of live data streams (Zaharia et al. 2016). Apache Shark (SQL on Spark) is a programming model for data sources (Armbrust et al. 2015). Apache Samza is another streaming-process framework that uses Apache YARN (Yet Another Resource Negotiator) for resource allocation and scheduling (Noghabi et al. 2017; Vavilapalli et al. 2013). Apache Druid provides real-time online analytical processing (OLAP) (Yang et al. 2014). Hybrid processing The platform supports batch processing for massive data incrementally or iteratively and stream processing for streaming data. It is designed to handle big data with low latency. There are two architectures that form hybrid stream processing. One is Lambda architecture (Feick et al. 2018). In Lambda architecture, there are three layers, that is, batch layer, speed layer, and serving layer. Figure 10.4 shows an example of hybrid-processing architecture using the Lambda architecture model. The batch layer consists of a batch-processing big data analytic platform that may combine existing data in the big data store (such as HBase) to incrementally process the data. The speed layer uses a stream-process framework to process data in stream (e.g., Redis stream store). In the serving layer, results from both the batch layer and speed layer are combined to be readily for query and use. Another is Kappa architecture (Feick et al. 2018). In Kappa architecture, there are only two layers, that is, the stream layer and serving layer. Figure 10.5 shows an example implementation using Kappa architecture to form a hybrid process. All data are handled using one single stream-processing engine. It supports both batch processing and stream processing. Apache Flink is designed with Kappa architecture. Results may be stored in a single big data store, that is, Cassandra.

Fig. 10.4 Example lambda architecture for hybrid big data processing

180

10 Big Data Analytic Platforms

Fig. 10.5 Example kappa architecture for hybrid big data processing

Graph processing Graph data are linked data that are increasingly available from different sources, such as social networks and web sources. A big data graph may contain trillions of edges and vertices. Traditional graph-processing frameworks or platforms may not be able to handle such a large volume of data. One of the popular frameworks for processing big graphs is Pregel which uses messages to coordinate the processing in a distributed computing manner (Malewicz et al. 2010). Process programs are modeled as a sequence of iterations. For each iterate, a vertex receives messages from the previous iteration, sends a message to other vertices, and modifies its state. Frameworks for graph processing includes directed acyclic graph (DAG), matrix, vertex-centric, graph-centric, and visitor models as classified in Bhatia and Kumar (2018). Pregel is vertex-centric. Stratosphere is an example of DAG (Warneke and Kao 2009). Apache HAMA (Hadoop Matrix Algebra) has matrix BSP API (Siddique et al. 2016). Apache Giraph works with Hadoop stack using vertex-centric model (Sakr et al. 2016).

10.4 Tools in Big Data Analytic Platforms Based on tools and application scenarios, big data analytics platforms can be classified as exploratory analysis, machine learning, recommend systems, and specialized intelligence. Exploratory analysis Tools for exploratory analysis aim at exploring big data to find trends, outliers, patterns, correlations, and statistic summaries. These tools are adapted to work with big data analytic platforms through big data modeling frameworks, such as MapReduce, Spark, and Flink. The function requirements include capabilities to explore time series analysis, part to whole/ranking, deviation analysis, distribution analysis, correlation analysis, association analysis, and multivariate analysis. Traditional tools exist for such analyses, but they need to be adapted to work with big data analytic platforms. Statistic software packages have related capabilities, such as SAS (Statistical Analysis System), SPSS, Tableau, Spotfire, and MATLAB. To work with big data, tools may be either extended to support selected big data platforms or developed as connectors or extensions in big data analytic platforms. For example, Tableau provides native connectors to link Tableau to Hadoop (Russom 2013) and SAS/ACCESS connects SAS to Hadoop (Brown

10.5 Data Visualization in Big Data Analytic Platforms

181

2015). An integrated solution may be developed with big data analytic platforms. For example, SparkR is an extension module for Spark to integrate R functions (Venkataraman et al. 2016). Apache Drill exposes SQL-like analysis for big data analytics to interact with a range of NoSQL databases, such as HBase, MongoDB, MapR-DB, Amazon S3, Azure Blob Storage, Google Cloud Storage, and HDFS (Hausenblas and Nadeau 2013). Machine learning Machine learning and data mining capabilities are required for analyzing big data. There are implementations and extensions of big analytic platforms with machine learning tools. Apache Mahout implements scalable machine learning algorithms (clustering, classification, and filtering) on top of Apache Hadoop and Spark (Anil et al. 2020). Apache Spark MLlib is the machine learning library working on top of Spark (Meng et al. 2016). Commercial big data analytic platforms have machine learning as part of the framework, such as ML for Amazon cloud, Google AppEngine, BigML, and DMCF (Belcastro et al. 2017). Recommend systems Big data enables recommender systems. Functions of recommender systems are included in many big data analytic platforms. Collaborative filtering components in Apache Spark are often used in constructing recommender systems (Yang et al. 2018). Oryx 2 is a hybrid big data-processing engine using Apache Spark and Apache Kafka for real-time, large-scale machine learning. Recommender functions are enabled through its collaborative filtering modules. Apache Prediction IO is a recommender system built on Apache Spark, HBase, and Spray (Korotaev and Lyadova 2018). Specialized intelligence There are domain-specific big data analytic tools or platforms for mining intelligence from massive data. For example, health and business are two very active domains. Modules and tools have been emerged in providing intelligence from big data. For example, Pentaho Business Analytics is a big data analytic platform specialized in business intelligence (Târnăveanu 2012). GeoMesa is a geospatial big data analytic platform built on top of Apache Spark and Apache Kafka for managing and analyzing geospatial data in HBase, Accumulo, or Cassandra (Hughes et al. 2015). It supports standard OGC geospatial services by integrating GeoServer as components.

10.5 Data Visualization in Big Data Analytic Platforms Data visualization is an important component in big data analytics. Platforms can be grouped into data visualization (exploratory), information visualization (interaction), scientific visualization (explanation), and intelligence visualization (semantics, knowledge, and recommendations) (Caldarola and Rinaldi 2017). Data visualization Examining big data is required for exploring the data. For the effective visualization of data, preprocessing may be required before using exploratory visualization tools. Dimension reduction (e.g., principal component analysis)

182

10 Big Data Analytic Platforms

and aggregation (e.g., summary/selection) may be needed to reduce dimensions and volume for data to be manageable for data visualization tools. Gephi and Graph-tool (Python library) are popular tools for exploring network data (Caldarola and Rinaldi 2017). Geographic information systems (e.g., QGIS and GRASS) are pools to view data in maps (both features and imageries). GeoMesa includes geotools and GeoServers that support rendering data in standard Web Map Service (WMS). Preview and simple view of maps are supported with GeoServer included in GeoMesa. Information visualization Information is extracted features from data. They may be rendered and displayed as maps, charts, or tables. Working together with back- end on-demand analysis and processing capabilities, information may be summarized on the fly and presented to users for decision making. Mapping libraries include OpenLayers (2D), Leaflet (2D), MapTalks (2D/3D), and CesiumJS (2D/3D). Chart libraries and grid controls can be linked together to features to display derived or aggregated information from big data. These libraries include Infogram, JS InfoVis Toolkit, RGraph, and DataTables. The story map is a syndication of description documents and visuals (e.g., map and charts) in a time line (Gandhi and Pruthi 2020). Tableau is an interactive data visualization software (Ko and Chang 2017). Scientific visualization Scientific visualization is domain-related. They show high-dimensional data. Apache Giraph is a big graph processing and visualization framework using the MapReduce model to process and visualize big data (Martella et al. 2015). NASA WorldView is a scientific visualization application developed with OpenLayers to display Earth Observations (Cechini et al. 2013). NASA WorldWind is a 3D visualization service on a virtual globe (Bell et al. 2007). NASA Giovanni is a Web-based visualization environment for displaying and analyzing geophysical phenomena (Berrick et al. 2008). Intelligence visualization Knowledge graphs from big data can be very large. Effective visualization helps in presenting and helping decision makers in using knowledge extracted from big data. Many of the graph libraries may be extended to work with big data to drill in and out of knowledge graphs. SocNetV (social network visualize) is a specialized graph visualization tool to visualize social networks in different graph formats (Clemente et al. 2020). Sentinel Visualizer is another social network visualization tool that supports social network visualization and geospatial mapping (Group 2020). Example representation and visualization of large knowledge graph has been demonstrated using Spark with GraphX tool (Gómez- Romero et al. 2018). For linked data, there are visualization tools for knowledge exploratory, graph-based, domain-specific, and ontology (Ding, et al. 2020). Ontology summarization is necessary when these tools are applied to visualize knowledge graphs to achieve visual scalability in big data analytics.

10.6 Remote Sensing Big Data Analytic Platforms

183

10.6 Remote Sensing Big Data Analytic Platforms There are big data analytic platforms that specifically consider capabilities in handling geospatial data and remotely sensed data. Here, review a set of open-source big data analytic platforms, that is, GeoMesa for geospatial indexing and queries (Hughes et al. 2015), GeoTrellis for raster and map algebra (Azavea 2020), and RasterFrames for an example integrated system for Earth Observing data processing and machine learning (Locationtech 2020).

10.6.1 GeoMesa GeoMesa is an open-source,1 geospatial querying and big data analytic toolkit based on distributed computing, especially the Accumulo wide-column database (Hughes et al. 2015). Figure 10.6 shows the overall architecture and technology stacks for GeoMesa. The data stores can be managed using distributed, wide-column, NoSQL (not only SQL) databases, including Apache Accumulo, Google BigTable, Apache HBase, and Apache Cassandra. Data can be integrated using Apache Kafka. Two types of processing model in GeoMesa—batch processing and streaming. Batch processing of big data is enabled with MapReduce or Apache Spark. Spark is a high-performance clustering framework. The core concept of clustering computation atop distributed computing environment in Spark is based on a resilient distributed dataset (RDD) to support parallelism. The streaming process deals with

Fig. 10.6 Technology stack and major components of GeoMesa https://github.com/locationtech/geomesa

1

184

10 Big Data Analytic Platforms

real-time or near-real-time processing with live data streaming. Apache Storm is used to support the streaming process in GeoMesa. GeoServer can be integrated into GeoMesa to support standard OGC APIs and protocols. GeoMesa can be used in Jupyter through its APIs. OGC-compliant interfaces are supported through the integration with GeoServer.

10.6.2 GeoTrellis GeoTrellis is an open-source,2 high-performance processing engine based on Apache Spark (Azavea 2020). Figure 10.7 shows the overall architecture and its components. It has a connector and modules for interacting with data files, Apache Accumulo data store, Apache Cassandra, Apache HBase, or Amazon S3 storage. It also has a module to interact with the geospatially indexed data stores under GeoMesa. Apache Spark is the base for the engine to support computations in clustering and parallelism. The core functions of GeoTrellis supports coordinate systems and projections, raster and vector data processing, and analytics. Clients may interact with GeoTrellis through its REST (Representational state transfer) API (Application Programming Interface).

10.6.3 RasterFrames RasterFrames is an open-source,3 Earth Observation analytic framework that leverages GeoTrellis for raster data analytics, JTS (Java Topology Suite) for geometry, GeoMesa for geospatial data queries, and SFCurve (space filling curve library) for geospatial indexing (Locationtech 2020). Figure 10.8 shows the overall architecture

Fig. 10.7 Technology stack and major components of GeoTrellis

https://github.com/locationtech/geotrellis https://github.com/locationtech/geotrellis

2 3

10.7 Remote Sensing Big Data Analytic Services

185

Fig. 10.8 Technology stack and major components of RasterFrames

and main components. Apache Spark is the core framework for RasterFrames to achieve high-performance computation in clustering. It has additional functions and modules to provide different APIs to be used in Jupyter Notebook. Interactions to the big data analytic framework can be done through Python, SQL (Structured Query Language), or Scala.

10.7 Remote Sensing Big Data Analytic Services Big data analytic platforms can be implemented and deployed as a service or platform as a service (PaaS). For remote sensing big data, the following are examples of deployed big data analytic services: • • • •

Google Earth Engine (GEE) (Gorelick et al. 2017), EarthServer (Baumann et al. 2016), NASA Earth Exchange (Nemani et al. 2020), NASA Giovanni (Goddard Interactive Online Visualization ANd aNalysis Infrastructure) (Berrick et al. 2009).

10.7.1 Google Earth Engine The Google Earth Engine (GEE) is one of the dedicated remote sensing big data analytic platform services with a large volume of Earth Observations. Figure 10.9 shows the simplified architecture of GEE. GEE hosts analysis-ready, remote-sensed data from Landsat, Sentinel, MODIS, and ASTER (Gorelick et al. 2017; Amani et al. 2020). The data catalog also holds a large repository of geospatial data that are useful in analyzing remotely sensed data, including land cover, weather and climate data, social variables, and environmental data. The data are stored and managed using big data stores readily usable for big data analytics. GEE has a suite of modules and functions to support big data analytics with machine learning, image, geometry, feature, feature collection, reducer, and image

186

10 Big Data Analytic Platforms

Fig. 10.9 Technology stack and major components of Google Earth Engine

collection sensor-specialized algorithms (e.g., Landsat and Sentinel-1). The engine supports both on-the-fly analytics and batch processing. Through efficient parallel processing of multiple computes, the engine can complete the certain computation in a short time. Interactions with GEE are available through its Code Editor, APIs (Python and REST), or specialized apps. The Code Editor is an interactive development environment in JavaScript. It consists of a code editor, map viewer, layer manager, geometry tools, and functional tabs for script, doc, assets, inspector, console, and tasks.

10.7.2 EarthServer—an Open Data Cube The EarthServer4 uses Datacube as the base for analysis-ready data. This is an implementation of Cube Open Data Cube (ODC).5 Figure 10.10 shows the overall architecture and major components. An Array DBMS, rasdaman, is the core data https://www.earthserver.eu/ https://www.opendatacube.org/

4 5

10.7 Remote Sensing Big Data Analytic Services

187

Fig. 10.10 Technology stack and major components of EarthServer

Fig. 10.11 Technology stack and major components of NEX

management system for the storage and retrieval of massive multidimensional arrays. Spatial indexing is enabled in the array database to achieve performance. EarthServer data service component consists of data ingestion system, rasdaman spatial data management system, and geospatial Web services (i.e., WCS, and WCPS) (Baumann et al. 2016). It can be accessed through Jupyter notebooks, open- standard geospatial Web services, and datacube APIs. The WCPS (Web Coverage Processing Service) allows array-based data filtering and processing.

10.7.3 NASA Earth Exchange The NASA Earth Exchange (NEX)6 is a service system that consists of supercomputing services, Earth science models, and NASA remotely sensed data (Nemani et al. 2011, 2020). Figure 10.11 shows the overall architecture and major infrastructures for NEX. The supercomputing capability includes the Pleiades Supercomputer, https://www.nasa.gov/nex

6

188

10 Big Data Analytic Platforms

which is highly ranked in the world. Data collections include satellite observations, weather and climate observations, and model outputs. NEX has software libraries and modules for the discovery, processing, and analytics of Earth data. Interaction with the big data analytic platform can be done through API and service operations provided as OpenDAP, ArcServer, and OGC-compliant data access and rendering services—WCS, WMS, WPS, and WCPS.

10.7.4 NASA Giovanni The NASA Giovanni (Goddard Interactive Online Visualization ANd aNalysis Infrastructure) is an open-source,7 Web environment for displaying and analyzing geophysical parameters through discovering, accessing, processing, and rendering data and results via metadata from Earth science data repository—the NASA’s Common Metadata Repository (CMR).8 Figure 10.12 shows a simplified architecture and infrastructure stack for Giovanni (Berrick et al. 2009). Data are discovered through searching the metadata from CMR. Data may be cached and managed in local data stores. Data can be accessed following standards and commonly supported models, including Earth science data from Open-source Project for a Network Data Access Protocol (OPeNDAP) (Cornillon et al. 2003), GrADS (Grid Analysis and Display System) Data Server (GDS) (Berman et al. 2001), OGC Web Coverage Service (WCS) (Evans 2006), and known formats directly through File Transfer Protocol (FTP). The data analysis and processing capabilities are mostly developed in the algorithm module. The module has algorithms for statistic summary, time series analysis, charts, and animation data preparation. It works with the data access module

Fig. 10.12 Technology stack and major components of Giovanni

https://github.com/nasa/Giovanni https://earthdata.nasa.gov/eosdis/science-system-description/eosdis-components/cmr

7 8

References

189

and the internal workflow engine to access, process, and serve Earth science data. The Giovanni also includes a Web client based on OpenLayers to support the user interaction in a Web browser.

10.7.5 Others There are platforms Sentinel Hub (SH),9 System for Earth Observation Data Access, Processing and Analysis for Land Monitoring (SEPAL),10 openEO,11 and JRC (Joint Research Centre) Big Data Platform (JEODPP).12 More remote sensing big data analytic platforms can be found in review articles (Pawar and Attar 2016; Gomes et al. 2020).

References Amani M, Ghorbanian A, Ahmadi SA et al (2020) Google Earth Engine cloud computing platform for remote sensing big data applications: a comprehensive review. IEEE J Sel Top Appl Earth Obs Remote Sens 13:5326–5350. https://doi.org/10.1109/JSTARS.2020.3021052 Anil R, Capan G, Drost-Fromm I et al (2020) Apache Mahout: machine learning on distributed Dataflow systems. J Mach Learn Res 21:1–6 Armbrust M, Xin RS, Lian C et al (2015) Spark sql: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, pp 1383–1394 Awaysheh FM, Alazab M, Gupta M et al (2020) Next-generation big data federation access control: a reference model. Future Gener Comput Syst 108:726–741. https://doi.org/10.1016/j. future.2020.02.052 Azavea (2020) GeoTrellis. Version 2.0. Azavea. https://geotrellis.io/ Bansal K, Chawla P, Kurle P (2019) Analyzing performance of Apache Pig and Apache Hive with Hadoop. In: Engineering vibration, communication and information processing. Springer, pp 41–51 Baumann P, Mazzetti P, Ungar J et al (2016) Big data analytics for earth sciences: the EarthServer approach. Int J Digit Earth 9:3–29 Belcastro L, Marozzo F, Talia D, Trunfio P (2017) Big data analysis on clouds. In: Zomaya AY, Sakr S (eds) Handbook of big data technologies. Springer International Publishing, Cham, pp 101–142 Bell DG, Kuehnel F, Maxwell C et al (2007) NASA World Wind: Opensource GIS for mission operations. In: 2007 IEEE aerospace conference. IEEE, pp 1–9 Berman F, Chien A, Cooper K et al (2001) The GrADS Project: software support for high-level grid application development. Int J High Perform Comput Appl 15:327–344. https://doi. org/10.1177/109434200101500401

https://www.sentinel-hub.com/ https://github.com/openforis/sepal/ 11 https://openeo.org/ 12 https://jeodpp.jrc.ec.europa.eu/ 9

10

190

10 Big Data Analytic Platforms

Berrick SW, Leptoukh G, Farley JD et al (2009) Giovanni: a web service workflow-based data visualization and analysis system. IEEE Trans Geosci Remote Sens 47:106–113. https://doi. org/10.1109/TGRS.2008.2003183 Berrick SW, Leptoukh G, Farley JD, Rui H (2008) Giovanni: a web service workflow-based data visualization and analysis system. IEEE Trans Geosci Remote Sens 47:106–113 Bhatia S, Kumar R (2018) Review of graph processing frameworks. In: 2018 IEEE international conference on data mining workshops (ICDMW). IEEE, Singapore, Singapore, pp 998–1005 Bo L, Zhenliu Z, Xiangfeng W (2012) A survey of HPC development. In: 2012 international conference on computer science and electronics engineering. IEEE, pp 103–106 Bonnie MMD, Ligon B, Marshall M et al (2011) OrangeFS: advancing PVFS. In: FAST’11 poster session. USENIX, San Jose Borkar VR, Carey MJ, Li C (2012) Big data platforms: What’s next? XRDS Crossroads ACM Mag Stud 19:44–49. https://doi.org/10.1145/2331042.2331057 Borthakur D (2007) The hadoop distributed file system: architecture and design. Hadoop Proj Website 11:21 Brown L (2015) The SAS® Scalable Performance Data Engine: moving your data to Hadoop without giving up the SAS features you depend on. SAS Institute Inc. Brown Z (2019) scoutfs: large scale POSIX archiving. USENIX, Boston Brown Z, Coverston H, McClelland B (2018) The ScoutFS archiving file system. Versity Caban W (2019) The OpenShift architecture. In: Architecting and operating OpenShift clusters. Apress, Berkeley, CA, pp 1–29 Caldarola EG, Rinaldi AM (2017) Big data visualization tools: a survey - the new paradigms, methodologies and tools for large data sets visualization. In: Proceedings of the 6th international conference on data science, technology and applications. SCITEPRESS - Science and Technology Publications, Madrid, Spain, pp 296–305 Campbell DG, Kakivaya G, Ellis N (2010) Extreme scale with full SQL language support in microsoft SQL Azure. In: Proceedings of the 2010 international conference on management of data - SIGMOD ’10. ACM Press, Indianapolis, Indiana, USA, p 1021 Cechini M, Murphy K, Boller R et al (2013) Expanding access and usage of NASA near real-time imagery and data. AGUFM 2013:IN14A–04 Chilan CM, Yang M, Cheng A, Arber L (2006) Parallel i/o performance study with hdf5, a scientific data package. TeraGrid 2006 Adv Sci Discov Clemente F, Matos C, Zanikolas S, et al (2020) SocNetV. https://socnetv.org/ Cornillon P, Gallagher J, Sgouros T (2003) OPeNDAP: accessing data in a distributed, heterogeneous environment. Data Sci J 2:164–174 DeCandia G, Hastorun D, Jampani M et al (2007) Dynamo: amazon’s highly available key-value store. ACM SIGOPS Oper Syst Rev 41:205–220. https://doi.org/10.1145/1323293.1294281 Deka GC, Mazumder S, Singh Bhadoria R (eds) (2017) Distributed computing in big data analytics: concepts, technologies and applications, 1st edn. Springer International Publishing: Imprint: Springer, Cham Ding Y, Groth P, Hendler J (eds) (2020) LINKED DATA VISUALIZATION: techniques, tools and big data. Morgan & Claypool, San Rafael Dutta K (2017) Distributed computing technologies in big data analytics. In: Mazumder S, Singh Bhadoria R, Deka GC (eds) Distributed computing in big data analytics. Springer International Publishing, Cham, pp 57–82 Erraissi A, Belangour A, Tragha A (2017) Digging into Hadoop-based big data architectures. Int J Comput Sci Issues IJCSI 14:52–59 Evans JD (2006) Web Coverage Service (WCS) implementation specification. Open Geospatial Consortium Inc., Wayland Feick M, Kleer N, Kohn M (2018) Fundamentals of real-time data processing architectures Lambda and Kappa. In: Becker M (ed) SKILL 2018 - Studierendenkonferenz Informatik. Gesellschaft für Informatik e.V, Bonn, pp 55–66 Fizians S (2014) RozoFS: a fault tolerant I/O intensive distributed file system based on Mojette erasure code. In: Workshop autonomic Oct, p 17

References

191

Folk M, Heber G, Koziol Q et al (2011) An overview of the HDF5 technology suite and its applications. In: Proceedings of the EDBT/ICDT 2011 workshop on array databases, pp 36–47 Foster I, Kesselman C, Nick JM, Tuecke S (2002) Grid services for distributed system integration. Computer 35:37–46 Foster I, Kesselman C, Tuecke S (2001) The anatomy of the grid: enabling scalable virtual organizations. Int J High Perform Comput Appl 15:200–222. https://doi. org/10.1177/109434200101500302 Foster I, Zhao Y, Raicu I, Lu S (2008) Cloud computing and grid computing 360-degree compared. In: 2008 grid computing environments workshop. IEEE, Austin, TX, USA, pp 1–10 Furht B, Villanustre F (2016) Big data technologies and applications. Springer Gandhi P, Pruthi J (2020) Data visualization techniques: traditional data to big data. In: Anouncia SM, Gohel HA, Vairamuthu S (eds) Data visualization. Springer Singapore, Singapore, pp 53–74 Garcia Lopez P, Montresor A, Epema D et al (2015) Edge-centric computing: vision and challenges. ACM SIGCOMM Comput Commun Rev 45:37–42. https://doi.org/10.1145/2831347.2831354 Garlasu D, Sandulescu V, Halcu I et al (2013) A big data implementation based on Grid computing. In: 2013 11th RoEduNet international conference. IEEE, Sinaia, pp 1–4 Genaud S, Rattanapoka C (2007) P2P-MPI: a peer-to-peer framework for robust execution of message passing parallel programs on grids. J Grid Comput 5:27–42. https://doi.org/10.1007/ s10723-006-9056-2 Ghazi MR, Gangodkar D (2015) Hadoop, MapReduce and HDFS: a developers perspective. Proc Comput Sci 48:45–50. https://doi.org/10.1016/j.procs.2015.04.108 Gomes V, Queiroz G, Ferreira K (2020) An overview of platforms for big earth observation data management and analysis. Remote Sens 12:1253. https://doi.org/10.3390/rs12081253 Gómez-Romero J, Molina-Solana M, Oehmichen A, Guo Y (2018) Visualizing large knowledge graphs: a performance analysis. Future Gener Comput Syst 89:224–238. https://doi. org/10.1016/j.future.2018.06.015 Gonzalez JE, Xin RS, Dave A et al (2014) Graphx: graph processing in a distributed dataflow framework. In: 11th ${$USENIX$}$ Symposium on operating systems design and implementation (${$OSDI$}$ 14), pp 599–613 Gorelick N, Hancher M, Dixon M et al (2017) Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote Sens Environ 202:18–27 Goudarzi M (2017) Heterogeneous architectures for big data batch processing in mapreduce paradigm. IEEE Trans Big Data 5:18–33 Gropp W, Thakur R, Lusk E (1999) Using MPI-2: advanced features of the message passing interface. MIT Press Group FAS (2020) Sentinel Visualizer 8.0: the new standard for data visualization and analysis. http://www.fmsasg.com/ Guo S, Dhamankar R, Stewart L (2017) DistributedLog: a high performance replicated log service. In: 2017 IEEE 33rd international conference on data engineering (ICDE). IEEE, pp 1183–1194 Hajibaba M, Gorgin S (2014) A review on modern distributed computing paradigms: cloud computing, jungle computing and fog computing. J Comput Inf Technol 22:69. https://doi. org/10.2498/cit.1002381 Hausenblas M, Nadeau J (2013) Apache drill: interactive ad-hoc analysis at scale. Big Data 1:100–104 He C, Weitzel D, Swanson D, Lu Y (2012) HOG: distributed Hadoop MapReduce on the grid. In: 2012 SC companion: high performance computing, networking storage and analysis. IEEE, Salt Lake City, UT, pp 1276–1283 Herold F, Breuner S (2018) An introduction to BeeGFS Herrera VM, Khoshgoftaar TM, Villanustre F, Furht B (2019) Random forest implementation and optimization for Big Data analytics on LexisNexis’s high performance computing cluster platform. J Big Data 6. https://doi.org/10.1186/s40537-019-0232-1 Hoffman S (2013) Apache Flume: distributed log collection for Hadoop. Packt Publishing Ltd

192

10 Big Data Analytic Platforms

Hughes JN, Annex A, Eichelberger CN et al (2015) Geomesa: a distributed architecture for spatio- temporal fusion. In: Geospatial informatics, fusion, and motion video analytics V. International Society for Optics and Photonics, p 94730F Hupfeld F, Cortes T, Kolbeck B et al (2008) The XtreemFS architecture—a case for object-based file systems in Grids. Concurr Comput Pract Exp 20:2049–2060 Jain A (2017) Mastering apache storm: Real-time big data streaming using kafka, hbase and redis. Packt Publishing Ltd Kalamaras D (2014) Social Networks Visualizer (SocNetV): social network analysis and visualization software. Soc Netw Vis Kalid S, Syed A, Mohammad A, Halgamuge MN (2017) Big-data NoSQL databases: a comparison and analysis of “Big-Table”, “DynamoDB”, and “Cassandra”. In: 2017 IEEE 2nd international conference on big data analysis (ICBDA). IEEE, Beijing, China, pp 89–93 Kang Y-S, Park I-H, Rhee J, Lee Y-H (2015) MongoDB-based repository design for IoT-generated RFID/sensor big data. IEEE Sensors J 16:485–497 Kjerrumgaard D (2020) Apache Pulsar in action. Manning Ko I, Chang H (2017) Interactive visualization of healthcare data using tableau. Healthc Inform Res 23:349–354 Koranne S (2011) Hierarchical data format 5 : HDF5. In: Handbook of open source tools. Springer US, Boston, pp 191–200 Korenkov VV, Kutovskiy NA, Balashov NA et al (2015) JINR cloud infrastructure. Proc Comput Sci 66:574–583. https://doi.org/10.1016/j.procs.2015.11.065 Korotaev A, Lyadova L (2018) Method for the development of recommendation systems, customizable to domains, with deep GRU network. In: KEOD, pp 229–234 Krintz C (2013) The appscale cloud platform: enabling portable, scalable web application deployment. IEEE Internet Comput 17:72–75 Krishnan S, Gonzalez JLU (2015) Google cloud SQL. In: Building your next big thing with Google cloud platform. Springer, pp 159–183 Kumar R, Jain K, Maharwal H et al (2014) Apache cloudstack: open source infrastructure as a service cloud computing platform. Proc Int J Adv Eng Technol Manag Appl Sci 111:116 Lakshman A, Malik P (2009) Cassandra: structured storage system on a P2P network. In: Proceedings of the 28th ACM symposium on principles of distributed computing - PODC ’09. ACM Press, Calgary, AB, Canada, p 5 Lee J, Malcolm G, Matthews A et al (2009) Overview of Microsoft SQL Azure database. Microsoft Tech Whitepaper Li H (2018) Alluxio: a virtual distributed file system. PhD Thesis, UC Berkeley Locationtech (2020). https://rasterframes.io/.locationtech Lomotey RK, Deters R (2015) Unstructured data mining: use case for CouchDB. Int J Big Data Intell 2:168–182 Loshin D (2013a) NoSQL data management for big data. In: Big data analytics. Elsevier, pp 83–90 Loshin D (2013b) Introduction to high-performance appliances for big data management. In: Big data analytics. Elsevier, pp 49–59 Lv Z, Li X, Lv H, Xiu W (2019) BIM big data storage in WebVRGIS. IEEE Trans Ind Inform 16:2566–2573 Malewicz G, Austern MH, Bik AJC et al (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 international conference on management of data - SIGMOD ’10. ACM Press, Indianapolis, Indiana, USA, pp 135–146 Martella C, Shaposhnik R, Logothetis D, Harenberg S (2015) Practical graph analytics with apache giraph. Springer Martini B, Choo K-KR (2014) Distributed filesystem forensics: XtreemFS as a case study. Digit Investig 11:295–313 Meier A, Kaufmann M (2019) NoSQL databases. In: SQL & NoSQL databases. Springer Fachmedien Wiesbaden, Wiesbaden, pp 201–218 Mell P, Grance T (2011) The NIST definition of cloud computing

References

193

Meng X, Bradley J, Yavuz B et al (2016) Mllib: machine learning in apache spark. J Mach Learn Res 17:1235–1241 Michalakes J (2020) HPC for weather forecasting. In: Grama A, Sameh AH (eds) Parallel algorithms in computational science and engineering. Springer International Publishing, Cham, pp 297–323 Morrison RS (2003) Cluster computing architectures, operating systems, parallel processing and programming languages. GNU Gen Public Licence 5 Nemani R, Lee T, Kalluri S et al (2020) GeoNEX: earth observations from operational geostationary satellite systems. In: EGU general assembly conference abstracts, p 2463 Nemani R, Votava P, Michaelis A et al (2011) Collaborative supercomputing for global change science. EOS Trans Am Geophys Union 92:109–110. https://doi.org/10.1029/2011EO130001 Neumeyer L, Robbins B, Nair A, Kesari A (2010) S4: distributed stream computing platform. In: 2010 IEEE international conference on data mining workshops. IEEE, pp 170–177 Noghabi SA, Paramasivam K, Pan Y et al (2017) Samza: stateful scalable stream processing at LinkedIn. Proc VLDB Endow 10:1634–1645 Pawar K, Attar V (2016) A survey on data analytic platforms for Internet of Things. In: 2016 international conference on computing, analytics and security trends (CAST). IEEE, Pune, India, pp 605–610 Pennefather PS, Suhanic W (2009) BioTIFF: a new BigTIFF file structure for organizing large image datasets and their associated metadata. Biophys J 96:30a Perçuku A, Minkovska D, Stoyanova L (2017) Modeling and processing big data of power transmission grid substation using neo4j. Proc Comput Sci 113:9–16 Piotr Robert Konopelko (2016) MooseFS 3.0 storage classes manual Puangsaijai W, Puntheeranurak S (2017) A comparative study of relational database and key- value database for big data applications. In: 2017 international electrical engineering congress (iEECON). IEEE, pp 1–4 Qiao L, Li Y, Takiar S et al (2015) Gobblin: unifying data ingestion for Hadoop. Proc VLDB Endow 8:1764–1769 Read T (2011) Oracle Solaris Cluster essentials. Prentice Hall, Upper Saddle River Reyes-Ortiz JL, Oneto L, Anguita D (2015) Big data analytics in the cloud: spark on Hadoop vs MPI/OpenMP on Beowulf. Proc Comput Sci 53:121–130. https://doi.org/10.1016/j. procs.2015.07.286 Rodrigues AP, Chiplunkar NN (2018) Real-time Twitter data analysis using Hadoop ecosystem. Cogent Eng 5:1534519 Russom P (2013) Integrating Hadoop into business intelligence and data warehousing. TDWI Best Pract Rep Sakr S, Orakzai FM, Abdelaziz I, Khayyat Z (2016) Large-scale graph processing using Apache Giraph. Springer Sefraoui O, Aissaoui M, Eleuldj M (2012) OpenStack: toward an open-source solution for cloud computing. Int J Comput Appl 55:38–42 Selimi M, Freitag F (2014) Tahoe-lafs distributed storage service in community network clouds. In: 2014 IEEE fourth international conference on big data and cloud computing. IEEE, pp 17–24 Selvaganesan M, Liazudeen MA (2016) An insight about GlusterFS and its enforcement techniques. In: 2016 international conference on cloud computing research and innovations (ICCCRI). IEEE, pp 120–127 Shaheen J (2017) Apache Kafka: real time implementation with Kafka architecture review. Int J Adv Sci Technol 109:35–42 Siddique K, Akhtar Z, Yoon EJ et al (2016) Apache Hama: an emerging bulk synchronous Parallel computing framework for big data applications. IEEE Access 4:8879–8887. https://doi. org/10.1109/ACCESS.2016.2631549 Siebers B, Balaji V (2013) Data storage. In: Earth system modelling - volume 4. Springer, Berlin, Heidelberg, pp 21–24

194

10 Big Data Analytic Platforms

Sievert O, Casanova H (2004) A simple MPI process swapping architecture for iterative applications. Int J High Perform Comput Appl 18:341–352. https://doi.org/10.1177/1094342004047430 Singh D, Reddy CK (2015) A survey on platforms for big data analytics. J Big Data 2. https://doi. org/10.1186/s40537-014-0008-6 Târnăveanu D (2012) Pentaho business analytics: a business intelligence open source alternative. Database Syst J 3:23–34 Tatebe O, Hiraga K, Soda N (2010) Gfarm grid file system. New Gener Comput 28:257–275 Tummalapalli S, Rao Machavarapu V (2016) Managing mysql cluster data using cloudera impala. Proc Comput Sci 85:463–474 Ullman R, Bane B, Yang J (2008) HDF-EOS 2 and HDF-EOS 5 compatibility library Van-Dai Ta, Chuan-Ming Liu, Nkabinde GW (2016) Big data stream computing in healthcare real-time analytics. In: 2016 IEEE international conference on cloud computing and big data analysis (ICCCBDA). IEEE, Chengdu, China, pp 37–42 Vavilapalli VK, Murthy AC, Douglas C et al (2013) Apache hadoop yarn: yet another resource negotiator. In: Proceedings of the 4th annual symposium on cloud computing, pp 1–16 Venkataraman S, Yang Z, Liu D et al (2016) Sparkr: Scaling r programs with spark. In: Proceedings of the 2016 international conference on management of data, pp 1099–1104 Vohra D (2016) Using apache sqoop. In: Pro Docker. Springer, pp 151–183 Wahid A, Kashyap K (2019) Cassandra—a distributed database system: an overview. In: Emerging technologies in data mining and information security. Springer, pp 519–526 Wang F, Oral S, Shipman G et al (2009) Understanding lustre filesystem internals. Oak Ridge Natl Lab Natl Cent Comput Sci Tech Rep Wang K, Liu G, Zhai M et al (2019) Building an efficient storage model of spatial-temporal information based on HBase. J Spat Sci 64:301–317 Wang L, Ma Y, Zomaya AY et al (2014) A parallel file system with application-aware data layout policies for massive remote sensing image processing in digital earth. IEEE Trans Parallel Distrib Syst 26:1497–1508 Warneke D, Kao O (2009) Nephele: efficient parallel data processing in the cloud. In: Proceedings of the 2nd workshop on many-task computing on grids and supercomputers - MTAGS ‘09. ACM Press, Portland, Oregon, pp 1–10 Weili Kou, Xuejing Yang, Changxian Liang et al (2016) HDFS enabled storage and management of remote sensing data. In: 2016 2nd IEEE international conference on computer and communications (ICCC). IEEE, Chengdu, China, pp 80–84 Williams M, Benfield C, Warner B et al (2019) Tahoe-LAFS: the least-authority file system. In: Expert twisted. Springer, pp 223–251 Wu D, Sakr S, Zhu L (2017) Big data programming models. In: Zomaya AY, Sakr S (eds) Handbook of big data technologies. Springer International Publishing, Cham, pp 31–63 Xin RS, Rosen J, Zaharia M et al (2013) Shark: SQL and rich analytics at scale. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data, pp 13–24 Yadav S (2013) Comparative study on open source software for cloud computing platform: Eucalyptus, openstack and opennebula. Int J Eng Sci 3:51–54 Yang F, Tschetter E, Léauté X et al (2014) Druid: a real-time analytical data store. In: Proceedings of the 2014 ACM SIGMOD international conference on management of data, pp 157–168 Yang M, Folk M, McGrath RE (2004) Investigation of parallel netCDF with ROMS. NCSA HDF Group April 15 Yang Y, Ning Z, Cai Y et al (2018) Research on parallelisation of collaborative filtering recommendation algorithm based on Spark. Int J Wirel Mob Comput 14:312–319 Zaharia M, Xin RS, Wendell P et al (2016) Apache Spark: a unified engine for big data processing. Commun ACM 59:56–65. https://doi.org/10.1145/2934664 Zhang R, Freitag M, Albrecht C, et al (2019) Towards scalable geospatial remote sensing for efficient OSM labeling. Editors. 27

Chapter 11

Algorithmic Design Considerations of Big Data Analytics

Abstract An algorithm is evaluated primarily in two metrics—temporal and spatial complexity. This chapter presents algorithm design challenges for remote sensing big data from the five dimensions—Volume, Velocity, Variety, Veracity, and Value. Volume requires algorithm design to consider processing performance, modularity, data sparsity, dimensionality, feature selection/extraction, nonlinearity, Bonferonni’s principle, and balanced variance/bias. Velocity requires algorithm design to consider data streaming, real-time processing, concept drift, and skewed distributions. Variety requires algorithm design to consider data locality and data heterogeneity. Veracity requires algorithm design to consider data provenance and data quality. Value requires algorithm design to consider domain-specific models and knowledge. Keywords Algorithm · Big data analytics · Performance · Modularity · Data sparsity · Data dimensionality · Feature selection · Streaming · Concept drift · Skewed distribution · Data provenance · Data quality The 5 Vs of big data dictate the design of data analytics for remote sensing big data (Ishwarappa and Anuradha 2015). The five Vs are Velocity, Variety, Volume, Voracity, and Value. For a lot of applications, the first three Vs are mostly considered in the algorithm designs. For remote sensing and scientific applications, Voracity has to be considered too. Each of the Vs brings technical requirements to the algorithmic design process, and each of these requirements needs to be considered.

11.1 Complexity of Remote Sensing Big Data Analytic Algorithms In computational complexity theory, we evaluate algorithms primarily on two types of complexities—time complexity and space complexity (Sipser 2012). Time complexity, or temporal complexity, is a measure that indicates how fast the algorithm © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. Di, E. Yu, Remote Sensing Big Data, Springer Remote Sensing/ Photogrammetry, https://doi.org/10.1007/978-3-031-33932-5_11

195

196

11 Algorithmic Design Considerations of Big Data Analytics

performs relatively. Space complexity, or spatial complexity, tells how much space (i.e., computational memory) the algorithm needs relatively to accomplish the designed computation. Algorithm design often aims at reducing these two complexities, ideally, to be at most linear space complexity and log-linear time complexity (van Zyl 2014). With remote sensing big data, the complexities increase. The following summarizes some of the reasons for the increase of temporal and spatial complexities that call for the reconsiderations of classic algorithms: • Complexity of geospatial and remote sensing big data: Algorithms have to deal with the intrinsic complex characteristics of geospatial and remote sensing big data. The same algorithm carries an increased complexity on remote sensing big data compared to its application to other simple small data (Khader and Al- Naymat 2020). For example, the time complexity of algorithms processing unstructured data is more than that of the same-sized structured data. The space complexity increases as well with unstructured data compared to structured data. • Unmet assumption of independently identical distribution (IID): The assumption for IID of a given distribution may not hold for remote sensing big data (Shu 2016). Data may be autocorrelated. Spatial and temporal autocorrelation exists in most remote sensing images (Jiang 2016; Jiang and Shekhar 2017). The autocorrelation prevents images from being processed paralleled at the pixel or a temporal step level. The time and space complexity of remote sensing algorithms is higher than their traditional counterparts. As a result, we may have to adopt a more sophisticated analytic algorithm instead of a simple version to solve the same problem. For example, spatial autoregression may be used instead of linear regression which increases the complexities (van Zyl 2014). Other examples are geographically weighted regression versus regression (van Zyl 2014), and co- location pattern mining versus association rule mining (van Zyl 2014). • Extended set theory: Big data analytics extends set theory to include probability- measured set and topological space (Shu 2016). A probability-measured set mathematically contains a sample space, a set of events, and a probability- measured function. In machine learning for big data, topological space is used to characterize a set with topological relations. Terms like topological space, Euclidean metric space, norm space, inner space, and vector space are interchangeably used by machine learning (Shu 2016). • Utilization of stochastic data access patterns: To deal with the increased volume and complexities, we may consider work with part of the data. Somewhat stochastic data access patterns may be used to process remote sensing big data (van Zyl 2014). Many existing sophisticated or complex algorithms for remote sensing data require variable patterns and frequencies of access to the underlying data. As a result, some algorithms may have increased time complexity. • Change of computing infrastructure: To support big data computation, we may have to adopt a different infrastructure than a conventional single-node computer. For example, the new distributed computing model MapReduce may be used to support big data processing. Many of the iterative algorithms, which are

11.2 Challenges and Algorithm Design Considerations from Volume

197

commonly used in remote sensing data processing, are not applicable since MapReduce does not support iterative algorithms (Kang and Lee 2017). Alternative approaches may have to be taken to achieve the effect of counterpart iterative algorithms which increase time complexity. • Maintaining internal relationships among data: Remote sensing data analysis requires tracking the relationship between every pair of observations to build some quadratic space complexity matrix. For example, a spatial weight matrix may be calculated for geographically weighted regression. With the large volume of data, it is impossible to keep all data in memory. Certain decomposition and aggregation strategies may have to be considered in algorithm design to realize the computation in clustering of distributed nodes. This process of decomposition and aggregation increases temporal and spatial complexity.

11.2 Challenges and Algorithm Design Considerations from Volume The geospatial community is challenged by the large volume of data in comparison to the capability of computing resources. The big data problem, or volume beyond the capability of computing capabilities, has been a hot topic of research and commercial applications involving all sectors of society and applications. Some of the challenges facing the broader big data community have existed in the geospatial temporal data analytic community. Remote sensing data are large in volume which demands high capacity in computing resources, that is, high-performance processing capabilities for data processing, large memory for holding data in processing, big storage for persistent archiving of data, and wide bandwidth network for exchanging data among nodes (Loshin 2013). Remote sensing big data analytics need to deal with large quantities of raster data, point clouds, and even vector data. Dealing with a large volume of geospatial remote sensing data has been a common problem in remote sensing big data analytics. Spatial data mining has long endeavored to unlock information from large databases with spatial attributes. Algorithmic approaches have been adapted to overcome the data volume. The problem of big data is well acknowledged and long studied in the geospatial community. With the advancements in computing methods and technologies, there are emerging opportunities to take advantage of advances made by the broader geospatial community. Spatial data includes three major forms: raster, vector, and areal (point cloud). Raster data has been always considered as big data since most remote sensing data are collected in raster form. For example, NASA EOSDIS managed more than 32 PB data in archive up to the end of 2019 and is projected to reach 37 PB by the end of 2020, most of them are in raster form (Murphy 2020). Vector data, used to be in smaller sizes, were not considered as big data, but that perception is changing with technological advancement. One of the data-driving forces in remote sensing is the inclusion of the Internet of Things (IoT) (Ge et al.

198

11 Algorithmic Design Considerations of Big Data Analytics

2018; Granell et al. 2020) and Sensor Web technologies (Di 2016; Garcia Alvarez et al. 2019). The data from these senor networks flowing-in in velocity and volume. These data could also place velocity constraints on the algorithms if near real-time processing is required. A Point Cloud consists of data points in space that can represent any type of data (Wang et al. 2017; Zou et al. 2019). The high-variety unstructured data may arrive at high velocity or any number of the many permutations. Machine learning and related algorithms are often used in analyzing remote sensing big data. The volume has specific challenges for machine learning algorithm applications in processing big data. Algorithm design may need to consider (L’Heureux et al. 2017; Augenstein et al. 2017, 2020) the following: (1) reducing temporal and spatial complexity to meet the processing performance requirements, (2) using parallelism or iterative process to mediate with the curse of modularity, (3) selection of features, (4) nonlinearity, (5) Bonferonni’s principle to identify and use the proper model on a proper subset of data for specific information extraction, (6) sampling approaches for data with skewed distribution, (7) high-dimensional analysis and data reduction to deal with the curse of dimensionality, and (8) balancing the variance and bias of learned models to avoid generalization failure and overfitting. The following lists some of the challenges and algorithmic design considerations for the dimension of volume in remote sensing big data analytics: • Parallel computation: A single node does not meet the time complexity required to process remote sensing big data. A concrete example of the challenge is illustrated using the biomass monitoring application using MODIS data (Prasad et al. 2017). The time complexity of the simple Gaussian process-based monitoring algorithm becomes unmanageable in a single workstation. The inclusion of parallelism in an algorithm is necessary to complete the computation in a reasonable time frame. This solution is often responding to the Curse of Modularity (Parker 2012; L’Heureux et al. 2017). • Adopting data reduction as the first step in algorithm design: In remote sensing big data, the large volume may eventually run out of available computing resources in terms of temporal and spatial complexity. The simple data reduction strategy is to select a representative subset and work on the subset to approximate the big dataset. Other data-reduction approaches may be applied as the first step, such as principal component analysis. This strategy is responding to the Curse of Dimensionality. As shown in Tables 11.1 and 11.2, the dimension (d) of data affects the complexity of algorithms. The data can be too large to be processed because of the increased time or space complexity. This is especially true during the training stage if supervised algorithms are used. With a large number of dimensions, machine learning algorithms may face challenges for efficiently computing similarity metrics among data (e.g., Euclidean and Manhattan distance), and similarity metrics among semantic relationships (e.g., connectedness or semantic similarity need to be accounted for). The computation is a high-dimensional matrix. Possible strategies include (van Zyl 2014; Shu 2016) the following: (1) iteration over the matrix, (2) using subsets that

11.2 Challenges and Algorithm Design Considerations from Volume

199

Table 11.1 Temporal and spatial complexity of selected unsupervised machine learning algorithms Algorithm k-nearest neighbors (k-NN) Logistic regression Spatial autoregressive modeling

Time complexity O(knd) O(nd) O(n3)

Space complexity O(nd) O(d) O(n2)

1. k—number of classes 2. n—number of samples/instances 3. d—number of dimensions Table 11.2 Temporal and spatial complexity of selected supervised machine learning algorithms Algorithm Naïve Bayes Decision tree Support vector machine (SVM) Random forest

Time complexity Training Runtime O(nd) O(cd) O(n log(n) d) O(depth of decision tree) O(n2) O(kd)

Space complexity Training Runtime O(cd) O(1)

O(n log(n) d l)

O(depth of tree * k)

O(depth of tree * k)

O(nd)

1. k—number of support vectors 2. n—number of samples/instances 3. d—number of dimensions 4. l—number of decision trees

represent the population to get fast results that converge to an optimal solution, and (3) specialized high-dimensional analysis (e.g., multidimensional scaling). • Representativeness of subspaces: As shown in Tables 11.1 and 11.2, the time and space complexity can run beyond the capacity of a single computational node. One solution is to divide and conquer. A selected subset of data may be used instead of the full data when analyzing. The representativeness of the data subset may affect pattern discovery since error is introduced during the selection of subsets to be processed (van Zyl 2014). The process of breaking the whole data into working subsets may introduce artificial errors due to biases. • Computational and input/output challenges of different data mining algorithms with remote sensing big data: Different algorithms have different limitations when they are applied in processing remote sensing big data. As a result, different strategies should be developed and applied in optimizing the processing of remote sensing big data. For example, the spatial autoregressive model (SAR) is often used in finding the spatial dependencies among remote sensing big data which has a complexity of O(n3) in time and O(n2) in space (Vatsavai et al. 2012). Parameter estimation uses Bayesian statistics. Markov Chain Monte Carlo (MCMC) sampling approach or similar stochastic relaxation approaches may be used in model building with subsets of data to reduce the actual temporal and spatial complexity (Pfarrhofer and Piribauer 2019). Markov random field

200

11 Algorithmic Design Considerations of Big Data Analytics

classifiers may be optimized by stochastic relaxation, iterated conditional mode, dynamic programming, and graph cut (Vatsavai et al. 2012). The temporal and spatial complexity of the Markov random field may be reduced with a Gaussian process learning approach from O(N3) and O(N2) to O(N2) and O(N), r espectively (Chandola and Vatsavai 2010). Big data problems may be handled efficiently by using an ensemble of models with reduced temporal and spatial complexity, such as the Gaussian mixture model (GMM) (Vatsavai et al. 2012).

11.3 Challenges and Algorithm Design Considerations from Velocity Remote sensing big data increases rapidly. The growth rate increases too. It is estimated that by 2022, the ingest rate of data into the EOSDIS archive is projected to grow from the current 3.9 petabytes (PB) per year to as much as 47.7 PB per year.1 There are two general approaches to deal with the accrued remote sensing data. One is a batch process that processes the data at some intervals. Another is a streaming process that processes data as they arrive. The algorithm design for batch processing has challenges mainly due to the volume as described in the previous section. The algorithm design for the streaming process has higher performance because it is efficient enough to handle the volume of data as they flow in. The following are some algorithm design considerations specifically to deal with the velocity of remote sensing big data: • Selection of infrastructure and big data modeling language: Batch processing and streaming processing are two different types of computational processes that require different computational frameworks or stacks to deal with remote sensing big data. For example, Hadoop and MapReduce models are good for batch processing which is based on partitioning and processing stored data. Apache Storm and its modeling language are designed to handle streamed data with a high-performance clustering core. Apache Spark and their modeling languages may be used in either batch process or stream process with a different configuration. • Specialized streaming algorithms: Algorithm design has to consider the availability of data where streamed data only sees part of the data. Incremental algorithms or online algorithms are those designed to deal with streamed data. To allow online algorithms to work on remote sensing big data, common requirements are (Benczúr et al. 2019) as follows: (1) online learning updates model only when the data instance arrives without looking for past data, (2) adaptive online algorithms to deal with concept drift, and (3) adapt online algorithms to work in a distributed stream-processing architecture.

https://earthdata.nasa.gov/esds/continuous-evolution

1

11.4 Challenges and Algorithm Design Considerations from Variety

201

Example for streaming algorithms are Hoeffding tree (Pfahringer et al. 2007), StreamKM++ (a K-Means algorithm for streamed data) (Ackermann et al. 2012), D-stream (Chen and Tu 2007), incremental linear discrimination analysis (Chu et al. 2015), and AdaBoost (Freund and Schapire 1997). On-board processing is another type of algorithmic strategy to cope with live streaming from remote sensors. The on-board processing reconfigures and moves processing algorithms to be close to data sources (sensors) to reduce, format, filter/select, compress, encrypt, tag data with metadata, or preprocess data into data product at higher (smaller) data production levels (van Duijn and Redi 2017). On-board processing algorithms can be implemented as a reconfigurable hardware/firmware, for example, to support advanced feature detection from hyperspectral imaging systems(Yuan et al. 2022; Bakken et al. 2022, p.; Zhang et al. 2022). • Complexity evaluation: The evaluation of temporal and spatial complexity of specialized stream algorithms is different from that of a regular algorithm. Amortized (or average) temporal and spatial complexity (Tarjan 1985) may be used in evaluating online algorithms instead of big-O (or the worst time). Specialized evaluation approaches may be developed as metrics to evaluate the performance of algorithms for big data streams (Bifet et al. 2015).

11.4 Challenges and Algorithm Design Considerations from Variety The variety dimension of remote sensing big data includes two aspects: one is the variety due to unstructured and semi-structured sensor data and another is the variety due to the large number of features from different sources (sensors) (van Zyl 2014). The following are algorithm considerations specifically dealing with challenges from the dimension of variety in remote sensing big data: • Data locality: Data may not be available in local storage. Some data may be distributed and accessible through the Internet. One way to achieve performance with a variety of distributed data is to enable data locality—the ability to move computation algorithms and programs close to where the actual data resides on the node instead of moving large data. Some platforms support data locality (Lee et al. 2019). For example, MapReduce in Hadoop supports data local data locality, intra-rack data locality, and inter-rack data locality (Wang and Ying 2016; Naik et al. 2019; Lee et al. 2019). Apache Spark also supports data locality at four levels—the same process as the source data, the same machine as the source data, the same rack as the source data, and no restriction (Hu et al. 2020). Algorithm design for remote sensing big data needs to consider data locality to achieve performance (Li 2020).

202

11 Algorithmic Design Considerations of Big Data Analytics

• Data heterogeneity: Semantically, data heterogeneity involves data with different concepts. Statically, data heterogeneity represents different distributions from different data sources. Algorithm design for remote sensing big data needs to deal with the integration of heterogeneous data (Li et al. 2020).

11.5 Challenges and Algorithm Design Considerations from Veracity Data quality of remotely sensed data varies from different sensors. Efficient algorithms and models need to be tolerant to a certain degree of data noise and uncertainty (Milman 2000). The following are algorithm considerations specifically dealing with challenges from the dimension of veracity in remote sensing big data: • Data provenance: Managing and using metadata properly are important in tracing the error and evaluating the quality of data source as inputs to the algorithm under research (Di et al. 2013). The composition of simple modules may be necessary for remote sensing big data analytics (Yue et al. 2010). As a result, it is necessary to track down the origin and data quality of component services and data (Di et al. 2013). Data provenance and related consideration are required to enable algorithm development in remote sensing big data (Lynnes and Huang 2018). • Data quality: Data quality and noise should be considered in remote sensing analytic algorithm design to eliminate their negative effect (Li et al. 2018). The uncertainty of data is rooted in increasingly complex data structure and inconsistency in remote sensing big data. Data quality challenges include information incompleteness (missing data—missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR)) and errors (inconsistency, unrepresentativeness, unreliability, etc.) (Allison 2002; Liu et al. 2016). The data quality can be classified into accuracy (correctness, validity, and precision), completeness (pertinence and relevance), readability (comprehensibility, clarity, and simplicity), accessibility (availability), consistency (cohesion and coherence), and trust (believability, reliability, and reputation) (Batini et al. 2016; Ramasamy and Chowdhury 2020). Algorithm design should consider proper preprocessing and error-tolerant models. The reliability of data source and their error propagation should be taken into account in algorithm development.

11.6 Challenges and Algorithm Design Considerations from Value The value dimension of remote sensing big data is related to significance in different applications. Different applications have different domain-specific models that need to be considered for algorithm design (Saggi and Jain 2018). Application-specific

References

203

algorithms are needed to meet the requirements from different domain areas, such as societal benefit areas identified for Earth Observations—ecosystem (biodiversity and ecosystem sustainability), disaster (disaster resilience), energy (energy and mineral resource management), agriculture (food security and sustainable agriculture), infrastructure (infrastructure and transportation management), health (public health surveillance), urban (sustainable urban development), and water (water resource management) (Nativi et al. 2015).

References Ackermann MR, Märtens M, Raupach C et al (2012) StreamKM++: a clustering algorithm for data streams. ACM J Exp Algor 17. https://doi.org/10.1145/2133803.2184450 Allison P (2002) Missing data. SAGE Publications, Thousand Oaks Augenstein C, Spangenberg N, Franczyk B (2017) Applying machine learning to big data streams: an overview of challenges. In: 2017 IEEE 4th international conference on soft computing & machine intelligence (ISCMI). IEEE, Mauritius, pp 25–29 Augenstein C, Zschörnig T, Spangenberg N et al (2020) A generic architectural framework for machine learning on data streams. In: Filipe J, Śmiałek M, Brodsky A, Hammoudi S (eds) Enterprise information systems. Springer International Publishing, Cham, pp 97–114 Bakken S, Honore-Livermore E, Birkeland R et al (2022) Software development and integration of a hyperspectral imaging payload for HYPSO-1. In: 2022 IEEE/SICE international symposium on system integration (SII). IEEE, Narvik, Norway, pp 183–189 Batini C, Rula A, Scannapieco M, Viscusi G (2016) From data quality to big data quality. In: Khosrow-Pour M, Clarke S, Jennex ME et al (eds) Big data: concepts, methodologies, tools, and applications. IGI Global, pp 1934–1956 Benczúr AA, Kocsis L, Pálovics R (2019) Online machine learning in big data streams: overview. In: Sakr S, Zomaya AY (eds) Encyclopedia of big data technologies. Springer International Publishing, Cham, pp 1207–1218 Bifet A, de Francisci MG, Read J et al (2015) Efficient online evaluation of big data stream classifiers. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining - KDD ’15. ACM Press, Sydney, NSW, Australia, pp 59–68 Chandola V, Vatsavai RR (2010) Scalable time series change detection for biomass monitoring using Gaussian process. In: Srivastava AN, Chawla NV, Yu PS, Melby P (eds) Proceedings of the 2010 conference on intelligent data understanding, CIDU 2010, October 5–6, 2010. NASA Ames Research Center, Mountain View, California, USA Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining - KDD ’07. ACM Press, San Jose, California, USA, p 133 Chu D, Liao L-Z, Ng MK-P, Wang X (2015) Incremental linear discriminant analysis: a fast algorithm and comparisons. IEEE Trans Neural Netw Learn Syst 26:2716–2735. https://doi. org/10.1109/TNNLS.2015.2391201 Di L (2016) Big data and its applications in agro-geoinformatics. In: 2016 IEEE international geoscience and remote sensing symposium (IGARSS). IEEE, Beijing, China, pp 189–191 Di L, Yue P, Ramapriyan HK, King RL (2013) Geoscience data provenance: an overview. IEEE Trans Geosci Remote Sens 51:5065–5072. https://doi.org/10.1109/TGRS.2013.2242478 Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55:119–139. https://doi.org/10.1006/jcss.1997.1504 Garcia Alvarez M, Morales J, Kraak M-J (2019) Integration and exploitation of sensor data in smart cities through event-driven applications. Sensors 19:1372. https://doi.org/10.3390/s19061372 Ge M, Bangui H, Buhnova B (2018) Big data for Internet of Things: a survey. Future Gener Comput Syst 87:601–614. https://doi.org/10.1016/j.future.2018.04.053

204

11 Algorithmic Design Considerations of Big Data Analytics

Granell C, Kamilaris A, Kotsev A et al (2020) Internet of Things. In: Guo H, Goodchild MF, Annoni A (eds) Manual of digital earth. Springer Singapore, Singapore, pp 387–423 Hu F, Yang C, Jiang Y et al (2020) A hierarchical indexing strategy for optimizing Apache Spark with HDFS to efficiently query big geospatial raster data. Int J Digit Earth 13:410–428. https:// doi.org/10.1080/17538947.2018.1523957 Ishwarappa, Anuradha J (2015) A brief introduction on big data 5Vs characteristics and Hadoop technology. Proc Comput Sci 48:319–324. https://doi.org/10.1016/j.procs.2015.04.188 Jiang Z (2016) Spatial big data analytics: classification techniques for earth observation imagery. Ph.D. Dissertation, University of Minnesota Jiang Z, Shekhar S (2017) Spatial big data science. Springer International Publishing, Cham Kang M, Lee J-G (2017) An experimental analysis of limitations of MapReduce for iterative algorithms on Spark. Clust Comput 20:3593–3604. https://doi.org/10.1007/s10586-017-1167-y Khader M, Al-Naymat G (2020) Density-based algorithms for big data clustering using MapReduce framework: a comprehensive study. ACM Comput Surv 53:1–38. https://doi. org/10.1145/3403951 Lee S, Jo J-Y, Kim Y (2019) Survey of data locality in Apache Hadoop. In: 2019 IEEE international conference on big data, cloud computing, data science & engineering (BCD). IEEE, Honolulu, HI, USA, pp 46–53 L’Heureux A, Grolinger K, Elyamany HF, Capretz MAM (2017) Machine learning with big data: challenges and approaches. IEEE Access 5:7776–7797. https://doi.org/10.1109/ ACCESS.2017.2696365 Li D, Shen X, Wang L (2018) Connected geomatics in the big data era. Int J Digit Earth 11:139–153. https://doi.org/10.1080/17538947.2017.1311953 Li H, Zhang Z, Tang P (2020) A web-based remote sensing data processing and production system with the unified integration of multi-disciplinary data and models. IEEE Access 8:162961–162972. https://doi.org/10.1109/ACCESS.2020.3021791 Li Z (2020) Geospatial big data handling with high performance computing: current approaches and future directions. In: Tang W, Wang S (eds) High performance computing for geospatial applications. Springer International Publishing, Cham, pp 53–76 Liu J, Li J, Li W, Wu J (2016) Rethinking big data: a review on the data quality and usage issues. ISPRS J Photogramm Remote Sens 115:134–142. https://doi.org/10.1016/j. isprsjprs.2015.11.006 Loshin D (2013) Big data analytics: from strategic planning to enterprise integration with tools, techniques, NoSQL, and graph. Elsevier, Morgan Kaufmann, Amsterdam Lynnes C, Huang T (2018) Future of big earth data analytics. NASA Milman AS (2000) Mathematical principles of remote sensing: Making inferences from Noisy Data, 0th edn. CRC Press Murphy K (2020) NASA earth science data systems program highlights 2019 Naik NS, Negi A, Bapu BRT, Anitha R (2019) A data locality based scheduler to enhance MapReduce performance in heterogeneous environments. Future Gener Comput Syst 90:423–434. https://doi.org/10.1016/j.future.2018.07.043 Nativi S, Mazzetti P, Santoro M et al (2015) Big data challenges in building the global earth observation system of systems. Environ Model Softw 68:1–26. https://doi.org/10.1016/j. envsoft.2015.01.017 Parker C (2012) Unexpected challenges in large scale machine learning. In: BigMine-12: 1st international workshop on big data, streams and heterogeneous source mining: algorithms, systems, programming models and applications. ACM, Beijing, China Pfahringer B, Holmes G, Kirkby R (2007) New options for Hoeffding trees. In: Orgun MA, Thornton J (eds) AI 2007: Advances in artificial intelligence. Springer, Berlin, Heidelberg, pp 90–99 Pfarrhofer M, Piribauer P (2019) Flexible shrinkage in high-dimensional Bayesian spatial autoregressive models. Spat Stat 29:109–128. https://doi.org/10.1016/j.spasta.2018.10.004

References

205

Prasad SK, Aghajarian D, McDermott M et al (2017) Parallel processing over spatial-temporal datasets from Geo, Bio, climate and social science communities: a research roadmap. In: 2017 IEEE international congress on big data (BigData Congress). IEEE, Honolulu, HI, USA, pp 232–250 Ramasamy A, Chowdhury S (2020) Big data quality dimensions: a systematic literature review. J Inf Syst Technol Manag 17. https://doi.org/10.4301/S1807-1775202017003 Saggi MK, Jain S (2018) A survey towards an integration of big data analytics to big insights for value-creation. Inf Process Manag 54:758–790. https://doi.org/10.1016/j.ipm.2018.01.010 Shu H (2016) Big data analytics: six techniques. Geo-Spat Inf Sci 19:119–128. https://doi.org/1 0.1080/10095020.2016.1182307 Sipser M (2012) Introduction to the theory of computation, 3rd edn. Course Technology Cengage Learning, Boston Tarjan RE (1985) Amortized computational complexity. SIAM J Algebr Discrete Methods 6:306–318. https://doi.org/10.1137/0606031 van Duijn P, Redi S (2017) Big data in space. In: Proceeding of the 31st annual conference on small satellite, Logan, Utah, USA van Zyl T (2014) Algorithmic design considerations for geospatial and/or temporal big data. Big Data Tech Technol Geoinformatics CRC Press, Baca Raton, pp 117–132 Vatsavai RR, Ganguly A, Chandola V et al (2012) Spatiotemporal data mining in the era of big spatial data: algorithms and applications. In: Proceedings of the 1st ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data - BigSpatial ’12. ACM Press, Redondo Beach, California, pp 1–10 Wang C, Hu F, Sha D, Han X (2017) Efficient Lidar point cloud data managing and processing in a Hadoop-based distributed framework. ISPRS Ann Photogramm Remote Sens Spat Inf Sci IV-4/W2:121–124. https://doi.org/10.5194/isprs-annals-IV-4-W2-121-2017 Wang W, Ying L (2016) Data locality in MapReduce: a network perspective. Perform Eval 96:1–11. https://doi.org/10.1016/j.peva.2015.12.002 Yuan S, Sun Y, He W et al (2022) MSLM-RF: a spatial feature enhanced random forest for on- board hyperspectral image classification. IEEE Trans Geosci Remote Sens 60:1–17. https://doi. org/10.1109/TGRS.2022.3194075 Yue P, Gong J, Di L (2010) Augmenting geospatial data provenance through metadata tracking in geospatial service chaining. Comput Geosci 36:270–281. https://doi.org/10.1016/j. cageo.2009.09.002 Zhang Z, Qu Z, Liu S et al (2022) Expandable on-board real-time edge computing architecture for Luojia3 Intelligent remote sensing satellite. Remote Sens 14:3596. https://doi.org/10.3390/ rs14153596 Zou W, Jing W, Chen G et al (2019) A survey of big data analytics for smart forestry. IEEE Access 7:46621–46636. https://doi.org/10.1109/ACCESS.2019.2907999

Chapter 12

Machine Learning and Data Mining Algorithms for Geospatial Big Data

Abstract This chapter focuses on strategies to extend and adapt traditional machine learning algorithms for remote sensing and geospatial big data. Ten major strategies are discussed. They are distributed and parallel learning, data reduction and approximate computing, feature selection and feature extraction, incremental learning, deep learning, ensemble analysis, granular learning, stochastic algorithms, transfer learning, and active learning. Keywords Machine learning · Parallel learning · Approximate computing · Feature selection · Incremental learning · Deep learning · Ensemble analysis · Granular learning · Stochastic algorithm · Transfer learning · Active learning Classic machine learning algorithms may include algorithms for clustering, decision tree, and association rules. To be applicable in handling remote sensing big data, these classic geospatial algorithms may need to be adapted to deal with challenges rooted in different dimensions of the big data, that is, five Vs (Volume, Velocity, Variety, Veracity, and Value). Big data processing has to primarily deal with five critical issues (Qiu et al. 2016): (1) learning from a large volume of data, (2) learning with a large variety of data, (3) learning with the high velocity of streaming data, (4) learning the veracity of uncertain and incomplete data, (4) learning the value of data with low value density and meaning diversity. Many machine learning and data mining algorithms have been developed without considering the nature of remote sensing big data. Most common geospatial and temporal algorithms have been historically developed for sparse data scenarios. For example, Kriging was developed for interpolating geological properties using a limited number of core samples (Oliver and Webster 1990). Classical algorithms seldom consider all the big data challenges while they focus on maximizing the information extraction from the relatively small sample of data provided by making appropriate assumptions. Big data algorithms commonly need to deal with population directly instead of samples. To deal with big data, many of the algorithms need to be adapted.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. Di, E. Yu, Remote Sensing Big Data, Springer Remote Sensing/ Photogrammetry, https://doi.org/10.1007/978-3-031-33932-5_12

207

208

12 Machine Learning and Data Mining Algorithms for Geospatial Big Data

There are two major categories of approaches to work with big data using these algorithms to big data: (1) bigger and faster computers and storage and (2) data reduction through preprocessing to either reduce the data volume or complexity. The former is expensive in terms of space and time complexity when big data is considered. The latter may be inexpensive in time and space complexity if a significant reduction is achieved during big data analytics. The balance between information content and data reduction needs to be considered with different algorithm design strategies and applications. Several studies have surveyed strategies to adapt machine learning and data mining algorithms for big data, especially geospatial and remote sensing big data, which result in slightly different perspectives of algorithm adaptation strategies. Table 12.1 lists several reviews on algorithm adaptation methods for big data analytics. This book covers ten methods for adapting or extending traditional machine learning and data mining algorithms for big data analytics. They are (1) distributed and parallel learning, (2) data reduction and approximate computing, (3) feature selection and feature extraction, (4) incremental learning, (5) deep learning, (6) ensemble analysis, (7) granular learning, (8) stochastic algorithms, (9) transfer learning, and (10) active learning.

Table 12.1 Strategies for adapting machine learning and data mining algorithms for remote sensing big data Reference van Zyl (2014)

Wang and He (2016) Qiu et al. (2016)

Shu (2016)

Hariri et al. (2019)

Classes of strategies Listed 12 approaches to adapt algorithms for geospatial big data: (1) divide and conquer, (2) subsampling, (3) aggregation, (4) filtering, (5) online algorithms, (6) streaming algorithms, (7) iterative algorithms, (8) relaxation, (9) convergent algorithms, (10) stochastic algorithms, (11) batch versus online algorithms, and (12) dimensionality reduction. Seven strategies for big data analytics: (1) divide and conquer, (2) parallelism, (3) incremental learning, (4) sampling, (5) granular computing, (6) feature selection, and (7) hierarchical classes. Six advanced learning methods for solving big data problems: (1) representation learning (Bengio et al. 2013), (2) deep learning, (3) distributed and parallel learning, (4) transfer learning (Pan and Yang 2010; Zhuang et al. 2020), (5) active learning, and (6) kernel-based learning. Six techniques for big data analytics: (1) ensemble analysis, (2) association analysis, (3) high-dimensional analysis, (4) deep analysis, (5) precision analysis, and (6) divide-and-conquer analysis. Five machine learning techniques for big data analytics: (1) feature learning, (2) deep learning, (3) transfer learning, (4) distributed learning, and (5) active learning.

12.1 Distributed and Parallel Learning

209

12.1 Distributed and Parallel Learning Traditional algorithm design strategies work with relatively manageable datasets, in terms of the capacity of either one computer or a clustering system. Small work domains may apply brute-force or exhaustive iterations of every component. For large datasets and complex problems, divide and conquer may be used to work on small component problems if all sub-problems are completely separable. Dynamic programming applies to sub-problems with overlapping. Big data is often managed in distributed systems considering the sheer volume. To deal with the large volume of data, parallelism in algorithm design is required (Kumar and Mohbey 2019). The algorithm for big data needs to decompose the data into workable partitions and process them in parallel to achieve the desired timeliness in processing. Big data processing is facilitated under big data frameworks, such as Hadoop and Spark. Algorithm design needs to consider a big data programming model that is specialized to handle big data in such frameworks. For example, MapReduce is a programming paradigm that works well with the “divide-and-conquer” strategy (Shu 2016; Hohl et al. 2020). Figure 12.1 illustrates the programming model using the divide-and-conquer strategy to process big data using the MapReduce model. Information management in MapReduce uses key-value big data stores. There are four stages for distributed big data to be processed. First, the data are split based on their domain or any other computing characteristics. Second, the map stage maps inputs into groups to be processed. Third, the shuffle stage sorts the data for processing. Finally, the reduce stage is to process data to produce outputs. An algorithm for processing big data managed and modeled in such a big data framework should consider the paradigm during the design stage. Data partitioning is one of the strategies for improving performance by analyzing sub-problems of the problem, either using divide and conquer for completely separable sub-problems or using dynamic programming for overlapping sub- problems (Mahmud et al. 2020). Based on relationships among decomposed partitions, data partitioning schemes may be classified into three major groups— horizontal partitioning, vertical partitioning, and functional partitioning (Mahmud et al. 2020). Horizontal partitioning refers to the process of splitting big data into subcomponents of which each contains the same attributes (Siddiqa et al. 2017; Hohl et al. 2020; Mahmud et al. 2020). For remotely sensed imagery, each partition has all the bands. Vertical partitioning refers to the process of splitting big data into subgroups by its columns or attributes (Siddiqa et al. 2017; Mahmud et al. 2020). For example, each band of hyperspectral remotely sensed data may be treated as one subgroup and managed together in one partition for easy processing. Functional partitioning is a hybrid approach that splits data into subgroups by a combination of columns and rows (Huang and Lai 2016; Hohl et al. 2020; Mahmud et al. 2020). It is also called domain decomposition in geospatial big data handling (Li 2020). Common geospatial decomposition includes spatial decomposition and temporal

210

12 Machine Learning and Data Mining Algorithms for Geospatial Big Data

Fig. 12.1 MapReduce programming model for big data

decomposition. Proper domain decomposition needs to be considered to avoid unnecessary communication among distributed sub-domains during computing (Li 2020).

12.2 Data Reduction and Approximate Computing Data reduction is necessary when dealing with big data considering its sheer volume. The time and space complexity to process all data instances may be too much to be considered feasible in practice. The process not only reduces the data to be analyzed, but also helps in eliminating noisy, redundant, or irrelevant data. The primary strategy for data instance reduction is sampling. It is also one major technique for approximate computing of big data.

12.2 Data Reduction and Approximate Computing

211

12.2.1 Sampling Samples are used in representing the population distribution for big data analytics to significantly reduce the amount of data to be processed. This is a straightforward approach of approximate computing through data instance reduction when the distribution model underlining selected samples match well with that of the population. Data sampling methods include random sampling, Bernoulli sampling, stratified sampling, reservoir sampling, and bootstrapping (Mahmud et al. 2020). When these sampling methods are used in big data analytics, algorithm design should consider the big data computing framework and its specific architecture. For example, the big data analytic system based on Apache Hadoop should consider the special block-based architecture and adapt the sampling method to work with the distributed filesystem (Mahmud et al. 2020).

12.2.2 Approximate Computing Approximate computing is used as a general computational approach that produces an approximate result. Strategies may include an approximate circuit at the hardware level, approximate storage at data storage, and processing approximation at the software level (Mittal 2016; Barua and Mondal 2019). In the context of remote sensing big data and algorithm design, approximate computing is achieved by data instance reduction (e.g., sampling) and approximate algorithms (Ma and Huai 2019). Sampling is often used in approximate computing to reduce data to be computed while the underline model is approximated. The core problem for sampling is how to find a representative subset of input data that reduces data in size while retaining information. ApproxHadoop (Goiri et al. 2015), MRPR (Luengo et al. 2020a), and Paraprox (Samadi et al. 2014) are example approximation algorithms that use data sampling for processing large data. ApproxHadoop is a framework that consists of multistage sampling implementations for Map and Reduce processes (Goiri et al. 2015). Multistage sampling theory is applied in sampling input data for approximation. Based on multistage sampling, certain tasks are dropped without execution. The sampling-based approximation algorithms can be applied in ranking, count, aggregation, and frequency computation. MRPR stands for MapReduce for prototype reduction which is a framework to support prototype reduction under MapReduce big data computational model (Luengo et al. 2020a). The framework adopts a prototype reduction (PR) method that reduces samples through representative instance selection with an instance-based classifier with a distance measure, such as k-nearest neighbors (KNN). MRPR is built on the MapReduce model to realize parallelism and handle big data. The classification problem for big data can be resolved with reduced computing time and storage requirement using MRPR. Paraprox is a framework to support approximation computing by applying different sampling reduction and approximations for different underlying

212

12 Machine Learning and Data Mining Algorithms for Geospatial Big Data

data-parallel patterns (Samadi et al. 2014). The Paraprox framework recognizes six data parallel patterns (Samadi et al. 2014): (1) map has a one-to-one relationship with regular memory access, (2) scatter/gather has one-to-one map but randomized memory access, (3) Reduction has multiple inputs to generate one output, (4) scan applies an associative function to an input array and generates an output array, (5) stencil applies a function on its corresponding input array and its neighbors to generate an output element, and (6) partition is similar to stencil but with multiple partitions. Four approximation techniques have been applied to deal with these different patterns: (1) approximate memorization is applied for map and scatter/ gather patterns, (2) sampling is used for reduction, (3) approximate array replication is applied for stencil and partition, and (4) operation on a subset is applied for scan. Approximate computing can be implemented as a specific approximate algorithm for certain groups of problems, as hardware systems, or as software frameworks. Approximate computing strategies include precision scaling on inputs, loop perforation (skipping some iteration), load value approximation, memorization, task dropping, memory access skipping, program accuracy reduction, approximate hardware, voltage scaling, refresh rate reduction, imperfect input/output, lossy compression, branch divergence avoidance, and neural network hardware accelerator (Mittal 2016). Specialized compilers and toolchains can be used to generate approximating execution code that results in approximate instruction processing (Barua and Mondal 2019).

12.3 Feature Selection and Feature Extraction The growth of big data leads to a significant increase in dimensionality. The growth of dimensionality is highly related to the variety of big data which concerns data from different sources. The number of features can grow quickly to be more than millions. For example, KDD2010, Webspam, and Gas sensor in UCI machine learning data repositories have more than one million features in each dataset (Bolón-Canedo et al. 2016; Luengo et al. 2020b). These can quickly grow beyond the capacity of computing resources—computing power and storage. Besides, certain features may be seen as noise to a given machine learning algorithm. There are broadly two types of dimension reduction approaches—feature selection and feature extraction. Feature selection is to select a subset of features that represent the problem without significant degradation of performance. Feature extraction is to select a subset of features generated through transforming that are informative and nonredundant. Figure 12.2 shows the general process to select a subset of features from data. There are three approaches to evaluate the performance of selected feature subsets: filter-based approach, wrapper-based approach, and embedded method (Chandrashekar and Sahin 2014). The filter-based approach uses a general model or approach in selecting the features. Ranking methods are filter- based approaches to filter out less relevant features before a classification algorithm is applied. Ranking criteria include correlation among features (e.g., Pearson

12.3 Feature Selection and Feature Extraction

213

Fig. 12.2 General procedure for feature selection and feature extraction

correlation coefficient) and information-theoretic ranking criteria (e.g., Kullback- Leibler divergence, Mutual Information Maximization (Information Gain), Mutual Information Feature Selection, Minimum Redundancy Maximum Relevance, and Conditional Mutual Information) (Li et al. 2018; AlNuaimi et al. 2022). The wrapper-based approach is to use the actual algorithm in evaluating the performance to select the features. Rapid image information mining (RIIM) is an example of wrapper-based feature selection that uses a genetic algorithm to develop the wrapper to select features for support vector machine (SVM) (Durbha et al. 2010). Because of the expensive computation (i.e., the large search space of 2d for d features) (Li et al. 2018), the wrapper-based approach is not as popular as a filter- based approach. The embedded feature selection incorporates feature selection as part of the training approach in machine learning algorithms. The feature selection is adapted specifically for the algorithm and becomes an intrinsic part. The SVM-RFE (Support Vector Machine Recursive Feature Elimination) is an example of an embedded feature selection algorithm that incorporates a pruning method based on the weights in the discrimination functions (Guyon et al. 2002). In an SVM-RFE, the weights are proportionally related to their correlation coefficients that are used to prune (eliminate) less-relevant features. The Feature Selection-Perceptron (FS-P) is another example of embedded feature selection algorithms that relies on weight pruning using a neural network as classification algorithms (Mejía-Lavalle et al. 2006). The neural network drops attributes with small interconnection weights. The following are some of the challenges for feature selection and feature extraction in big data analytics (Li and Liu 2017; Tiwari and Rana 2021): 1. Distributed computing environment: Large volume of data may have to be processed in a distributed computing environment.

214

12 Machine Learning and Data Mining Algorithms for Geospatial Big Data

2. Curse of dimensionality: The process has to deal with extreme high dimensionality. 3. Class imbalance: The availability of training data for different classes is different. The ratio between features and training data is different from class to class. 4. Streaming data: Data is incrementally streamed in. The algorithms for feature selection and feature extraction need to be adapted to work with big data. MR-EFS (MapReduce for Evolutionary Feature Selection) (Peralta et al. 2015) is an example feature selection framework that adapts feature selection to work with big data under Hadoop MapReduce computational model. A genetic algorithm, that is, CHC Adaptive Search Algorithm (Eshelman 1991), is adopted in selecting features. An information theory-based feature selection framework on Apache Spark is another example in dealing with high dimensionality and large data volume (Ramírez-Gallego et al. 2018). Apache Spark is adopted to support clustering computing across distributed cloud environments to handle big data. Information theory-based algorithms, minimum redundancy-maximum relevance (mRMR) and conditional mutual information (CMI), and joint mutual information (JMI) are used to select features (AlNuaimi et al. 2022). The framework can handle millions of instances with high dimensionality.

12.4 Incremental Learning One of the dimensions for big data is its velocity that describes the fast streaming and flowing of data. For remote sensing big data, many sensors at different levels collect data continuously. This requires machine learning algorithms that can handle streamed data as it arrives to provide information extraction in real time or near real time. Incremental learning algorithms are specialized in dealing with streaming data. Figure 12.3 shows the process for incremental learning. As the data flow in, new training data is extracted from the data stream to update the model. The updated model should be still valid for all classes being seen. The incremental learning should meet the following requirements: (1) the model should be updated with new training data, (2) only new training data are used for training at each step, (3) the updated model should still be valid for all data previously seen, (4) the updated model should be able to add new classes from new training data, and (5) there should be no assumption of underlying models (Polikar et al. 2002; Yang et al. 2019). Many incremental learning algorithms have been prototyped during past decades. Most of these incremental learning algorithms are based on adapting batch- processing machine learning algorithms, including support vector machine (SVM), Bayesian classifiers, decision trees, random forest, artificial neural network, ensemble classifiers (He et al. 2011; Gepperth and Hammer 2016; Losing et al. 2018; Yang et al. 2019; Luo et al. 2020), and clustering algorithms (Chaudhari et al. 2019). The incremental learning with support vector machine (SVM) was reported with an algorithm by leveraging the group effect of support vectors (Syed et al.

12.4 Incremental Learning

215

Fig. 12.3 Incremental learning

1999a, b). Instead of keeping all samples seen, the incremental algorithm keeps learned support vectors. The incrementally trained SVM achieved comparable accuracy to the SVM trained with all the training samples. An exact incremental online learning algorithm for SVM was developed (Cauwenberghs and Poggio 2001). This incremental SVM (ISVM) can be as accurate as batch processing given that all previously seen data are modeled by the set of candidate vectors. LASVM (Bordes et al. 2005) is an incremental learning algorithm that is refined on sequential minimal optimization (Platt 1998). The LASVM does not keep “candidate vectors” as ISVM which leads to an approximate solution. A recent review of incremental SVM learning algorithms can be found in Lawal (2019). The improvements on incremental SVM learning are on reduced computation complexity (space or time), concept shift, big data, and long lifestream. The Learn++ (Polikar et al. 2002) is an incremental learning algorithm using the concept of ensemble classifiers. The incremental extreme learning machine (ELM) (Ding et al. 2015) is a neural network for incremental learning, either one by one or in batch. Ensemble method is also popular in realization of incremental learning considering the additive effect of multiple weak classifiers. Example ensemble incremental learning algorithms are online random forest (Saffari et al. 2009) and incremental random forest (Ma and Ben-Arie 2014). One of the major challenges for incremental learning on a stream of big data is concept drift (Ditzler et al. 2015; Gepperth and Hammer 2016), that is, the underlying distribution of data may shift over a long time stream of data (Gama et al. 2014; Tennant et al. 2017; Lu et al. 2018). In other words, the data flow no way guarantees statistically independent and identically distributed randomness. One learned model may become inapplicable to new data streamed in. This is one of the major areas that the development of new incremental learning algorithms has been focused on Alex and Nayahi (2020). Algorithm design needs to be adapted to deal with the concept drift. For example, Hoeffding tree (Pfahringer et al. 2007),

216

12 Machine Learning and Data Mining Algorithms for Geospatial Big Data

G-eRules (Le et al. 2014), and VFDR (Very Fast Decision Rule) (Gama et al. 2014) take into consideration of concept drift in real-time data stream processing. MC-NN (micro-cluster nearest neighbor) is a parallel-adaptive data stream classifier that adapts to concept drifts (Tennant et al. 2017). Other challenges for incremental learning on the stream of big data include (He and Garcia 2009; He et al. 2011; Gepperth and Hammer 2016): (1) adaptive models over time, (2) stability-plasticity dilemma, (3) large volume, (4) difficult to evaluate, and (5) imbalanced data. There are several trends in the development of incremental learning algorithms in dealing with streams of big data. Incremental learning with ensembles of adaptive models or kernels is one of the approaches to cope with complexity and concept shift, for example, ADAIN (He et al. 2011). Deep learning is adopted to handle complexity in a big data stream, for example, GeoBoost with convolutional neural network (CNN) (Yang and Tang 2020).

12.5 Deep Learning Deep learning is a set of neural networks with deep architectures that can be used to model problems of complexity (LeCun et al. 2015). Representative architectures include stacked autoencoder (SAE), deep belief network (DBN), convolutional neural network (CNN), and recurrent neural network (RNN) (LeCun et al. 2015; Zhang et al. 2018). An SAE consists of stacked autoencoders and finds its applications in clustering or unsupervised classifications (Vincent et al. 2010). DBN is a feedforward neural network with many layers (Hinton et al. 2006). Deep Boltzmann Machine (DBM) is a classical DBN (Hinton and Salakhutdinov 2012). CNN consists of a convolutional layer, pooling layer, and fully connected layer that can effectively analyze data in multiple scales (LeCun et al. 1998). It is widely used in extracting objects from high-resolution imagery (Maggiori et al. 2017; Tong et al. 2020). RNN is a stacked neural network that takes outputs of a previous neural network as inputs of the next neural network in a sequential manner (Hochreiter and Schmidhuber 1997). This architecture effectively models the temporal sequence and finds its extensive applications in time series prediction (Tealab 2018) and speech recognition (Graves et al. 2013). The challenges for deep learning are from different dimensions of big data while the characteristics of big data also provide opportunities for training deep neural networks (Zhang et al. 2018). First, the volume of big data casts challenge in training a deep learning machine. On the other hand, the volume of big data makes it possible to train deep neural networks up to millions of parameters. Second, the variety of big data is a big challenge to machine learning while deep neural networks can approximate complex models with an increased number of nodes and layers. Third, the velocity of big data introduces challenges in timely processing and effectively modeling long time series. Nonstationary characteristics of streaming big data invalidate the applicability of conventional learning algorithms. Deep connectionist

12.6 Ensemble Analysis

217

models provide the flexibility to be incrementally adapted and trained. Fourth, the veracity of big data casts challenges on learning algorithms to deal with imperfect data and noisy data. Deep neural networks have specialized models to tolerate noise and deal with imbalanced data (Johnson and Khoshgoftaar 2019). To deal with big data, deep neural networks have been advanced to be working with clusters of graphics processing unit (GPU) and central processing unit (CPU) in distributed and cloud computing environments (Chen and Lin 2014; Zhang et al. 2018). Large-scale versions of deep neural networks have been emerging to work with big data on all aspects. Distributed deep belief networks have been developed to work distributedly across different platforms (Teerapittayanon et al. 2017). DBN training with GPUs speeds up the process up to 15-fold and scales up the model to over 100 million free parameters (Raina et al. 2009). A multisource deep learning model has been applied to leverage the variety of multiple data for improved classification on fused data (Ienco et al. 2019).

12.6 Ensemble Analysis Ensemble learning consists of weak models and combined the results from weak models to produce final results. Figure 12.4 shows a simplified structure of the ensemble learning algorithm for classification. Weak classifiers can be created to

Fig. 12.4 Ensemble learning

218

12 Machine Learning and Data Mining Algorithms for Geospatial Big Data

achieve diversity by input manipulation (e.g., bagging and random forest), output manipulation (e.g., one-versus-all decision trees, error-correcting output codes), base learner manipulation (e.g., adaptive-size Hoeffding trees), and heterogeneous base learners (e.g., HSMiner) (Gomes et al. 2017; Dong et al. 2020). The strength of ensemble learning has been shown in three aspects: class imbalance, concept drift, and curse of dimensionality (Sagi and Rokach 2018). Machine learning from big data has to deal with all three issues. The concept shift is common for a stream of big data. It is also called a nonstationary stream. Ensemble learning algorithms have been developed for dealing with concept shift by dynamically addition or removal of component weak classifiers (Krawczyk et al. 2017; Ghomeshi et al. 2019). Ensemble classifiers have been seen as one of the effective methods for imbalanced learning in big data analytics (Juez-Gil et al. 2021). There are two approaches to generate ensembles: one is resampling or rebalancing at the data level (e.g., bagging and boosting) (Galar et al. 2012) and another is the use of heterogeneous ensemble (Ghaderi Zefrehi and Altınçay 2020). Resample methods for ensemble learning include random undersampling (RUS), random oversampling (ROS), synthetic minority oversampling technique (SMOTE), random oversampling examples (ROSE), and random balance (RB) (Juez-Gil et al. 2021).

12.7 Granular Computing Granular computing is an information-processing paradigm that processes data at different abstraction levels (Yao et al. 2013). It can be viewed as three components: granulation, granularity, and hierarchical structures (Yao 2016). The purpose of applying granular computing in big data analytics is often aiming at the reduction of dimensionality, redundancy, and storage. The types of granulation can be viewed at different stages for machine learning on big data, that is, input variable, preprocessing, and machine learning approaches (Peters and Weber 2016). For input variables, the direct approach for granulation is adopting different discretization or quantization schemes. Increasing interval size would result in coursing scales of the input variables and different granulation of input variables. In the preprocessing stage, variable transformation or variable aggregation may be adopted in achieving different levels of granulation. Transformation methods include principal component analysis and factor analysis. These granules are formed by new variables computed from original data. The components may result in a reduced dataset to represent most or a specific aspect of the original data. This would result in dimensionality reduction. Variable aggregation may be built by using simple linear clustering approaches to group variables of similarity (Pedrycz 2018). In remote sensing, the granulation may be achieved by reducing spatial or temporal resolution which leads to spare data. Autocorrelation may be also used to measure the similarity of geospatial data against locations. A cluster of similarities

12.9 Transfer Learning

219

may be aggregated as an area to represent all the underlining pixels which would significantly reduce the number of entities to be analyzed in all data. If a logarithmic reduction of entities is achieved, the computation complexity would be significantly reduced from a normally O(N2) complexity to an O(NlogN) complexity (van Zyl 2014). In the processing stage of machine learning, different approaches may be applied to support granular computing. The machine learning models can be sets, fuzzy sets, shadowed sets, probability-based information granules (quotient space model), rough sets, lattice model (or cloud space model), association analysis model, and classification model (Han and Lin 2009; Yao et al. 2013; Peters and Weber 2016; Yao 2016; Pal 2020; Aydav and Minz 2020; Xiaona et al. 2020).

12.8 Stochastic Algorithms Stochastic algorithms are a family of iterative algorithms with randomization (Nickson et al. 2014; Lai and Yuan 2021). They are either used for root-finding or optimization of a regression function. There are two types of randomized algorithms or probabilistic algorithms (van Zyl 2014). One category of randomized algorithms uses random input to get the correct answer always while the running time is finite. The expectation of finite runtime is achieved through the space of random information or entropy (Luby et al. 1993). Another category of randomized algorithms does not guarantee the derivation of correct results with limited time. The correct solution is possible given infinite time (Lai and Yuan 2021). The Robbins-Monro paper sets the foundation for the optimization of a univariate regression function (Robbins and Monro 1951). Recently, it has been evolved to be applied in solving machine learning of regression models in big data. Applications include high-dimensional sparse linear stochastic regression models in time series big data (Basu and Michailidis 2015), recursive gradient boosting for nonlinear stochastic regression (Friedman 2001; Lai and Yuan 2021), and stochastic approximation for particle swarm optimization (Yuan and Yin 2015).

12.9 Transfer Learning Transfer learning is a suite of algorithms that apply the knowledge of a problem to a different problem in other similar domains (Pan and Yang 2010; Zhuang et al. 2020). In general, if the target domain is completely different from the source domain where the knowledge is gained, transfer learning is not likely successful. Similarities between the source domain and target domain do not guarantee the success of transfer learning. In applications to the classification problem in remote sensing big data, one of the main problems is to find sufficient labeled instances to train the machine learning algorithms. This process of finding labeled training

220

12 Machine Learning and Data Mining Algorithms for Geospatial Big Data

instances can be very expensive and sometimes even impossible. Transfer learning can help in reducing the number of labeled instances for training. Transfer learning algorithms can be grounded into different categories. Comprehensive classifications can be found in Zhuang et al. (2020). Based on the relationship between the source domain and the target domain, transfer learning algorithms can be grouped into inductive transfer learning, transductive transfer learning, and unsupervised transfer learning (Pan and Yang 2010). Based on “what to transfer” or the approaches, transfer learning algorithms can be categorized into four cases: instance-transfer, feature-representation-transfer, parameter-transfer, and relational-knowledge-transfer (Pan and Yang 2010). Transfer learning can be applied in big data along with deep learning. Deep transfer learning approaches include instance-based (reusing instances in source domain through a specific weight adjustment strategy), mapping-based (mapping both source and target domain into a new space), network-based (reuse partial networks pretrained in source domain), and adversarial-based (use adversarial technology to find transferable representations from both source and target domain) (Tan et al. 2018). Deep transfer learning is applied in machine learning and information retrieval from remote sensing big data (Liu et al. 2020). Transfer learning is also found useful in improving the performance of stream big data learning or nonstationary data environment (Minku 2019). Dynamic Cross-company Mapped Model Learning (Dycom) is an online inductive transfer learning for regression from a big data stream (Minku and Yao 2014). Diversity for Dealing with Drifts (DDD) is an online ensemble transfer learning that reuses old ensembles to deal with concept drifts when streaming big data (Minku and Yao 2012).

12.10 Active Learning Active learning algorithms involve alternative sources or users iteratively to optimize the sampling and training of the classification model (Tuia et al. 2011). The general framework for active learning starts with labeled data to training a machine learning model. The machine learning process generates requests for labeling certain groups of unlabeled data. These requests are labeled using another machine learning algorithm or a human user. The labels are then fed back to train the machine learning model. The process is iterative (Krishnakumar 2007). The active learning algorithms have been applied in remote sensing image classification. In the survey (Tuia et al. 2011), active learning algorithms for remote sensing image classifications are grouped into four groups: committee-based heuristics (using a committee of learners to quantify the uncertainty), large margin- based heuristics (improving margin-based classifiers, e.g., Support Vector Machine), posterior probability-based heuristics (using posterior probabilities to rank class membership), and cluster-based heuristics (unsupervised automation through pruning a hierarchical clustering tree with users-provided labels) (Xia et al. 2015).

References

221

Active learning algorithms have been advanced recently for big data analytics with the combining use of deep learning algorithms (Liu et al. 2017; Yang et al. 2018). The use of deep learning algorithms with active learning algorithms leads to the improvement of capabilities in processing complex and high-dimensional, hyperspectral remote sensing big data. Another trend is combining the active learning strategy with ensemble learning approaches (Tüysüzoğlu and Yaslan 2018). The accuracy of ensemble classifiers improves as the iteration of active learning increases.

References Alex SA, Nayahi JJV (2020) Deep incremental learning for big data stream analytics. In: Pandian AP, Senjyu T, Islam SMS, Wang H (eds) Proceeding of the international conference on computer networks, big data and IoT (ICCBI – 2018). Springer International Publishing, Cham, pp 600–614 AlNuaimi N, Masud MM, Serhani MA, Zaki N (2022) Streaming feature selection algorithms for big data: a survey. Appl Comput Inform. https://doi.org/10.1016/j.aci.2019.01.001 Aydav PSS, Minz S (2020) Granulation-based self-training for the semi-supervised classification of remote-sensing images. Granul Comput 5:309–327. https://doi.org/10.1007/ s41066-019-00161-x Barua HB, Mondal KC (2019) Approximate computing: a survey of recent trends—bringing greenness to computing and communication. J Inst Eng India Ser B 100:619–626. https://doi. org/10.1007/s40031-019-00418-8 Basu S, Michailidis G (2015) Regularized estimation in sparse high-dimensional time series models. Ann Stat 43. https://doi.org/10.1214/15-AOS1315 Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35:1798–1828. https://doi.org/10.1109/TPAMI.2013.50 Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2016) Feature selection for high- dimensional data. Prog Artif Intell 5:65–75. https://doi.org/10.1007/s13748-015-0080-y Bordes A, Ertekin S, Weston J et al (2005) Fast kernel classifiers with online and active learning. J Mach Learn Res 6:1579–1619 Cauwenberghs G, Poggio T (2001) Incremental and decremental support vector machine learning. In: Leen TK, Dietterich TG, Tresp V (eds) Advances in neural information processing systems 13: proceedings of the 2000 conference. MIT press, Cambridge, pp 409–415 Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40:16–28. https://doi.org/10.1016/j.compeleceng.2013.11.024 Chaudhari A, Joshi RR, Mulay P et al (2019) Bibliometric survey on incremental clustering algorithms. Libr Philos Pract:1–23 Chen X-W, Lin X (2014) Big data deep learning: challenges and perspectives. IEEE Access 2:514–525. https://doi.org/10.1109/ACCESS.2014.2325029 Ding J-L, Wang F, Sun H, Shang L (2015) Improved incremental Regularized Extreme Learning Machine Algorithm and its application in two-motor decoupling control. Neurocomputing 149:215–223. https://doi.org/10.1016/j.neucom.2014.02.071 Ditzler G, Roveri M, Alippi C, Polikar R (2015) Learning in nonstationary environments: a survey. IEEE Comput Intell Mag 10:12–25. https://doi.org/10.1109/MCI.2015.2471196 Dong X, Yu Z, Cao W et al (2020) A survey on ensemble learning. Front Comput Sci 14:241–258. https://doi.org/10.1007/s11704-019-8208-z

222

12 Machine Learning and Data Mining Algorithms for Geospatial Big Data

Durbha SS, King RL, Younan NH (2010) Wrapper-based feature subset selection for rapid image information mining. IEEE Geosci Remote Sens Lett 7:43–47. https://doi.org/10.1109/ LGRS.2009.2028585 Eshelman LJ (1991) The CHC adaptive search algorithm: how to have safe search when engaging in nontraditional genetic recombination. In: Rawlins GJE (ed) Foundations of genetic algorithms. Morgan Kaufmann, San Mateo, CA, USA, pp 265–283 Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29. https://doi.org/10.1214/aos/1013203451 Galar M, Fernandez A, Barrenechea E et al (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C Appl Rev 42:463–484. https://doi.org/10.1109/TSMCC.2011.2161285 Gama J, Žliobaitė I, Bifet A et al (2014) A survey on concept drift adaptation. ACM Comput Surv 46:1–37. https://doi.org/10.1145/2523813 Gepperth A, Hammer B (2016) Incremental learning algorithms and applications. In: Verleysen M (ed) 24th European symposium on artificial neural networks, computational intelligence and machine learning: ESANN 2016: Bruges, Belgium, 27–29 April 2016: proceedings. Bruges, Belgium Ghaderi Zefrehi H, Altınçay H (2020) Imbalance learning using heterogeneous ensembles. Expert Syst Appl 142:113005. https://doi.org/10.1016/j.eswa.2019.113005 Ghomeshi H, Gaber MM, Kovalchuk Y (2019) Ensemble dynamics in non-stationary data stream classification. In: Sayed-Mouchaweh M (ed) Learning from data streams in evolving environments. Springer International Publishing, Cham, pp 123–153 Goiri I, Bianchini R, Nagarakatte S, Nguyen TD (2015) ApproxHadoop: bringing approximations to MapReduce frameworks. ACM SIGPLAN Not 50:383–397. https://doi. org/10.1145/2775054.2694351 Gomes HM, Barddal JP, Enembreck F, Bifet A (2017) A survey on ensemble learning for data stream classification. ACM Comput Surv 50:1–36. https://doi.org/10.1145/3054925 Graves A, Mohamed A, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, Vancouver, pp 6645–6649 Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46:389–422. https://doi.org/10.1023/A:1012487302797 Han J, Lin TY (2009) Granular computing: models and applications. Int J Intell Syst n/a-n/a. https://doi.org/10.1002/int.20390 Hariri RH, Fredericks EM, Bowers KM (2019) Uncertainty in big data analytics: survey, opportunities, and challenges. J Big Data 6. https://doi.org/10.1186/s40537-019-0206-3 He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21:1263–1284. https://doi.org/10.1109/TKDE.2008.239 He H, Chen S, Li K, Xu X (2011) Incremental learning from stream data. IEEE Trans Neural Netw 22:1901–1914. https://doi.org/10.1109/TNN.2011.2171713 Hinton GE, Salakhutdinov RR (2012) A better way to pretrain deep boltzmann machines. Adv Neural Inf Process Syst 25:2447–2455 Hinton GE, Osindero S, Teh Y-W (2006) A fast learning algorithm for deep belief nets. Neural Comput 18:1527–1554 Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735 Hohl A, Saule E, Delmelle E, Tang W (2020) Spatiotemporal domain decomposition for high performance computing: a flexible splits heuristic to minimize redundancy. In: Tang W, Wang S (eds) High performance computing for geospatial applications. Springer International Publishing, Cham, pp 27–50 Huang Y-F, Lai C-J (2016) Integrating frequent pattern clustering and branch-and-bound approaches for data partitioning. Inf Sci 328:288–301. https://doi.org/10.1016/j.ins.2015.08.047

References

223

Ienco D, Interdonato R, Gaetano R, Ho Tong Minh D (2019) Combining Sentinel-1 and Sentinel-2 Satellite Image Time Series for land cover mapping via a multi-source deep learning architecture. ISPRS J Photogramm Remote Sens 158:11–22. https://doi.org/10.1016/j.isprsjprs.2019.09.016 Johnson JM, Khoshgoftaar TM (2019) Survey on deep learning with class imbalance. J Big Data 6. https://doi.org/10.1186/s40537-019-0192-5 Juez-Gil M, Arnaiz-González Á, Rodríguez JJ, García-Osorio C (2021) Experimental evaluation of ensemble classifiers for imbalance in Big Data. Appl Soft Comput 108:107447. https://doi. org/10.1016/j.asoc.2021.107447 Krawczyk B, Minku LL, Gama J et al (2017) Ensemble learning for data stream analysis: a survey. Inf Fusion 37:132–156. https://doi.org/10.1016/j.inffus.2017.02.004 Krishnakumar A (2007) Active learning literature survey. Technical Reports 42 (University of California Santa Cruz, 2007) pp 1-13. Kumar S, Mohbey KK (2019) A review on big data based parallel and distributed approaches of pattern mining. J King Saud Univ – Comput Inf Sci. https://doi.org/10.1016/j.jksuci.2019.09.006 Lai TL, Yuan H (2021) Stochastic approximation: from statistical origin to big-data, multidisciplinary applications. Stat Sci 36. https://doi.org/10.1214/20-STS784 Lawal IA (2019) Incremental SVM learning: review. In: Sayed-Mouchaweh M (ed) Learning from data streams in evolving environments. Springer International Publishing, Cham, pp 279–296 Le T, Stahl F, Gomes JB et al (2014) Computationally efficient rule-based classification for continuous streaming data. In: Bramer M, Petridis M (eds) Research and development in intelligent systems XXXI. Springer International Publishing, Cham, pp 21–34 LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86:2278–2324 LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444. https://doi.org/10.1038/ nature14539 Li Z (2020) Geospatial big data handling with high performance computing: current approaches and future directions. In: Tang W, Wang S (eds) High performance computing for geospatial applications. Springer International Publishing, Cham, pp 53–76 Li J, Liu H (2017) Challenges of feature selection for big data analytics. IEEE Intell Syst 32:9–15. https://doi.org/10.1109/MIS.2017.38 Li J, Cheng K, Wang S et al (2018) Feature selection: a data perspective. ACM Comput Surv 50:1–45. https://doi.org/10.1145/3136625 Liu P, Zhang H, Eom KB (2017) Active deep learning for classification of hyperspectral images. IEEE J Sel Top Appl Earth Obs Remote Sens 10:712–724. https://doi.org/10.1109/ JSTARS.2016.2598859 Liu Y, Ding L, Chen C, Liu Y (2020) Similarity-based unsupervised deep transfer learning for remote sensing image retrieval. IEEE Trans Geosci Remote Sens 58:7872–7889. https://doi. org/10.1109/TGRS.2020.2984703 Losing V, Hammer B, Wersing H (2018) Incremental on-line learning: a review and comparison of state of the art algorithms. Neurocomputing 275:1261–1274. https://doi.org/10.1016/j. neucom.2017.06.084 Lu J, Liu A, Dong F et al (2018) Learning under concept drift: a review. IEEE Trans Knowl Data Eng:1–1. https://doi.org/10.1109/TKDE.2018.2876857 Luby M, Sinclair A, Zuckerman D (1993) Optimal speedup of Las Vegas algorithms. Inf Process Lett 47:173–180. https://doi.org/10.1016/0020-0190(93)90029-9 Luengo J, García-Gil D, Ramírez-Gallego S et al (2020a) Data reduction for big data. In: Big data preprocessing. Springer International Publishing, Cham, pp 81–99 Luengo J, García-Gil D, Ramírez-Gallego S et al (2020b) Dimensionality reduction for big data. In: Big data preprocessing. Springer International Publishing, Cham, pp 53–79 Luo Y, Yin L, Bai W, Mao K (2020) An appraisal of incremental learning methods. Entropy 22:1190. https://doi.org/10.3390/e22111190

224

12 Machine Learning and Data Mining Algorithms for Geospatial Big Data

Ma K, Ben-Arie J (2014) Compound exemplar based object detection by incremental random forest. In: 2014 22nd international conference on pattern recognition. IEEE, Stockholm, pp 2407–2412 Ma S, Huai J (2019) Approximate computation for big data analytics. ArXiv190100232 Cs Maggiori E, Tarabalka Y, Charpiat G, Alliez P (2017) Convolutional neural networks for large- scale remote-sensing image classification. IEEE Trans Geosci Remote Sens 55:645–657. https://doi.org/10.1109/TGRS.2016.2612821 Mahmud MS, Huang JZ, Salloum S et al (2020) A survey of data partitioning and sampling methods to support big data analysis. Big Data Min Anal 3:85–101. https://doi.org/10.26599/ BDMA.2019.9020015 Mejía-Lavalle M, Sucar E, Arroyo G (2006) Feature selection with a perceptron neural net. In: Liu H, Stine R, Auslender L (eds) Proceedings of the international workshop on feature selection for data mining, Bethesda, pp 131–135 Minku LL (2019) Transfer learning in non-stationary environments. In: Sayed-Mouchaweh M (ed) Learning from data streams in evolving environments. Springer International Publishing, Cham, pp 13–37 Minku LL, Yao X (2012) DDD: a new ensemble approach for dealing with concept drift. IEEE Trans Knowl Data Eng 24:619–633. https://doi.org/10.1109/TKDE.2011.58 Minku LL, Yao X (2014) How to make best use of cross-company data in software effort estimation? In: Proceedings of the 36th international conference on software engineering. ACM, Hyderabad, pp 446–456 Mittal S (2016) A survey of techniques for approximate computing. ACM Comput Surv 48:1–33. https://doi.org/10.1145/2893356 Nickson T, Osborne MA, Reece S, Roberts SJ (2014) Automated machine learning on big data using stochastic algorithm tuning. ArXiv14077969 Stat Oliver MA, Webster R (1990) Kriging: a method of interpolation for geographical information systems. Int J Geogr Inf Syst 4:313–332. https://doi.org/10.1080/02693799008941549 Pal SK (2020) Granular mining and big data analytics: rough models and challenges. Proc Natl Acad Sci India Sect Phys Sci 90:193–208. https://doi.org/10.1007/s40010-018-0578-3 Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22:1345–1359. https://doi.org/10.1109/TKDE.2009.191 Pedrycz W (2018) Granular computing: analysis and design of intelligent systems, 1st edn. CRC Press Peralta D, del Río S, Ramírez-Gallego S et al (2015) Evolutionary feature selection for big data classification: a MapReduce approach. Math Probl Eng 2015:1–11. https://doi. org/10.1155/2015/246139 Peters G, Weber R (2016) DCC: a framework for dynamic granular clustering. Granul Comput 1:1–11. https://doi.org/10.1007/s41066-015-0012-z Pfahringer B, Holmes G, Kirkby R (2007) New options for Hoeffding trees. In: Orgun MA, Thornton J (eds) AI 2007: advances in artificial intelligence. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 90–99 Platt J (1998) Sequential minimal optimization: a fast algorithm for training support vector machines. Microsoft Polikar R, Byorick J, Krause S et al (2002) Learn++: a classifier independent incremental learning algorithm for supervised neural networks. In: Proceedings of the 2002 international joint conference on neural networks. IJCNN’02 (Cat. No.02CH37290). IEEE, Honolulu, pp 1742–1747 Qiu J, Wu Q, Ding G et al (2016) A survey of machine learning for big data processing. EURASIP J Adv Signal Process 2016. https://doi.org/10.1186/s13634-016-0355-x Raina R, Madhavan A, Ng AY (2009) Large-scale deep unsupervised learning using graphics processors. In: Proceedings of the 26th annual international conference on machine learning. ACM Press, New York, pp 873–880

References

225

Ramírez-Gallego S, Mouriño-Talín H, Martínez-Rego D et al (2018) An information theory-based feature selection framework for big data under apache spark. IEEE Trans Syst Man Cybern Syst 48:1441–1453. https://doi.org/10.1109/TSMC.2017.2670926 Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat 22:400–407 Saffari A, Leistner C, Santner J et al (2009) On-line random forests. In: 2009 IEEE 12th international conference on computer vision workshops, ICCV workshops. IEEE, Kyoto, pp 1393–1400 Sagi O, Rokach L (2018) Ensemble learning: a survey. Wiley Interdiscip Rev Data Min Knowl Discov 8. https://doi.org/10.1002/widm.1249 Samadi M, Jamshidi DA, Lee J, Mahlke S (2014) Paraprox: pattern-based approximation for data parallel applications. ACM SIGARCH Comput Archit News 42:35–50. https://doi. org/10.1145/2654822.2541948 Shu H (2016) Big data analytics: six techniques. Geo-Spat Inf Sci 19:119–128. https://doi.org/1 0.1080/10095020.2016.1182307 Siddiqa A, Karim A, Gani A (2017) Big data storage technologies: a survey. Front Inf Technol Electron Eng 18:1040–1070. https://doi.org/10.1631/FITEE.1500441 Syed NA, Liu H, Sung KK (1999a) Incremental learning with support vector machines. In: KDD’99. SanDiego Syed NA, Liu H, Sung KK (1999b) Handling concept drifts in incremental learning with support vector machines. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining – KDD’99. ACM Press, San Diego, pp 317–321 Tan C, Sun F, Kong T et al (2018) A survey on deep transfer learning. In: Kůrková V, Manolopoulos Y, Hammer B et al (eds) Artificial neural networks and machine learning – ICANN 2018. Springer International Publishing, Cham, pp 270–279 Tealab A (2018) Time series forecasting using artificial neural networks methodologies: a systematic review. Future Comput Inform J 3:334–340. https://doi.org/10.1016/j.fcij.2018.10.003 Teerapittayanon S, McDanel B, Kung HT (2017) Distributed deep neural networks over the cloud, the edge and end devices. In: 2017 IEEE 37th international conference on distributed computing systems (ICDCS). IEEE, Atlanta, pp 328–339 Tennant M, Stahl F, Rana O, Gomes JB (2017) Scalable real-time classification of data streams with concept drift. Future Gener Comput Syst 75:187–199. https://doi.org/10.1016/j. future.2017.03.026 Tiwari SR, Rana KK (2021) Feature selection in big data: trends and challenges. In: Kotecha K, Piuri V, Shah HN, Patel R (eds) Data science and intelligent applications. Springer Singapore, Singapore, pp 83–98 Tong X-Y, Xia G-S, Hu F et al (2020) Exploiting deep features for remote sensing image retrieval: a systematic investigation. IEEE Trans Big Data 6:507–521. https://doi.org/10.1109/ TBDATA.2019.2948924 Tuia D, Volpi M, Copa L et al (2011) A survey of active learning algorithms for supervised remote sensing image classification. IEEE J Sel Top Signal Process 5:606–617. https://doi. org/10.1109/JSTSP.2011.2139193 Tüysüzoğlu G, Yaslan Y (2018) Sparse coding based classifier ensembles in supervised and active learning scenarios for data classification. Expert Syst Appl 91:364–373. https://doi. org/10.1016/j.eswa.2017.09.024 van Zyl T (2014) Algorithmic design considerations for geospatial and/or temporal big data. Big Data Tech Technol Geoinformatics. CRC Press, Boca Raton, pp 117–132 Vincent P, Larochelle H, Lajoie I et al (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11:3371–3408 Wang X, He Y (2016) Learning from uncertainty for big data: future analytical challenges and strategies. IEEE Syst Man Cybern Mag 2:26–31. https://doi.org/10.1109/MSMC.2016.2557479 Xia G-S, Wang Z, Xiong C, Zhang L (2015) Accurate annotation of remote sensing images via active spectral clustering with little expert knowledge. Remote Sens 7:15014–15045. https:// doi.org/10.3390/rs71115014

226

12 Machine Learning and Data Mining Algorithms for Geospatial Big Data

Xiaona D, Chunfeng L, Baoxiang L (2020) Research on image granulation in granular computing. In: 2020 IEEE 3rd international conference on information systems and computer aided education (ICISCAE). IEEE, Dalian, pp 667–674 Yang N, Tang H (2020) GeoBoost: an incremental deep learning approach toward global mapping of buildings from VHR remote sensing images. Remote Sens 12:1794. https://doi.org/10.3390/ rs12111794 Yang L, MacEachren A, Mitra P, Onorati T (2018) Visually-enabled active deep learning for (geo) text and image classification: a review. ISPRS Int J Geo-Inf 7:65. https://doi.org/10.3390/ ijgi7020065 Yang Q, Gu Y, Wu D (2019) Survey of incremental learning. In: 2019 Chinese control and decision conference (CCDC). IEEE, Nanchang, pp 399–404 Yao Y (2016) A triarchic theory of granular computing. Granul Comput 1:145–157. https://doi. org/10.1007/s41066-015-0011-0 Yao JT, Vasilakos AV, Pedrycz W (2013) Granular computing: perspectives and challenges. IEEE Trans Cybern 43:1977–1989. https://doi.org/10.1109/TSMCC.2012.2236648 Yuan Q, Yin G (2015) Analyzing convergence and rates of convergence of particle swarm optimization algorithms using stochastic approximation methods. IEEE Trans Autom Control 60:1760–1773. https://doi.org/10.1109/TAC.2014.2381454 Zhang Q, Yang LT, Chen Z, Li P (2018) A survey on deep learning for big data. Inf Fusion 42:146–157. https://doi.org/10.1016/j.inffus.2017.10.006 Zhuang F, Qi Z, Duan K et al (2020) A comprehensive survey on transfer learning. Proc IEEE:1–34. https://doi.org/10.1109/JPROC.2020.3004555

Chapter 13

Modeling, Prediction, and Decision Making Based on Remote Sensing Big Data

Abstract This chapter covers the life cycle of using remote sensing big data in real-world actions. The application life cycle includes three stages: modeling, prediction, and decision making. The modeling stage processes remote sensing data and produces information. The prediction stage extrapolates and forecasts the phenomena information through time series of data products derived from remote sensing big data, of which predicative methods/algorithms are discussed in Sect. 9.1.2. The decision support stage is to link the data products with the decision problems to produce and deliver actionable information. Keywords Modeling · Prediction · Decision making · Data model · Life cycle Remote sensing is one of the major sources of continuous Earth monitoring through different levels of sensors. Remote sensing big data found extensive applications in modeling the Earth through extrapolation of time series and producing decision- ready information (DRI).

13.1 A General Framework The general framework of remote sensing big data is as shown in Fig. 13.1. The inclusion of remote sensing big data in data analytics and decision support starts with the acquisition of remote sensing big data. The data can be collected by physical sensors onboard satellites, aircraft, unmanned aerial vehicles (UAV), high towers, or smart field vehicles. For example, the Arctic-Boreal Vulnerability Experiment (ABoVE) uses airborne altimeters-radar and Laser sensors to collect data of the Arctic and boreal ecosystems (Scholten et al. 2021). The Sentinel-1 onboard polar-orbiting satellite uses synthetic aperture radar (SAR) sensors to capture images of land and oceans day and night with the all-weather capability of radar technology (Ienco et al. 2019, p. 1). The Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) is a hyperspectral sensor (224 contiguous spectral bands) flying on aircraft to capture molecular absorption and particle scattering signatures of Earth surface and atmosphere (Green et al. 1998). UAVs can be mounted with © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. Di, E. Yu, Remote Sensing Big Data, Springer Remote Sensing/ Photogrammetry, https://doi.org/10.1007/978-3-031-33932-5_13

227

228

13 Modeling, Prediction, and Decision Making Based on Remote Sensing Big Data

Fig. 13.1 A general framework from remote sensing data to decision ready information

different sensors or cameras to capture imagery and data of fields or Earth surface quickly with details (Maes and Steppe 2019). A suite of sensors is mounted on AmeriFlux towers to collect in situ data of carbon, water, and energy flux (Novick et al. 2018). Sensors mounted on combines are collecting crop yield and quality of grain and straws (Reyns et al. 2002). Sensors may not be limited to physical sensors but virtual sensors and citizen sensors. Virtual sensors may be model outputs that may be resulted from assimilating different live streams of remote sensors (Srivastava et al. 2005). Citizen sensors are collecting different data for different campaigns or applications (Villatoro and Nin 2013). The next phase is the data processing and modeling stage (Rathore et al. 2015). This stage may include ingestion and feed sensor data into analytic models and/or assimilated into models. Different frameworks and models can be used to analyze remote sensing big data. The data modeling stage also includes the preprocessing of data or the preparation of data ready for analytics depending on the analytic models or frameworks used. Data may be fused from a variety of sources to form a uniform base data for modeling. The prediction stage is an extension of observations or time series of observations (Junqué de Fortuny et al. 2013; Yu et al. 2016; Sayad et al. 2019). The data from remote sensors accumulated over time build up the base for short, middle, or even long-term time series analysis. The analyses of time series data provide the base data for extrapolating or predict the physical measurements into the near future. These can be either simple extrapolation or incorporate into a simulation model to generate predictions. Predictive analytic methods are discussed in Sect. 9.1.2. The outputs of modeling and predictive analytics are used to support decision making (Boulila et al. 2018; Huang et al. 2018; Wang et al. 2021). The information may go through a further preparation and processing stage to produce

13.2 Modeling

229

decision-ready information. The decision-support information may be delivered through standard interfaces and formats for decision makers to easily access and use in their decision activities.

13.2 Modeling 13.2.1 Data Models and Structures The first task for modeling using remote sensing big data is the preparation and transformation of data into a form that is readily usable for models or processing algorithms. These are generally called preprocessing processes. The processes may include the selection of data models/structures for processing algorithms, reformatting of data, re-projection for matching spatial referencing systems, and rescaling for matching the granular resolution of models to be used. Remote sensing data can be seen as a special type of geospatial data. Besides, in the modeling process, different auxiliary data may be introduced to calibrate, train, or validate models along with remote sensing big data. The selection of data models and structures may be an important step to successfully enable the analytics and modeling with remote sensing big data. Table 13.1 shows selective data models and their applicability for remote sensing big data (Li et al. 2016). The most comprehensive spatial relationships may be modeled with networks and topological data models for accurate geospatial modeling. However, the comprehensive network spatial data model may be difficult to manage and maintain for big data considering Table 13.1 Selective remote sensing big data models

Network and topological data models Regular tessellations

Suitable for big data? No Yes

Irregular tessellations (e.g., Voronoi, Yes k-d-tree, binary space partitioning tree) Tabular data Yes Distributed data streams

Yes

Distributed spatial database

Yes

Suitability reasoning 1. Connectivity 2. Adjacency 1. Soft errors (inconsistencies in data that do not necessarily cause erroneous results) 2. In matrix with parallelized vectorization 1. Sparse tree 2. Indexing 1. Summaries 2. Support spatial statistics 1. Ready for parallelisms through incremental processing 2. May be affected by concept shifts in modeling 1. Spatial indexing 2. Backed with database management systems

230

13 Modeling, Prediction, and Decision Making Based on Remote Sensing Big Data

the consistency requirements on connectivity and adjacency. Regular tessellations or gridded data may be easily partitioned and managed in a matrix with some reasonable accuracy. Irregular tessellations are for representing spatial data using a sparse matrix. Distributed data streams are natural forms of streaming data. Remote sensing big data often arrives in streams as sensors are constantly sensing the Earth. A distributed spatial database is backed with the spatial database system in a distributed environment. They can be readily used in a distributed computing environment for managing and processing remote sensing big data.

13.2.2 Modeling with Remote Sensing Big Data Remote sensors can be connected to form a network that feeds data lively into the modeling system. To support the continuous integration of remote sensing big data, a set of standard-compliant services may be used to support the modeling. The networked remote sensors can be called a sensor Web when the connections among them follow certain well-known specifications or standards (Di 2007a). The Self- adaptive Earth Predictive System (SEPS) is an example framework as described in Sect. 4.3 (Di 2007b). SEPS can used to connect sensors to Earth science models to build up the live data link from sensor to science models. Once a workflow is established, the remote sensing big data can be streamlined and processed as they become available. Figure 13.2 shows a live workflow that is established through chaining standard geospatial Web services to process satellite sensor data from the Moderate Resolution Imaging Spectroradiometer (MODIS) to produce crop condition indices in an operational service—VegScape (Yang et al. 2016). Different models are enabled as Web Processing Services (WPS), including processes to calculate normalized difference vegetation index (NDVI), vegetation condition index (VCI), and median VCI (MVCI). The WPS services can be deployed into different endpoints in a distributed computing environment. The computation can be scaled through a balance loader in a cloud computing environment to support the processing of remote sensing big data. To achieve the parallelism of processing and modeling using remote sensing big data, distributed model languages may be used to program and scale the computation. MapReduce is a program model to support distributed computing. The model can be used to scale the processing of remote sensing big data. In Rathore et al. (2015), the MapReduce program model is adopted to support the scaling of processing real- time remote sensing big data (Rathore et al. 2015). The system consists of three major components: remote sensing data acquisition unit (RSDU), data processing unit (DPU), and data analysis and decision unit (DADU). The system was set up to process live streams from sensors like advanced synthetic apertures radar (ASAR) and medium resolution imaging spectrometer (MERIS) onboard Envisat. In Sun et al. (2019), Apache SPARK is adopted to support high-performance processing in

13.2 Modeling

231

Fig. 13.2 Continuous production of crop condition indices from remote sensed data

a cloud computing environment (Sun et al. 2019). Optimization is achieved through the MapReduce program model and task scheduling using a directed acyclic graph (DAG) model of remote sensing applications. Deep learning algorithms are applied in modeling and classifying land cover (Parente et al. 2019). Random forest is used as a reference classifier to evaluate the performance of deep learning algorithms. The study showed that deep learning algorithms (i.e., U-net convolutional neural networks, long short-term memory (LSTM) recurrent neural network) performed better than random forest on remote sensing big data for land cover mapping.

232

13 Modeling, Prediction, and Decision Making Based on Remote Sensing Big Data

13.2.3 Validation with Remote Sensing Big Data Another way of using remote sensing big data in modeling is done by assimilating remote sensing data with an existing model (Reichle 2008). The remotely sensed data may not be directly used as parameters, but validating the actual simulation models through comparing and validating comparable parameters. Figure 13.3 shows the general framework. Both processing of remote-sensed data and simulation model is run. Comparable parameters are retrieved from both models and remote sensing algorithms. The parameters are compared. If the parameters do not converge to the given criteria, models will be rerun with a different set of initial settings while more remotely sensed data may be acquired to produce comparable outputs. The models are optimized through comparison and validation of parameters. In such an assimilation paradigm of remote sensing big data, computation is intensive since both simulation models and remote sensing processing models should be executed. Physically based simulation models may have to be reset with different initial settings and rerun several times before the results or intermediate parameters converge. The assimilation of remote sensing big data with models is often applied with simulation models or prediction models. In Miyoshi et al. (2016), the remote sensing big data from phased array weather radar and geostationary satellite Himawari-8 are assimilated with the simulation model of numerical weather prediction (NWP). In Chen et al. (2019), Google Earth Engine (GEE) is used in processing remote sensing data and results are fed into a flood inundation simulation model to model and predict flood inundation. More than 15 years of Landsat images were used to build

Fig. 13.3 Assimilation of remote sensed data with models through validation

13.3 Decision Making

233

up a time series of water level and inundation maps. The time series are then used to build and validate the simulating and predicting of flood extent using the relationship between inundation extent and water level.

13.3 Decision Making The use of remote sensing big data in decision making concerns three steps—the discovery and access of decision-relevant information, the production of decision- ready information (DRI), and the efficient delivery of DRI to decision makers. Figure 13.4 shows a decision support system based on remote sensing big data. The first step for decision support is to find or produce relevant information (Rathore et al. 2015; Huang et al. 2018; Tantalaki et al. 2019). Decision making is often goal-driven. With the given goal, relevant information should be gathered. These data may be derived from remote sensing or other sources. They should be integrated, fused, or aligned to provide a better understanding of the decision problem. In support of the discovery of relevant information for a given decision goal, a catalog system may be used to index and search relevant information. Contextual information retrieval may be required to discover relevant information when the data grows and the information is shared across different tasks (Li et al. 2021). For example, an environmental decision may call for information on soil and coastal erosion, water quality, forest resources, and land cover. Relevant data may be produced by different organizations. One of the solutions for enabling discovery and access of different data products is the adoption of the Open Data Cube (ODC)

Fig. 13.4 A decision support system based on remote sensing big data

234

13 Modeling, Prediction, and Decision Making Based on Remote Sensing Big Data

infrastructure and service systems (Killough 2018). ODC enables standard data search and access. Discovery and access of data products can be achieved through standard interfaces. The second step for decision support is the preparation of decision-ready information (DRI). The DRI should be actionable. The data products found in the first step may not make sense to decision makers because they may just reflect one aspect of the whole problem. The link between the information and the decision goal may not be established. Multiple data products may be analyzed and linked to the decision goal through a decision-evaluation system, such as multiple criteria decision making (MCDM), to produce actionable decision information (Triantaphyllou et al. 1998; Pektürk and Ünal 2018). For example, soil moisture and evapotranspiration data from remote sensing and model may not make sense to farmers, the final decision makers who need to decide if the irrigation system should turn on and how much water should be applied (Di et al. 2018). These parameters need to be integrated and evaluated through a decision-making system to produce the actual irrigation map which recommends irrigation (Lin et al. 2021). The final step for decision support is to deliver the DRI to decision makers in time (Pektürk and Ünal 2018; Pavithra and Murali 2018; Tantalaki et al. 2019). The DRI is normally time-sensitive. The decision information is useful only if it reaches the decision maker in time with proper visualization. To speed the delivery of such decision information and present the results visually to decision makers, standard services may be used. For example, OGC Web Map Service (WMS) may be used to present map-based information which can be readily visualized at the client’s end. The irrigation decision support service system discussed above uses open geospatial standards to serve irrigation-relevant information and decision-ready information (Di et al. 2018). Farmers, the decision maker in the field, may access this standard information visually using a standard-compliant client, such as GeoFairy (Sun et al. 2021).

References Boulila W, Farah IR, Hussain A (2018) A novel decision support system for the interpretation of remote sensing big data. Earth Sci Inf 11:31–45. https://doi.org/10.1007/s12145-017-0313-7 Chen Z, Luo J, Chen N et al (2019) RFim: a real-time inundation extent model for large floodplains based on remote sensing big data and water level observations. Remote Sens 11:1585. https:// doi.org/10.3390/rs11131585 Di L (2007a) Geospatial sensor web and self-adaptive earth predictive systems (SEPS). In: ESTO- AIST sensor web PI meeting. NASA, San Diego, California, USA Di L (2007b) A general framework and system prototypes for the self-adaptive earth predictive systems (SEPS)--dynamically coupling sensor web with earth system models (AIST-05-0064). In: ESTO-AIST sensor web PI meeting. NASA, San Diego, California, USA Di L, Chen F, Yang H et al (2018) WaterSmart: a cyberinfrastructure-based integrated decision- support web service system to facilitate informed irrigation decision-making. In: AGU fall meeting abstracts, p GC52B-08

References

235

Green RO, Eastwood ML, Sarture CM et al (1998) Imaging spectroscopy and the airborne visible/infrared imaging spectrometer (AVIRIS). Remote Sens Environ 65:227–248. https://doi. org/10.1016/S0034-4257(98)00064-9 Huang Y, Chen Z, Yu T et al (2018) Agricultural remote sensing big data: management and applications. J Integr Agric 17:1915–1931. https://doi.org/10.1016/S2095-3119(17)61859-8 Ienco D, Interdonato R, Gaetano R, Ho Tong Minh D (2019) Combining Sentinel-1 and Sentinel-2 Satellite Image Time Series for land cover mapping via a multi-source deep learning architecture. ISPRS J Photogramm Remote Sens 158:11–22. https://doi.org/10.1016/j.isprsjprs.2019.09.016 Junqué de Fortuny E, Martens D, Provost F (2013) Predictive modeling with big data: is bigger really better? Big Data 1:215–226. https://doi.org/10.1089/big.2013.0037 Killough B (2018) Overview of the open data cube initiative. In: IGARSS 2018 – 2018 IEEE international geoscience and remote sensing symposium. IEEE, Valencia, pp 8629–8632 Li S, Dragicevic S, Castro FA et al (2016) Geospatial big data handling theory and methods: a review and research challenges. ISPRS J Photogramm Remote Sens 115:119–133. https://doi. org/10.1016/j.isprsjprs.2015.10.012 Li J, Liu Z, Lei X, Wang L (2021) Distributed fusion of heterogeneous remote sensing and social media data: a review and new developments. Proc IEEE:1–14. https://doi.org/10.1109/ JPROC.2021.3079176 Lin L, Di L, Guo L et al (2021) Developing a semantic irrigation ontology to support WaterSmart System: a demonstration of reducing water and energy consumption in Nebraska. Geography Maes WH, Steppe K (2019) Perspectives for remote sensing with unmanned aerial vehicles in precision agriculture. Trends Plant Sci 24:152–164. https://doi.org/10.1016/j.tplants.2018.11.007 Miyoshi T, Lien G-Y, Satoh S et al (2016) “Big data assimilation” toward post-petascale severe weather prediction: an overview and Progress. Proc IEEE 104:2155–2179. https://doi. org/10.1109/JPROC.2016.2602560 Novick KA, Biederman JA, Desai AR et al (2018) The AmeriFlux network: a coalition of the willing. Agric For Meteorol 249:444–456. https://doi.org/10.1016/j.agrformet.2017.10.009 Parente L, Taquary E, Silva A et al (2019) Next generation mapping: combining deep learning, cloud computing, and big remote sensing data. Remote Sens 11:2881. https://doi.org/10.3390/ rs11232881 Pavithra M, Murali G (2018) Implementation of scientific architecture of real-time big information in remote sensing applications. In: 2018 3rd international conference on communication and electronics systems (ICCES). IEEE, Coimbatore, India, pp 481–486 Pektürk MK, Ünal M (2018) Performance-aware high-performance computing for remote sensing big data analytics. In: Thomas C (ed) Data mining. InTech Rathore MMU, Paul A, Ahmad A et al (2015) Real-time big data analytical architecture for remote sensing application. IEEE J Sel Top Appl Earth Obs Remote Sens 8:4610–4621. https://doi. org/10.1109/JSTARS.2015.2424683 Reichle RH (2008) Data assimilation methods in the Earth sciences. Adv Water Resour 31:1411–1418. https://doi.org/10.1016/j.advwatres.2008.01.001 Reyns P, Missotten B, Ramon H, De Baerdemaeker J (2002) A review of combine sensors for precision farming. Precis Agric 3:169–182. https://doi.org/10.1023/A:1013823603735 Sayad YO, Mousannif H, Al Moatassime H (2019) Predictive modeling of wildfires: a new dataset and machine learning approach. Fire Saf J 104:130–146. https://doi.org/10.1016/j. firesaf.2019.01.006 Scholten RC, Jandt R, Miller EA et al (2021) Overwintering fires in boreal forests. Nature 593:399–404. https://doi.org/10.1038/s41586-021-03437-y Srivastava AN, Oza NC, Stroeve J (2005) Virtual sensors: using data mining techniques to efficiently estimate remote sensing spectra. IEEE Trans Geosci Remote Sens 43:590–600. https:// doi.org/10.1109/TGRS.2004.842406 Sun J, Zhang Y, Wu Z et al (2019) An efficient and scalable framework for processing remotely sensed big data in cloud computing environments. IEEE Trans Geosci Remote Sens 57:4294–4308. https://doi.org/10.1109/TGRS.2018.2890513

236

13 Modeling, Prediction, and Decision Making Based on Remote Sensing Big Data

Sun Z, Di L, Cvetojevic S, Yu Z (2021) GeoFairy2: a cross-institution mobile gateway to location- linked data for in-situ decision making. ISPRS Int J Geo Inf 10:1 Tantalaki N, Souravlas S, Roumeliotis M (2019) Data-driven decision making in precision agriculture: the rise of big data in agricultural systems. J Agric Food Inf 20:344–380. https://doi.org/1 0.1080/10496505.2019.1638264 Triantaphyllou E, Shu B, Sanchez SN, Ray T (1998) Multi-criteria decision making: an operations research approach. Encycl Electr Electron Eng 15:175–186 Villatoro D, Nin J (2013) Citizens sensor networks. In: Nin J, Villatoro D (eds) Citizen in sensor networks. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 1–5 Wang R-Y, Lin P, Chu J-Y et al (2021) A decision support system for Taiwan’s forest resource management using remote sensing big data. Enterp Inf Syst:1–22. https://doi.org/10.108 0/17517575.2021.1883123 Yang Z, Hu L, Yu G et al (2016) Web service-based SMAP soil moisture data visualization, dissemination and analytics based on vegscape framwork. IEEE, pp 3624–3627 Yu C-H, Ding W, Morabito M, Chen P (2016) Hierarchical spatio-temporal pattern discovery and predictive modeling. IEEE Trans Knowl Data Eng 28:979–993. https://doi.org/10.1109/ TKDE.2015.2507570

Chapter 14

Examples of Remote Sensing Applications of Big Data Analytics—Fusion of Diverse Earth Observation Data

Abstract This chapter describes data fusion, a common task for remote sensing big data analytics that produces improved images by fusing data with different (spatial, spectral, radiometric, and temporal) resolutions. One newly developed, learning- based spatiotemporal fusion model, the Deep Convolutional Spatiotemporal Fusion Network (DCSTFN), is described and compared with alternative spatiotemporal fusion models, that is, the spatial and temporal adaptive reflectance fusion model (STARFM) (the earliest and the most popular spatiotemporal fusion model) and FSDAF (a representative unmixing spatiotemporal fusion model). These three models represent three categories of spatiotemporal fusion models—DCSTFN for a learning-based model, STARFM for a filter-based model, and the Flexible Spatiotemporal Data Fusion (FSDAF) for an unmixing-based model. The learning- based fusion model is relatively a new and emerging type among the three. This study shows the advantages of incorporation of deep learning: robust and improved accuracy. The training of deep learning models can be completed quickly using parallel computing. Keywords Data fusion · Learning-based fusion · Filter-based fusion · Unmixing- based fusion · Deep learning · Parallel computing

14.1 The Concept of Data Fusion 14.1.1 Definitions Data fusion is the process of combining data from multiple sources into one unified data which provides improvements in some aspects comparing to any of the source data (Schmitt and Zhu 2016). Many definitions of data fusion have occurred. Schmitt and Zhu (2016) listed 12 definitions that occurred between 1990 and 2013. Table 14.1 lists a few more definitions beyond the range which shows the evolution of the term over time and application domains. The early study of data fusion can be

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. Di, E. Yu, Remote Sensing Big Data, Springer Remote Sensing/ Photogrammetry, https://doi.org/10.1007/978-3-031-33932-5_14

237

238

14 Examples of Remote Sensing Applications of Big Data Analytics—Fusion…

Table 14.1 Additional list of definitions for data fusion Definitions “Multisensor fusion is posed as a labeled set covering problem where the report subsets are labeled as targets or false alarms”

References Bowman and Morefield (1980) Data fusion constitutes the following: (1) “The integration of information Goodman from multiple sources to produce the most comprehensive and specific unified (1987) and data about an entity.” (2) “The analysis of intelligence information from White (1991) multiple sources covering a number of different events to produce a comprehensive report of activity that assesses its significance. The analysis is often supported by the inclusion of operational data.” (3) “Intelligence usage, the logical blending of related information /intelligence from multiple sources.” “Data fusion is the process of combining data to refine state estimates and Steinberg et al. predictions.” (1999) “Data fusion is a process of integration of multiple data and knowledge Rashinkar and representing the same real-world object into a consistent, accurate and useful Krushnasamy representation.” (2017) Data fusion includes elements of (1) “Data sources: Single or multiple data Meng et al. sources from different positions and at different points of time are involved in (2020) data fusion.” (2) “Operation: One needs an operation of a combination of data and refinement of information, which can be described as ‘transforming’.” (3) “Purpose: Gaining improved information with less error possibility in detection or prediction and superior reliability as the goal of fusion.”

traced back to the mathematical models for data manipulation in the 1960s (Rashinkar and Krushnasamy 2017).

14.1.2 Classification of Data Fusion The Joint Directors of Laboratories (JDL) defines data fusion as “[a] process dealing with the association, correlation, and combination of data and information from single and multiple sources to achieve refined position and identity estimates, and complete and timely assessments of situations and threats as well as their significance” (White 1991). According to this definition and model, data fusion can be classified into different levels: L0—Sub-Object Assessment, L1—Object Assessment, L2—Situation Assessment, L3—Impact Assessment, and L4 Process Refinement (Steinberg et al. 1999). Based on information theory and levels of abstractions, data fusion can be grouped into low level (data level), medium level (feature level), and high level (decision level) (Luo and Su 1999). Dasarathy (1997) expands the three-level hierarchy of data fusion into five fusion processes based on their input and output dependents, plus one for temporal data modeling (Dasarathy 1997). They are Data In-Data Out (DAI-DAO) fusion, Data In-Feature Out (DAI-FEO) fusion, Feature In-Feature Out (FEI-FEO) fusion, Feature In-Decision Out (FEI-DEO) fusion,

14.2 Data Fusion Architectures

239

Decision In-0Decision Out (DEI-DEO) fusion, and Temporal (data/feature/ decision) fusion. In applications in remote sensing big data, data fusion includes pansharpening and resolution enhancement, multitemporal data fusion, elevation data fusion, and big data social mediation integration (Ghamisi et al. 2019). The JDL levels can be grouped into data alignment, data/object correlation, attribute estimation, and identity estimation when they are applied in remote sensing data fusion (Schmitt and Zhu 2016).

14.2 Data Fusion Architectures Figure 14.1 illustrates a general framework for data fusion. Heterogeneous data can be taken into considerations for data fusion. This model is generalized from different data fusion models (Goodman 1987; White 1991; Dasarathy 1997; Luo and Su 1999; Steinberg et al. 1999; Schmitt and Zhu 2016; Meng et al. 2020). The focus of data fusion may differ depending on data and target. For remotely sensed data, the goal may be enhancement of spatial resolution, spectral resolution, radiometric resolution, or temporal resolution. The first step is the alignment of different data. These preprocessing processes may be reprojection, rescaling, or registration. Data transformation or feature extraction may be accomplished during this step. The next step is to establish the relationship between goal and input data. The correlation may be established as a regression model, a machine learning model, or a pixel unmixing model. The final step is applying the model to estimate the target attribute or identity at target (spatial, spectral, radiometric, and/or temporal) resolution or level of details in an abstraction hierarchical system. Designing a data fusion algorithm for remote sensing big data needs to take the following into considerations: 1. The large volume of data: The massive volume of remote sensing data puts a challenge on data fusion. Many data fusion algorithms may become computationally intractable (Ma and Kang 2020). On the other hand, fusion

Fig. 14.1 A generalized data fusion framework

240

14 Examples of Remote Sensing Applications of Big Data Analytics—Fusion…

from remote sensing big data streams may also increase significantly the computational demand (Li et al. 2021). As a result, the algorithm design of data fusion for remote sensing big data not only needs to carefully select correlation models by their computational tractability but also needs to fundamentally shift the computational infrastructure from one single workstation to a distributed computing environment that supports massive parallelism (Schmitt and Zhu 2016; Ghamisi et al. 2019; Li et al. 2021). 2. Change of support: Remote sensing data are of different resolutions. Data fusion often has to deal with data from different (spatial, temporal, spectral, or radiometric) resolutions. Statistic support may be different from one resolution to another resolution. This is called the change of support (Cressie 1996; Nguyen et al. 2012; Ma and Kang 2020). The data fusion algorithm design should consider models that are capable of resolving the problem of the change of support. 3. Variety of data: Data from different sensors with different modalities require data fusion to be able to work with data of high dimensionality. Different sensor models also bring challenges in aligning different data and correlating different data. For example, SAR and optical remote sensing are two different models of remote sensing. Their alignment may not be a trivial task. Added dimensionalities may be resulted from considering single sensor data from different viewing angles, different resolutions, or from different points in time (Schmitt and Zhu 2016). The capability of handling high dimensionality of data should be one of the selection criteria for data fusion to be used for remote sensing big data. Table 14.2 lists major data fusion types for remote sensing data. Three major types are covered here: spatiospectral fusion (SS), spatiotemporal fusion (ST), and point cloud fusion (PC) (Ghamisi et al. 2019). Major methods for each category are briefed with comparison on advantages and limitations/challenges which should give a brief guide to select different data fusion models for different goals or applications in remote sensing.

14.3 Fusion of MODIS and Landsat with Deep Learning 14.3.1 The Problem Crop monitoring and condition assessment require remote sensing with two high resolutions: spatial resolution and temporal resolution (Gao et al. 2015). The most popular data with high-temporal resolution up to daily coverage is the Moderate Resolution Imaging Spectroradiometer (MODIS) that is currently onboard both Aqua and Terra satellites. However, its spatial resolution is too low for agricultural applications which expect to get some sensing information at fields. In the conterminous United States (CONUS), mean and median field sizes are 0.193 km2 and 0.278 km2, respectively (Yan and Roy 2016). The spatial resolution of nominally

14.3 Fusion of MODIS and Landsat with Deep Learning

241

Table 14.2 Data fusion for remote sensing data Type Definition Advantages Challenges SS Spatiospectral fusion (SS) is to fuse fine spatial resolution images with coarse spatial resolution images to create fine spatial resolution images for all bands. Component substitution Three steps: transform to Spatial Global spectral separate spatial and spectral; fidelity distortion substitute spatial information Simple by high-resolution Registration panchromatic image; inverse transformation Multiresolution analysis Extract spatial details from the Spectral Complexity panchromatic image; inject the consistency details into multispectral data Geostatistical analysis Maintain the value from Spectral Kriging downscaled prediction is consistency upscaled to the original coarse spatial resolution is identical to the original coarse spatial resolution image Subspace representation Unmixing Unmixing Time complex Spectral artifacts Sparse representation Dictionary Fast Limited ST Spatiotemporal fusion (ST) is to blend fine spatial resolution data with coarse temporal resolution data and fine temporal resolution data with coarse spatial resolution. Image pair-based Relation between pairs Widely used Consistent change Spatial unmixing Unmixing No Limited assumption Combined Use both pair-based and Both Limited use unmixing PC Point cloud fusion (PC) is to make use of 3D geometric, spatial-structural, and lidar backscatter information inherent in point clouds and combine it with spectral data sources. Point cloud level Point cloud co-registration Precise Co-registration Voxel level Transformation into 2D Image From 3D to 2D Feature level Classification Feature Machine learning

250 m for one visible (VIS) and one near-infrared (NIR) band and 500 m for other VIS and NIR bands. Few pixels would fall into one field with the same crop. On the other hand, Landsat provides multiple bands with 30-m resolution, making it a good source of Earth Observation for crop mapping and monitoring in terms of spatial resolution. However, the temporal resolution of Landsat is quite low, about 16 days per revisit. Crop monitoring expects daily observations to form a continuous understanding of crop growth and crop conditions. Spatiotemporal fusion is one of the approaches that can combine both MODIS and Landsat to form a time series of fine spatial resolution (30 m) and fine temporal resolution (daily) (Gao et al. 2015, 2017). That is, the approach is to get a time

242

14 Examples of Remote Sensing Applications of Big Data Analytics—Fusion…

series of data with a fine spatial resolution (30 m) and temporal resolution (daily) by fusing Landsat ETM+ data, a fine spatial resolution (30 m) data with coarse temporal resolution (~every 16 days) data, and MODIS, a fine temporal resolution (daily) data with a coarse spatial resolution (250 m for band 1 and 2500 m for band visible and near-infrared bands). This can be a challenging problem with characteristics of remote sensing big data if we need to build up a long time series of high spatial and temporal resolution data for crop monitoring and condition assessment. The following summarizes some of the major challenges: 1. Massive data: MODIS data has been available since 2000, and corresponding Landsat data are available too. Data volume has been growing with continuous observations from both MODIS and Landsat. 2. Change of support: In the spatial domain, the fusion has to do inference at one spatial resolution using a model formed with data at another spatial resolution. The application of models built at one resolution to be applied precisely to another resolution requires that the data scale linearly, which is not guaranteed. Some stochastic approaches may have to be applied to deal with the uncertainty. 3. The velocity of data stream: MODIS data can be streamed in almost near real time. The computational performance of data fusion in terms of time complexity is expected to meet the requirements of processing data in time.

14.3.2 Data Fusion Methods The spatiotemporal fusion can be further classified into different subcategories. Depending on the ways of using images, the spatiotemporal fusion methods can be grouped into image-pair-based, unmixing, and combined (Ghamisi et al. 2019). Most of the methods are image-pair-based which establish a correlation between images at one time and applies the model at another time. Based on the core fusion model, the spatiotemporal fusion methods can be grouped into filter-based, unmixing-based, and learning-based (Song et al. 2018). The filter-based approach is to form a filter that is a weighted sum of spectrally similar neighboring pixels which are determined according to their spectral difference, temporal difference, and spatial proximity (Song et al. 2018). This group of methods is most popular for fusing MODIS and Landsat data (Wei et al. 2017). Among this group, the most popular spatiotemporal fusion model is the spatial and temporal adaptive reflectance fusion model (STARFM) (Gao et al. 2006). These algorithms assume that pixels of a class (spectrally similar class) share similar pixel values. In other words, for time series, land cover changes should be neglected. This assumption may not be easily met. Based on the sparse representation theory, machine learning can be used to build a sparse dictionary to correlate data between Landsat and MODIS in data fusion (Song et al. 2018; Gao et al. 2020). The computational complexity of

14.3 Fusion of MODIS and Landsat with Deep Learning

243

learning-based approaches is higher than other types of spatiotemporal data fusion methods, that is, unmixing and filter-based. However, the performance of a learning-based spatiotemporal fusion is generally better than unmixing or filter-based (Song et al. 2018). In this study, a Deep Convolutional Spatiotemporal Fusion Network (DCSTFN), which is a deep learning algorithm, is developed (Tan et al. 2018). A convolutional neural network (CNN) is selected as the base to develop this spatiotemporal fusion algorithm. The reason for selecting CNN among deep learning algorithms is that CNN has great performance in object recognition owing to its internal structure of using multiscale information in learning (Han et al. 2017; Zhao et al. 2017). Direct nonlinear relationship between MODIS and Landsat is learned using a convolutional neural network. The architecture of DCSTFN is shown in Fig. 14.2. DCSTFN consists of three components: the expansion of MODIS images at both time Ta and Tb, the extraction of high-frequency components from Landsat at time Ta, and the fusion of extracted features at Time Tb. First, the expansion of MODIS images at both time Ta and Tb is completed through a convolutional neural network consisting of two convolution layers, three deconvolution layers, and one convolution layer in sequence. This step aligns the MODIS images at both Ta and Tb with those extracted features from Landsat at Ta in terms of spatial resolution. In this design, the aligned spatial resolution is at half of the way between that of Landsat and MODIS, that is, roughly 150 m. Second, the extraction of high-frequency features from Landsat at time Ta is done through a convolutional neural network consisting of two convolution layers, one max-pooling layer, and two convolution layers in sequence. The resulted layer of Landsat should be aligned with the intermediate layers from MODIS at time Ta and Tb. Finally, the fusion of extracted features is completed in one merge at extracted features using the following equation (where Landsat (Tb) is the mapped feature with matching spatial resolution of Landsat at time Tb), one deconvolution layer to restore merged features from intermediate spatial resolution to the spatial

Fig. 14.2 Architecture of the Deep Convolutional Spatiotemporal Fusion Network (DCSTFN)

244

14 Examples of Remote Sensing Applications of Big Data Analytics—Fusion…

resolution of Landsat at Ta, and two fully connected neural networks for the fine- tuning of fusion outputs. Landsat (Tb) = MODIS (Tb) + Landsat(Ta) − MODIS(Ta) To compare and evaluate the performance of the spatiotemporal fusion model with a deep learning algorithm, two popular spatiotemporal fusion models are used. The first one is the spatial and temporal adaptive reflectance model (STARFM), the earliest spatiotemporal fusion model developed by Gao et al. (2006). Figure 14.3 shows the algorithm and its overall flow. Here, coarse data is MODIS, while fine

Fig. 14.3 Algorithm of the spatial and temporal adaptive reflectance fusion model (STARFM)

14.3 Fusion of MODIS and Landsat with Deep Learning

245

Fig. 14.4 Algorithm of the Flexible Spatiotemporal Data Fusion (FSDAF)

data is Landsat. The fused result is data with a temporal resolution of MODIS and spatial resolution of Landsat. The time t2 can be any of the closest dates and times to time t1 when both MODIS and Landsat are available. The second algorithm for comparison is the Flexible Spatiotemporal Data Fusion (FSDAF), an unmixing spatiotemporal data fusion model (Zhu et al. 2016). It has six steps as shown in Fig. 14.4. The major steps are (1) classify Landsat at time t1, (2) estimate the temporal changes of each class, (3) predict fine spatial resolution

246

14 Examples of Remote Sensing Applications of Big Data Analytics—Fusion…

Fig. 14.5 Data fusion result comparison (DCSTFN, STARFM, and FSDAF)

image and residuals from temporal changes, (4) get the thin-plate spline (TPS) interpolation function for guiding residual distribution, (5) distribute residuals to pixels with fine spatial resolution, and (6) predict image with fine spatial resolution using the neighborhood. All three algorithms are spatiotemporal fusion which can produce fine spatial resolution data at time t2, given a pair of MODIS and Landsat at time t1 and a MODIS at time t2. The algorithm can be applied repetitively to MODIS at any time t for which time t2 is the closest time when both Landsat and MODIS are available at the same time. Figure 14.5 shows one example output using all three algorithms. The result from DCSTFN shows the most similarity to the Landsat at target date among all the results. In the inset images, the lower right corner shows the difference in details

References

247

among all three results. The DCSTFN result shows more similar details than any other model result. Statistical analyses on the accuracy have been done with repetitions. The result is reported in Tan et al. (2018). The results showed that DCTFN performed better than the other two models.

References Bowman C, Morefield C (1980) Multisensor fusion of target attributes and kinematics. In: 1980 19th IEEE conference on decision and control including the symposium on adaptive processes. IEEE, Albuquerque, pp 837–839 Cressie NA (1996) Change of support and the modifiable areal unit problem. Geogr Syst 3:159–180 Dasarathy BV (1997) Sensor fusion potential exploitation-innovative architectures and illustrative applications. Proc IEEE 85:24–38. https://doi.org/10.1109/5.554206 Gao F, Masek J, Schwaller M, Hall F (2006) On the blending of the Landsat and MODIS surface reflectance: predicting daily Landsat surface reflectance. IEEE Trans Geosci Remote Sens 44:2207–2218. https://doi.org/10.1109/TGRS.2006.872081 Gao F, Hilker T, Zhu X et al (2015) Fusing Landsat and MODIS data for vegetation monitoring. IEEE Geosci Remote Sens Mag 3:47–60. https://doi.org/10.1109/MGRS.2015.2434351 Gao F, Anderson MC, Zhang X et al (2017) Toward mapping crop progress at field scales through fusion of Landsat and MODIS imagery. Remote Sens Environ 188:9–25. https://doi. org/10.1016/j.rse.2016.11.004 Gao J, Li P, Chen Z, Zhang J (2020) A survey on deep learning for multimodal data fusion. Neural Comput 32:829–864. https://doi.org/10.1162/neco_a_01273 Ghamisi P, Gloaguen R, Atkinson PM et al (2019) Multisource and multitemporal data fusion in remote sensing: a comprehensive review of the state of the art. IEEE Geosci Remote Sens Mag 7:6–39. https://doi.org/10.1109/MGRS.2018.2890023 Goodman IR (1987) A general theory for the fusion of data. Naval Ocean Systems Center, San Diego Han X, Zhong Y, Zhang L (2017) An efficient and robust integrated geospatial object detection framework for high spatial resolution remote sensing imagery. Remote Sens 9:666. https://doi. org/10.3390/rs9070666 Li J, Liu Z, Lei X, Wang L (2021) Distributed fusion of heterogeneous remote sensing and social media data: a review and new developments. Proc IEEE:1–14. https://doi.org/10.1109/ JPROC.2021.3079176 Luo RC, Su KL (1999) A review of high-level multisensor fusion: approaches and applications. In: Proceedings. 1999 IEEE/SICE/RSJ. International conference on multisensor fusion and integration for intelligent systems. MFI’99 (Cat. No.99TH8480). IEEE, Taipei, pp 25–31 Ma P, Kang EL (2020) Spatio-temporal data fusion for massive sea surface temperature data from MODIS and AMSR-E instruments. Environmetrics 31. https://doi.org/10.1002/env.2594 Meng T, Jing X, Yan Z, Pedrycz W (2020) A survey on machine learning for data fusion. Inf Fusion 57:115–129. https://doi.org/10.1016/j.inffus.2019.12.001 Nguyen H, Cressie N, Braverman A (2012) Spatial statistical data fusion for remote sensing applications. J Am Stat Assoc 107:1004–1018. https://doi.org/10.1080/01621459.2012.694717 Rashinkar P, Krushnasamy VS (2017) An overview of data fusion techniques. In: 2017 international conference on innovative mechanisms for industry applications (ICIMIA). IEEE, Bangalore, pp 694–697 Schmitt M, Zhu XX (2016) Data fusion and remote sensing: an ever-growing relationship. IEEE Geosci Remote Sens Mag 4:6–23. https://doi.org/10.1109/MGRS.2016.2561021

248

14 Examples of Remote Sensing Applications of Big Data Analytics—Fusion…

Song H, Liu Q, Wang G et al (2018) Spatiotemporal satellite image fusion using deep convolutional neural networks. IEEE J Sel Top Appl Earth Obs Remote Sens 11:821–829. https://doi. org/10.1109/JSTARS.2018.2797894 Steinberg AN, Bowman CL, White FE (1999) Revisions to the JDL data fusion model. In: Dasarathy BV (ed) Proceedings of the SPIE. Sensor fusion: architectures, algorithms and applications. SPIE, Orlando, p 430 Tan Z, Yue P, Di L, Tang J (2018) Deriving high spatiotemporal remote sensing images using deep convolutional network. Remote Sens 10:1066. https://doi.org/10.3390/rs10071066 Wei J, Wang L, Liu P et al (2017) Spatiotemporal fusion of MODIS and Landsat-7 reflectance images via compressed sensing. IEEE Trans Geosci Remote Sens 55:7126–7139. https://doi. org/10.1109/TGRS.2017.2742529 White FR (1991) Data fusion lexicon. Joint Directors of Labs, Washington, DC Yan L, Roy DP (2016) Conterminous United States crop field size quantification from multi- temporal Landsat data. Remote Sens Environ 172:67–86. https://doi.org/10.1016/j. rse.2015.10.034 Zhao W, Du S, Emery WJ (2017) Object-based convolutional neural network for high-resolution imagery classification. IEEE J Sel Top Appl Earth Obs Remote Sens 10:3386–3396. https://doi. org/10.1109/JSTARS.2017.2680324 Zhu X, Helmer EH, Gao F et al (2016) A flexible spatiotemporal method for fusing satellite images with different resolutions. Remote Sens Environ 172:165–177. https://doi.org/10.1016/j. rse.2015.11.016

Chapter 15

Examples of Remote Sensing Applications of Big Data Analytics—Agricultural Drought Monitoring and Forecasting

Abstract This chapter describes an example of processing remote sensing big data in a distributed computing environment for realizing the agricultural drought monitoring and forecasting system. The system demonstrated the event-based processing workflow using a service-oriented architecture. Standards of geospatial Web services are adopted to achieve reusability, flexibility, and scalability in handling remote sensing big data. Keywords Agricultural drought · Monitoring · Event-based processing · Workflow · Service-oriented architecture · Standard geospatial Web service · Web portal · Earth Observation · Remote sensing · Vegetation index

15.1 Agricultural Drought Drought is one of the major natural hazards to agriculture. The definitions of agricultural drought evolve over the time. Table 15.1 quotes several recent direct definitions on agricultural drought, in contrast to other drought types—meteorological drought, hydrological drought, and socioeconomic drought. The core is to reflect a functional relationship drop yield and major agricultural drought factors, that is, crop, soil, and climate factors (Mannocchi et al. 2004; Dalezios et al. 2017). The most common way to quantify agricultural drought are using agricultural drought indices. A review of twentieth-century drought indices in the United States can be found in Heim (2002). Table 15.2 lists major agricultural indices. Early drought indices consider the deficiency of precipitation. Meteorological observations are the major source for calculating agricultural drought indices, such as aridity index (AI) and standardized precipitation index (SPI). More and more factors are taken into considerations for calculating agricultural drought indices which include vegetation, soil type, antecedent soil moisture, and evapotranspiration (Heim 2002). With the introduction of remote sensing, agricultural drought indices can be calculated in near real time to be applied in agricultural practices. A comprehensive

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. Di, E. Yu, Remote Sensing Big Data, Springer Remote Sensing/ Photogrammetry, https://doi.org/10.1007/978-3-031-33932-5_15

249

250

15 Examples of Remote Sensing Applications of Big Data Analytics—Agricultural…

Table 15.1 Selected definitions of agricultural drought Year Reference 2000 Maracchi (2000)

Definition Agricultural drought is “the interaction between climatic conditions and some other factors that lead to a strong decrease in agricultural production or to a worsening of product quality.” 2002 Wilhelmi and Agricultural drought is evaluated with a weighted vulnerability of key Wilhite (2002) factors based on their significance for agricultural sectors. The key factors include land use, soil root zone available holding capacity, probability of seasonal crop moisture deficiency, and irrigation support. 2012 Dalezios et al. Agricultural drought “exists when soil moisture is depleted so that crop (2012) yield is reduced considerably.” 2016 Hazaymeh and Agricultural drought is “the deficiency of soil moisture below the Hassan (2016) optimal level required for the proper growth of plants during different growing stages, resulting in growth stress and yield reduction.”

review of remote sensing-based agricultural drought indices can be found in Hazaymeh and Hassan (2016). Among all the agricultural drought indices, VCI is one of the earliest indices that can be directly calculated from remote sensing data. It will be used as the base index in building the agricultural drought monitoring and forecasting system by leveraging the remote sensing big data in the example followed.

15.2 Remote Sensing Big Data for Agricultural Drought Different types of remote sensing data have been used in calculating different agricultural drought indices, including both optical remote sensing and microwave remote sensing (Hazaymeh and Hassan 2016). Table 15.3 lists commonly used remote sensing data for the calculation of agricultural drought indices. In general, the remote sensing technology needs to meet the following requirements to be meaningfully used in monitoring and forecasting agricultural drought. • High temporal resolution: Agricultural activities are seasonal. Crop condition and growth change quickly, especially during the growing season when sufficient soil moisture is essential in addition to temperature. The most common requirement is have a repetitive, clear observation less than a week. Mostly, optical remote sensing is used in agricultural drought monitoring that has multiple spectral bands. Cloudy and foggy weather affects the observation of remote sensors. As a result, it is often required to have a daily temporal resolution. • Spatial resolution: Spatial resolution should be also required to be reasonably high to be relevant to agricultural fields and their management. Remote sensor observations with spatial resolution lower than 1 km would not meet the requirements of agricultural drought monitoring up to field level. For the large

15.2 Remote Sensing Big Data for Agricultural Drought

251

Table 15.2 Agricultural Drought Indices Year Index 1925 AIa

Description The ratio of precipitation to temperature.

1965 CMIb

The sum of the evapotranspiration anomaly and the moisture excess, usually weekly. The moisture excess or deficiency.

1965 PDSIc 1968 KBDId

1993 VCIg

The net effect of evapotranspiration and precipitation in producing a moisture deficiency in the upper layers of the soil. A degree of dryness of the soil compared with normal conditions. A combination of VCI and TCI to identify drought- related agricultural impacts. A temporally normalized vegetation index.

1993 SPIh

A probability of precipitation based on history.

1995 TCIi

An index to reflect stress on vegetation caused by temperature and excessive wetness. A standardized precipitation needed for a return to a normal or deviation of effective precipitation. A weighted vulnerability of key agricultural drought vulnerability factors, including land use, soil root zone available holding capacity, probability of seasonal crop moisture deficiency, and irrigation support. A remote sensing index for canopy water content.

1988 SMAe 1990 VHIf

1999 EDIj 2002 ADVk

2003 SIWSIl

2004 Sc-PDSIm A self-calibrated PDSI based on each station and its climate regime. A weekly ratio of water stress ratio against the median 2005 ETDIn over a long term. 2005 SMDIo A weekly soil moisture of total soil column and at 0.61, 1.23, and 1.83 m. 2008 VegDRIp A drought index to reflect the drought-induced vegetation stress using a combination of remote sensing, climate- based indicators, and biophysical information and land-use data. 2010 AAIq A ratio of the aridity Index (AI) compared to the normal AI. 2010 SPEIr An index based on SPI but including a temperature component.

Reference De Martonne (1925) and Baltas (2007) Palmer (1965, 1968) Palmer (1965) and Alley (1984) Keetch and Byram (1968) Bergman et al. (1988) Kogan (1990, 1997) Kogan and Sullivan (1993), Kogan (1995, 1997) McKee et al. (1993) and Wu et al. (2005) Kogan (1995) Byun and Wilhite (1999) Wilhelmi and Wilhite (2002)

Fensholt and Sandholt (2003) Wells et al. (2004) Narasimhan and Srinivasan (2005) Narasimhan and Srinivasan (2005) Brown et al. (2008)

Gommes et al. (2010, p. 6) Vicente-Serrano et al. (2010) (continued)

252

15 Examples of Remote Sensing Applications of Big Data Analytics—Agricultural…

Table 15.2 (continued) Year Index 2010 SDCIs

2011 ESIt 2011 SWDIu 2012 ARIDv 2012 CDIw 2013 MIDIx 2013 PCIy 2017 PADIz 2019 ADCIaa

Description A remote sensing-based drought index combining land surface temperature, NDVI, and precipitation from TRMM. An index comparing evapotranspiration to potential evapotranspiration. An index combining vegetation index and land surface temperature from remote sensing data. A combination of water stress approximation and crop models (e.g., CERES-Maize). A combination of SPI, SMA, and fraction of absorbed photosynthetically active radiation (fAPAR). An integrated drought index from microwave remote sensing data. A precipitation index.

b

Anderson et al. (2011) Keshavarz et al. (2011) Woli et al. (2012)

Sepulcre-Canto et al. (2012) Zhang and Jia (2013) Zhang and Jia (2013) An integrated index to quantify the accumulative drought Zhang et al. (2017) impacts on crops. Sur et al. (2019) A combination of hydrometeorological parameters, including soil moisture, vegetation, land surface temperature, VHI, and MIDI.

Aridity Index Crop Moisture Index c Palmer Drought Severity Index d Keetch-Byram Drought Index e Soil Moisture Anomaly f Vegetation Health Index g Vegetation Condition Index h Standardized Precipitation Index i Temperature Condition Index j Effective Precipitation Index k Agricultural Drought Vulnerability l Shortwave Infrared Water Stress Index m Self-Calibrated PDSI n Evapotranspiration Deficit Index o Soil Moisture Deficit Index p Vegetation Drought Response Index q Aridity Anomaly Index r Microwave Integrated Drought Index s Scaled Drought Condition Index t Evapotranspiration Stress Index u Soil Wetness Deficit Index v Agricultural Reference Index for Drought w Combined Drought Indicator x Microwave Integrated Drought Index y Precipitation Condition Index z Process-Based Accumulated Drought Index aa Agricultural Dry Condition Index a

Reference Rhee et al. (2010)

15.3 Geospatial Data Analysis Infrastructure GeoBrain

253

Table 15.3 Remote Sensing Data for Agricultural Drought Monitoring Sensor AVHRRa MODISb

Type Optical Optical

Agricultural drought indices VCI, VHI SIWSI, VCI, SDCI

SPOT- Optical VCI VEG TRMMc Microwave PCI, SDCI AMSR-Ed Microwave MIDI

Reference Kogan (1990, 1995, 1997) Fensholt and Sandholt (2003), Rhee et al. (2010), and Kukunuri et al. (2020) Owrangi et al. (2011) Rhee et al. (2010), Zhang and Jia (2013) Zhang and Jia (2013)

Advanced Very High Resolution Radiometer Moderate Resolution Imaging Spectroradiometer c Tropical Rainfall Measuring Mission d Advanced Microwave Scanning Radiometer a

b

fields in the United States, spatial resolution should be at least less than a quarter mile. • Spectral bands for water content or vegetation: Detection of water content in the top soil level and crop growth conditions are the key techniques in monitoring agricultural drought. The remote sensor needs to have spectral bands that detect soil water content or crop growth condition to infer soil moisture and crop water deficit. For example, infrared bands (e.g., 1.55–1.75 μmmeters and 2.08–2.35 μm) are sensitive to moisture content variation in soils and crops for optical remote sensing. In microwave remote sensing, L-band (1.4 GHz) is correlated to soil moisture content at a deep soil layer (at least 5 cm) while C-band (6.8 GHz) is sensitive to surface soil moisture and affected by vegetation and surface roughness (Macelloni et al. 2003; Gruhier et al. 2010). • Long time series: Forecasting models rely on time series of agricultural drought monitoring for training and validation. Long time series of observations are required to build forecasting models. • Near real time: Agricultural decision needs timely information. The production of agricultural drought information should be in low time latency. The automation of observation data retrieval and processing is important to deliver the information in time.

15.3 Geospatial Data Analysis Infrastructure GeoBrain GeoBrain is a geospatial data analysis infrastructure developed under a service- oriented architecture (Di 2004). It extensively adopts the geospatial standards in the implementation of component services and the composition of service workflow. Geospatial Web services are the core to form functional components. The computation is distributed. The system is scalable to support big data analysis by distributing analysis across the network through standard interfaces, such as OGC

254

15 Examples of Remote Sensing Applications of Big Data Analytics—Agricultural…

Web Coverage Service (WCS), Web Feature Service (WFS), Web Process Service (WPS), Catalog Service for Web (CSW), and Web Notification Service (WNS). The overall architecture of GeoBrain is shown in Fig. 15.1 (Deng and Di 2006). The system can be extended by adding standard pluggable service components as long as these components following open geospatial Web interface specifications. The new data source can be added to the system and used in analysis. The satellite remote sensing data is added as a WCS service. If the data provider has a WCS interface, the data can be added to the system directly. If the data provider does not have a WCS interface, a plugin can be implemented to proxy the source data as a WCS and then added to the system. The information production in the GeoBrain adopts an approach of virtual data production by leveraging workflows to chain standard geospatial Web services.

Fig. 15.1 Service oriented architecture of GeoBrain

15.3 Geospatial Data Analysis Infrastructure GeoBrain

255

Figure 15.2 shows the life cycle of information production in GeoBrain (Di 2005; Di et al. 2006; Yue et al. 2011). The information production takes two steps. The first step is the virtual product design at abstract workflow level. At this level, the design of workflow is concerning of finding data source services, processing services implementing the required processing algorithms, and defining the output product. At the level of conceptual workflow design, components are chained with required data objects and processing algorithms. The second step is the instantiation of the abstract workflow or the process of actually binding the instances of Web services in the workflow. This involves the looping of available geospatial Web services by filtering and matching the services by description and functions. The Web services may be registered in a standard OGC-compliant Catalog for the Web (CSW) service. Additional processing Web services may be added for reformatting, scaling and reprojecting. The advantage of using GeoBrain as the base infrastructure to develop the agricultural drought monitoring and forecasting system is as follows: • Reusing existing component services: The GeoBrain has already developed and implemented many standard geospatial Web services for data access, processing, and delivery. These services can be directly reused in the development of the agricultural drought monitoring and forecasting system. • Cloud-ready for distributed computing: The GeoBrain service system was implemented under a service-oriented architecture. Each component was implemented as a standard geospatial Web services which can be redeployed into the cloud and hosted in a virtual machine. The processing and analyzing workflows can be easily adapted to work in a distributed cloud computing environment to be scaled out to handle remote sensing big data. • Flexibility on modeling: The virtual product design by leveraging Web service workflows enables the flexibility in modeling and chaining service to meet the requirements of big data processing in the distributed computing environment. The abstract workflow can be re-instantiated with updated Web services when the service and catalog are evolved to work with different deployments of geospatial Web services.

Fig. 15.2 The concept of virtual data product

256

15 Examples of Remote Sensing Applications of Big Data Analytics—Agricultural…

• Timely processing with streaming data flow from the Web: The service workflow can be designed to be triggered by events, such as the availability of new Earth Observations received or high-level product processed. The promptly processing capability of the workflow enables the low time latency in data processing and result distributing.

15.4 The Global Agricultural Drought Monitoring and Forecasting System Portal The agricultural drought calculation is model based using the GeoBrain service system. One active deployment is made available at https://gis.csiss.gmu.edu/ GADMFS/ (Deng et al. 2013). Figure 15.3 shows one screenshot of the portal. The backend processing workflow roughly includes the following steps: • Data access: Proxy service to access the Earth Observation data and their products of MODIS directly from the data provider. Many products include Vegetation Indices, including MOD13Q1 from MODIS on Terra and MYD13Q1 on Aqua. • Calculation of Vegetation Condition Index: The vegetation condition index is calculated using the equation in Kogan (1995). • Categorization of drought severity levels: The severity levels of agricultural drought are classified based on VCI. Six levels are evaluated. They are no sign of drought, dry, moderate drought, severe drought, extreme drought, and exceptional drought.

Fig. 15.3 The portal of Global Agricultural Drought Monitoring and Forecasting System

References

257

• Publication of results in standard map services: The results are published as standard Web Map Services for rendering and Web Coverage Service for machineready data access. • Registering the results in the catalog: The catalog for data products is updated through transaction. New records include vegetation index, vegetation condition index, and drought classification map. • Synchronization of current data records in the Portal: The availability of data is updated at the Portal. The current data record is update to allow the Portal to show the latest result. The Global Agricultural Drought Monitoring and Forecasting System allow interactive query and analysis by leveraging standard Web Map Services. Location- based analysis is supported. The area of interest can be defined by onscreen drawing, uploaded geographic data, or named administrative regions.

References Alley WM (1984) The Palmer drought severity index: limitations and assumptions. J Clim Appl Meteorol 23:1100–1109. https://doi.org/10.1175/1520-0450(1984)0232.0.CO;2 Anderson MC, Hain C, Wardlow B et al (2011) Evaluation of drought indices based on thermal remote sensing of evapotranspiration over the continental United States. J Clim 24:2025–2044. https://doi.org/10.1175/2010JCLI3812.1 Baltas E (2007) Spatial distribution of climatic indices in northern Greece. Meteorol Appl 14:69–78. https://doi.org/10.1002/met.7 Bergman K, Sabol P, Miskus D (1988) Experimental indices for monitoring global drought conditions. In: Proceedings of the 13th annual climate diagnostics workshop, Cambridge, pp 190–197 Brown JF, Wardlow BD, Tadesse T et al (2008) The vegetation drought response index (VegDRI): a new integrated approach for monitoring drought stress in vegetation. GIScience Remote Sens 45:16–46. https://doi.org/10.2747/1548-1603.45.1.16 Byun H-R, Wilhite DA (1999) Objective quantification of drought severity and duration. J Clim 12:2747–2756. https://doi.org/10.1175/1520-0442(1999)0122.0.CO;2 Dalezios NR, Blanta A, Spyropoulos NV (2012) Assessment of remotely sensed drought features in vulnerable agriculture. Nat Hazards Earth Syst Sci 12:3139–3150. https://doi.org/10.5194/ nhess-12-3139-2012 Dalezios NR, Gobin A, Tarquis Alfonso AM, Eslamian S (2017) Agricultural drought indices: combining crop, climate, and soil factors. In: Eslamian S, Eslamian FA (eds) Handbook of Drought and Water Scarcity: Principles of Drought and Water Scarcity, pp 73–89, CRC Press, Boca Raton, FL, USA. De Martonne E (1925) Traite de geographie physique: ouvrage couronne par l’Academie des sciences, Prix Binoux, et par la Societe de geographie de Paris. A. Colin Deng M, Di L (2006) Utilization of latest geospatial web service technologies for remote sensing education through GeoBrain sysem. In: 2006 IEEE international symposium on geoscience and remote sensing. IEEE, Denver, pp 2013–2016 Deng M, Di L, Han W, Yagci AL, Peng C, Heo G (2013) Web-service-based monitoring and analysis of global agricultural drought. Photogramm Eng Remote Sens 79(10):929–943

258

15 Examples of Remote Sensing Applications of Big Data Analytics—Agricultural…

Di L (2004) GeoBrain-A web services based geospatial knowledge building system. In: Proceedings of NASA earth science technology conference 2004. NASA Earth Science Technology Office, Palo Alto, pp 1–8 Di L (2005) Customizable virtual geospatial products at web/grid service environment. In: Proceedings of 2005 IEEE international geoscience and remote sensing symposium, Seoul, South Korea, 25–29 July 2005, pp 4215–4218 Di L, Chen A, Bai Y, Wei Y (2006) Implementation of geospatial product virtualization in grid environment. In: Proceedings of the sixth annual NASA earth science technology conference – ESTC2006, College Park, MD, USA, 27–29 June 2006 Fensholt R, Sandholt I (2003) Derivation of a shortwave infrared water stress index from MODIS near- and shortwave infrared data in a semiarid environment. Remote Sens Environ 87:111–121. https://doi.org/10.1016/j.rse.2003.07.002 Gommes HD, Mariani L, Challinor A et al (2010) Chapter 6 Agrometeorological forecasting. In: Guide to agricultural meteoroglical practices. World Meteorological Organization, Geneva Gruhier C, de Rosnay P, Hasenauer S et al (2010) Soil moisture active and passive microwave products: intercomparison and evaluation over a Sahelian site. Hydrol Earth Syst Sci 14:141–156. https://doi.org/10.5194/hess-14-141-2010 Hazaymeh K, Hassan QK (2016) Remote sensing of agricultural drought monitoring: a state of art review. AIMS Environ Sci 3:604–630. https://doi.org/10.3934/environsci.2016.4.604 Heim RR (2002) A review of twentieth-century drought indices used in the United States. Bull Am Meteorol Soc 83:1149–1166. https://doi.org/10.1175/1520-0477-83.8.1149 Keetch JJ, Byram GM (1968) A drought index for forest fire control. US Department of Agriculture, Forest Service, Southeastern Forest Experiment Station, Asheville Keshavarz M, Vazifedoust M, Alizadeh A (2011) Development of soil wetness deficit index (SWDI) using MODIS satellite data. Iranian Journal of Irrigation & Drainage, 4(3): 465–477. Kogan FN (1990) Remote sensing of weather impacts on vegetation in non-homogeneous areas. Int J Remote Sens 11:1405–1419. https://doi.org/10.1080/01431169008955102 Kogan FN (1995) Application of vegetation index and brightness temperature for drought detection. Adv Space Res 15:91–100. https://doi.org/10.1016/0273-1177(95)00079-T Kogan FN (1997) Global drought watch from space. Bull Am Meteorol Soc 78:621–636. https:// doi.org/10.1175/1520-0477(1997)0782.0.CO;2 Kogan F, Sullivan J (1993) Development of global drought-watch system using NOAA/AVHRR data. Adv Space Res 13:219–222. https://doi.org/10.1016/0273-1177(93)90548-P Kukunuri ANJ, Murugan D, Singh D (2020) Variance based fusion of VCI and TCI for efficient classification of agriculture drought using MODIS data. Geocarto Int:1–22. https://doi.org/1 0.1080/10106049.2020.1837256 Macelloni G, Paloscia S, Pampaloni P et al (2003) Microwave radiometric measurements of soil moisture in Italy. Hydrol Earth Syst Sci 7:937–948 Maracchi G (2000) Agricultural drought—a practical approach to definition, assessment and mitigation strategies. In: Vogt JV, Somma F (eds) Drought and drought mitigation in Europe. Springer, Netherlands, Dordrecht, pp 63–75 McKee TB, Doesken NJ, Kleist J, Others (1993) The relationship of drought frequency and duration to time scales. In: Proceedings of the 8th conference on applied climatology, California, pp 179–183 Narasimhan B, Srinivasan R (2005) Development and evaluation of Soil Moisture Deficit Index (SMDI) and Evapotranspiration Deficit Index (ETDI) for agricultural drought monitoring. Agric For Meteorol 133:69–88. https://doi.org/10.1016/j.agrformet.2005.07.012 Owrangi MA, Adamowski J, Rahnemaei M et al (2011) Drought monitoring methodology based on AVHRR images and SPOT vegetation maps. J Water Resour Prot 03:325–334. https://doi. org/10.4236/jwarp.2011.35041 Palmer WC (1965) Meteorological drought. US Department of Commerce, Weather Bureau, Washington, DC

References

259

Palmer WC (1968) Keeping track of crop moisture conditions, nationwide: the new crop moisture index. Weatherwise 21:156–161. https://doi.org/10.1080/00431672.1968.9932814 Rhee J, Im J, Carbone GJ (2010) Monitoring agricultural drought for arid and humid regions using multi-sensor remote sensing data. Remote Sens Environ 114:2875–2887. https://doi. org/10.1016/j.rse.2010.07.005 Sepulcre-Canto G, Horion S, Singleton A et al (2012) Development of a combined drought indicator to detect agricultural drought in Europe. Nat Hazards Earth Syst Sci 12:3519–3531. https:// doi.org/10.5194/nhess-12-3519-2012 Sur C, Park S-Y, Kim T-W, Lee J-H (2019) Remote sensing-based agricultural drought monitoring using hydrometeorological variables. KSCE J Civ Eng 23:5244–5256. https://doi.org/10.1007/ s12205-019-2242-0 Mannocchi F, Todisco F, Vergini L (2004) Agricultural drought: Indices, definition and analysis. In: The Basis of Civilization—Water Science? Proceedings of the UNESCO/IAHS/IWIIA Symposium, Rome, Italy, December 2003, IAHS Publication 286, IAHS: Wallingford, UK, pp. 246–254. Available online: https://iahs.info/uploads/dms/12839.246-254-286-Mannocchi. pdf (accessed on 29 June 2023). Vicente-Serrano SM, Beguería S, López-Moreno JI (2010) A multiscalar drought index sensitive to global warming: the standardized precipitation evapotranspiration index. J Clim 23:1696–1718. https://doi.org/10.1175/2009JCLI2909.1 Wells N. Goddard S, Hayes MJ (2004) A self-calibrating Palmer Drought Severity Index. Journal of Climate, 17: 2335–2351. https://doi.org/10.1175/1520-0442(2004)0172.0.CO;2 Wilhelmi OV, Wilhite DA (2002) Assessing vulnerability to agricultural drought: a Nebraska case study. Nat Hazards 25:37–58 Woli P, Jones JW, Ingram KT, Fraisse CW (2012) Agricultural reference index for drought (ARID). Agron J 104:287–300. https://doi.org/10.2134/agronj2011.0286 Wu H, Hayes MJ, Wilhite DA, Svoboda MD (2005) The effect of the length of record on the standardized precipitation index calculation. Int J Climatol 25:505–520. https://doi.org/10.1002/ joc.1142 Yue P, Gong J, Di L, He L (2011) Automatic geospatial metadata generation for earth science virtual data products. GeoInformatica 16:1–29. https://doi.org/10.1007/s10707-011-0123-x Zhang A, Jia G (2013) Monitoring meteorological drought in semiarid regions using multi-sensor microwave remote sensing data. Remote Sens Environ 134:12–23. https://doi.org/10.1016/j. rse.2013.02.023 Zhang X, Chen N, Li J et al (2017) Multi-sensor integrated framework and index for agricultural drought monitoring. Remote Sens Environ 188:141–163. https://doi.org/10.1016/j.rse. 2016.10.045

Chapter 16

Examples of Remote Sensing Applications of Big Data Analytics—Land Cover Time Series Creation

Abstract This chapter demonstrated one experiment of constructing long time series of land cover maps using temporal segment modeling method. The long time series involves many different types of data and large volume of data. The mining of temporal patterns using all the accumulated Earth Observations makes it possible to build long time series of land cover maps with retrospective modeling or forwarding prediction along the temporal dimension. Keywords Time series · Land cover · Mapping · Earth Observation · Google Earth Engine · Temporal segment · Dynamic time warping Land cover classification is one of the most common applications of remote sensing in monitoring the ecosystems at large scale. The long time series of land cover classification data are required in monitoring and studying the changes of environment, climate, urban, land resources, water, and ecosystems over time. This study demonstrates the possibility of producing long time series of land cover dataset using remote sensing big data.

16.1 Remote Sensing Big Data for Land Cover Classification Land cover is the observed biophysical (e.g., vegetation and crops) and physical (e.g., bare rock and bare soil) cover on the Earth’s surface (Di Gregorio and Jansen 2001). The first well-known land cover classification systems for use with remote sensing data is the system by Anderson (1976). The International Geosphere- Biosphere Programme (IGBP) land cover classification system has 17 major categories that are used in moderate resolution land cover mapping (Friedl et al. 2010). The Food and Agriculture Organization (FAO) land cover classification system (LUCC) provides portable classification scheme with hierarchical details with different classifiers (Di Gregorio and Jansen 2001; Di Gregorio 2005). The first level has eight major land cover types: (1) cultivated and managed terrestrial areas, © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. Di, E. Yu, Remote Sensing Big Data, Springer Remote Sensing/ Photogrammetry, https://doi.org/10.1007/978-3-031-33932-5_16

261

262

16 Examples of Remote Sensing Applications of Big Data Analytics—Land Cover…

(2) natural and seminatural terrestrial vegetation, (3) cultivated aquatic or regularly flooded areas, (4) natural and seminatural aquatic or regularly flooded vegetation, (5) artificial surfaces and associated areas, (6) bare areas, (7) artificial waterbodies, snow and ice, and (8) natural waterbodies, snow and ice (Di Gregorio 2005). Satellite remote sensing has been applied in mapping land cover since its inception. The open access to Earth Observations further prompts the continuous mapping of land cover at large scale. Medium spatial resolution sensors have been often used in global land cover mapping. Table 16.1 lists major sensors and their land cover products at large scale. The Advanced Very High Resolution Radiometer (AVHRR) is one of the earliest open access satellite observations that have been used in mapping land cover at global scale. The global land cover product, IGBP- DISCover, was one of the results under the IGBP with its land cover classification scheme (Loveland et al. 2000). Around the turn of the twenty-first century, several land cover map products have been produced at either large country scale or global scale using high spatial resolution Landsat TM data, including the National Land Cover Dataset (NLCD) of the United States, the Earth Observation for Sustainable Development of Forests (EOSD) land cover map of Canada (Wood et al. 2002), the continuously updated CORINE (Coordination of information on the environment) land cover of the European Union (Büttner et al. 2004), and the National Carbon Accounting System—Land Cover Change Project (NCAS-LCCP) land cover map of Australia (Caccetta et al. 2007). Recently, the global land cover products produced from Landsat data include the National Geomatics Center of China (NGCC) produced Global Land Cover (GlobalLand30) (Chen and Chen 2018) and the Global Land Cover with Fine Classification System (GLC_FC30) (Zhang et al. 2021, p. 30). The Global Land Cover 2000 Project (GLC2000) land cover map was procured using the SPOT VEGETATION sensor onboard the SPOT 4 satellite Table 16.1 Satellite sensors and representative land cover products Sensor Thematic Mapper (TM), Enhanced Thematic Mapper Plus (ETM+), and Operational Land Imager (OLI)

Product NLCD 2001/2006/2011 in the US; Australian NCAS-LCCP, EOSD 2000 in Canada; CORINE Land Cover 2000 in Europe; Globeland30, GLC_FCS30

AVHRR

IGBP-DISCover

SPOT VEGETATION MODIS

GLC-2000 MODIS Collection 4 and 5 Land Cover product GlobCover Copernicus Global Land Cover Collection 2, ESA-S2-LC20, FROM-GLC10

MERIS Sentinel-2

References Wood et al. (2002), Büttner et al. (2004), Fry et al. (2008), Lehmann et al. (2013), Gong et al. (2013), Chen et al. (2015), Wickham et al. (2017), Chen and Chen (2018), Zhang et al. (2021) Loveland and Belward (1997), Hansen et al. (2000), and Loveland et al. (2000) Bartholomé and Belward (2005) Friedl et al. (2002) Defourny et al. (2006, 2008) Cover (2017), Gong et al. (2019), and Buchhorn et al. (2019)

16.2 Land Cover Classification Methodology

263

(Bartholomé and Belward 2005). The Moderate Resolution Imaging Spectroradiometer (MODIS) onboard of Terra and Aqua have been used produced global land cover product at moderate resolution (500 m spatial resolution) (Friedl et al. 2002). The GlobCover, succeeded from the 1-km GLC-2000 (Global Land Cover 2000), has produced global land cover products at 300 m spatial resolution from the Medium Resolution Imaging Spectrometer (MERIS) of the ESA Envisat mission (Arino et al. 2007). Recently, high resolution of Sentinel-2 has been used to produce global land cover maps at higher spatial resolution, such as the ESA Climate Change Initiative Land Cover Sentinel-2 (ESA-CCI-LC20) (Mousivand and Arsanjani 2019) at 20 m, the Sentinel-2 Global Land Cover (S2GLC) (Kukawska et al. 2017) at 10 m, and the Finer Resolution Observation and Monitoring of Global Land Cover (FROM-GLC10) (Gong et al. 2019) at 10 m. The availability of remote sensing data covering the global make it possible to build up the long time series of land cover products. Landsat data at 30 m spatial resolution have been used in land cover mapping since 1980s. The data can be retrospectively processed with historical data and to produce a time series of Land cover dataset since early 1980s. The production of such long time series of land cover dataset needs to cope with the big data nature of remote sensing data. The volume can be quite big. For the example application of remote sensing big data in producing the time series of land cover dataset, the study chose a relatively small country—Nepal and focused one compatible data source—Landsat TM/ETM+ data. The results of searching Landsat (4, 5, 7, and 8) sense, covering Nepal, are at staggering 12,713 scenes. The raw file size is around 500 MB for Landsat 4 and 5 TM, 785 MB for Landsat 7 ETM+, and 1.61 GB Landsat 8 OLI/TIRS. The data volume is more than 10 TB. In this study, the Google Earth Engine (GEE) (Gorelick et al. 2017; Amani et al. 2020) is used to handle the big volume of remote sensing data by processing them in tiles. The overview and technology stacks of GEE is briefed in Sect. 10.7.1. The Global LANd Cover mapping and Estimation (GLANCE) Grids (Tarrio et al. 2019) are used to partition the processing areas.

16.2 Land Cover Classification Methodology The long time series of land cover dataset construction from remote sensing include recurrent neural networks on modeling seasonal patterns and order relationships (Wang et al. 2019), using features of temporal segmentation and trajectories (Liu and Cai 2012), spatiotemporal change modeling (Boucher et al. 2006), variations of Dynamic Time Warping (DTW) classifiers (Viana et al. 2019; Yan et al. 2019), classification based on Breaks For Additive Seasonal and Trend (BFAST) (Verbesselt et al. 2010; Xu et al. 2020), Landsat-based detection of Trends in Disturbance and Recovery (LandTrendr) (Kennedy et al. 2010), Continuous Change Detection and Classification (CCDC) (Zhu and Woodcock 2014), Detecting Breakpoints and Estimating Segments in Trend (DBEST) (Jamali et al. 2015), Sub-annual Change Detection (SCD) (Cai and Liu 2015), Continuous monitoring of Land Disturbance

264

16 Examples of Remote Sensing Applications of Big Data Analytics—Land Cover…

Fig. 16.1 Time series land classification based CCDC

(COLD) (Zhu et al. 2020), and Noise Insensitive Trajectory Algorithm (pyNITA) (Alonzo et al. 2021). This study chose the algorithms based on CCDC (Zhu et al. 2016) for classifying and building up the time series of land cover classification since 1980 for Nepal. The implementation in Google Earth Engine is used to compute the features using CCDC (Zhu and Woodcock 2014; Arévalo et al. 2020). Figure 16.1 shows the workflow to build the long time series of land cover classification in Nepal. The remote sensing data search starts with the Google Earth Engine Catalog by search Landsat product. This identifies the collection identification for Landsat 4, 5, 7, and 8. The partition of computing tiles is based on the GLANCE grids (Bauer- Marschallinger et al. 2014). There are 15 tiles to cover Nepal completely. They are H19V44, H20V44, H21V44, H19V45, H20V45, H21V45, H22V45, H20V46, H21V46, H22V46, H23V46, H24V46, H22V47, H23V47, and H24V47 in the Asia land grids (BU-GLANCE-Project 2022). The CCDC calculation is done with surface reflectance bands from Landsat data. The bands for spectral changes and modeling are green (0.5–0.6), red (0.6–0.7), near infrared (0.8–0.9), shortwave near infrared (1.5–1.7), and shortwave near infrared (2.1–2.3). The bands for cloud and cloud-shadow masking are green (0.5–0.6) and shortwave near infrared (2.1–2.3). Other parameter settings are 4 as the number of consecutive observations exceeding threshold for change, 0.995 as the threshold for chi-square testing, 1.33 as the number of years after which a new model fit is calculated, fraction of a year as the date format, 0.002 as the Lambda value for lasso regression fitting, and 20,000 as the maximum iterations for lasso regression fitting. The training dataset is built up with multiple years of land cover dataset. Figure 16.2 shows the sampling strategy for training data collection. The Copernicus

16.3 Results and Discussions

265

Fig. 16.2 Training data sampling and collection

Global land cover classification is used as the land classification scheme. The data of Copernicus Global Land Cover Layers was used to collect training samples between 2015 and 2019 (Buchhorn et al. 2020). The classifier used in the study is random forest (ee.Classifier.smileRandomForest). The ancillary data for the training and classification include elevation (Farr et al. 2007; Takaku et al. 2014), rainfall (Hijmans et al. 2005), aspect (Farr et al. 2007; Takaku et al. 2014), temperature (Hijmans et al. 2005), and slope (Farr et al. 2007; Takaku et al. 2014).

16.3 Results and Discussions The CCDC algorithm relies on the temporal segments to model all the bands. Figure 16.3 shows the modeled shortwave-infrared band of selected pixels. The pixel of the selected urban area had a change around 1993 which is captured by the modeled shortwave-infrared segments. Season changes of forest areas and croplands are noticeable. Permanent glacier remains unchanged during the modeled years of 1980–2022. The pixel on the river bank shows several breaks which may represent the change of river streams during the model years. The classification scheme is compatible with that of Copernicus Global Land Cover Layers (Buchhorn et al. 2020). The name and code are as follows: (20) shrubs; (30) herbaceous vegetation; (40) cultivated and managed vegetation/agriculture; (50) urban/built-up; (60) bare/sparse vegetation; (70) snow and ice; (80) permanent water bodies; (90) herbaceous wetland; (100) moss and lichen; (111)

266

16 Examples of Remote Sensing Applications of Big Data Analytics—Land Cover…

Fig. 16.3 Modeled temporal segments of selected points (using shortwave infrared 0.7–0.9)

closed forest, evergreen needle leaf; (112) closed forest, evergreen broad leaf; (113) closed forest, deciduous needle leaf; (114) closed forest, deciduous broad leaf; (115) closed forest, mixed; (116) closed forest, not matching any of the other definitions; (121) open forest, evergreen needle leaf; (122) open forest, evergreen broad leaf; (123) open forest, deciduous needle leaf; (124) open forest, deciduous broad leaf; (125) open forest, mixed; (126) open forest, not matching any of the other definitions; and (200) Oceans, seas. The codes are used in all the legend shown in this chapter, including both Figs. 16.2 and 16.4. The land cover map can be retrieved annually since 1980, up to the current year—2022. Figure 16.4 shows the time series of land cover maps in step of every 5 years. The city of Kathmandu shows noticeable expansion over the years. The computational demand is still quite high for the modeling of long time series. With the Google Earth Engine, the processing of each tile at building the

References

267

Fig. 16.4 Time series of land cover maps (Proximity area of Kathmandu, Nepal, year 2000-current, in 5 year step)

change segments takes around 3 h. To complete the processing of Nepal, the time is at least 45 h continuously if one instance is used in computing and processing the long time series.

References Alonzo M, Van Den Hoek J, Murillo-Sandoval PJ et al (2021) Mapping and quantifying land cover dynamics using dense remote sensing time series with the user-friendly pyNITA software. Environ Model Softw 145:105179. https://doi.org/10.1016/j.envsoft.2021.105179 Amani M, Ghorbanian A, Ahmadi SA et al (2020) Google Earth Engine Cloud Computing Platform for remote sensing big data applications: a comprehensive review. IEEE J Sel Top Appl Earth Obs Remote Sens 13:5326–5350. https://doi.org/10.1109/JSTARS.2020.3021052 Anderson JR (1976) A land use and land cover classification system for use with remote sensor data. US Government Printing Office, Washington, DC Arévalo P, Bullock EL, Woodcock CE, Olofsson P (2020) A suite of tools for continuous land change monitoring in Google Earth Engine. Front Clim 2:576740. https://doi.org/10.3389/ fclim.2020.576740 Arino O, Gross D, Ranera F et al (2007) GlobCover: ESA service for global land cover from MERIS. In: 2007 IEEE international geoscience and remote sensing symposium. IEEE, Barcelona, pp 2412–2415 Bartholomé E, Belward AS (2005) GLC2000: a new approach to global land cover mapping from Earth observation data. Int J Remote Sens 26:1959–1977. https://doi.org/10.108 0/01431160412331291297

268

16 Examples of Remote Sensing Applications of Big Data Analytics—Land Cover…

Bauer-Marschallinger B, Sabel D, Wagner W (2014) Optimisation of global grids for high- resolution remote sensing data. Comput Geosci 72:84–93 Boucher A, Seto KC, Journel AG (2006) A novel method for mapping land cover changes: incorporating time and space with geostatistics. IEEE Trans Geosci Remote Sens 44:3427–3435. https://doi.org/10.1109/TGRS.2006.879113 Buchhorn M, Smets B, Bertels L et al (2019) Copernicus global land service: land cover 100m: collection 2: epoch 2015: globe Buchhorn M, Smets B, Bertels L et al (2020) Copernicus global land service: land cover 100m: collection 3 epoch 2015, globe. Version V3 01Data Set BU-GLANCE-Project (2022) Global LANd Cover mapping and Estimation (GLANCE) Grids Büttner G, Feranec J, Jaffrain G et al (2004) The CORINE land cover 2000 project. EARSeL EProceedings 3:331–346 Caccetta P, Furby S, O’Connell J et al (2007) Continental monitoring: 34 years of land cover change using Landsat imagery. In: 32nd International symposium on remote sensing of environment. Citeseer, pp 25–29 Cai S, Liu D (2015) Detecting change dates from dense satellite time series using a sub-annual change detection algorithm. Remote Sens 7:8705–8727. https://doi.org/10.3390/rs70708705 Chen J, Chen J (2018) GlobeLand30: operational global land cover mapping and big-data analysis. Sci China Earth Sci 61:1533–1534. https://doi.org/10.1007/s11430-018-9255-3 Chen J, Chen J, Liao A et al (2015) Global land cover mapping at 30m resolution: a POK-based operational approach. ISPRS J Photogramm Remote Sens 103:7–27. https://doi.org/10.1016/j. isprsjprs.2014.09.002 Cover CL (2017) S2 prototype land cover 20 m map of Africa. ESA Defourny P, Vancutsem C, Bicheron P et al (2006) GLOBCOVER: a 300 m global land cover product for 2005 using Envisat MERIS time series. In: Proceedings of ISPRS Commission VII mid- term symposium: remote sensing: from pixels to processes, Enschede (NL). Citeseer, pp 8–11 Defourny P, Vancutsem C, Pekel J-F et al (2008) Towards a 300 m global land cover product— the globcover initiative. In: Proceedings of second workshop of the EARSeL Special Interest Group on Land Use and Land Cover Di Gregorio A (2005) Land cover classification system: classification concepts and user manual: LCCS. Food & Agriculture Organization of the United Nations, Rome Di Gregorio A, Jansen LJM (eds) (2001) Land cover classification system (LCCS): classification concepts and user manual; for software version 1.0, Repr. FAO, Rome Farr TG, Rosen PA, Caro E et al (2007) The shuttle radar topography mission. Rev Geophys 45:RG2004. https://doi.org/10.1029/2005RG000183 Friedl MA, McIver DK, Hodges JC et al (2002) Global land cover mapping from MODIS: algorithms and early results. Remote Sens Environ 83:287–302 Friedl MA, Strahler AH, Hodges J (2010) ISLSCP II MODIS (Collection 4) IGBP Land Cover, 2000–2001. 3.564251 MB. https://doi.org/10.3334/ORNLDAAC/968 Fry J, Coan M, Homer CG et al (2008) Completion of the National Land Cover Database (NLCD) 1992–2001 land cover change retrofit product. US Geol Surv Open-File Rep 1379:18 Gong P, Wang J, Yu L et al (2013) Finer resolution observation and monitoring of global land cover: first mapping results with Landsat TM and ETM+ data. Int J Remote Sens 34:2607–2654. https://doi.org/10.1080/01431161.2012.748992 Gong P, Liu H, Zhang M et al (2019) Stable classification with limited sample: transferring a 30-m resolution sample set collected in 2015 to mapping 10-m resolution global land cover in 2017. Sci Bull 64:370–373. https://doi.org/10.1016/j.scib.2019.03.002 Gorelick N, Hancher M, Dixon M et al (2017) Google Earth Engine: planetary-scale geospatial analysis for everyone. Remote Sens Environ 202:18–27 Hansen MC, Defries RS, Townshend JRG, Sohlberg R (2000) Global land cover classification at 1 km spatial resolution using a classification tree approach. Int J Remote Sens 21:1331–1364. https://doi.org/10.1080/014311600210209

References

269

Hijmans RJ, Cameron SE, Parra JL et al (2005) Very high resolution interpolated climate surfaces for global land areas. Int J Climatol 25:1965–1978. https://doi.org/10.1002/joc.1276 Jamali S, Jönsson P, Eklundh L et al (2015) Detecting changes in vegetation trends using time series segmentation. Remote Sens Environ 156:182–195. https://doi.org/10.1016/j.rse.2014.09.010 Kennedy RE, Yang Z, Cohen WB (2010) Detecting trends in forest disturbance and recovery using yearly Landsat time series: 1. LandTrendr—Temporal segmentation algorithms. Remote Sens Environ 114:2897–2910. https://doi.org/10.1016/j.rse.2010.07.008 Kukawska E, Lewiński S, Krupiński M et al (2017) Multitemporal Sentinel-2 data-remarks and observations. In: 2017 9th international workshop on the analysis of multitemporal remote sensing images (MultiTemp). IEEE, Piscataway, pp 1–4 Lehmann EA, Wallace JF, Caccetta PA et al (2013) Forest cover trends from time series Landsat data for the Australian continent. Int J Appl Earth Obs Geoinformation 21:453–462. https://doi. org/10.1016/j.jag.2012.06.005 Liu D, Cai S (2012) A spatial-temporal modeling approach to reconstructing land-cover change trajectories from multi-temporal satellite imagery. Ann Assoc Am Geogr 102:1329–1347. https://doi.org/10.1080/00045608.2011.596357 Loveland TR, Belward AS (1997) The International Geosphere Biosphere Programme Data and Information System global land cover data set (DISCover). Acta Astronaut 41:681–689. https:// doi.org/10.1016/S0094-5765(98)00050-2 Loveland TR, Reed BC, Brown JF et al (2000) Development of a global land cover characteristics database and IGBP DISCover from 1 km AVHRR data. Int J Remote Sens 21:1303–1330. https://doi.org/10.1080/014311600210191 Mousivand A, Arsanjani JJ (2019) Insights on the historical and emerging global land cover changes: the case of ESA-CCI-LC datasets. Appl Geogr 106:82–92. https://doi.org/10.1016/j. apgeog.2019.03.010 Takaku J, Tadono T, Tsutsui K (2014) Generation of high resolution global DSM from ALOS PRISM. ISPRS Ann Photogramm Remote Sens Spat Inf Sci 2:243 Tarrio K, Friedl MA, Woodcock CE et al (2019) Global Land Cover mapping and Estimation (GLanCE): a multitemporal Landsat-based data record of 21st century global land cover, land use and land cover change. In: AGU fall meeting abstracts. p GC21D-1317 Verbesselt J, Hyndman R, Newnham G, Culvenor D (2010) Detecting trend and seasonal changes in satellite image time series. Remote Sens Environ 114:106–115. https://doi.org/10.1016/j. rse.2009.08.014 Viana CM, Girão I, Rocha J (2019) Long-term satellite image time-series for land use/land cover change detection using refined open source data in a rural region. Remote Sens 11:1104. https://doi.org/10.3390/rs11091104 Wang H, Zhao X, Zhang X et al (2019) Long time series land cover classification in China from 1982 to 2015 based on Bi-LSTM deep learning. Remote Sens 11:1639. https://doi.org/10.3390/ rs11141639 Wickham J, Stehman SV, Gass L et al (2017) Thematic accuracy assessment of the 2011 National Land Cover Database (NLCD). Remote Sens Environ 191:328–341. https://doi.org/10.1016/j. rse.2016.12.026 Wood JE, Gillis MD, Goodenough DG et al (2002) Earth Observation for Sustainable Development of Forests (EOSD): project overview. In: IEEE international geoscience and remote sensing symposium. IEEE, Toronto, pp 1299–1302 Xu Y, Yu L, Peng D et al (2020) Annual 30-m land use/land cover maps of China for 1980–2015 from the integration of AVHRR, MODIS and Landsat data using the BFAST algorithm. Sci China Earth Sci 63:1390–1407. https://doi.org/10.1007/s11430-019-9606-4 Yan J, Wang L, Song W et al (2019) A time-series classification approach based on change detection for rapid land cover mapping. ISPRS J Photogramm Remote Sens 158:249–262. https:// doi.org/10.1016/j.isprsjprs.2019.10.003

270

16 Examples of Remote Sensing Applications of Big Data Analytics—Land Cover…

Zhang X, Liu L, Chen X et al (2021) GLC_FCS30: global land-cover product with fine classification system at 30 m using time-series Landsat imagery. Earth Syst Sci Data 13:2753–2776. https://doi.org/10.5194/essd-13-2753-2021 Zhu Z, Woodcock CE (2014) Continuous change detection and classification of land cover using all available Landsat data. Remote Sens Environ 144:152–171. https://doi.org/10.1016/j. rse.2014.01.011 Zhu Z, Fu Y, Woodcock CE et al (2016) Including land cover change in analysis of greenness trends using all available Landsat 5, 7, and 8 images: a case study from Guangzhou, China (2000–2014). Remote Sens Environ 185:243–257. https://doi.org/10.1016/j.rse.2016.03.036 Zhu Z, Zhang J, Yang Z et al (2020) Continuous monitoring of land disturbance based on Landsat time series. Remote Sens Environ 238:111116. https://doi.org/10.1016/j.rse.2019.03.009

Chapter 17

Geospatial Big Data Initiatives in the World

Abstract The chapter review major big data initiatives that have a geospatial component of remote sensing and Earth Observations. These include initiatives in selected countries and international organizations. The reviews highlighted that geospatial standards play an important role in these initiatives to support interoperation of data, metadata, and services. Keywords Big data initiative · Big Earth Data Initiative · EarthCube · Big Earth Data Science Engineering · Destination Earth · Australian Geoscience Data Cube · Research Data Alliance · United Nations Global Pulse

17.1 US Federal Government Big Data Initiative One of the biggest big data initiatives in the United States is the National Big Data Research and Development Initiative declared on March 29, 2012 by President Obama’s Office of Science and Technology Policy (Kalil 2012; Weiss and Zgorski 2012). The initiative allocated additional $200 millions in big data research and development from six participating agencies—National Science Foundation (NSF), Department of Defense (DOD), National Institutes of Health (NIH), Department of Energy (DOE), US Geological Survey, and the Defense Advanced Research Projects Agency (DARPA) (Weiss and Zgorski 2012). The Initiative aims at (1) advancing the technologies in collecting, storing, preserving, managing, analyzing, and sharing big data; (2) accelerating discovery in science and engineering, strengthening national security, and transforming teaching and learning by harnessing the big data technologies; and (3) expanding the workforce for big data research and development (Weiss and Zgorski 2012). Among the actually funded projects and activities during the first wave of the Initiative, the EarthCube in NSF is a special cyberinfrastructure for geoscientists to access, analyze, and share information about our planet (Allison 2012; Jacobs 2012; Allison et al. 2013). The USGS funded research projects in its Big Data for Earth System Science initiative through the John Wesley Powell Center for Analysis and Synthesis which focused on improving understanding of species © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. Di, E. Yu, Remote Sensing Big Data, Springer Remote Sensing/ Photogrammetry, https://doi.org/10.1007/978-3-031-33932-5_17

271

272

17 Geospatial Big Data Initiatives in the World

responses to climate change, earthquake recurrence rates, and ecological indicators (Bertot and Choi 2013). Other Big Earth Data projects were the Next Generation Networking program, NASA’s Advanced Information System Technology (AIST) programs, NASA’s Earth Science Data and Information System (ESDIS) project, the Global Earth Observation System of Systems (GEOSS), and the Earth System Grid Federation (Executive Office of the President 2012). By May 2016, the Federal Big Data Research and Development Strategic Plan was released (Marzullo 2016). More agencies are getting involved in advancing the Initiative. The plan is the collaborative result of 15 participating agencies. The plan pinpointed seven strategies in advancing the big data research and development (Big Data Senior Steering Group 2016): (1) creating next-generation capabilities by leveraging emerging big data foundations, techniques, and technologies (scaling up for the size, speed, and complexity of data and developing new methods for future big data capabilities); (2) supporting research and development to explore and understand trustworthiness of data and resulting knowledge, to make better decisions, enable breakthrough discoveries, and take confident action (understanding the trustworthiness of data and validity of knowledge, and designing tools to support data-driven decision making); (3) building and enhancing research cyberinfrastructure to support agency mission through innovative big data technologies (strengthening the data infrastructure, empowering the advanced scientific cyberinfrastructure, and addressing community needs); (4) increasing the value of data through policies that promote sharing and management of data (developing best practices for metadata to increase data transparency and utility and providing efficient, sustainable, and secure access to data assets); (5) understanding big data collection, sharing, and use with regard to privacy, security, and ethics (providing privacy protection, enabling big data security, and understanding ethics for data governess); (6) improving the national landscape for big data education and training to meet the increasing demand for deep analytical talent and analytical capacity for the broader workforce (growing data scientists, expanding the community of dataempowered domain experts, broadening data-capable workforce, and improving the data literacy of the public); and (7) creating and enhancing connections among the national big data innovation ecosystem (encouraging cross-sector, cross-agency big data collaboration, and promoting policies and frameworks for faster responses and measurable impacts).

17.1.1 Big Earth Data Initiative The Big Earth Data Initiative (BEDI) is a cross-agency collaboration specifically for Big Earth Data under the National Big Data Research and Development Initiative (Holdren 2013; NSTC 2014). BEDI is coordinated through the US Group on Earth Observation (USGEO) Subcommittee of the National Science and Technology Council (NSTC) Committee on Environment Natural resources and Sustainability (CENRS) (Tilmes et al. 2014). More than 12 agencies directly participated in the initiative. The overall goal is to “improve the interoperability of civil Earth observing

17.1 US Federal Government Big Data Initiative

273

data across U.S. federal agencies, systems, and platforms by improving the usability, discoverability, and accessibility of these data and systems along with improving data management practices”(EarthData 2016). Under this initiative, the first National Plan for Civil Earth Observations was released in July 2014 (NSTC 2014). The plan identified five priorities (NSTC 2014): (1) continuity of sustained observations for public services; (2) continuity of sustained observations for earth system research; (3) continued investment in experimental observations; (4) planned improvements to sustained observation networks and surveys for all observation categories; and (5) continuity of, and improvements to, a rigorous assessment and prioritization process. Eight supporting actions were also identified: (1) coordinated and integrate observation; (2) improve data access, management, and interoperability; (3) increase efficiency and cost savings; (4) improve observation density and sampling; (5) maintain and support infrastructure; (6) explore commercial solutions; (7) maintain and strengthen international collaboration; (7) maintain and strengthen international collaboration; and (8) engage in stakeholder-driven data innovation. The observations include those from airborne, terrestrial, and marine platforms. The 2019 National Plan for Civil Earth Observations set on three goals (NSTC 2019): (1) support and balance the portfolio of Earth Observations, (2) engage the Earth Observation enterprise, and (3) improve the impact of Earth Observations. The participating agency NASA provides sensor development and launching services for continuity of observations for public services, including the Geostationary Operational Environmental Satellite (GOES) series and the Ozone Mapping and Profiler Suite (OMPS) Nadir sensor for air quality and ozone, the Landsat series for land imaging, the Visible Infrared Imaging Radiometer Suite (VIIRS) instrument for ocean-color observations, the altimeter on the Jason series for ocean surface and water-level monitoring, the Deep Space Climate Observatory (DSCOVR) for space weather monitoring, and the Polar-orbiting Operational Environmental Satellite (POES) series for weather, hazards, and seasonal/interannual climate variability. NASA also conducts Earth system research through the Stratospheric Aerosol and Gas Experiment (SAGE III) for aerosols and trace gases, the orbital Carbon Observatory (OCO) for atmospheric carbon dioxide, the Gravity Recovery and Climate Experiment (GRACE) for groundwater, the Earth’s Radiant Energy System (CERES) sensor and the Total Solar Irradiance Sensor (TSIS-1) for net energy balance, and other experimental satellite observations for studying the integrated Earth system. Other participating agencies, such as NOAA and USGS, are mainly participating in operations of these sensors for public services and Earth system research. The Common Framework for Earth Observation Data was released in 2016 (NSTC 2016). The Framework provides a set of standards and recommendations for data search and discovery services, data access services, data documentation, and compatible formats and vocabularies (NSTC 2016). The Earth Science Data and Information System (ESDIS) Project of NASA manages the data and services of Earth observing systems (EOS). The Framework was implemented through the following improvements (EarthData 2016):

274

17 Geospatial Big Data Initiatives in the World

• Data usability: The data are organized around the Societal Benefits Area (SBA). More data are searchable and viewable through Earthdata Search and Global Imagery Browse Services (GIBS). Time-critical data service is improved through systems designed to deliver data rapidly. The compatibility with GIS software (open source or commercial) is also improved. • Data discoverability: Metadata are improved in archives to make datasets more discoverable by popular search engines (e.g., Google or Bing). The Common Metadata Repository (CMR) unifies the metadata standards and management which merges all existing capabilities and metadata of NASA Earth science metadata systems. Standards are adopted to support wide interoperability. DOI was used in registering dataset to enhance the discoverability of data objects. • Data accessibility: Data access end points are provided in searching results. Open data formats are supported. Data access software are open-sourced for nonrestriction data access. Open geospatial Web service standards are adopted in supporting data interoperation.

17.1.2 NSF EarthCube EarthCube is an interdisciplinary program that aims at developing a community- guided cyberinfrastructure to improve access, sharing, visualization, and analysis of data and related resources in geosciences (Black et al. 2014). It is an initiative and partnership between the Directorate for Geosciences and the Office of Advanced Cyberinfrastructure in the National Science Foundation that links geosciences, cyberinfrastructure, computer science, and associated communities to foster a better understanding of the planet through improved data collection, analytics, visualization, sharing, and archiving (Black et al. 2014). EarthCube has been developing building blocks or products that build and provide tools, data, services, and apps for geosciences researchers. Integrated solution is also funded to aggregate the results from developed building blocks and components. Interoperation and community standards play important roles to link the building blocks to achieve a big, improved, collective result (Kirkpatrick et al. 2021; Valentine et al. 2021). Example building blocks are CyberConnector (Di et al. 2017; Sun and Di 2018; Di and Sun 2021) and CyberWay (Di et al. 2019; Gaigalas et al. 2019) that extensively use open geospatial standards in connecting data services to bridge the data flow from sensor to Earth science models. The CyberConnector supports the automatic preparation and feeding of both historic and near-real-time Earth Observation data and on-demand derived products into Earth science models. It establishes the live link from sensor to Earth science models. The CyberWay is an integrative activity that utilizes several building blocks to achieve the complete process of data discovery, access, analytics, production, and dissemination for interdisciplinary collaboration.

17.3 Big Data Initiatives in Europe

275

17.2 Big Data Initiative in China The State Council of China announced the Big Data Initiative in 2015 (The State Council of China 2015). Geospatial big data is one of the foundational systems to be further developed. A series of big data development and experiments have been carried out in different agencies at different administrative levels (Cheng 2014; Hajirahimova and Aliyeva 2017). The Big Earth Data Science Engineering (CASEarth) (Guo 2017) is a Big Earth Data initiative led by the Chinese Academy of Sciences (CAS)—Big Earth Data collaborative research mechanism, Big Earth Data infrastructure, and decision support system. The initiative will invest in (1) small satellites, (2) Big Earth Data cloud platform, (3) international collaboration on digital belt and road, (4) Big Earth Data decision support, (5) integrated data for biodiversity and ecological security, (6) 3-D data, (7) study of polar areas and extremely high mountains, and (8) Digital Earth Science Platform to host data and analytical tools.

17.3 Big Data Initiatives in Europe The Horizon 2020 program of the European Union (EU) has funded many big data projects that had a significant impact on the big data research and development in Europe (Veugelers et al. 2015; Hajirahimova and Aliyeva 2017). The Copernicus Initiative is a Big Earth Data initiative that consists of Earth Observation satellite series and spatial information systems that support different social benefit areas, including atmosphere, marine, land, climate, emergency, and security (Bereta et al. 2018, 2019; Hristov and Alexandrov 2019; Koubarakis et al. 2019). It consists of pace components (satellites, ground data stations), in situ measurements (ground- based observations and sensors), and services (especially data Web services for data distribution and processing) (Smets et al. 2013; Jutz and Milagro-Pérez 2020). The Destination Earth (DestinE) is an European Commission’s initiative to develop a very high resolution digital model of the Earth for the Common European Green Deal data space for climate change and its impact (European Commission, Joint Research Centre 2020a; Nativi et al. 2021). The big data system is designed to handle large amount of data distributed in the cloud computing environment. The system support different special application topics (European Commission, Joint Research Centre 2021). There are 30 use cases identified with different priorities for thematic digital twin implementations (European Commission, Joint Research Centre 2020b).

276

17 Geospatial Big Data Initiatives in the World

17.4 Big Data Initiatives in Australia The Australian government released a big data strategy for public service in 2013 (Australian Government 2013; Hajirahimova and Aliyeva 2017). The strategy aims at improving public services, delivering new services, and providing better policy advice on big data. The Best Practice Guide for Big Data provide guidelines for the adoption of big data analytics by Australian government agencies (Australian Government 2015). The Australian Geoscience Data Cube is a research project evolved into a technology and platform effectively addressing the big data challenges of Earth Observation data (Lewis et al. 2017). The project was originally funded by Geoscience Australia in 2011. The core components of the AGDC include data preparation, managing software, and high-performance computing environment (Lewis et al. 2017). The Open Data Cube was evolved from the AGDC with the endorsement of the Committee on Earth Observation Satellites (CEOS) (Killough 2018).

17.5 Other Big Data Initiatives The geospatial open standard organization, OGC, has a Big Data Domain Working Group (DWG) that provides a forum to work on geospatial big data interoperability, access, and analytics (OGC 2022). The DWG focuses on standards for Big Earth Data. It explores the characteristics of Big Earth Data and fits geospatial standards. The group prompts best practices in adopting geospatial standards in supporting managing and processing of Big Earth Data (Baumann 2014, 2018; Bermudez 2017; Percivall and Simonis 2020). The United Nations Global Pulse is a development initiative that leverages big data for global and sustainable development (Hajirahimova and Aliyeva 2017; Hidalgo-Sanchis 2021). Heterogeneous data have been studied to monitoring and analyzing developments in public health, climate and resilience, and food and agriculture (UN Global Pulse 2015). The Research Data Alliance (RDA) has a Big Data Interest Group that aims at providing recommendations for scientific community to select an appropriate solution for a particular domain application (Kuo et al. 2016; Research Data Alliance 2022). The big data characteristics of research data were recognized since the conception of the Alliance in 2012 by a joint venture of the National Science Foundation (NSF) of the United States and the European Commission (Demchenko and Stoy 2021).

References

277

References Allison M (2012) A governance roadmap and framework for EarthCube. In: AGU fall meeting abstracts. p IN21A-1467 Allison ML, Cutcher-Gershenfeld J, Patten K et al (2013) EarthCube End-user principal investigator workshop: executive summary, Tucson, AZ, 14–15 Aug 2013 Australian Government (2013) The Australian public service big data strategy: improved understanding through enhanced data-analytics capability Australian Government (2015) Australian public service better practice guide for big data. Commonwealth of Australia, Sydney, Australia. Baumann P (2014) Big geo data: standards and best practices. In: 2014 Fifth international conference on computing for geospatial research and application. IEEE, Washington, DC, pp 127–128 Baumann P (2018) Datacube standards and their contribution to analysis-ready data. In: IGARSS 2018–2018 IEEE international geoscience and remote sensing symposium. IEEE, Valencia, pp 2051–2053 Bereta K, Caumont H, Goor E et al (2018) From Copernicus big data to big information and big knowledge: a demo from the Copernicus App Lab Project. In: Proceedings of the 27th ACM international conference on information and knowledge management – CIKM ‘18. ACM Press, Torino, pp 1911–1914 Bereta K, Caumont H, Daniels U et al (2019) The Copernicus App Lab project: easy access to Copernicus data. Advances in Database Technology – EDBT Bermudez L (2017) New frontiers on open standards for geo-spatial science. Geo-Spat Inf Sci 20:126–133. https://doi.org/10.1080/10095020.2017.1325613 Bertot JC, Choi H (2013) Big data and e-government: issues, policies, and recommendations. In: Proceedings of the 14th annual international conference on digital government research. ACM, Quebec, pp 1–10 Big Data Senior Steering Group (2016) The federal big data research and development strategic plan. Executive Office of the President, National Science and Technology Council, Washington, DC, USA. Black R, Katz A, Kretschmann K (2014) EARTHCUBE: a community-driven organization for geoscience cyberinfrastructure. Limnol Oceanogr Bull 23:80–83. https://doi.org/10.1002/ lob.201423480a Cheng J (2014) Big data for development in China. UNDP China Working Paper Demchenko Y, Stoy L (2021) Research data management and data stewardship competences in university curriculum. In: 2021 IEEE global engineering education conference (EDUCON). IEEE, Piscataway, pp 1717–1726 Di L, Sun Z (2021) Big data and its applications in agro-geoinformatics. In: Agro-geoinformatics. Springer, Cham, pp 143–162 Di L, Sun Z, Zhang C (2017) Facilitating the easy use of earth observation data in earth system models through CyberConnector. In: AGU fall meeting abstracts. p IN21D-0072 Di L, Sun Z, Yu E, et al (2019) CyberWay–an integrated geospatial cyberinfrastructure to facilitate innovative way of inter-and multi-disciplinary geoscience studies. In: Geophysical research abstracts EarthData (2016) NASA EOSDIS role in the big earth data initiative. In: earthdata.nasa.gov. https://earthdata.nasa.gov/learn/articles/tools-and-technology-articles/eosdis-role-in-bedi. Accessed 20 Mar 2022 European Commission, Joint Research Centre (2020a) Destination earth: survey on “Digital Twins” technologies and activities, in the Green Deal area. Publications Office of the European Union, Luxembourg. European Commission, Joint Research Centre (2020b) Destination earth: use cases analysis. Publications Office of the European Union, Luxembourg. European Commission, Joint Research Centre (2021) Destination earth: ecosystem architecture description. Publications Office of the European Union, Luxembourg.

278

17 Geospatial Big Data Initiatives in the World

Executive Office of the President (2012) Big data across the federal government. The White House, Washington, DC, USA. Gaigalas J, Di L, Sun Z (2019) Advanced cyberinfrastructure to enable search of big climate datasets in THREDDS. ISPRS Int J Geo-Inf 8:494. https://doi.org/10.3390/ijgi8110494 Guo H (2017) Big earth data: a new frontier in earth and information sciences. Big Earth Data 1:4–20. https://doi.org/10.1080/20964471.2017.1403062 Hajirahimova M, Aliyeva A (2017) Big data initiatives of developed countries. Probl Inf Soc 08:10–19. https://doi.org/10.25045/jpis.v08.i1.02 Hidalgo-Sanchis P (2021) UN global pulse: a UN innovation initiative with a multiplier effect. In: Data science for social good. Springer, New York, pp 29–40 Holdren JP (2013) National strategy for civil earth observations. National Science and Technology Council Washington, D.C., USA. Hristov AA, Alexandrov CI (2019) Overview of ESA “COPERNICUS” Program. In: Сборник с доклади от Симпозиум “Стратегически алианс-фактор за развитието на икономическите коридори”. pp 78–85 Jacobs C (2012) A vision for, and progress towards EarthCube. In: EGU general assembly conference abstracts. p 1227 Jutz S, Milagro-Pérez M (2020) Copernicus: the European Earth Observation programme. Rev Teledetec Kalil T (2012) Big data is a big deal. In: obamawhitehouse.archives.gov. https://obamawhitehouse. archives.gov/blog/2012/03/29/big-data-big-deal. Accessed 20 Mar 2022 Killough B (2018) Overview of the open data cube initiative. In: IGARSS 2018–2018 IEEE international geoscience and remote sensing symposium. IEEE, Piscataway, pp 8629–8632 Kirkpatrick C, Daniels MD, McHenry K et al (2021) EarthCube: a community-driven cyberinfrastructure for the geosciences–a look ahead. In: AGU fall meeting 2021. AGU Koubarakis M, Bereta K, Bilidas D et al (2019) From Copernicus big data to extreme earth analytics. Open Proc:690–693 Kuo K-S, Baumann P, Evans B, Riedel M (2016) Earth science big data activities at research data alliance. In: EGU general assembly conference abstracts. p EPSC2016-17309 Lewis A, Oliver S, Lymburner L et al (2017) The Australian geoscience data cube—foundations and lessons learned. Remote Sens Environ 202:276–292. https://doi.org/10.1016/j.rse.2017.03.015 Marzullo K (2016) Administration issues strategic plan for big data research and development. In: obamawhitehouse.archives.gov. https://obamawhitehouse.archives.gov/blog/2016/05/23/ administration-issues-strategic-plan-big-data-research-and-development. Accessed 20 Mar 2022 Nativi S, Mazzetti P, Craglia M (2021) Digital ecosystems for developing digital twins of the earth: the destination earth case. Remote Sens 13:2119. https://doi.org/10.3390/rs13112119 NSTC (2014) National plan for civil earth observations. National Science and Technology Council Executive Office of the President, Washington, DC NSTC (2016) Common framework for earth-observation data. U.S. Government, Washington, DC NSTC (2019) 2019 National plan for civil earth observations. U.S. Government, Washington, DC OGC (2022) Big Data DWG. In: www.ogc.org. https://www.ogc.org/projects/groups/bigdatadwg. Accessed 22 Mar 2022 Percivall G, Simonis I (2020) Exploitation of earth observations: OGC contributions to GRSS earth science informatics. In: IGARSS 2020–2020 IEEE international geoscience and remote sensing symposium. IEEE, Waikoloa, pp 601–604 Research Data Alliance (2022) RDA big data interest group. In: Wwwrd-Allianceorg. https:// www.rd-alliance.org/groups/big-data-analytics-ig.html. Accessed 20 Mar 2022 Smets B, Lacaze R, Freitas S et al (2013) Operating the Copernicus global land service. In: ESA living planet symposium, Edinburgh, 66p Sun Z, Di L (2018) CyberConnector COVALI: enabling inter-comparison and validation of earth science models. In: AGU fall meeting abstracts. p IN23B-0780

References

279

The State Council of China (2015) The state council on printing and distributing to promote the development of big data notice of action plan Tilmes C, deLaBeaujardiere J, Bristol S (2014) Developing the big earth data initiative (BEDI) common framework (BCF). In: In the U.S. group on earth observations (USGEO). Data Management Working Group UN Global Pulse (2015) Big data for development in action – UN Global Pulse Project Series Valentine D, Zaslavsky I, Richard S et al (2021) EarthCube Data Discovery Studio: a gateway into geoscience data discovery and exploration with Jupyter notebooks. Concurr Comput Pract Exp 33:e6086 Veugelers R, Cincera M, Frietsch R et al (2015) The impact of horizon 2020 on innovation in Europe. Intereconomics 50:4–30. https://doi.org/10.1007/s10272-015-0521-7 Weiss R, Zgorski L-J (2012) Obama administration unveils “big data” initiative: announces $200 million in new R&D investments. Office of Science and Technology Policy Executive Office of the President

Chapter 18

Challenges and Opportunities in the Remote Sensing Big Data

Abstract This chapter discusses challenges and opportunities in remote sensing big data. Three challenges are discussed. They are data complexity, data quality, and infrastructure change. The growth of remote sensing big data also introduces several new opportunities. The discussed changes are single scale to multiscale, on-premise servers to distributed services, data-focused Sensor Web to modeled-simulation Digital Twins, “meteorology” snapshots to “climate” series, isolated case to intertwined “teleconnections,” and one-level system to hierarchical knowledge graph. Keywords Challenge · Opportunity · Remote sensing big data · Trend · Data complexity · Data quality · Infrastructure · Sensor Web · Digital Twin

18.1 Challenges Several major challenges of remote sensing big data remain with the advancement of remote sensing technologies, computing technologies, and application development. The frontiers of challenges are summarized as follows. 1. Increasing volume and complexity Information increases outgrow the increase of capability (Ma et al. 2015). The computing complexity of storing, managing, and processing remote sensing big data is a big challenge resulted from its volume growth (Liu 2015; Huang et al. 2018). Indexing and searching relevant dataset from the massive data distributed across the world needs to not only deal with the volume but also handle the high dimensionality (Ma et al. 2015; Zhu et al. 2021a, b). Data retrieval needs to deal with multidimensional data cubes (Sabri et al. 2021). Content-based retrieval can be especially difficult. The challenge questions open for research include the efficient approach to annotate large-scale remote sensing data samples, the robust training of data retrieval deep networks under weak supervision (i.e., imbalanced labels or noisy labels), and the highly intelligent reasoning abilities to recognize objects and their topological relationships in remote sensing data (Li et al. 2021b). Research in © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 L. Di, E. Yu, Remote Sensing Big Data, Springer Remote Sensing/ Photogrammetry, https://doi.org/10.1007/978-3-031-33932-5_18

281

282

18 Challenges and Opportunities in the Remote Sensing Big Data

data partition, data loading, and task scheduling is of high priority for parallel processing of remote sensing big data (Ma et al. 2015; Zhu et al. 2021a). High dimensionality is one of the characteristic complexities in remote sensing big data which leads to the computing complexity (Liu 2015). Dimension reduction have been proved to improve the performance of machine learning for image classification (Chao et al. 2019; Kang et al. 2021). Dimension reduction technologies include feature extraction and feature selection. Feature extraction methods have principal component analysis, canonical correlation analysis, nonnegative matrix factorization, and manifold learning-based algorithms (e.g., isomap, locally linear embedding, and Laplacian eigenmap) (Ray et al. 2021). Feature selection has structures of filter, embedding, or wrapper. Optimization approaches for feature selections include genetic algorithms, particle swarm optimization, ReliefF, minimum redundancy maximum relevance, recursive feature elimination, and simultaneous perturbation stochastic approximation (Ray et al. 2021). However, the effectiveness of the dimension reduction technologies vary depending on data, machine learning algorithms, and application, while certain unwanted artifacts may be introduced or enhanced (Lorenz et al. 2021; Ahmad et al. 2021). The selection of fit dimension reduction technology is a challenge for given data, application, and methodology in applications of remote sensing big data. One of the open research questions for remote sensing big data analytics is the specialized and fit analytic system to extract and retrieve relevant information for specific applications. Conventional machine learning and analytic algorithms need to be adapted to be applicable with remote sensing big data (Cravero and Sepúlveda 2021). Some applications require fast ingestion and processing capabilities which adapt analytic processes to work with real-time data stream (Chen et al. 2019). Some applications need to integrate multisource data (Fan et al. 2017; Wang et al. 2018a). Limited by remote sensing big data infrastructures, some applications have to prioritize their choice of data used in analysis. The Google Earth Engine is one of the popular remote sensing big data analysis platforms which have limitations on training data size, machine learning algorithms, and size of external data uploads (Amani et al. 2020). 2. Inconsistency and incompleteness Integration of remote-sensed data from multiple resources often deals with data of different quality. Incompleteness and incorrectness of data introduce noise into image labels which affects the performance of machine learning and image classifiers (Hua et al. 2022). Sensor noises and data missing due to natural phenomena, such as clouds, fog, haze, and mist, are other major challenges in correctly training machine learning algorithms (Zaytar and El Amrani 2021; Farhangkhah et al. 2021). Data inconsistency and imbalance also introduce errors during the training of classifiers or modelers using remote sensing big data (Suryono et al. 2022; Gutiérrez et al. 2022). Imbalanced data means that the classifiers may see more negative samples than positive samples or not enough samples for certain importance classes. The result would be misleading with incorrect classifications. Resampling

18.1 Challenges

283

approaches and data synthesis are common approach to balanced training data. Resampling approaches include oversampling method to increase the samples for classes with insufficient data samples and undersampling method to reduce the number of samples for classes with overwhelming numbers of samples. Oversampling methods have random oversampling (Ghazikhani et al. 2012), synthetic minority oversampling technique (SMOTE) (Chawla et al. 2002), borderline SMOTE (Han et al. 2005), borderline oversampling (Pérez-Ortiz et al. 2013), adaptive synthetic sampling (ADASYN) (Haibo He et al. 2008), etc. Undersampling methods have random undersampling (Liu and Tsoumakas 2020), condensed nearest neighbor rule (Hart 1968), near-miss undersampling (Goswami and Roy 2021), Tomek link undersampling (Devi et al. 2017), edited nearest neighbors rule (Wagner 1973), one-sided selection (Kubat et al. 1997), neighborhood cleaning rule (Agustianto and Destarianto 2019), etc. It is also common to combine both oversampling for minority classes and undersampling for majority classes to mitigate the imbalanced data in applications of machine learning (Galar et al. 2012). Hybrid methods have SMOTEBoost (Chawla et al. 2003), RUSBoost (Seiffert et al. 2010), RUSSMOTE, USOS (Abedin et al. 2022), combining SMOTE and Tomek Links (bin Alias et al. 2021), SMOTE-WENN (Guan et al. 2021), SMOTE-RSB* (Ramentol et al. 2012), etc. However, the applicability of those imbalanced data mitigation approaches in remote sensing big data requires further studies and experiments with different applications (Juez-Gil et al. 2021). Data insistency is also one of the major barriers preventing the data to be readily accessed by different applications (Futia et al. 2017). The veracity of remote sensing big data refers to the data quality. Spatiotemporal data cube is one of the approaches to enable the distribution of analysis-ready data with prevalence (Zhao and Yue 2019). Uncertainty analysis and improving data consistency are important research questions for applications of remote sensing big data (Fan et al. 2017; Meng et al. 2020). 3. Changing computing infrastructure Management and processing of remote sensing big data have been migrating to cloud computing environment. More and more remote sensing data become directly accessible in the cloud (Wang et al. 2020b). Earth on AWS hosts more and more Earth Observations. The Google Earth Engine Data Catalog manages many Earth Observations and their products of climate and weather and geophysical products. The migration of computing infrastructure from one monolith desktop computer or on-premise server to distributed computing resources in the cloud has significant impacts on management, processing, and distribution of remote sensing big data (Wang et al. 2020b; Sabri et al. 2021). First, the discovery and access of data in the cloud is different from that in a sing point server. Data management capabilities are provided as services in cloud, such as Amazon S3 bucket, DynamoDB, RDS, Azure Storage service, Azure relational database services, and Google Cloud Storage object/block. Computing capabilities are also provided as services, such as Amazon Lambda, Machine Learning on AWS, and Google Machine Learning App Engine. Specialized data formats have been developed to enable the optimized storage and

284

18 Challenges and Opportunities in the Remote Sensing Big Data

range subset, such as cloud-optimized GeoTiff and cloud-optimized Raster Encoding (Iosifescu Enescu et al. 2021). Different cloud providers have their data catalogs and search end points. The open questions are standard cloud-optimized storage formats and discovery mechanism across cloud providers to support interoperation and flexible data discovery and access. Second, high-performance computing enablement technologies changed with the migration of data into the cloud. Cloud computing provides the Platform as a Service (PaaS) and the Software as a Service (SaaS). The formation of high- performance computing clusters may be realized using different technologies, such as SPARK, MapReduce, or direct cloud-optimized software service (Wang et al. 2020b). Specialized remote sensing high-performance frameworks have emerged, such as pipsCloud (Wang et al. 2018b) and GeoPySpark (Guo et al. 2022). Considering the distributed data storage in the cloud, the algorithm can be moved close to data to achieve improved performance and cost-effectiveness. The open questions are optimization of high-performance computing in the cloud for applications of remote sensing big data. Third, the privacy and security is a challenge for remote sensing cloud services (Wang et al. 2020b). The isolation of cloud services is different from on-premise servers or single server where physical isolation is managed by data providers or service providers. In cloud computing environment, the security is provided as a service agreement between cloud providers and cloud users. Besides, big data analytics often requires data from different sources that may be hosted on different cloud providers. The across-cloud access requires multiple authentication and security control. Security and privacy management need to be studied from different perspectives—data providers, cloud service providers, and data users for applications of remote sensing big data.

18.2 Opportunities Advances in remote sensing technologies enable the collection of large volumes of Earth science data rapidly at high-spectral, spatial, and temporal resolutions. The advancement of remote sensing big data presents many opportunities that would not be possible before. The following briefs the new trend of research and development on remote sensing big data. 1. From one scale to multiscale Multiscale is one of the intrinsic characteristics in remote sensing big data (Liu 2015). The advances of remote sensing expand the range of spatial resolution and temporal resolution. More multiscale levels become available along spatial dimensions and temporal dimension. More multiscale features become available to remote sensing big data analytics. Specialized multiscale analytic methods and technologies become possible. For example, the multiscale object detection algorithm with deep convolutional neural networks detects multi-class objects in

18.2 Opportunities

285

remote sensing images with large-scale variability (Deng et al. 2018). The multiscale spectral-spatial cross-extraction network (MSSCEN) is a deep learning algorithm that utilizes the multiscale characteristics in classification of hyperspectral remote sensing data classification (Gao et al. 2022). NestNet is a multiscale convolutional neural network that uses multiscale property in change detection using remote sensing images (Yu et al. 2021). MSResNet is a multiscale residual network that use multiscale properties to detect water body from remote-sensed imageries (Dang and Li 2021). Deep learning algorithms have been used to explore multiscale properties for object detection, classification, and change detection using remote sensing data (Seydi et al. 2022; Wang et al. 2022a, b). 2. From on-premise systems to distributed services Distributed computing provides scalable opportunities of managing, processing, and distributing remote sensing big data (Wang et al. 2020b; Benediktsson and Wu 2021; Wu et al. 2021). Moving remote sensing data to cloud makes it possible for data providers to take more responsibilities in producing analysis ready data through flexible processing capabilities in the cloud computing environment (Sabri et al. 2021). Elasticity of computing resources (e.g., processing capabilities and storage capacities) enables the scalable deployment of data services and processing services for remote sensing big data. Open geospatial standards will play an important role in enabling the interoperation of data and services across multiple providers of cloud service and data (Yao et al. 2019). New security and privacy protection strategies are needed in distributed computing environment (Wang et al. 2020b). 3. From Sensor Web snapshots to Digital Twins The variety of remote sensing big data allows the modeling and simulation of real world from different aspects. The constellation of remote sensors is not only data sources as a Sensor Web but also simulators as a Digital Twin of Earth. The long time series of observing the physical environment through Sensor Web build up the comprehensive understanding and modeling of the Earth for Digital Twins. The concept of a Digital Twin refers to the comprehensive physical and functional description of a system. The advances of remote sensing deploy many remote sensors at different altitude with different capabilities. Digital Twins can be built up with remote sensing big data (Yang et al. 2021; Yu and He 2022). Smart cities use remote sensing data to build Digital Twins to model and simulate different aspects of cities (Shahat et al. 2021; Deren et al. 2021). Integrating remote sensing data, social sensing and crowdsourcing monitors and simulates the disaster using a Disaster City Digital Twin (Fan et al. 2021). The Destination Earth (DestinE) is a very high-resolution digital model of the Earth for the Common European Green Deal data space for climate change and its impact (European Commission. Joint Research Centre. 2020a; Nativi et al. 2021). The concept of Digital Twins of the Earth was used in designing, implementing, and managing the data and their analytic capabilities by tightly binding to specific data with selected topics (European Commission. Joint Research Centre. 2021). Their survey found 30 use cases with

286

18 Challenges and Opportunities in the Remote Sensing Big Data

different priorities for thematic twin implementations (European Commission. Joint Research Centre. 2020b). 4. From “meteorology” to “climate” The long time series accumulation of remote sensing data over the years since its inception in 1970s makes it possible to study long time trends and evaluate dynamic changes. The data expands from snapshots of the “world” to growing long time series of the “world.” Applications of the long time series include lake basin changes (Wang et al. 2021), urban land and its fractional cover dynamic changes (Yin et al. 2021), cropland changes (Gumma et al. 2020), coastline changes (Ding et al. 2021), etc. 5. From monolith application to intertwined “teleconnections” The large-scale coverage of remote sensing big data enhances the study of teleconnections among different phenomena over the world (Li et al. 2020; Jiao et al. 2021). Remote sensing data are used in validation, calibration, or direct incorporation in modeling relationships of physical phenomena (Karpatne et al. 2019; Banerjee et al. 2020). The data-driven approach helps in identifying teleconnections between different phenomena (Cheng et al. 2020). 6. From one level to hierarchical knowledge systems Remote sensing big data enables the extraction of hierarchical levels of knowledge and semantics. Knowledge graph can be constructed by using remote sensing big data. The semantic representation of scene categories by representation learning of remote sensing knowledge graph not only identifies the object but also recognizes the relationship among objects (Li et al. 2021a). Knowledge graph can be constructed with conceptual attributes, interspecies relationships, and plant functions of grassland entities to guide the fully understanding of grassland (Yang and Liu 2021). Remote sensing knowledge graph can be used in heterogeneous data integration (Zárate et al. 2020; Wang et al. 2020a; Hao et al. 2021). Multilevel semantics can be constructed from remote sensing big data using machine learning approaches (Lu et al. 2021; Jiang et al. 2022). Crop semantic segmentation and classification from remote sensing data can be automated with particle swarm optimization and depth residual network (Jadhav and Singh 2018). Multilevel semantics can be extracted and constructed using remote sensing data at different levels: spectral scene categorization at visual primitives, object classification at object level, spatial relationship at semantic partial level, and spatiotemporal relationship at semantic full level (Ghazouani et al. 2019). For scenes, deep convolutional neural networks can be used to extract different semantic features, such as multiscale deep semantic representation and multilevel deep semantic representation (Hu et al. 2020). These features can be applied in improving the classification of image scenes.

References

287

References Abedin MZ, Guotai C, Hajek P, Zhang T (2022) Combining weighted SMOTE with ensemble learning for the class-imbalanced prediction of small business credit risk. Complex Intell Syst. https://doi.org/10.1007/s40747-021-00614-4 Agustianto K, Destarianto P (2019) Imbalance data handling using neighborhood cleaning rule (NCL) sampling method for precision student modeling. In: 2019 International conference on computer science, information technology, and electrical engineering (ICOMITEE). IEEE, Jember, pp 86–89 Ahmad M, Shabbir S, Raza RA et al (2021) Artifacts of different dimension reduction methods on hybrid CNN feature hierarchy for Hyperspectral Image Classification. Optik 246:167757. https://doi.org/10.1016/j.ijleo.2021.167757 Amani M, Ghorbanian A, Ahmadi SA et al (2020) Google earth engine cloud computing platform for remote sensing big data applications: a comprehensive review. IEEE J Sel Top Appl Earth Obs Remote Sens 13:5326–5350. https://doi.org/10.1109/JSTARS.2020.3021052 Banerjee A, Chen R, Meadows ME et al (2020) An analysis of long-term rainfall trends and variability in the Uttarakhand Himalaya using Google earth engine. Remote Sens 12:709. https:// doi.org/10.3390/rs12040709 Benediktsson JA, Wu Z (2021) Distributed computing for remotely sensed data processing [scanning the section]. Proc IEEE 109:1278–1281. https://doi.org/10.1109/JPROC.2021.3094335 Bin Alias MSA, Ibrahim NB, Zin ZBM (2021) Improved sampling data workflow using Smtmk to increase the classification accuracy of imbalanced dataset. Eur J Mol Clin Med 8:2021 Chao G, Luo Y, Ding W (2019) Recent advances in supervised dimension reduction: a survey. Mach Learn Knowl Extr 1:341–358. https://doi.org/10.3390/make1010020 Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over- sampling technique. J Artif Intell Res 16:321–357. https://doi.org/10.1613/jair.953 Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: improving prediction of the minority class in boosting. In: Lavrač N, Gamberger D, Todorovski L, Blockeel H (eds) Knowledge discovery in databases: PKDD 2003. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 107–119 Chen Z, Luo J, Chen N et al (2019) RFim: a real-time inundation extent model for large floodplains based on remote sensing big data and water level observations. Remote Sens 11:1585. https:// doi.org/10.3390/rs11131585 Cheng Q, Oberhänsli R, Zhao M (2020) A new international initiative for facilitating data-driven earth science transformation. Geol Soc Lond Spec Publ 499:225–240. https://doi.org/10.1144/ SP499-2019-158 Cravero A, Sepúlveda S (2021) Use and adaptations of machine learning in big data—applications in real cases in agriculture. Electronics 10:552. https://doi.org/10.3390/electronics10050552 Dang B, Li Y (2021) MSResNet: multiscale residual network via self-supervised learning for water- body detection in remote sensing imagery. Remote Sens 13:3122. https://doi.org/10.3390/ rs13163122 Deng Z, Sun H, Zhou S et al (2018) Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS J Photogramm Remote Sens 145:3–22. https://doi. org/10.1016/j.isprsjprs.2018.04.003 Deren L, Wenbo Y, Zhenfeng S (2021) Smart city based on digital twins. Comput Urban Sci 1:4. https://doi.org/10.1007/s43762-021-00005-y Devi D, Biswas S k, Purkayastha B (2017) Redundancy-driven modified Tomek-link based undersampling: a solution to class imbalance. Pattern Recogn Lett 93:3–12. https://doi.org/10.1016/j. patrec.2016.10.006 Ding Y, Yang X, Jin H et al (2021) Monitoring coastline changes of the Malay Islands based on Google earth engine and dense time-series remote sensing images. Remote Sens 13:3842. https://doi.org/10.3390/rs13193842

288

18 Challenges and Opportunities in the Remote Sensing Big Data

European Commission. Joint Research Centre (2020a) Destination earth: survey on “Digital Twins” technologies and activities, in the Green Deal area. Publications Office, LU European Commission. Joint Research Centre (2020b) Destination earth: use cases analysis. Publications Office, LU European Commission. Joint Research Centre (2021) Destination earth: ecosystem architecture description. Publications Office, LU Fan J, Yan J, Ma Y, Wang L (2017) Big data integration in remote sensing across a distributed metadata-based spatial infrastructure. Remote Sens 10:7. https://doi.org/10.3390/rs10010007 Fan C, Zhang C, Yahja A, Mostafavi A (2021) Disaster city digital twin: a vision for integrating artificial and human intelligence for disaster management. Int J Inf Manag 56:102049. https:// doi.org/10.1016/j.ijinfomgt.2019.102049 Farhangkhah N, Samadi S, Khosravi MR, Mohseni R (2021) Overcomplete pre-learned dictionary for incomplete data SAR imaging towards pervasive aerial and satellite vision. Wirel Netw. https://doi.org/10.1007/s11276-021-02821-w Futia G, Melandri A, Vetrò A et al (2017) Removing barriers to transparency: a case study on the use of semantic technologies to tackle procurement data inconsistency. In: Blomqvist E, Maynard D, Gangemi A et al (eds) The semantic web. Springer International Publishing, Cham, pp 623–637 Galar M, Fernandez A, Barrenechea E et al (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C Appl Rev 42:463–484. https://doi.org/10.1109/TSMCC.2011.2161285 Gao H, Wu H, Chen Z et al (2022) Multiscale spectral-spatial cross-extraction network for hyperspectral image classification. IET Image Process 16:755–771. https://doi.org/10.1049/ ipr2.12382 Ghazikhani A, Yazdi HS, Monsefi R (2012) Class imbalance handling using wrapper-based random oversampling. In: 20th Iranian conference on electrical engineering (ICEE2012). IEEE, Tehran, pp 611–616 Ghazouani F, Farah IR, Solaiman B (2019) A multi-level semantic scene interpretation strategy for change interpretation in remote sensing imagery. IEEE Trans Geosci Remote Sens 57:8775–8795. https://doi.org/10.1109/TGRS.2019.2922908 Goswami T, Roy UB (2021) Classification accuracy comparison for imbalanced datasets with its balanced counterparts obtained by different sampling techniques. In: Kumar A, Mozar S (eds) ICCCE 2020. Springer Singapore, Singapore, pp 45–54 Guan H, Zhang Y, Xian M et al (2021) SMOTE-WENN: solving class imbalance and small sample problems by oversampling and distance scaling. Appl Intell 51:1394–1409. https://doi. org/10.1007/s10489-020-01852-8 Gumma MK, Thenkabail PS, Teluguntla PG et al (2020) Agricultural cropland extent and areas of South Asia derived using Landsat satellite 30-m time-series big-data using random forest machine learning algorithms on the Google Earth Engine cloud. GIScience Remote Sens 57:302–322. https://doi.org/10.1080/15481603.2019.1690780 Guo J, Huang C, Hou J (2022) A scalable computing resources system for remote sensing big data processing using GeoPySpark based on Spark on K8s. Remote Sens 14:521. https://doi. org/10.3390/rs14030521 Gutiérrez R, Rampérez V, Paggi H et al (2022) On the use of information fusion techniques to improve information quality: taxonomy, opportunities and challenges. Inf Fusion 78:102–137. https://doi.org/10.1016/j.inffus.2021.09.017 Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang D-S, Zhang X-P, Huang G-B (eds) Advances in intelligent computing. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 878–887 Hao X, Ji Z, Li X et al (2021) Construction and application of a knowledge graph. Remote Sens 13:2511. https://doi.org/10.3390/rs13132511 Hart P (1968) The condensed nearest neighbor rule (corresp.). IEEE Trans Inf Theory 14:515–516

References

289

He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). IEEE, Hong Kong, pp 1322–1328 Hu F, Xia G-S, Yang W, Zhang L (2020) Mining deep semantic representations for scene classification of high-resolution remote sensing imagery. IEEE Trans Big Data 6:522–536. https://doi. org/10.1109/TBDATA.2019.2916880 Hua Y, Mou L, Jin P, Zhu XX (2022) MultiScene: a large-scale dataset and benchmark for multiscene recognition in single aerial images. IEEE Trans Geosci Remote Sens 60:1–13. https:// doi.org/10.1109/TGRS.2021.3110314 Huang Y, Chen Z, Yu T et al (2018) Agricultural remote sensing big data: management and applications. J Integr Agric 17:1915–1931. https://doi.org/10.1016/S2095-3119(17)61859-8 Iosifescu Enescu I, de Espona L, Haas-Artho D et al (2021) Cloud optimized raster encoding (CORE): a web-native streamable format for large environmental time series. Geomatics 1:369–382. https://doi.org/10.3390/geomatics1030021 Jadhav JK, Singh RP (2018) Automatic semantic segmentation and classification of remote sensing data for agriculture. Math Models Eng 4:112–137. https://doi.org/10.21595/mme.2018.19840 Jiang B, An X, Xu S, Chen Z (2022) Intelligent image semantic segmentation: a review through deep learning techniques for remote sensing image analysis. J Indian Soc Remote Sens. https:// doi.org/10.1007/s12524-022-01496-w Jiao C, Yu G, Yi X et al (2021) The impact of teleconnections on the temporal dynamics in aboveground net primary productivity of the Mongolian Plateau grasslands. Int J Climatol 41:6541–6555. https://doi.org/10.1002/joc.7211 Juez-Gil M, Arnaiz-González Á, Rodríguez JJ, García-Osorio C (2021) Experimental evaluation of ensemble classifiers for imbalance in Big Data. Appl Soft Comput 108:107447. https://doi. org/10.1016/j.asoc.2021.107447 Kang L, Zhang X, Zhang K, Yuan B (2021) Comparative study on different dimension reduction methods in remote sensing ground object recognition. In: 2021 4th international conference on artificial intelligence and big data (ICAIBD). IEEE, Chengdu, pp 382–386 Karpatne A, Ebert-Uphoff I, Ravela S et al (2019) Machine learning for the geosciences: challenges and opportunities. IEEE Trans Knowl Data Eng 31:1544–1554. https://doi.org/10.1109/ TKDE.2018.2861006 Kubat M, Matwin S, Others (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: Icml. Citeseer, p 179 Li H, Zhong X, Ma Z et al (2020) Climate changes and their teleconnections with ENSO over the last 55 years, 1961–2015, in floods-dominated basin, Jiangxi Province, China. Earth Space Sci 7. https://doi.org/10.1029/2019EA001047 Li Y, Kong D, Zhang Y et al (2021a) Robust deep alignment network with remote sensing knowledge graph for zero-shot and generalized zero-shot remote sensing image scene classification. ISPRS J Photogramm Remote Sens 179:145–158. https://doi.org/10.1016/j.isprsjprs.2021.08.001 Li Y, Ma J, Zhang Y (2021b) Image retrieval from remote sensing big data: a survey. Inf Fusion 67:94–115. https://doi.org/10.1016/j.inffus.2020.10.008 Liu P (2015) A survey of remote-sensing big data. Front Environ Sci 3. https://doi.org/10.3389/ fenvs.2015.00045 Liu B, Tsoumakas G (2020) Dealing with class imbalance in classifier chains via random undersampling. Knowl-Based Syst 192:105292. https://doi.org/10.1016/j.knosys.2019.105292 Lorenz S, Ghamisi P, Kirsch M et al (2021) Feature extraction for hyperspectral mineral domain mapping: a test of conventional and innovative methods. Remote Sens Environ 252:112129. https://doi.org/10.1016/j.rse.2020.112129 Lu H, Liu Q, Liu X, Zhang Y (2021) A survey of semantic construction and application of satellite remote sensing images and data. J Organ End User Comput 33:1–20. https://doi.org/10.4018/ JOEUC.20211101.oa6 Ma Y, Wu H, Wang L et al (2015) Remote sensing big data computing: challenges and opportunities. Future Gener Comput Syst 51:47–60. https://doi.org/10.1016/j.future.2014.10.029

290

18 Challenges and Opportunities in the Remote Sensing Big Data

Meng T, Jing X, Yan Z, Pedrycz W (2020) A survey on machine learning for data fusion. Inf Fusion 57:115–129. https://doi.org/10.1016/j.inffus.2019.12.001 Nativi S, Mazzetti P, Craglia M (2021) Digital ecosystems for developing digital twins of the earth: the destination earth case. Remote Sens 13:2119. https://doi.org/10.3390/rs13112119 Pérez-Ortiz M, Gutiérrez PA, Hervás-Martínez C (2013) Borderline kernel based over-sampling. In: Pan J-S, Polycarpou MM, Woźniak M et al (eds) Hybrid artificial intelligent systems. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 472–481 Ramentol E, Caballero Y, Bello R, Herrera F (2012) SMOTE-RSB *: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl Inf Syst 33:245–265. https://doi.org/10.1007/ s10115-011-0465-6 Ray P, Reddy SS, Banerjee T (2021) Various dimension reduction techniques for high dimensional data analysis: a review. Artif Intell Rev 54:3473–3515. https://doi.org/10.1007/ s10462-020-09928-0 Sabri Y, Bahja F, Siham A, Maizate A (2021) Cloud computing in remote sensing: big data remote sensing knowledge discovery and information analysis. Int J Adv Comput Sci Appl 12. https:// doi.org/10.14569/IJACSA.2021.01205104 Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2010) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern – Part Syst Hum 40:185–197. https://doi.org/10.1109/TSMCA.2009.2029559 Seydi ST, Amani M, Ghorbanian A (2022) A dual attention convolutional neural network for crop classification using time-series Sentinel-2 imagery. Remote Sens 14:498. https://doi. org/10.3390/rs14030498 Shahat E, Hyun CT, Yeom C (2021) City digital twin potentials: a review and research agenda. Sustainability 13:3386. https://doi.org/10.3390/su13063386 Suryono H, Kuswanto H, Iriawan N (2022) Rice phenology classification based on random forest algorithm for data imbalance using Google Earth engine. Procedia Comput Sci 197:668–676. https://doi.org/10.1016/j.procs.2021.12.201 Wagner T (1973) Convergence of the edited nearest neighbor (Corresp.). IEEE Trans Inf Theory 19:696–697. https://doi.org/10.1109/TIT.1973.1055059 Wang H, Skau E, Krim H, Cervone G (2018a) Fusing heterogeneous data: a case for remote sensing and social media. IEEE Trans Geosci Remote Sens 56:6956–6968. https://doi.org/10.1109/ TGRS.2018.2846199 Wang L, Ma Y, Yan J et al (2018b) pipsCloud: high performance cloud computing for remote sensing big data management and processing. Future Gener Comput Syst 78:353–368. https://doi. org/10.1016/j.future.2016.06.009 Wang J, Lan Y, Zhang S et al (2020a) Knowledge graph for multi-source data fusion topics research. In: 2020 international conference on high performance big data and intelligent systems (HPBD&IS). IEEE, Shenzhen, pp 1–5 Wang L, Yan J, Ma Y (2020b) Cloud computing in remote sensing. CRC Press, Boca Raton Wang Z, Gan S, Lv J, Yang M (2021) Lake basin based on long time series remote sensing data. In: 2021 IEEE 2nd international conference on big data, artificial intelligence and internet of things engineering (ICBAIE). IEEE, Nanchang, pp 211–215 Wang D, Cao W, Zhang F et al (2022a) A review of deep learning in multiscale agricultural sensing. Remote Sens 14:559. https://doi.org/10.3390/rs14030559 Wang J, Bretz M, Dewan MAA, Delavar MA (2022b) Machine learning in modelling land-use and land cover-change (LULCC): current status, challenges and prospects. Sci Total Environ 822:153559. https://doi.org/10.1016/j.scitotenv.2022.153559 Wu Z, Sun J, Zhang Y et al (2021) Recent developments in parallel and distributed computing for remotely sensed big data processing. Proc IEEE 109:1282–1305. https://doi.org/10.1109/ JPROC.2021.3087029

References

291

Yang K, Liu Y (2021) Construction of knowledge graph in the field of grassland plants based on ontology database. In: He Y, Weng C-H (eds) International conference on environmental remote sensing and big data (ERSBD 2021). SPIE, Wuhan, p 13 Yang W, Zheng Y, Li S (2021) Application status and prospect of digital twin for on-orbit spacecraft. IEEE Access 9:106489–106500. https://doi.org/10.1109/ACCESS.2021.3100683 Yao X, Li G, Xia J et al (2019) Enabling the big earth observation data via cloud computing and DGGS: opportunities and challenges. Remote Sens 12:62. https://doi.org/10.3390/rs12010062 Yin Z, Kuang W, Bao Y et al (2021) Evaluating the dynamic changes of urban land and its fractional covers in Africa from 2000–2020 using time series of remotely sensed images on the big data platform. Remote Sens 13:4288. https://doi.org/10.3390/rs13214288 Yu D, He Z (2022) Digital twin-driven intelligence disaster prevention and mitigation for infrastructure: advances, challenges, and opportunities. Nat Hazards. https://doi.org/10.1007/ s11069-021-05190-x Yu X, Fan J, Chen J et al (2021) NestNet: a multiscale convolutional neural network for remote sensing image change detection. Int J Remote Sens 42:4898–4921. https://doi.org/10.108 0/01431161.2021.1906982 Zárate M, Buckle C, Mazzanti R et al (2020) Harmonizing big data with a knowledge graph: OceanGraph KG uses case. In: Rucci E, Naiouf M, Chichizola F, De Giusti L (eds) Cloud computing, big data & emerging topics. Springer International Publishing, Cham, pp 81–92 Zaytar MA, El Amrani C (2021) Satellite imagery noising with generative adversarial networks. Int J Cogn Inform Nat Intell 15:16–25. https://doi.org/10.4018/IJCINI.2021010102 Zhao J, Yue P (2019) Spatiotemporal data cube modeling for integrated analysis of multi-source sensing data. In: IGARSS 2019–2019 IEEE international geoscience and remote sensing symposium. IEEE, Yokohama, pp 4791–4794 Zhu L, Su X, Hu Y et al (2021a) A spatio-temporal local association query algorithm for multi- source remote sensing big data. Remote Sens 13:2333. https://doi.org/10.3390/rs13122333 Zhu L, Su X, Tai X (2021b) A high-dimensional indexing model for multi-source remote sensing big data. Remote Sens 13:1314. https://doi.org/10.3390/rs13071314

Index

A Active learning, 208, 220–221 Administration, 6, 95, 99 Agricultural drought, 11, 68, 249–257 Algorithm, 11, 60, 67, 68, 91, 109, 113, 156, 173, 188, 195–203, 207, 208, 211, 213–221, 229, 231, 232, 239, 240, 242–246, 255, 264, 265, 282, 284, 285 Approximate computing, 208, 210–212 Australian Geoscience Data Cube, 276 B Batch process, 178, 200 Big data, 1–11, 17, 45, 53, 55, 73, 95, 135, 155, 171, 195, 207, 229, 239, 253, 263, 271, 281 Big data analytic platform, 11, 171–189 Big data analytics, 3, 10, 11, 55, 95, 155–167, 172–174, 176, 180–183, 185, 195–203, 208, 211, 213, 218, 221, 237–247, 249–257, 261–267, 276, 284 Big data features, 4–9, 11 Big data initiative, 11, 271–276 Big data management, 10, 11, 95–104, 107–130, 135, 166 Big data management system, 11, 135–153 Big Earth Data Initiative (BEDI), 272–275 Big Earth Data Science Engineering, 275 C Catalog, 48, 60, 64, 77, 101–104, 117, 120–123, 125, 129, 136, 140–144, 148, 150, 185, 233, 255, 264, 283, 284

Challenges, 1, 2, 10, 11, 53–68, 96, 101, 163–165, 174, 197–203, 207, 213, 215–217, 239–242, 276, 281–286 Cloud computing, 5, 73, 85, 88–92, 162, 166, 173, 174, 217, 230, 255, 275, 283–285 Cluster computing, 73, 86, 91, 92 The Committee on Earth Observation Satellites (CEOS), 11, 61, 121, 125, 135, 136, 138–140, 276 Concept drifts, 200, 215, 216, 218, 220 Curation, 3, 10, 95, 99–102, 166 Cyberinfrastructure, 11, 53–68, 271, 272, 274 D Data access, 11, 60, 61, 64, 96, 102, 104, 107, 126–130, 140, 142, 150, 166, 172, 188, 189, 196, 255–257, 273, 274 Data collections, 3, 53–61, 139–140, 142, 147, 152, 171, 188, 264, 272, 274 Data complexity, 282 Data dimensionality, 198, 212, 214, 218, 240, 281, 282 Data discovery, 11, 56, 63, 64, 95, 99–104, 107, 120–125, 163, 274, 284 Data formats, 11, 54, 99, 100, 107, 119–120, 128, 130, 166, 274, 283 Data fusion, 55, 160, 237–247 Data governance, 96–100 Data granule, 11, 95, 140, 218 Data mining, 4, 11, 49, 56, 156, 181, 197, 199, 207, 208 Data models, 64, 67, 80, 127, 156, 229–230 Data provenance, 48, 202 Data quality, 3, 9, 48, 95–99, 101, 110, 113, 116–118, 165, 202, 283

© Springer Nature Switzerland AG 2023 L. Di, E. Yu, Remote Sensing Big Data, Springer Remote Sensing/ Photogrammetry, https://doi.org/10.1007/978-3-031-33932-5

293

294 Data science, 3, 156 Data visualization, 55, 56, 177, 181–182 Decision-making, 2, 11, 68, 96, 156, 157, 182, 227–234, 272 Deep learning, 4, 158, 159, 208, 216–217, 220, 221, 231, 240–247, 285 Descriptive analytics, 156, 158 Destination Earth (DestinE), 275, 285 Diagnosis analytics, 155, 156 Digital Twin, 275, 285 Dissemination, 11, 95, 102–104, 274 Distributed computing, 73–76, 85–87, 90, 121, 128, 173, 174, 180, 183, 196, 213, 230, 240, 255, 283, 285 Dynamic Time Warping (DTW), 263 E EarthCube, 271, 274 Earth Observation, 11, 38, 46, 47, 54, 56, 58, 60–62, 102–104, 125, 136, 145, 147, 150, 152, 153, 182, 184, 185, 189, 203, 237–247, 256, 262, 272–276, 283 Earth Observing System (EOS), 10, 35, 49, 54, 56, 58–60, 100, 114, 119, 136, 273 Earth Science Modeling Framework (ESMF), 65–67 Earth system (ES), 62–65, 67, 271–273 Ensemble analysis, 208, 217–218 Event-based processing, 178 F Feature selection, 208, 212–214, 282 Federated catalog, 103 Filter-based fusion, 242, 243 G Geospatial computing platform, 74–76 Global Earth Observation System of Systems (GEOSS), 11, 56–58, 62, 135, 145–153, 272 Google Earth Engine (GEE), 185–186, 232, 263, 264, 266, 282, 283 Granular learning, 208 Grid computing, 73, 85–88, 90, 91, 172, 173, 175 H Hadoop, 6, 90, 162, 173–175, 177, 180, 181, 200, 201, 209, 211, 214

Index I Incremental learning, 208, 214–216 Infrastructure, 4, 11, 49, 56, 60, 62, 66, 85–92, 109, 135, 136, 146, 171, 172, 185, 187, 188, 196, 200, 203, 234, 240, 253–256, 272, 273, 275, 282, 283 International Organization for Standardization (ISO), 18, 60, 64, 78, 84, 107–109, 115–121, 123–125, 128, 137, 138, 140, 141, 167 L Land cover, 11, 185, 231, 233, 242, 261–267 Learning-based fusion, 242, 243 Life cycle, 96, 98, 99, 255 M Machine learning, 11, 158, 160, 162, 177, 180, 181, 183, 185, 196, 198, 199, 207–221, 239, 241, 242, 282, 283, 286 Mapping, 34, 38, 60, 103, 119, 136, 137, 140, 182, 220, 241, 261–263, 273 MapReduce, 90, 177, 180, 182, 183, 196, 197, 200, 201, 209, 211, 214, 230, 231, 284 Metadata, 11, 48, 60, 61, 64, 80, 83, 84, 96, 99–104, 107–122, 124, 127, 129, 136, 137, 139, 140, 147, 148, 152, 153, 166, 175, 176, 188, 201, 202, 272, 274 Microwave remote sensing, 20–29, 47, 250, 252, 253 Modeling, 11, 49, 62, 65, 66, 98, 104, 116, 121, 155–158, 180, 199, 200, 216, 227–234, 238, 255, 263, 264, 266, 285, 286 Modularity, 198 Monitoring, 11, 28, 29, 39, 48, 49, 54, 64, 65, 68, 98, 99, 147, 152, 189, 198, 227, 240–242, 249–257, 261, 263, 273, 276 N Not only SQL (NoSQL), 172, 174, 176, 181, 183 O Open Geospatial Consortium (OGC), 60, 61, 64–66, 78, 81–85, 100, 103, 104, 119–129, 137–142, 147, 148, 150, 151, 177, 181, 184, 188, 234, 253, 276 OpenSearch, 61, 120, 121, 125, 137, 138, 140, 141, 145

Index Opportunity, 11, 81, 158, 197, 216, 281–286 Optical remote sensing, 19, 20, 29, 240, 250, 253 Organization, 9, 10, 54, 56, 81, 85, 91, 95–100, 102, 109, 110, 112, 113, 115, 122, 136, 145, 147, 166, 167, 173, 233, 261, 276 P Parallel computing, 11, 73, 87, 173 Parallel learning, 208–210 Performance, 8, 55, 66, 85, 90, 91, 98, 99, 103, 145, 173, 187, 198, 200, 201, 209, 212, 213, 220, 231, 242–244, 282, 284 Predictions, 11, 62, 63, 66, 68, 156, 158, 159, 181, 216, 227–234, 238, 241 Predictive analytics, 49, 156, 158, 228 Predictive modeling, 160 Processing, 1–4, 18, 48, 49, 64, 65, 67, 73, 74, 83, 90–92, 98, 101, 109, 113, 117, 118, 156, 158, 160, 162, 165, 171–174, 177–180, 182–184, 186–189, 196–201, 207, 209, 211, 212, 215–217, 219, 221, 228–230, 232, 242, 253, 255, 256, 263, 266, 267, 275, 276, 281–283, 285 R Radiometric spectrum, 19 Remote sensing, 1, 10–11, 17–39, 45, 53, 73, 95, 135, 158, 173, 218, 227, 239, 249, 261, 281 Remote sensing big data, 1, 3, 10–11, 17, 19, 45–49, 53–68, 73–92, 95–104, 108–130, 135, 158, 163–166, 183–189, 195–202, 207, 208, 211, 214, 219, 221, 227–234, 239, 240, 242, 250–253, 255, 261–263, 281–286 Research Data Alliance (RDA), 276 S Sensor Web, 11, 53–68, 198, 230, 285 Sensor work mode, 30–35 Service-oriented architecture (SOA), 63, 66, 76–85, 87, 88, 148, 253, 255 Skewed distribution, 198 Sonar, 30, 39 Spark, 177, 230, 284 Standard, 2, 10, 11, 60, 62, 64, 66, 80, 95, 107, 136, 166, 172, 229, 253, 273, 284 Standard geospatial web services, 230, 254, 255

295 Stochastic algorithms, 208, 219 Storage, 1, 3–5, 10, 55, 56, 73, 75, 88, 91, 101, 114, 120, 156, 173–176, 181, 184, 187, 197, 201, 208, 211, 212, 218, 283–285 Streaming, 104, 120, 174, 177–179, 183, 184, 200, 201, 207, 208, 214, 216, 220, 230, 256 Stream process, 179, 200 Structured Query Language (SQL), 98, 121, 173, 174, 177, 179, 185 Super computing, 85 T Temporal segment, 265 Time series, 10, 11, 101, 158, 159, 161, 180, 188, 216, 219, 227, 228, 233, 241–242, 253, 261–267, 285, 286 Transfer learning, 208, 219–220 Trends, 5, 101, 156–158, 180, 216, 221, 263, 284, 286 U United Nations Global Pulse, 276 Unmixing-based fusion, 242 V Value, 4, 6, 9, 45, 54, 60, 99, 111, 155, 174, 195, 202, 207, 242, 264, 272 Variety, 3–6, 8, 10, 29, 45–47, 53–55, 73, 99, 128, 155, 163, 164, 195, 201–202, 207, 212, 216, 217, 228, 240, 285 Vegetation index, 158, 230, 251, 252, 256, 257 Velocity, 3–6, 8, 25, 45, 48, 54, 73, 99, 163, 195, 198, 200–201, 207, 214, 216, 242 Veracity, 3, 6, 8–9, 45, 48–49, 54, 99, 163, 202, 207, 217, 283 Version control, 95, 99, 102 Volume, 1–6, 10, 45, 46, 48, 49, 53–55, 58, 73, 96, 99–101, 163, 174–176, 180, 182, 185, 195–200, 207–210, 213, 214, 216, 239, 242, 263, 281, 284 W Web portal, 56, 136 Web service, 4, 5, 65, 66, 78, 80–82, 90, 102, 104, 142, 187, 253, 255, 274, 275 Web service chaining, 81 Workflow, 9, 49, 65, 67, 78, 177, 189, 230, 253–256, 264