129 50 31MB
English Pages 284 Year 2023
本书版权归Arcler所有
Introduction to Environmental Statistics
本书版权归Arcler所有
本书版权归Arcler所有
本书版权归Arcler所有
INTRODUCTION TO ENVIRONMENTAL STATISTICS
Edited by: Akansha Singh and Esha Rami
www.delvepublishing.com
Introduction to Environmental statistics Akansha Singh and Esha Rami Delve Publishing 224 Shoreacres Road Burlington, ON L7L 2H2 Canada www.delvepublishing.com Email: [email protected] e-book Edition 2023 ISBN: 978-1-77469-616-3 (e-book )
This book contains information obtained from highly regarded resources. Reprinted material sources are indicated and copyright remains with the original owners. Copyright for images and other graphics remains with the original owners as indicated. A Wide variety of references are listed. Reasonable efforts have been made to publish reliable data. Authors or Editors or Publishers are not responsible for the accuracy of the information in the published chapters or consequences of their use. The publisher assumes no responsibility for any damage or grievance to the persons or property arising out of the use of any materials, instructions, methods or thoughts in the book. The authors or editors and the publisher have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission has not been obtained. If any copyright holder has not been acknowledged, please write to us so we may rectify.
Notice: Registered trademark of products or corporate names are used only for explanation and identification without intent of infringement. © 2023 Delve Publishing ISBN: 978-1-77469-460-2 (Hardcover) Delve Publishing publishes wide variety of books and eBooks. For more information about Delve Publishing and its products, visit our website at www.delvepublishing.com.
本书版权归Arcler所有
ABOUT THE EDITORS
Dr. Akansha Singh is presently working as Project Scientist in Department of Genetics and Plant Breeding, Institute of Agricultural Sciences, Banaras Hindu University, India. She has also worked as Postdoctoral fellow, Mumbai University and as Associate Professor, Department of Genetics and Plant Breeding, College of Agriculture, Parul University, India She has obtained her Ph.D. (Ag) in Genetics and Plant breeding from Banaras Hindu University in the year 2012. She has been Awarded with ICAR, senior research fellowship in 2010 to pursue PhD. She has authored numerous national and international publications in journals of repute.
本书版权归Arcler所有
Dr. Esha Rami is presently working as an Assistant Professor, in the Department of Life science, Parul Institute of Applied Science, Parul University, India. She Did her Post graduated from Ganpat University, Ph.D. in biotechnology from Hemchandracharya North Gujarat University, in the year 2015. She has authored a number of national and international publications in reputed journals.
本书版权归Arcler所有
TABLE OF CONTENTS
List of Figures...............................................................................................xiii
List of Abbreviations.................................................................................... xvii
Abstract....................................................................................................... xix Preface..................................................................................................... ....xxi Chapter 1
Statistics..................................................................................................... 1 1.1. Introduction......................................................................................... 2 1.2. History of Statistics.............................................................................. 4 1.3. Types of Statistics................................................................................. 9 1.4. Exploratory Data Analysis (EDA)........................................................ 15 1.5. Types of Descriptive Statistics............................................................. 15 1.6. Measure of Variability........................................................................ 16 1.7. Inferential Statistics............................................................................ 17
Chapter 2
Development of Environment Statistics.................................................... 23 2.1. Introduction....................................................................................... 24 2.2. Framework for the Development of Environment Statistics................. 25 2.3. State-of-Environment Statistics in Developing Member Countries...... 30
Chapter 3
本书版权归Arcler所有
Environmental Data................................................................................. 33 3.1. The Frameworks of the Data............................................................... 36 3.2. Goals of Collecting Data About the Environment............................... 38 3.3. Additional Information and Analysis Regarding Risk Indices.............. 40 3.4. Methods for Public Relations and Retail Sales that Are Adapted Precisely to the Environment....................................... 41 3.5. Environment APIs!............................................................................. 42 3.6. The Amounts of Humidity.................................................................. 43
3.7. The State of the Atmosphere Has a Role............................................. 45 3.8. Parameters That are Used to Measure Biodiversity.............................. 48 3.9. Diversity of Species and Representation of Taxonomic Groups in the Data.......................................................................... 50 3.10. Concerning Measurements,Accuracy, and Possible Bias................... 50 3.11. The Benefits and Drawbacks of “Averaged” Indexes......................... 52 3.12. Considering the Numerous Error Causes.......................................... 53 3.13. Analysis of Biodiversity Data............................................................ 54 3.14. Societal and Occupational Health Information................................ 55 3.15. Excellent Air Quality........................................................................ 56 3.16. Examine the Locations of Monitors Using an Interactive Map.......... 56 3.17. Data Visualization............................................................................ 57 3.18. Some Fundamental Air Quality Concepts........................................ 58 3.19. Particulate Matter Data.................................................................... 59 3.20. Ozone Depletion Trends.................................................................. 59 3.21. Sources That Give Information on the Environment.......................... 60 3.22. How Should Data About the Environment Be Evaluated?................. 61 3.23. What is the Cost of Environmental Data on Average?....................... 61 3.24. What Questions Should You Ask Environmental Data Providers?...... 62 Chapter 4
本书版权归Arcler所有
The Role of Statistics in Environmental Science....................................... 63 4.1. Uses of Statistics in Environmental Science........................................ 66 4.2. Sources of Information....................................................................... 67 4.3. Methods............................................................................................ 68 4.4. Basic Concepts.................................................................................. 68 4.5. Applications of Statistical Tools in Environment Science..................... 69 4.6. Statistical Models............................................................................... 75 4.7. Goodness of Fit Test........................................................................... 77 4.8. Theoretical or Biological Models........................................................ 78 4.9. Fitting Niche Apportionments Models to Empirical Data.................... 81 4.10. Species Accumulation Curves.......................................................... 82 4.11. Users of Environmental Data........................................................... 84 4.12. Environmental Information.............................................................. 85 4.13. Sources of Environmental Statistics.................................................. 88 4.14. Monitoring Systems......................................................................... 90 4.15. Scientific Research........................................................................... 90 viii
4.16. Geospatial Information and Environment Statistics........................... 92 4.17. Institutional Dimensions of Environment Statistics........................... 93 4.18. Importance of Environmental Statisticians........................................ 94 Chapter 5
Types of Data Sources.............................................................................. 97 5.1. What Are Sources of Data?................................................................ 99 5.2. Types of Data Sources........................................................................ 99 5.3. Statistical Surveys............................................................................ 102 5.4. Collection of Data........................................................................... 104 5.5. Processing and Editing of Data......................................................... 105 5.6. Estimates and Projections Are Created............................................. 105 5.7. Analysis of Data............................................................................... 106 5.8. Procedures for Review..................................................................... 106 5.9. Dissemination of Information Products............................................ 107 5.10. The Benefits of Administrative Data................................................ 110 5.11. Limitations of Administrative Data................................................. 111 5.12. Obtaining and Learning from Administrative Data......................... 113 5.13. Remote Sensing and Mapping........................................................ 114 5.14. Technologies of Digital Information and Communication............... 119 5.15. Environmental Monitoring Types.................................................... 121 5.16. Iot-Based Environmental Monitoring.............................................. 123 5.17. Reasons For Environmental Monitoring.......................................... 124 5.18. Data From Scientific Research and Special Projects....................... 124 5.19. Global And International Sources of Data...................................... 127 5.20. Key Government Databases........................................................... 128 5.21. A Data Source is the Location Where Data That is Being Used Originates From...................................................... 131 5.22. Data Source Types......................................................................... 131 5.23. Sources of Machine Data............................................................... 132
Chapter 6
本书版权归Arcler所有
Environmental Sampling......................................................................... 133 6.1. Introduction..................................................................................... 134 6.2. Importance of Environmental Sampling........................................... 135 6.3. Environmental Sampling Methods.................................................... 136 6.4. Hydrological Traces......................................................................... 140 6.5. Measuring PH and Electrical Conductivity (EC)................................ 141 ix
6.6. Stream Gaging................................................................................. 143 6.7. Winkler Method for Measuring Dissolved Oxygen........................... 146 6.8. Measuring Turbidity Using a Secchi Disk......................................... 148 6.9. Conductivity, Temperature, and Depth Rosette (CTD)....................... 150 6.10. Stable Isotope Primer and Hydrological Applications..................... 152 6.11. Challenges of Environmental Sampling.......................................... 154 Chapter 7
Models for Data..................................................................................... 155 7.1. Introduction..................................................................................... 156 7.2. Literature Review............................................................................. 158 7.3. The Process of Developing Models for Data..................................... 163 7.4. Types of Data Models...................................................................... 164 7.5. The Advantages that Come With Using The ER Model...................... 167 7.6. Importance of Data Models............................................................. 170 7.7. What Makes a Data Model Good?................................................... 176 7.8. Data Properties................................................................................ 183 7.9. Data Organization........................................................................... 184 7.10. Data Structure................................................................................ 184 7.11. Data Modeling Tools to Know........................................................ 185 7.12. ER/Studio....................................................................................... 186 7.13. Db Modeling................................................................................. 186 7.14. Erbuilder........................................................................................ 186 7.15. Heidisql......................................................................................... 187 7.16. Open-Source................................................................................. 187 7.17. A Modeling Tool for SQL Databases............................................... 188 7.18. Data Flow Diagram (DFD)............................................................. 188 7.19. Data Conceptualization................................................................. 188 7.20. Unified Modeling Language (UML) Models................................... 189 7.21. Data Modeling Features................................................................. 190 7.22. Data Modeling Examples............................................................... 191 7.23. Summary....................................................................................... 192
Chapter 8
本书版权归Arcler所有
Spatial-Data Analysis.............................................................................. 193 8.1. Sa Geometric................................................................................... 197
x
8.2. History............................................................................................. 201 8.3. Spatial Data Analysis In Science...................................................... 203 8.4. Functions of Spatial Analysis............................................................ 204 8.5. Spatial Processes.............................................................................. 206 8.6. The Spatial Data Matrix: It’s Quality................................................. 209 8.7. Sources of Spatial Data.................................................................... 211 8.8. The Purpose and Conduct of Spatial Sampling................................. 213 8.9. Models for Measurement Error......................................................... 214 8.10. Analysis of Spatial Data and Data Consistency............................... 214 8.11. EDA (Exploratory Data Analysis) and ESDA (Exploratory Spatial Data Analysis).................................................................... 215 8.12. Data Visualization: Approaches and Tasks...................................... 220 Chapter 9
Challenges in Environmental Statistics................................................... 225 9.1. Introduction..................................................................................... 226 9.2. Statistical Models for Spatiotemporal Data (STD)............................. 227 9.3. Spatiotemporal (ST) Relationships.................................................... 228 9.4. Data Characteristics......................................................................... 229 9.5. Random Fields................................................................................. 231 9.6. Gaussian Processes and Machine Learning (Ml)............................... 232 9.7. Neural Networks............................................................................. 232 9.8. Population Dynamics Stochastic Modeling...................................... 233 9.9. Population Dynamics...................................................................... 234 9.10. Spatial Extended System................................................................ 235 9.11. Non-Gaussian Noise Sources......................................................... 236 9.12. Environmental Exposures and Health Effects in Collection of Environmental Statistics........................................ 237 9.13. General Logic and Strategy............................................................ 237
Chapter 10 Future of Environment Statistics............................................................. 241
本书版权归Arcler所有
xi
10.1. Use of New Technologies.............................................................. 242 10.2. Technologies that Can Be Used in Environment Statistics: Predictive Analytics........................................................ 243 10.3. Changes in Utilization of Resources............................................... 245 Bibliography........................................................................................... 249 Index...................................................................................................... 255
本书版权归Arcler所有
xii
LIST OF FIGURES Figure 1.1. Statistics Figure 1.2. Terminologies Used In Statistics Figure 1.3. Global carbon dioxide emissions by industry Figure 1.4. History of statistics Figure 1.5. Types of statistics Figure 1.6. Descriptive statistics Figure 1.7. Graphical comparison of abiotic depletion and global warming capacity Figure 1.8. Outliers Figure 1.9. A boxplot for yield traits (a); and showing outliers (b) Figure 1.10. Central tendency Figure 1.11. Measure of variability Figure 1.12. Inferential statistics Figure 1.13. Hypothesis testing Figure 1.14. Regression analysis Figure 2.1. A book with the title environmental statistics Figure 2.2. Framework for the development of environment statistics Figure 2.3. An ecological model Figure 2.4. Environmental statistics Figure 2.5. Development of environment statistics Figure 3.1. Environmental data representation Figure 3.2. Young Asian woman looking at see through global and environmental data whilst seated in a dark office. The data is projected on a see through (see-thru) display Figure 3.3. Trees regulate carbon dioxide levels (a greenhouse gas) in the atmosphere Figure 3.4. Environmental technology concept – sustainable development goals (SDGs) Figure 3.5. Hands of farmer with tablet with infographics on the screen Figure 3.6. Infographics template. Set of graphic design elements, histogram, arc, and Venn diagram, timeline, radial bar, pie charts, area, and line graph. Vector choropleth world map Figure 3.7. A silhouette of a man stands on the background of large office windows and views a hologram of corporate infographic with work data
本书版权归Arcler所有
Figure 3.8. Spring colored birds flirting, natural design, and unique moments in the wild Figure 3.9. Abstract visualization of data and technology in graph form (3D illustration) Figure 4.1. Environmental statistics Figure 4.2. There are various applications of statistics in environmental science Figure 4.3. Data from environmental statistics is used by government agencies among other groups Figure 4.4. Survey is a good source of information Figure 4.5. Descriptive statistics explores and visualizes environmental data Figure 4.6. Inferential statistics is formed on the basis of probability Figure 4.7. A Whittaker plot Figure 4.8. The log series model Figure 4.9. There are two methods of generating asymptotic curves Figure 4.10. Government agencies utilize environmental data Figure 4.11. Environmental data is used in environment statistics Figure 4.12. Environment indicators gives information on the state of the environment Figure 4.13. Environmental indices are part of environment Figure 5.1. Types of data sources Figure 5.2. Factorial design experiments Figure 5.3. A basic questionnaire in the Thai language Figure 5.4. Karl Pearson, a founder of mathematical statistics Figure 5.5. A manager keeping an administrative record of a meeting Figure 5.6. During meetings, a note taker is typically assigned to keep an administrative record that will be distributed later Figure 5.7. Importance of administrative data sources Figure 5.8. Geographic information system (GIS) Figure 5.9. Layers in a GIS Figure 5.10. Providers of remote sensing data Figure 5.11. Air quality monitoring station Figure 5.12. Collecting a soil sample in Mexico for pathogen testing Figure 5.13. Top 10 tips for good research data management Figure 5.14. Data activities Figure 5.15. The United Nations Statistics Division is committed to the advancement of the global statistical system. We compile and disseminate global statistical information, develop standards and norms for statistical activities, and support countries’ efforts
本书版权归Arcler所有
xiv
to strengthen their national statistical systems. We facilitate the coordination of international statistical activities and support the functioning of the United Nations Statistical Commission as the apex entity of the global statistical system Figure 7.1. An image showing how models of data are Figure 7.2. Data models Figure 7.3. Types of data models Figure 7.4. An entity-relationship (E-R) model diagram Figure 7.5. An example of a hierarchical model Figure 7.6. An example of a network model Figure 7.7. An example of a relational mode Figure 7.8. Importance of data models Figure 7.9. Leverage Figure 7.10. Components of quality data Figure 7.11. Practical environmental monitoring and aseptic models Figure 7.12. Communication Figure 7.13. Performance Figure 7.14. Conflicts Figure 8.1. Data management and analysis Figure 8.2. Ornate world maps were characteristic during the “age of exploration” in the 15th through 17th centuries Figure 8.3. The application of mathematical models Figure 8.4. Map by Dr. John Snow of London, showing clusters of cholera cases in the 1854 broad street cholera outbreak. This was one of the first uses of map-based spatial analysis Figure 8.5. Listing of parcel number and value with land use = ‘commercial’ is an attribute query. Identification of all parcels within 100-m distance is a spatial query Figure 8.6. Landowners within a specified distance from the parcel to be rezoned identified through spatial query Figure 8.7. Device controlling agricultural robot Figure 8.8. Macro of Washington DC on the map Figure 8.9. Assessment icon set Figure 8.10. The concept of collecting data on humidity, temperature, illumination of acidity, fertilizers, and pests without human intervention, the transmission of the obtained data and their analysis to increase the yield Figure 8.11. Consistency check Figure 8.12. What is exploratory data analysis?
本书版权归Arcler所有
xv
Figure 8.13. Administrative regions of the Netherlands Figure 8.14. Businessmen in a dark room standing in front of a large data display Figure 8.15. Types of data visualization Figure 8.16. Land cover surrounding Madison, WI. Fields are colored yellow and brown, water is colored blue, and urban surfaces are colored red
本书版权归Arcler所有
xvi
LIST OF ABBREVIATIONS
µS/cm
microsiemens per centimeter
APIs
application programming interfaces
AQI
air quality index
ASA
American Statistical Association
B-G
Boltzmann-Gibbs
BLUP
best linear unbiased predictor
CAST
computer-assisted statistics
CBSA
core-based statistical areas
CRD
completely randomized design
CTD
conductivity, temperature, and depth rosette
DBMS
database management system
DFD
dataflow diagram
DIBS
database of international business statistics
DMP
data management plan
DMPS
dimethyl phenyl sulfone
DOM
document object model
DPU
diphenylurea
EC
electrical conductivity
EDA
exploratory data analysis
EDMS
environmental data management systems
EPA
Environmental Protection Agency
E-R
entity relationships
ERD
entity relationship diagram
ESP
electrostatic precipitator
FD
factorial designs
FDES
framework for development of environmental statistics
GFD
global financial information
GIS
geographic information system
GISc
geographic information science
本书版权归Arcler所有
GPA
grade point average
GPR
GP regression
HDI
human development index
HVAC
heating, ventilation, and air conditioning
ICPSR
inter-university political and social research consortium
LSD
Latin square design
LSS
London special service
ML
machine learning
mS/m
millisiemens per meter
NGOs
non-governmental organizations
NOAA
National Oceanic and Atmospheric Administration
NPOESS
National Polar-orbiting Operational Environmental Satellite System
NWS
National Weather Service
PDM
physical data model
PRS
political risk services
QuickSCAT
quick scatterometer
RBD
randomized block design
RDBMS
relational database management systems
RDM
research data management
RothC
Rothamsted carbon model
SEM
smart environmental monitoring
SPE
solid-phase extraction
ST
space-time
ST
spatiotemporal
STD
spatiotemporal data
SVOCs
semi-volatile organic compounds
TAP
total annual precipitation
UCSD
University of California, San Diego
UML
unified modeling language
UNECE’s
United Nations Economic Commission for Europe
UNSD
United Nations Statistics Division
USLE
Universal Soil Loss Equation
VOCs
volatile organic compounds
WSN
wireless sensor network
本书版权归Arcler所有
xviii
ABSTRACT
Environmental statistics involves the use of statistical approaches to environmental science. It covers methods for managing questions regarding the natural ecosystem in its undisturbed condition. This reader-friendly book emphasizes the fields of probability hypothesis and dimensions that are significant in environmental data analysis, monitoring, research, ecological field studies, and ecological decision making. It discusses fundamental statistical theory with minimal documentation, however without precluding significant details and presumptions. The book likewise presents a hypothesis of how and why environmental physical cycles in the environment create right-slanted, lognormal dispersions. The volume likewise presents the Rollback Statistical Theory, which permits data analysts and administrators to appraise the impact of various emission control methodologies on environmental quality frequency diffusions. Assuming just a simple understanding of polynomial math and analytics, Environmental Data Analysis and Statistics provides a superior reference and assortment of measurable strategies for investigating environmental data and developing precise environmental predictions.
本书版权归Arcler所有
本书版权归Arcler所有
PREFACE
In contemporary culture, people are increasingly being more mindful of environmental issues, whether these include global warming, disruption of rivers and oceans, invasion of forests, contamination of land, low air quality, environmental health problems, and so forth. At the basic level, it is important to screen what is in the environment – gathering data to define the evolving scene. More critically, it is essential to formally depict the climate with sound and approved models, and to dissect and decipher the information we get to make a move. Ecological Statistics gives an expansive outline of the factual philosophy utilized in the investigation of the climate, written in an open style by a main expert regarding the matter. It fills in as both a course book for understudies of ecological measurements, as well as an exhaustive wellspring of reference for anybody working in the factual examination of natural issues. The volume provides expansive coverage of the philosophy used in the measurable analysis of ecological issues. It covers a wide scope of key themes, including testing, strategies for excessive data, relationship models and techniques, time series, spatial analysis, and environmental norms. The authors discuss different viable models that delineate the uses of factual techniques in ecological issues. Environmental degradation is returning back to haunt humanity. Globally, individuals are perishing from contamination at an alarming rate. Meanwhile, the list of affected animals continues to develop, and their natural surroundings adversity could prompt diseases much more terrible than coronavirus. Sadly, climate measurements show that non-industrial nations are the worst impacted, with many individuals passing on from unhygienic conditions. Be that as it may, all isn’t lost. Researchers are attempting to fasttrack the process to save the Earth. We just have one planet, so we all must preserve it. Statistical analysis is crucial for environmental sciences, allowing specialists to acquire a deep comprehension of environmental phenomena through investigating and creating possible solutions for everyday issues. The uses of statistical systems to environmental sciences are diverse. Environmental statistics are utilized in many fields such as; environmental standardization bodies, research centers, health, and safety institutions, meteorological departments, fisheries, and regulation offices concerned with environmental management. Environmental statistics is particularly appropriate and broadly used in the academic, administrative, technological, and environmental consulting sectors. Particular uses of statistical analysis in the sector of environmental science incorporate earthquake risk analysis, policymaking, biological sampling technique,
本书版权归Arcler所有
and environmental forensics. In environmental statistics, there are two fundamental categories of their purposes. Descriptive statistics isn’t just used for data inference, yet also to define its qualities. Inferential statistics can be applied to make inferences on test hypotheses, data, or making predictions Some forms of studies captured in environmental statistics are: Baseline studies to archive the current situation with an environment to give a foundation in the event of obscure changes; Targeted research to depict the possible effect of modifications being planned or of unplanned events; and regular monitoring to recognize changes in the climate. Environmental statistics data sources are diverse and incorporate surveys on human populaces and the ecosystem, and records from organizations managing environmental resources.
本书版权归Arcler所有
xxii
1
CHAPTER
STATISTICS
CONTENTS
本书版权归Arcler所有
1.1. Introduction......................................................................................... 2 1.2. History of Statistics.............................................................................. 4 1.3. Types of Statistics................................................................................. 9 1.4. Exploratory Data Analysis (EDA)........................................................ 15 1.5. Types of Descriptive Statistics............................................................. 15 1.6. Measure of Variability........................................................................ 16 1.7. Inferential Statistics............................................................................ 17
2
Introduction to Environmental Statistics
1.1. INTRODUCTION Both “the science of learning from data” and “the science of measuring, managing, and conveying uncertainty” are included in the definition of statistics provided by the American Statistical Association (ASA). “The science of expressing uncertainty and learning from data” is how statistics are typically described. Using this method, even though not all statisticians would agree with it, provides a broad beginning point that has a history of being successful. It incorporates and encapsulates concisely the “broader perspectives” of Hahn and Doganaksoy (2012); and Fienberg (2012), as well as the definitions given in the first pages of those two authors’ works, as well as the “greater statistics” of Chambers (1993), the “broader field” of Bartholomew (1995), and the “broader vision” of Brown and Kass (2012). In addition to this, it addresses some of the more restricted points of view. Even though statisticians have investigated every facet of this cycle, researchers, and thinkers in statistical theory and practice have concentrated on a variety of subjects at various times (Figure 1.1).
Figure 1.1. Statistics. Source: https://online.stat.psu.edu/statprogram/sites/statprogram/ files/2018-08/statistics-review.jpg.
本书版权归Arcler所有
Statistics
3
At least for the previous half-worth centuries, the primary focus of attention has been placed on the application of probabilistic models during the phases of analysis and conclusion, and to a lesser extent, on sampling and experimental design procedures during the stage of plan. This has been the case even though probabilistic models have been used for at least the previous half-century. On the other hand, to glimpse into the future of statistical education, it is required to take a more all-encompassing perspective (Zio et al., 2004). Because of the nature of the statistician’s work, many people believe that the statistical field as a whole, and education within the statistical field, in particular, will thrive “in the future.” An education in statistics should provide students with both conceptual frameworks (tools for organized thought) and practical talents that will better equip them for their future lives in a world that is always evolving. Because the digital world is in a constant state of flux and evolution, educators have little choice but to concentrate on the future rather than on the lessons of the past in order to ensure that their students are adequately prepared for the digital world of the future. Even while it is necessary to think about the past, our major goal should be to find ways to make better use of the wealth of historical data that is available to us in order to improve our level of readiness for the future (Ying and Sheng-Cai, 2012). In educational settings, the term “statistics” must not be defined in terms of the methods that statisticians have traditionally used to accomplish their goals; rather, statistics ought to be characterized in terms of the outcome’s statisticians want to accomplish (Figure 1.2).
Figure 1.2. Terminologies used in statistics. Source: https://editor.analyticsvidhya.com/uploads/75819stats_1050x520.png.
Changing capabilities, such as those made accessible by developments in technology, may cause the preferred method of achieving goals to change
本书版权归Arcler所有
4
Introduction to Environmental Statistics
over time; nonetheless, the fundamental goals will continue to be the same. The all-encompassing description that we initially began with is “keeps our eyes on the ball” because it centers our universe on the innate human need to learn about how our environment works through data, while also recognizing sources of uncertainty and varying degrees of it. The phrase “keeps our eyes on the ball” comes from the phrase “keeps our eyes on the ball” (Xia et al., 2017). In order to solve a particular substantive issue, statisticians develop innovative approaches to the problem-solving process. In addition, students take a step back and use concepts from logic as well as statistics in order to combine what they have studied into a more comprehensive structure “if one is to believe what Mr. Fienberg has to say about the matter (2014). After that, they can take their notions elsewhere and experiment with new permutations of those ideas in order to go forward. In the vast majority of academic fields, there is a strong desire to acquire further knowledge regarding the natural world, living beings, and the economic and political structures that regulate these categories. The realization that these concerns are necessary to our continued existence is the source of this motivation. As a result of its focus on the mental processes involved in transforming raw data into knowledge that can be used, statistics are regarded as a meta-discipline. Statistics as a metadiscipline advance whenever the methodological lessons and principles learned from a single piece of work are abstracted and incorporated into a theoretical framework. This makes it possible for them to be applied to a broad range of issues in several situations, which in turn advances statistics as a meta-discipline.
1.2. HISTORY OF STATISTICS Despite the fact that census information has been collected since ancient times, rulers “were concerned with keeping track of their people, money, and important events (such as battles and Nile floods), but nothing more in terms of quantitative appraisal of the entire world.” As a direct outcome of his work, John Graunt is often regarded as the man who invented statistical data analysis (including the publishing of his book Natural and Political Observations in 1662). Graunt made the observation that the plague was passed from person to person, which is in contrast to the alternative hypothesis, which held that “filthy air” was to blame for the progression of illnesses that occurred over time. Graunt and other “political arithmeticians” in Western Europe during the Renaissance were affected by the emergence
本书版权归Arcler所有
Statistics
5
of science that was based on observation of the natural world (Wieland et al., 2010). The manner in which they “reasoned about their facts in the same way that they do” is comparable to the way that we think now. They did not only gather or acquire information; rather, they evaluated, forecasted, and gained knowledge based on it. In addition to this, they contended that the direction of state policy should be determined by data, rather than by the authority of the church and the nobility (Porter, 1986). On the other hand, the approach to statistics that were taken by the political mathematician lacked discipline in the process of collecting data and analyzing it (Figure 1.3).
Figure 1.3. Global carbon dioxide emissions by industry. Source: https://www.eraa.org/sites/default/files/environment_0.jpg.
Sampling for surveys and counting persons for censuses were still in their infancy throughout the eighteenth century. In addition, Pascal (1623– 1662); and Bernoulli (1654–1705) laid the groundwork for probability, which ultimately led to the creation of statistics as we know it today. Pascal was alive from the years 1623 to 1662, whereas Bernoulli lived from the years 1654 to 1705. Bayes (1764); and Laplace (1749–1827) made major conceptual improvements in the application of probability to quantitative inference by making use of the Bayesian inversion of probability. The body of work that Bayes produced is universally acknowledged as being among the most important to have been done in the field. Around the year 1800, astronomy was considered the most important subject to study, and many of the most talented mathematicians of the time period made substantial contributions to the field. Legendre, Gauss, and Laplace were all affected
本书版权归Arcler所有
6
Introduction to Environmental Statistics
in their various fields of mathematics by astronomical difficulties and discoveries. Legendre developed the least-squares method, while Gauss developed the normal theory of errors (least squares and the central limit theorem). Quetelet (1796–1874) wanted to build universal laws that regulate human activity, and he did so by applying these concepts to social reality and using them in his studies. He lived from 1796 to 1874. After the French Revolution, there was some progress made in the idea of statistics as a state science, and statisticians began conducting surveys on a variety of topics, including trade, industrial growth, labor, poverty, education, sanitation, and crime. These surveys were conducted in the decades that followed the French Revolution (Figure 1.4) (Porter, 1986).
Figure 1.4. History of statistics. Source: https://alchetron.com/cdn/history-of-statistics-aca1db15-f7d8-476fac40-55af872841d-resize-750.jpeg.
The development of statistical graphics is the third factor that has played a role in the expansion of the field of statistics. William Playfair, who lived from 1759 to 1823, was the first well-known person. He is credited with creating the line chart, bar chart, and pie chart, all of which are used for the visualization of economic data. According to Friendly (2008), the time period between 1850 and 1900 is considered to have been the “golden period of statistical graphics.” This was the era that saw the creation of John Snow’s cholera data dot map and the Broad Street pump, Minard’s famous graph depicting soldier losses during Napoleon’s march on Moscow and subsequent retreat, Florence Nightingale’s coxcomb plot, which was used to persuade of the need for improved military field hospitals, and the introduction of the majority of the graphic forms that are still used today for conveying geographically-linked information on maps, such as the coxcomb
本书版权归Arcler所有
Statistics
7
plot and the (2015) The duo of Horton and Utts (2015) Adolphe Quetelet, Charles Babbage, and Thomas Malthus are the three illustrious men who were responsible for establishing the London Special Service (LSS). He is well-known for the opinions that he has expressed concerning the expansion of the human population. In the year 1858, Florence Nightingale was the first woman to sign up for the LSS, more often referred to as the LSS. These first members of the LSS and the ASA were noteworthy due to the fact that they embodied a diverse assortment of real-world endeavors (including scientific, economic, political, and social realms). Many consider the contributions to statistics that were made toward the latter half of the nineteenth century by individuals such as Francis Galton, Francis Ysidro Edgeworth, Karl Pearson, and George Udny Yule to be the point at which the field first got its start. Because of their varied experiences in domains including biology, commerce, and the social sciences, they were able to develop statistical approaches that were applicable not just to the areas of research in which they had previously worked, but also to a broad variety of other fields of study (Wagner and Fortin, 2005). Concerns raised by William Gosset in the 1920s sparked a new round of research, which ultimately led to discoveries made by Ronald Fisher in the fields of experimental design, analysis of variance, maximum likelihood estimate, and augmentation of significance testing. Egon Pearson and Jerzy Neyman (1933) laid the groundwork for hypothesis testing and the development of confidence intervals in the 1930s, according to Fienberg. By the year 1940, the foundation for the majority of the ideas that make up “modern statistics” had been laid thanks to the pioneering work that Bruno de Finetti and Harold Jeffreys had done on subjective and objective Bayesian inference, respectively. During World War II, there was a significant increase in the employment available for young mathematicians, many of which required them to deliver immediate solutions to problems relating to the war. As a consequence of this, the field of mathematics experienced an era marked by extraordinary advancement. The contributions made by a vast number of statisticians during their lives were essential to the development of the field. In addition, we place a strong emphasis on the ground-breaking work that John Tukey did in the field of exploratory data analysis (EDA) in the 1970s. According to Fienberg, Scheaffer (2001); and Pfankuch and Wild (2004), who investigated how mathematicians were paid or hired to see how this influenced their views and how they changed, the belief systems of mathematicians have changed over time. These researchers studied how
本书版权归Arcler所有
8
Introduction to Environmental Statistics
mathematicians were paid or hired to see how this influenced their views and how they changed. The researchers investigated the process of hiring and paying mathematicians to see how this factored into the formation of their beliefs and how those beliefs changed over time. In their respective works, Fienberg (1992); Porter (1986); Stigler (1986, 2016), and Hacking provide descriptions that are far more comprehensive (1990). Vere-Jones (1995); Scheaffer (2001); Holmes (2003), and Forbes are some well-known examples of historical references that are relevant to the study of statistics education. Statisticians are required to have expertise in a broad range of statistical methodologies, such as mathematical, computational, and statistical procedures. In contrast to the production of computer code, which requires a different set of cognitive abilities, dealing with mathematical derivations calls for a certain set of cognitive abilities. This is because dealing with mathematical derivations requires one to deal with mathematical proofs. Although each of these modes of cognition has robust connections among themselves, the interconnections that bind them to one another are not quite as robust. In the following paragraphs, we will discuss “statistical reasoning,” often known as the process of applying statistics to events that occur in the actual world (Toivonen et al., 2001). In spite of this, the phrase “resolving real-world (or practical) difficulties” is used an excessive amount in the field of statistical research. “Solving a real-world problem” means, in the eyes of the general public, attempting to abolish or, at the very least, significantly minimize the problem (for example, decreasing unemployment rates). This notion of “solving a real-world problem” comes from the perspective of the general population (e.g., unemployment levels are reduced). This separation is necessary because it is essential to differentiate between “the desire to act” and “the need to know,” and the reason for this need is because it is crucial to do so. The majority of the time, deciding how to approach a topic call for the accumulation of additional knowledge in order to make an informed decision. In light of this, doing statistical research could prove to be advantageous. It satisfies the “want to know” of the general population. Statisticians frequently bring up a lack of information or comprehension as a repeating issue when they are proposing potential remedies to problems that occur in the real world.
本书版权归Arcler所有
Statistics
9
1.3. TYPES OF STATISTICS There are two types of statistics (Figure 1.5): • •
Descriptive statistics. Inferential statistics.
Figure 1.5. Types of statistics. Source: https://media.geeksforgeeks.org/wp-content/uploads/20200625233042/ Untitled-Diagram-918.png.
1.3.1. Descriptive Statistics Data summaries in the form of tables, graphs, and numerical summaries are all covered under descriptive statistics. The goal of descriptive statistics is to make data understandable to a broader audience by reducing it to its simplest form. For the most part, the information you’ll find in publications like newspapers and magazines is written in the form of a description. When studying the associations between two or more variables, multivariate descriptive statistics use statistics; when studying a single variable, univariate descriptive statistics makes use of data. Descriptive statistics approaches will be demonstrated by evaluating this case study, which entailed collecting data on the age, gender, marriage status, and annual income of a hundred people (Sivak and Thomson, 2014). A data set might represent the complete population or merely a subset of it, and descriptive coefficients are used to give a concise description of the data set’s contents. Descriptive statistics include measurements of central tendency and measures of variability (spread). While central tendency metrics include the median, the mode, and the mean, the standard deviation, variance, kurtosis, and skewness are also used to quantify variability. There are a number of measures that may be used to determine central tendency (Figure 1.6).
本书版权归Arcler所有
10
Introduction to Environmental Statistics
Figure 1.6. Descriptive statistics. Source: png.
https://miro.medium.com/max/707/1*w2hGJO5gUD6se5yQ6Efsdw.
Distinguishing the characteristics of a given data set is made easier by descriptive statistics. They do this by providing brief explanations of the sample and the data measurements they’ve collected. The three most well-known descriptive statistics are the mean, median, and mode, and they are used almost universally in math and statistics classes. It is possible to calculate a data collection’s average by taking the total and dividing that total by the total number of individual numbers. If you take a look at these numbers, they add up to a total of 20. (2, 3, 4, 5, 6). In most cases, the student receives a final score of 4/20. The mode is the most common value in a collection of data, whereas the median is the value right in the middle of the distribution. The distance between the greatest and lowest values in a dataset is determined by this number. Descriptive statistics are less often used but have a role to play. In order to explain difficult-to-understand quantitative conclusions gathered from a vast data collection, descriptive statistics are utilized (Simoncelli and Olshausen, 2001). A student’s grade point average (GPA) and the extent to which they comprehend descriptive statistics are linked (GPA). The GPA is calculated by calculating a student’s performance on a range of exams, assignments, and grades and averaging them together. Overall academic success can be measured by the GPA.
1.3.2. Tabular Methods A table that summarizes data for a variable that is used most frequently in a given context is called the frequency distribution. This table may be
本书版权归Arcler所有
Statistics
11
found in the frequency distribution. The number of unique data values that may be placed into each of a large number of separate categories can be shown graphically in the form of frequency distributions. The term “relative frequency distributions” refers to sets of tables that illustrate the percentage of data points that belong to each group. This type of distribution is often referred to by the name frequency distribution. A cross-tabulation is essentially a frequency distribution with two variables; it is the most common way that data for two variables are represented in tables. When looking at the frequency distribution of a variable, it is feasible to determine the number of data points that fall into each qualitative category. For instance, the male and the female categories are both viable options for the gender variable. In a gender frequency distribution, there would be two distinct clusters of males and females, neither of which would overlap with the other. These clusters would be males and females. This variable’s relative frequency distribution can be used to make an educated guess about the proportion of men and females that are present in a particular population (Seid et al., 2014). When developing a frequency distribution for a quantitative variable, it is necessary to take into consideration the classification points as well as the division points between neighboring classes. Using the age range of 22 to 78 as an example, we may categorize the data using the categories 20–29, 30–39, 40–49, 50–59, 60–69, and 70–79, respectively. Alternatively, we can utilize the whole age range. A frequency distribution, in contrast to a relative frequency distribution, represents the total number of data values that belong to each of these classes rather than the number of data values that belong to each of these classes individually. A cross-tabulation in a type of two-way table in which each row represents the classes of one variable and each column represents the classes of the other variable. This kind of table also goes by the moniker of a two-way table in some circles. Two rows can be used to represent men and females, while six columns can be used to represent age groups ranging from 20 to 29 years old, 30 to 39 years old, 40 to 49 years old, 50 to 59 years old, 60 to 69 years old, and 70 to 79 years old (Rovetta and Castaldo, 2020). Each cell in the table would have a total number of data values, and the row and column headers would indicate the age and gender of the subjects, accordingly. By using this method of cross-tabulation, we have a better chance of gaining further insight into the connection that exists between age and gender.
本书版权归Arcler所有
12
Introduction to Environmental Statistics
1.3.3. Graphical Methods There are many different graphical representations from which you may select when explaining data. A bar graph is a type of graph that uses qualitative data to graphically represent the frequency distribution of data. The categories that are associated with the qualitative variable are labeled along the axis that is situated horizontally in the middle of the graph. The height of the bar that appears above each label is determined by the total amount of data points that are contained inside a category. A bar graph is used in Figure 1.7 to illustrate the marital status of the 100 individuals who participated in the scenario that came before. On the graph, each of the four categories is denoted by a separate bar. The pie chart is an extra visual method that may be utilized to summarize qualitative data. The amount of data values that are contained inside each category is directly proportionate to the size of each category’s slice of the pie (Pleil et al., 2014). Histograms are by far the most common type of graphical representation used when quantitative data is provided in the form of a frequency distribution. On the horizontal axis of the graph are labels indicating the values of the quantitative variable. A rectangle is displayed above each class; its width is proportional to the interval’s breadth, and its height is proportional to the total amount of data values included in the class (Figure 1.7).
Figure 1.7. Graphical comparison of abiotic depletion and global warming capacity. Source: https://www.researchgate.net/figure/Graph-comparing-the-environmental-impact-categories-analyzed-between-scenarios-RS-HS_ fig1_268818096.
本书版权归Arcler所有
Statistics
13
1.3.4. Numerical Measures To summarize data, numerous quantitative measurements are utilized. The most important quantitative indicator for qualitative data is the proportion of total data values that can be attributed to each classification. Another approach to indicate this is by the use of a percentage. The most common numerical metrics for quantitative data are the mean, median, mode, range, variance, and standard deviation. Other numerical metrics for quantitative data include the percentiles. The mean, which is also known as the average, may be calculated by first adding all of the data values for a variable and then dividing that total by the total number of data points. This is the first step in calculating the mean (Piegorsch and Edwards, 2002). The average is calculated so that the gravitational center of the data may be seen. In contrast to the mean, the median is a measure of center placement that is unaffected by extremes in the size of the data values on either end of the spectrum. On the other hand, the mean takes into account the influence of both extremes. Before computing the median, the data values are sorted, moving from lowest to highest in a descending sequence. When there are an even number of pieces of data, the median is determined by taking the average of the two values that are located in the center of the group. The middle value of the information is what is referred to as the median when there are an odd number of variables. The mode, also known as the value that predominates the majority of the time, is the final measure of central tendency. The mode is sometimes referred to as the value that predominates the most. The distribution of data values throughout a spectrum may be illustrated with the help of percentiles, which can be ordered from least significant to most significant in terms of importance. Only a percent of the total data points are located below the path percentile, and none of the data points are located above it (Antweiler and Taylor, 2008). On the great majority of standardized tests, for instance, the percentiles that are displayed are an excellent illustration of this concept. The first quartile is represented by the 25th percentile, the middle quartile is found at the 50th percentile (which is also the median), and the third quartile is found at the 75th percentile. The range, as its name suggests, is the strategy that is the most basic and essential when analyzing the degree to which data might vary from one set to another. Only the two data points that were chosen to be the most severe were included in the calculation of the range. On the other hand, the variance (s2) and standard deviation (s) are often used as estimates of variability that are based on all of the data. Both of these measures are denoted by the letter s. To determine the sample’s variance, first, add up all of the squared
本书版权归Arcler所有
14
Introduction to Environmental Statistics
deviations, and then divide the sum obtained by the number of samples, which is n–1. The statistical measure of standard deviation is calculated by taking the square root of the variance (Pleil et al., 2014). This is the formula that is used to determine standard deviation. A lot of individuals decide to use the standard deviation as a descriptive measure of variability since the unit of measure for the standard deviation is the same as the unit of measure for the data.
1.3.5. Outliers There may be one or more data values associated with a variable that stands out from the rest of the values in the dataset due to their disproportionately big or small size, respectively. The inclusion of inaccurate numbers in data collection might result in the occurrence of outliers. Outliers are found and evaluated by statisticians so that decisions may be made on whether or not to include them in the data collection. If a mistake occurs, appropriate remedial steps may be implemented, such as the rejection of the data value that was responsible for the issue. The mean and standard deviation are two statistics that may be utilized to locate and analyze outliers. It is possible to get a z-score for each given value in the data set. The formula for computing the z-score is z = (x–µ)/s, where x stands for the data value, µ stands for the mean of the sample, and s stands for the standard deviation of the sample (Parise et al., 2014). The z-score provides a visual representation of both the number of standard deviations and the distance that separates a data point from the mean. Any numerical value that has a z-score that is either lower than three or higher than three might be considered an outlier. This is a decent rule of thumb to follow (Figure 1.8).
Figure 1.8. Outliers.
Source: https://datascience.foundation/img/pdf_images/knowing_all_about_ outliers_in_machine_learning_sample_points_in_green_are_near_to_each_ other.jpg.
本书版权归Arcler所有
Statistics
15
1.4. EXPLORATORY DATA ANALYSIS (EDA) Exploratory data analysis (EDA) offers a number of strategies that may be utilized to rapidly summarize and investigate a dataset. These ideas are shown through the use of the box graph as well as the five-number summary. A five-number summary is comprised of the data point with the lowest value, the data point with the first quartile, the data point with the median, and the data point with the highest value. Graphically illustrating a five-number summary can be accomplished through the use of a box plot. The starting point of the rectangle is determined by the first quartile, and its finishing point is determined by the third quartile. The rectangle is used to depict the data that is located in the central section of the chart. To get the exact location of the center of the rectangle, just draw a line that runs vertically down the length of the shape. Finally, whiskers are drawn from one end of the rectangle to the data value with the least significance, and from the other end of the rectangle to the data value with the most significance (Patil and Taillie, 2003). In situations in which there are no data points that may be categorized as outliers, the whiskers will often extend just to the minimum and maximum values of the data set. Asterisks or dots are often used when there are outliers present that fall beyond the range of the normal distribution range to indicate their presence (Figure 1.9).
Figure 1.9. A boxplot for yield traits (a); and showing outliers (b).
1.5. TYPES OF DESCRIPTIVE STATISTICS There are two types of descriptive statistics that is central tendency and measure of variability.
本书版权归Arcler所有
16
Introduction to Environmental Statistics
1.5.1. Central Tendency The focus of measurements of central tendency is on the mean or median value of a data collection, whereas the focus of measures of variability is on the dispersion of the data. Measures of central tendency and measures of variability are both used. Consumers are given assistance in understanding the significance of the examined data through the use of graphs, tables, and general comments in both of these measures. The position of the center of the distribution of a data set may be characterized using measures of central tendency (Adelman, 2003). The frequency of each data point may be defined by utilizing the study’s mean, median, or mode, which represents the collection’s patterns that are the most prevalent overall (Figure 1.10).
Figure 1.10. Central tendency. Source: png.
https://cdn.corporatefinanceinstitute.com/assets/central-tendency4.
1.6. MEASURE OF VARIABILITY Variability measurements may be utilized to do an analysis of the dispersion of the distribution of a data collection (also known as spread measures). Central tendency measurements, which only supply the mean of a collection of data, do not reveal the distribution of values within a set. These measurements only provide the mean of the data (Figure 1.11).
本书版权归Arcler所有
Statistics
17
Figure 1.11. Measure of variability. Source: https://slideplayer.com/4909712/16/images/slide_1.jpg.
The form and dispersion of a data collection may be specified with the use of measures of variability, which makes communication much easier (Notaro et al., 2019). Variability may be measured in a variety of ways, some examples of which are range, quartiles, absolute deviation, and variance. Consider the numbers 5, 19, 24, 62, 91, and 100 as just a few examples. The range of the data set is 5 to 100, which can be determined by taking the absolute minimum value (5) and adding it to the absolute maximum value (100).
1.7. INFERENTIAL STATISTICS Sample data are subjected to a number of analytical techniques in order to make conclusions about the population as a whole using inferential statistics. Inferential statistics is the most frequent term for this branch of statistics. In addition to inferential statistics, the study of statistics includes descriptive statistics. Statistics are used to make generalizations about populations, whereas descriptive statistics are used to define data set attributes. Statistics are subsets of descriptive statistics (Austin et al., 2005). Inferential statistics are used to generate generalizations about populations from data. Inferential statistics is the process of making conclusions about a population based on a random sample, which is performed via the use of analysis tools. Inferential
本书版权归Arcler所有
18
Introduction to Environmental Statistics
statistics is one of the many subfields of statistics. Using inferential statistics and basing such inferences on the size of the sample, it is feasible to draw broad generalizations about the population. It is possible to derive conclusions about a population parameter (for example, the population mean) by employing a statistical statistic generated from sample data. These results are achievable because inferences about the population mean may be formed (e.g., the population mean). Examining representative samples collected from a larger population can provide a more in-depth understanding of a certain dataset. It accomplishes this by employing a wide range of analytical tools and procedures, making it considerably easier to develop demographic generalizations (Ng and Vecchi, 2020). Several different sampling processes may be used to choose a random sample that correctly represents a population. Some essential techniques for data collection and sample design, in general, include simple random sampling, stratified sampling, cluster sampling, and systematic sampling (Figure 1.12).
Figure 1.12. Inferential statistics. Source: https://i.ytimg.com/vi/84H8HNV9mk0/maxresdefault.jpg.
Inferential statistics employ a range of methodologies, the most common of which are hypothesis testing and regression analysis. If inferential statistics are to have any validity, they must be drawn from a sample that is representative of the population being researched.
本书版权归Arcler所有
Statistics
19
1.7.1. Hypothesis Testing In inferential statistics, also known as hypothesis testing, data from a sample is used to test hypotheses and derive conclusions about the population as a whole. This procedure makes use of sample data. To do this, we must develop a null hypothesis and an alternative hypothesis, and perform a statistical test to determine whether or not there is a significant difference between the two hypotheses. To conclude, the test statistic value, the critical value, and the confidence intervals must all be considered (Bankole and Surajudeen, 2008). The tests for the left-tailed, right-tailed, and two-tailed hypotheses differ. To be more explicit, the many different types of hypothesis tests used in inferential statistics are divided into the following sections (Figure 1.13):
Figure 1.13. Hypothesis testing. Source: https://www.datasciencecentral.com/wp-content/uploads/2021/10/2808326942.png.
本书版权归Arcler所有
•
•
The Z Test: This cannot be used to examine data if the distribution is not normal and there are less than 30 samples available. When establishing whether or not the mean of the sample and the mean of the population are comparable, it is critical to consider the population’s diversity. A t-Test: This is used to assess data if the data has a student t distribution and there are less than 30 participants in the sample.
Introduction to Environmental Statistics
20
This is also true in cases where the sample size is fewer than 30 persons. When the variance of the population is unknown, the mean of the sample must be compared to the mean of the population. The test for the inferential statistics hypothesis is performed as follows: When the t statistic associated with the null hypothesis is greater than the t critical value, the null hypothesis is ruled out. •
The F Test: A comparison of the standard deviations of two separate sets of data can assist in determining whether the two sets of data are distinct. In a nutshell, the f hypothesis test for the right-tailed hypothesis looks like this: In the test statistic computation, F value is the variances of the first and second populations, respectively. If the f test statistic is greater than the f tabulated value, the null hypothesis must be rejected (Marohasy, 2003). The usage of confidence intervals may be used to estimate the parameters of a population. Even if the test is repeated 100 times with fresh samples and similar parameters, the result will almost always fall within a 95% accuracy range, according to this criterion. This holds even if the test is repeated 100 times with new samples and similar settings. When a confidence interval of 95% is utilized, it means that the outcome will fall somewhere inside the range that is being presented 95% of the time (Bustos-Korts et al., 2016). Using confidence intervals, you may determine the significance of a hypothesis’s crucial value. ANOVA, the Wilcoxon signed-rank test, the Mann-Whitney U test, and the Kruskal-Wallis H test are examples of inferential statistics tests. The Mann-Whitney U test is another example of one of these tests.
1.7.2. Regression Analysis A “response” or “outcome” variable is a statistical method used in statistical modeling to assess the relationships between one or more independent variables and a dependent variable. This approach is also known as an “outcome” variable. The variable that reflects the response or result is the subject of the regression analysis (often referred to as “predictors,” “covariates,” “explanatory variables,” or “features”). Linear regression is the most popular sort of regression analysis, and it comprises selecting the line (or a more sophisticated linear combination) that most closely matches the properties of the data in accordance with a preset list of mathematical
本书版权归Arcler所有
Statistics
21
requirements (Browning et al., 2015). The slope of the line that offers the best fit to the data is obtained during this sort of regression analysis. Ordinary least squares, for example, can be used to find the line (or hyperplane) that has the fewest sum of squared differences between it and the real data. This type of line (or hyperplane) has been shown to be the most effective therapy choice (or hyperplane; Figure 1.14).
Figure 1.14. Regression analysis. Source: https://cdn.wallstreetmojo.com/wp-content/uploads/2019/05/Regression-Analysis-Formula-2.jpg.
This allows the researcher to estimate the conditional expectation of the dependent variable (also known as the population mean value) when the independent variables take on a specific set of values (see linear regression). It is feasible to obtain alternative location parameters by using slightly independent approaches such as quantile regression or required condition analysis (such as the conditional expectation over a larger variety of nonlinear models). This is performed through the use of Quantile Regression or Necessary Condition Analysis. Quantile regression and Necessary Condition Analysis are demonstrated in two contexts (e.g., nonparametric regression). Regression analysis is performed in almost all contexts for a variety of reasons. When it comes to making predictions and forecasts, the sciences of regression analysis and machine learning (ML) have a lot in common. Second, as previously indicated, regression analysis may be used to find correlations between the independent and dependent variables. Regressions may be used to identify the degree to which a dependent variable and several independent variables in a particular dataset are connected (Chai et al., 2020). Based on these correlations, one may make predictions regarding the variable under consideration. A researcher’s initial step is to explain why previous correlations may be used to anticipate future findings in a new environment, or why the association between two variables is substantial enough to be called causal.
本书版权归Arcler所有
22
Introduction to Environmental Statistics
Before utilizing regressions to create predictions or infer the presence of causal links, this step is required. Using observational data to calculate the causal linkages that exist between events necessitates careful thought to discover how to utilize regression analysis to produce accurate predictions about the connection between two variables. Regressions are classified into several types, including simple linear regressions, multiple linear regressions, nominal regressions, logistic regressions, and ordinal regressions. In inferential statistics, linear regression is the most often used regression technique. To determine how a change in the value of an independent variable affects the value of a dependent variable, use linear regression.
本书版权归Arcler所有
2
CHAPTER
DEVELOPMENT OF ENVIRONMENT STATISTICS
CONTENTS
本书版权归Arcler所有
2.1. Introduction....................................................................................... 24 2.2. Framework for the Development of Environment Statistics................. 25 2.3. State-of-Environment Statistics in Developing Member Countries...... 30
24
Introduction to Environmental Statistics
2.1. INTRODUCTION The application of statistical procedures is what is meant by the term “environmental statistics” in the context of the study of the environment. This book discusses everything there is to know about the natural environment, as well as human interactions with the environment and urban environments (Madavan and Balaraman, 2016). As a result of an increase in environmental awareness among individuals, organizations, and government agencies in recent decades, the collection of environmental statistics has grown rapidly (Figure 2.1). This can be attributed to the growth of environmental awareness.
Figure 2.1. A book with the title environmental statistics. Source: https://media.wiley.com/product_data/coverImage300/19/04714897/0471489719.jpg.
Statisticians working in the field of environmental science take into account not only the biological and physical components of their surroundings but also the ways in which these components interact with one another. It is not only difficult, but also pointless to draw a line between the two as a result of the interconnectedness of data regarding the economy, society, and the environment. This is because it is impossible to isolate one from the other. When describing processes and activities that directly impact or interact with the environment, environmental statistics heavily rely on data from other fields, particularly the social and economic fields. They are subject to the Federal Drug Enforcement Administration’s oversight.
本书版权归Arcler所有
Development of Environment Statistics
25
Policymakers are the primary audience for environmental data (and statistics in general) (and for statistics in general). Various forms of essential, readily available environmental data are made available to this target audience. In order to show the importance of environmental statistics, the most basic types (statistics, indicators, and indices), from fundamental to derive are examined (Kobayashi and Ye, 2014). The forms are not mutually exclusive in practice. Statistics on the state of the environment as a result of human activity and natural phenomena reveal the underlying causes, consequences, and solutions revealed by this data. This is a long and vague sentence. In practice, environmental statistics are a collection of data presented in a way that demonstrates some natural or logical connections. All of the presented environmental data has a framework, whether it is explicit or implicit, as a result of the expected connection and selection of components. It is established that national publications would have some qualities in common under the UN Framework for Development of Environmental Statistics (FDES). The unique feature of environmental statistics is that they mix data from the environmental, social, and economic sectors into one comprehensive picture. It’s a chemical and physical reality that is created by man’s economic activities and has a direct influence on humans, plants, and animals alike.
2.2. FRAMEWORK FOR THE DEVELOPMENT OF ENVIRONMENT STATISTICS Multidisciplinary environmental statistics are inherent. Dispersed origins and a variety of strategies are applied to put these things together. Improved coordination and organizational structure are important needs for success in the challenging realm of statistics. To attain this objective, statistical frameworks, and methodologies from the fields of social, demographic, and economic statistics have been effectively utilized. Making use of this framework to obtain data about the environment would save time and money (Khormi and Kumar, 2011). In order to make socioeconomic and environmental programs and policies more accessible, the statistical framework gathers and combines data from numerous data-collection authorities (Figure 2.2). This makes the data more complete.
本书版权归Arcler所有
Introduction to Environmental Statistics
26
Figure 2.2. Framework for the development of environment statistics. Source: https://unstats.un.org/unsd/envstats/assets/img/sliders/envstats/FDES. jpg.
Various conceptual frameworks predominate in various countries. Four conceptual models are applied for environmental statistics (Figure 2.3): • • • •
Environmental media-based framework; Resource accounting model; Ecological model; and Stress-response model.
Figure 2.3. An ecological model. Source: https://s3-us-west-2.amazonaws.com/courses-images/wp-content/uploads/sites/3814/2018/12/18200236/image14.png.
本书版权归Arcler所有
Development of Environment Statistics
27
In order to have a better understanding of environmental issues, the media-based framework has it structured according to air, water, land, and soil, as well as the environment that was produced by humans. This was discovered after doing a preliminary examination of all of these frameworks. On the other hand, rather than linearly tracking environmental processes, the focus of this research is on how environmental media have developed throughout history. The method taken by the media is congruent with the conventions and classifications used in fields like statistics, administration, and organization, as well as with the views held by the general public towards the natural environment (Christensen and Himme, 2017). The primary flaw in the media-based paradigm is that it places a greater emphasis on the natural environment than it does on the human aspect, and it does not take into account the connectivity of the many different components. In the paradigm of resource accounting, natural resources are tracked from the moment they are gathered from their natural surroundings and harvested, all the way through the many stages of processing and ultimate usage, and then eventually back into the economic sector so that they can be recycled. In addition to that, the SNA from 1993 may be used with this system without any problems. Even though it is appealing, this method is theoretically difficult to implement since it calls for a high degree of collaboration across a large number of companies in order to monitor resources during their entire life cycle (Köhl et al., 2000). Even though it is esthetically pleasing, putting this method into action is difficult theoretically. The ecological approach to data collecting and analysis, on the other hand, makes use of a broad variety of models, monitoring techniques, and ecological indicators. This category includes not just the diversity and dynamics of populations, but also the output of biomass, as well as the productivity, stability, and resilience of ecosystems. Because of the restrictions that the media have in terms of how it may convey environmental change, the stress-response model was developed. Within the context of what is known as a “stress-response paradigm” (also known as an environmental response), the interactions of humans with their environments (also known as stress), as well as the changes that result from those interactions, are investigated (environmental response). The framework makes it easier to develop statistics that can be used for both preventative and corrective measures. This is accomplished by establishing causal relationships between activities and subsequent environmental impacts. These causal relationships are important for the protection of the
本书版权归Arcler所有
28
Introduction to Environmental Statistics
environment and the mitigation of the negative environmental impacts of development. It is possible to establish a link between cause and effect between an activity and the environmental repercussions that follow from that action (Kessler et al., 2015). On top of this essential design, the construction of a number of other systems has taken place. The most fundamental flaw in this strategy is that it oversimplifies the surrounding environment by disregarding the interactions that take place between the various components of that environment. In addition to the existing state of a particular environmental resource, the impact of certain activity on that resource could also be influenced by the way in which that activity interacts with several other aspects of the surrounding environment. One environmental impact has the potential to set off a domino effect consisting of further environmental impacts if it causes a chain reaction (Figure 2.4).
Figure 2.4. Environmental statistics. Source: https://unstats.un.org/unsd/envstats/assets/img/sliders/envstats/Newsletter.jpg.
The FDES format was developed as a result of a debate on the goals of such a framework, the scope, and kind of environmental data, as well as the purposes and characteristics of such a framework. The combination of these characteristics led to the construction of a table with a two-way connection that links the fundamental components of the environment to a number of different sources of information. The way in which emerging nations analyze the breadth of environmental statistics will determine how the environmental components will play out (Kandlikar et al., 2018). These living things, in addition to the air, water, and land or soil that make up the
本书版权归Arcler所有
Development of Environment Statistics
29
environmental mediums, are what make up the natural environment. Human settlements are part of the man-made environment. These settlements include the physical components of a city, such as its buildings and its infrastructure, as well as the social and economic functions that these components provide. The information categories are predicated on the idea that environmental issues are caused by a combination of natural and man-made occurrences and processes. Information that is pertinent to social and economic activities, natural phenomena, and their effects on the environment, as well as the responses of governments, non-governmental organizations (NGOs), businesses, and individuals to these consequences are referred to as relevant data. There is a chance that extra time and resources will be required in order to collect data for each environmental region. Starting with the data that is now available in the method that is going to be the most effective. In many developing countries, the proliferation of environmental organizations that do research and surveys has led to a significant increase in the amount of work that is duplicated (John et al., 2021). These organizations collect a vast amount of information pertaining to the environment as a result of the many environmental actions for which they are responsible. For instance, ad hoc research and administrative records kept by the government might offer a plethora of information on the environment. A graph or chart should always begin with some explanatory text so that any misconceptions may be avoided, technical words can be clarified, and it is possible to see how the data connection to one another. The FDES of each of the participating nations compiled a list that detailed all of the environmental and informational features of each category component. They have also included a description of the units of measurement and calculation levels for every variable, in addition to the data sources, the availability of the data, and the procedures used to acquire the data. It is going to be given with specific information on the environmental conditions of a country (Dutilleul et al., 2000). The most challenging problem that academics are currently confronted with is figuring out how to organize forthcoming research and surveys in such a way that essential data about the environment may be gathered and stored eternally. Every NSO is obligated to conduct research in order to fulfill its duty of ensuring that environmental considerations are incorporated into all of its ongoing surveys.
本书版权归Arcler所有
30
Introduction to Environmental Statistics
2.3. STATE-OF-ENVIRONMENT STATISTICS IN DEVELOPING MEMBER COUNTRIES The trustworthiness of environmental and socioeconomic data is essential for environmental data interpretation and analysis. Every environmental report is reduced to descriptive, anecdotal, and unsystematic observations when such data are missing; these observations are insufficient for logical decision-making. There is a wide range of information required. The availability of data on natural resource amounts and environmental conditions is a must without exception. Human activities that have an impact on the surrounding environment, pollutant emissions, natural occurrences, and human responses to changes in the surrounding environment are all necessary to quantify ecosystem interactions (Iwai et al., 2005). Various categories and methodologies are used by many firms to collect data about the environment and the economic climate in isolated silos, all for very particular purposes. Environmental data may be gathered using a variety of approaches, including monitoring programs and remote sensing images. Statistical surveys and administrative records are used to gather socioeconomic data. There is a lack of geographical and temporal resolution in data used to analyze environmental conditions at national, regional, and global levels (Figure 2.5).
Figure 2.5. Development of environment statistics. Source: https://www.adb.org/sites/default/files/Publication/28328/files/cover-28328_0.png.
本书版权归Arcler所有
Development of Environment Statistics
31
An approach that addresses the ecosystem as a whole is absent from the majority of the material that has been made available to date. For example, commercial forest area databases tend to emphasize the forestry management agency’s production purpose. Thus, these databases fail to reflect the numerous benefits of forest ecozones, including their significance in protecting habitat and biodiversity, the conservation of water, and traditional and alternative land use. Rather than actively seeking out additional data, should governments instead focus on improving the efficiency with which they utilize the data they already have? Some individuals feel that developing countries are suffering from a lack of knowledge, while others believe that they have an excess of data. The biota of the earth, particularly its bacteria, is still poorly understood despite the daily delivery of huge volumes of raw data from observation satellites (Farrell et al., 2010). The databases we use to monitor social, economic, and demographic conditions and trends are significantly more comprehensive and interconnected than the data we gather on the environment. It’s evident that despite this, there are several key issues linked with sustainability that we don’t have appropriate or trustworthy data to solve. As a result of inaccuracies and low-quality data, it is difficult to statistically analyze and report on environmental conditions. Today’s data must be able to facilitate the generation of information, cognition, and knowledge that is more comprehensive.
本书版权归Arcler所有
本书版权归Arcler所有
3
CHAPTER
ENVIRONMENTAL DATA
CONTENTS
本书版权归Arcler所有
3.1. The Frameworks of the Data............................................................................. 36 3.2. Goals of Collecting Data About the Environment............................................. 38 3.3. Additional Information and Analysis Regarding Risk Indices............................. 40 3.4. Methods for Public Relations and Retail Sales that Are Adapted Precisely to the Environment.................................................... 41 3.5. Environment APIs!............................................................................................ 42 3.6. The Amounts of Humidity................................................................................ 43 3.7. The State of the Atmosphere Has a Role........................................................... 45 3.8. Parameters That are Used to Measure Biodiversity............................................ 48 3.9. Diversity of Species and Representation of Taxonomic Groups in the Data....... 50 3.10. Concerning Measurements,Accuracy, and Possible Bias................................. 50 3.11. The Benefits and Drawbacks of “Averaged” Indexes....................................... 52 3.12. Considering the Numerous Error Causes........................................................ 53 3.13. Analysis of Biodiversity Data.......................................................................... 54 3.14. Societal and Occupational Health Information............................................... 55 3.15. Excellent Air Quality...................................................................................... 56 3.16. Examine the Locations of Monitors Using an Interactive Map......................... 56 3.17. Data Visualization.......................................................................................... 57 3.18. Some Fundamental Air Quality Concepts....................................................... 58 3.19. Particulate Matter Data................................................................................... 59 3.20. Ozone Depletion Trends................................................................................ 59 3.21. Sources That Give Information on the Environment........................................ 60 3.22. How Should Data About the Environment Be Evaluated?............................... 61 3.23. What is the Cost of Environmental Data on Average?...................................... 61 3.24. What Questions Should You Ask Environmental Data Providers?.................... 62
34
Introduction to Environmental Statistics
To phrase this another way, environmental data is information that pertains to environmental conditions and pressures, in addition to the negative effects that these factors have on the ecosystem. In practice, the DPSIR model is implemented as a PSI, where D refers to drivers, P to pressure, S to state, I to impact, and R to reaction. It is a compilation of information about the current state of the environment and the trajectory of its change that are related quantitatively, qualitatively, and geographically (Homma et al., 2020). Data examples for the environment include the weather, the quality of the air, and pollen. This information is frequently collected and utilized by a great number of non-governmental groups, environmental ministries, statistical agencies, and other types of organizations. Because of an innovative approach to the data collecting, distribution, and analysis processes, there is a significant possibility to secure data that is free of errors and accurate. It is possible to collect data about the environment remotely by utilizing technologies like satellites and sensors, which can reduce the need for manual data entry (Figure 3.1).
Figure 3.1. Environmental data representation. Source: https://media.istockphoto.com/vectors/innovative-green-technologiessmart-systems-and-recycling-vector-id1164502175?k=20&m=1164502175&s =612x612&w=0&h=N9BjYxw2vzlfZQySNRQk-zY-6eWeg8IaUTOf6j_T_30=.
本书版权归Arcler所有
Environmental Data
35
In the field of environmental research, several establishments regularly collect, store, process, and evaluate the data that they find. It is rare for a single person to have a comprehensive understanding of all of the complexities involved in data gathering and processing, including mathematical methods and algorithms for statistical evaluation and presentation of results based on scientific foundations. The information that pertains to the environment is quite complicated. At the same time, data obtained from a wide variety of sources and analyzed using a number of approaches are examined. They exhibit random fluctuation rather than established laws or principles in the links between observable values, as may be seen in the examples given. In addition, there is a temporal and spatial component to the environmental problem (Fourcade et al., 2018). In addition, there is no possibility of repeating the time measurements at all. It is possible that disruptions in the measuring equipment may lead to the loss of data as well as gaps in the series. During the exploration phase, it is possible that some of the measurements that were gathered at intermediate intervals might be beneficial. Consequently, the utilization of statistical analysis is accompanied by a steep learning curve due to this reason. In addition to this, it is essential to take into account a variety of temporal frames. Some measurements are taken every half an hour, while others are done no more frequently than once or twice a month. In addition, there is an issue with environmental statistics as a result of the massive amounts of data that are involved. Despite the fact that today’s computer technology is capable of handling compatibility, standards, and the harmonization of data, these issues continue to be obstacles (Fuentes et al., 2007). The process of coding is typically associated with the storing of data. Because codes are occasionally supplied in an erroneous manner, it is difficult to combine data coming from a variety of sources. In the event that you have not personally taken the measures, you are at risk of being unable to comprehend how they are carried out. It’s possible that this is the situation because there is a lack of awareness regarding what is being assessed, as well as the reliability and quality of the data. In the field of environmental statistics, closed information systems—those in which every stage of data gathering is directed by a single idea—are quite uncommon (Figure 3.2).
本书版权归Arcler所有
36
Introduction to Environmental Statistics
Figure 3.2. Young Asian woman looking at see through global and environmental data whilst seated in a dark office. The data is projected on a see through (see-thru) display. Source: https://media.istockphoto.com/photos/see-through-screen-pictureid1170687091?s=612x612.
3.1. THE FRAMEWORKS OF THE DATA Data pertaining to the environment are frequently displayed in the form of a matrix. In this format, each row is meant to stand in for a particular item, while each column is meant to hold the relevant variable measurements. Examples of column units include logical characters, categories, integers or reals, time information, and measurement spot coding (Higuchi and Inoue, 2019). During the coding process, you will need to make corrections to data that has been suppressed or is missing. There is no ignoring the fact that each column has the potential to include a wide array of distinct types of text (Figure 3.3).
Figure 3.3. Trees regulate carbon dioxide levels (a greenhouse gas) in the atmosphere. Source: https://media.istockphoto.com/photos/environmental-technology-concept-sustainable-development-goals-sdgs-picture-id1297158594?s=612x612.
本书版权归Arcler所有
Environmental Data
37
Gathering information for the vast majority of environmental statistics requires the use of precise measuring tools and a limited scale range. As part of the data analysis, consideration must be given to both of these things, as well as the relative roles of each. Inadvertent malfunctions of measurement tools as well as interruptions in transmission lines both have the potential to provide inaccurate or absent findings. If there is sufficient variety in the data that is being compared, then outliers can be determined. This is something that can be accomplished through the application of statistical techniques (Han and Li, 2020). There may be a requirement for a price readjustment due to other factors. If there is a risk to the interests of the people whose data is being manipulated, then no one can rule out the potential of data manipulation. It is easy to delete or change any measurements that are not needed. The date of the measurements and the places at which they were taken might both have an effect on the ultimate conclusion. When examining data related to the environment, threshold levels are quite useful. It is up to the person who utilizes brief surpasses and vast averages to determine how such statistics should be used to express certain thoughts. It is not unusual for researchers to learn, while a study is still being conducted, that some of the results do not apply to the subject that is now being investigated. The results of other studies are required before any inferences can be drawn from this. Testing at an early stage can assist in identifying which pieces of information are most important. There are others who feel that environmental statistics should be exempt from the purview of any data protection policy (Frankenhuis et al., 2019). These individuals argue for the privacy of environmental data. On the other hand, one is obligated to comply with all economic and safety standards. When doing research in the medical field, it is imperative that patients’ rights be respected. The time period that you decide on is also very essential. When this occurs, you are left with little choice except to make the most of the circumstances and utilize the resources at your disposal. The social norms of a society or technological concerns might have an effect on intervals. In other instances, it is not until a considerable amount of time has passed that it becomes obvious which type of scale would have been most suitable for the task at hand. A basic survey of the environment may be carried out in a matter of minutes at the most. The issue is that monitoring the environment necessitates rapid alerts and notifications in order to safeguard the ecosystem from being harmed or damaged. Naturally, statisticians are obligated to take this into consideration (Girshick et al., 2011). They only have to deal with human relationships on an automated basis in the most catastrophic of circumstances (Figure 3.4).
本书版权归Arcler所有
38
Introduction to Environmental Statistics
Figure 3.4: Environmental technology concept – sustainable development goals (SDGs). Source: https://edinburghsensors.com/news-and-events/impact-of-technologyon-the-environment-and-environmental-technology/.
3.2. GOALS OF COLLECTING DATA ABOUT THE ENVIRONMENT Scientists are always on the lookout for high-quality observational data sets that may be linked to simulations in order to improve their understanding of the temporal evolution of ecosystems and improve their ability to anticipate this development. In order to get a better understanding of earth processes, they gather, compile, and examine data pertaining to the environment; they also carry out field experiments; parameterize; and communicate information to both broad audiences and users with specialized knowledge (Gotelli et al., 2012). With its assistance, the development of infrastructure, environmental, and energy-use strategies may all be simplified to a greater extent. For knowledge on the state of the environment, the general public relies heavily on statistics that can be trusted to be accurate on the environment. It makes it easier for people to cut down on the amount of carbon dioxide they release into the atmosphere.
本书版权归Arcler所有
Environmental Data
39
The formulation of decisions and the carrying out of significant changes to environmental policy on the basis of an analysis of environmental data. They are now able to close the data gaps that have hampered policymakers for decades thanks to advancements in technology. This enables policymakers to take preventative steps before problems ever arise. Businesses in every region of the world are actively working to reduce their negative impact on the environment. They are gathering information on environmental performance not only for regulatory grounds, but also for branding reasons, so the data they collect will serve both of these goals (Giraldo et al., 2010). The application of Big Data and AI led to an increase in the reliability of environmental forecasts. When one is seeking to find out what is going on in the world, starting with raw data is a terrific place to begin. Because it is necessary for end users to be able to organize and modify their behavior based on these data, it is necessary to translate these data into actionable insights (Figure 3.5).
Figure 3.5. Hands of farmer with tablet with infographics on the screen. Source: https://media.istockphoto.com/photos/tablet-with-infographics-on-thescreen-picture-id476777720?s=612x612.
Thanks to advancements in environmental monitoring and forecasting techniques, we are now able to collect and store enormous amounts of environmental data in the form of astronomically large datasets. We can then computationally analyze these datasets to discover patterns, trends, and connections that were not previously known:
本书版权归Arcler所有
40
Introduction to Environmental Statistics
Modeling Climate Change Data scientists use a System approach to Big Data and AI tools in order to analyze and forecast climate change in the future using a System of Systems methodology. In climate computing, several fields of study and different types of research thoughts are brought together and organized before being merged into a more complete picture (Girshick et al., 2011). Reconstruction allows scientists to fill in gaps in previous environmental datasets that have been left undetected. These gaps might be caused by the absence of observations. Researchers in Europe are making use of ML to collect vast quantities of data about the environment from a variety of sources in order to make extremely accurate forecasts of the weather. By accelerating the processing of compounded data, it is now feasible to provide weather forecasts that are both hyper-local and hyper-accurate.
3.2.1. Innovative Financial Structures and Environmental Perspectives in the Health Industry Through the analysis and reporting of one’s surrounding environment, it is now feasible to attain “personalized health.” The health repercussions of being exposed to the environment are gaining more attention as a result of recent events such as a worldwide epidemic and harsh weather. There are projections that the global wellness business will experience annual growth rates of up to 10% (Gallyamov et al., 2007). On the basis of findings from health-oriented environmental research, companies of varying sizes are proposing novel and inventive use cases.
3.3. ADDITIONAL INFORMATION AND ANALYSIS REGARDING RISK INDICES Environmental intelligence is utilized by insurance firms in order to gain a deeper understanding of the risk profile of an individual, as well as to verify and improve the accuracy of claims (EI). Environmental data might also assist insurance companies warn policyholders of imminent calamities, maintain asset value, and avoid claims against them for larger losses. These benefits would all come from preventing larger claims. The improvement in the weather forecasting services can be attributed to the user input received via mobile applications. By making available a greater variety of insightful environmental data, producers of weather
本书版权归Arcler所有
Environmental Data
41
forecasts have the potential to increase their credibility as environmental information sources (Antweiler and Taylor, 2008). Because users are more likely to often check their applications, it is simpler for service providers to monetize their usage and advertise to them. Smart air purifiers and HVAC (heating, ventilation, and air conditioning) systems use environmental data that is both real-time and predictive in order to improve performance, maintenance, and remote management. Apps that are relevant to recreation, fitness, and health: As long as the climate is agreeable, a variety of consumer applications may now promote activities that take place outside. For example, these applications can inform users of the optimal times to engage in physical activity or to take their children out for a stroll. Connected devices and patient apps allow for more efficient clinical research and healthcare delivery. The use of drugs and the reporting of symptoms, in conjunction with environmental data, are used by specialists in the treatment industry to deliver individualized prediction insights. Patients could also be told to take their medicine before going outdoors or to remain indoors in order to prevent having their symptoms get worse as a result of being exposed to the outside environment (Zio et al., 2004). Using a customized value-based medicine strategy, clinicians are able to remotely monitor patient symptoms and environmental threats. If necessary, they may change treatment regimens to meet the patient’s needs, resulting in cost savings. Patients and healthcare professionals could both be able to reduce the number of unnecessary hospitalizations and in-person medical visits, resulting in cost savings for both parties. In the context of clinical research, environmental triggers can be related to patient symptoms as well as other personal and lifestyle factors in order to gain a deeper understanding of the effectiveness of a treatment and how well it is adhered to.
3.4. METHODS FOR PUBLIC RELATIONS AND RETAIL SALES THAT ARE ADAPTED PRECISELY TO THE ENVIRONMENT Ad agencies have the ability to communicate the appropriate message to the appropriate viewers at the appropriate moment based on real-time and forecasted changes in the surrounding environment, which significantly increases ROI and campaign engagement.
本书版权归Arcler所有
Introduction to Environmental Statistics
42
By including data about the surrounding environment into their sales presentation dashboards, pharmaceutical companies may be better able to convey to physicians and other healthcare professionals the importance of their products as well as the value they provide. Enhanced Capacity Planning of Routes and Safety Measures for Vehicles Taking measures to cut down on the amount of pollution in the air Users of safe route-planning software may learn about the possible health hazards connected with various alternatives as they are shown to them in real-time if the program is properly utilized (Ying and Sheng-Cai, 2012). They provide choices that produce less pollution and compare the estimated travel times in order to persuade consumers to use more environmentally friendly routes. The prevention of an infection within the cabin should be a primary focus. Dashboard interfaces that are compatible with mobile applications enable drivers to receive updates on the current air quality and weather conditions in real-time while they are on the road. Drivers may design routes via some of the most attractive places of the city with the use of environmental data maps, which are available to them.
3.5. ENVIRONMENT APIS! Environmental application programming interfaces (APIs) provide businesses and other organizations a convenient method for obtaining direct access to huge amounts of environmental data without requiring them to construct their very own monitoring infrastructure from the ground up.
本书版权归Arcler所有
•
•
Interfaces for Software Applications Dealing with Air Pollution: It is feasible to map and notify users of real-time and expected outdoor air pollution by making use of APIs related to air pollution (Xia et al., 2017). APIs for Gaining Access to Various Forms of Weather Information: People are able to organize their daily and weekly routines and be better prepared for severe weather and extreme climate events with the assistance of weather APIs, which allows them to remain on top of severe weather and extreme climate occurrences.
Environmental Data
43
•
Wildfire Application Programming Interfaces (APIs): Wildfires in real time API based on the location of a user. This website provides a lot of information, ranging from cautions regarding the air quality caused by smoke to suggestions for evacuating the area. APIs for pollen data can give individualized daily pollen predictions, medication reminders, and information that can be acted upon about health and lifestyle in order to make allergy treatment more effective (Wieland et al., 2010).
3.6. THE AMOUNTS OF HUMIDITY Access to forensic meteorological data is often granted for a fee. This is standard procedure. Important data sources in the United States include the National Climatic Data Center of the National Oceanic and Atmospheric Administration (NOAA), regional climate centers, and state climatologists. All of these organizations are run by the National Park Service. The official records kept by the meteorological services of various nations can be compared. As a part of their specialized monitoring systems, a wide variety of local government agencies, non-profit organizations, and private enterprises, in addition to the United States federal government, gather, and analyze weather data, as well as run weather observation stations (Wagner and Fortin, 2005). On the other hand, forensic meteorologists are educated to locate and collect pertinent data, as well as to organize it into a logical sequence for the purpose of analyzing meteorological events. In the past few years, there has been a significant shift in the technologies that are used for collecting, storing, and retrieving meteorological data. As a result of this sea change, the field of forensic meteorology has also undergone significant change. The gathering and recording of meteorological data have been carried out in a rather different fashion as of late, though. The radar pictures were captured on film, and the surface observations were recorded either by hand or on instrument recorder charts. Both the analytical tools and the volume of the data were frequently lacking. In spite of this, it is standard procedure for investigators to submit a request to the government
本书版权归Arcler所有
44
Introduction to Environmental Statistics
agency that was initially responsible for providing the information to ask for physical copies of the data and images along with a certificate stating that they are real. Even in modern times, it is common practice to utilize both a physical seal and a certification of the data copy in order to validate the copy prior to its admission into a legal proceeding (Toivonen et al., 2001). When it came to the production and dissemination of hard-copy data and photographs, the data source often experienced delays that ranged anywhere from a few weeks to several months. The data analysis performed by forensic meteorologists required a significant amount of time since the information had to be extracted from verified paper forms. This was especially true when it was necessary to analyze data spanning many years in order to ascertain normal climatological conditions. Paper, microfiche, and film are quickly becoming obsolete as mediums for the gathering, storage, and transmission of data, respectively. The utilization of more complex computer models of the atmosphere and ocean, digitalized surface weather sensors, computer networks, and the Internet are revolutionizing the work of forensic meteorologists. End users have access, either for free or at a reduced fee, to a wide variety of public and private data sets that are hosted on the internet. Data collected by satellites and radar can be retrieved from computer tapes stored in government archives and then sent over the Internet to end users. Multiple data vendors currently offer for sale huge data sets that may be downloaded or obtained on CDROM or DVD. As a direct consequence of this, forensic meteorologists now have the ability to access and scrutinize enormous amounts of data stored on their computers (Sivak and Thomson, 2014). In the great majority of cases, the amount of time spent waiting for data transmission has been reduced to a few minutes or hours, down from weeks or months in the past. In spite of significant reductions in the amount of time required to acquire and analyze data per unit, the amount of data and variety of data available for forensic work has dramatically increased. This ensures that the typical investigation’s labor requirements have not decreased as a result of these advancements. Despite this, specialists are now able to deliver preliminary research and assessments to their customers in a shorter amount of time than they were previously capable of doing so due to the increased availability of data (Figure 3.6).
本书版权归Arcler所有
Environmental Data
45
Figure 3.6. Infographics template. Set of graphic design elements, histogram, arc, and Venn diagram, timeline, radial bar, pie charts, area, and line graph. Vector choropleth world map. Source: https://media.istockphoto.com/vectors/infographics-template-set-ofgraphic-design-elements-vector-id519623338?s=612x612.
The certification of physical data can be challenging at times, and in other cases may even be impossible. If the expert witness can establish that the data was received from reputable sources and is regularly used by meteorologists, then the data can often be admitted as trial evidence. This is because trustworthy sources and common usage are both characteristics of meteorologists.
3.7. THE STATE OF THE ATMOSPHERE HAS A ROLE The vast bulk of the findings have been based on the tepui peak height range as a result of the absence of current meteorological data pertaining to Pantepui. At elevations between 1,500 and 2,400 meters, you may anticipate a mesothermic ombrophilous climate. Here, annual average
本书版权归Arcler所有
46
Introduction to Environmental Statistics
temperatures vary from 12°C to 18°C, and total annual precipitation (TAP) varies from 2,000–3,500 mm, with less than one dry month. According to Huber’s idea, this should be the case. On the higher slopes, temperatures would be submicrothermal ombrophilous (AAT between 8°C and 12°C), and the patterns of rainfall would be similar to those found in a mesothermal environment (Adelman, 2003). There is data to support these statements, but it can only be found at three tepui summits: Auyantepui, Guaiquinima, and Kukenán. At each of these peaks, crucial meteorological parameters were monitored for at least 10 years. These peaks are spread out across a range of elevations. As the altitude increased, the temperature of the AAT decreased from 16.5°C to 11.4°C. The range of altitudinal dispersion that is available to the dwelling zones in the Guiana region is practically limited due to the fact that there is only a temperature gradient of 0.6°C per 100 meters. The TAP increased from 2,800 to 5,300 mm at the same heights, which is likewise consistent with ombrophilic climates. Because of this, the mesothermic/submicrothermic ombrophilous climatic type may be the most accurate categorization of the Tepui peaks (Simoncelli and Olshausen, 2001). This climatic type is supported by the most recent data and may be regarded as the best alternative in the absence of more complete meteorological study. The collection of meteorological data in real-time is seen as being of critical importance for the purposes of ensuring the safety of both people and property. There are several nations all over the world, frequently in close proximity to one another that have observers or aggregations of automated sensors that are tasked with the responsibility of collecting meteorological data. Regrettably, there are a lot fewer means for monitoring the oceans than there used to be. It might be challenging to obtain reliable weather information when at sea. When it is most important for observers aboard ships to gather data, they are typically preoccupied with other activities and unable to do so. Clouds have the ability to trick the sensors of satellites. Whether they are attached to the seafloor or allowed to float freely, buoys continue to be the tool of choice for the most effective data collection at sea. We are going to discuss a range of buoy types and the data that they collect, in addition to some of the benefits and drawbacks that are associated with using each kind (Seid et al., 2014). Due to the fact that the author specializes in a certain field, the focus of this article will be on the computer systems that are utilized by the NOAA, the National Weather Service (NWS), and the US National Data Buoy Center.
本书版权归Arcler所有
Environmental Data
47
The CAST (computer-assisted statistics) model cannot function without climate data, monthly plant input and soil cover measurements, clay content measurements, and sample depth measurements. CAST is able to model the consequences of climate change since it makes use of the same plant input and meteorological data that Rothamsted carbon model (RothC) does. Not only does the CAST model replicate the buildup of carbon through time, but it also models change in the structural makeup of the soil. Therefore, the amount of silt and clay present in the soil is an essential extra input parameter that must be taken into consideration. In order to provide a more accurate approximation of soil texture for the model input, the soil texture of the Damma forefield was simulated by utilizing the average of the physical attributes of each age group’s silt and clay content. This is because silt and clay content rise with soil age (Rovetta and Castaldo, 2020). During the process of setting up the model, the carbon content of the various carbon pools was adjusted such that it would be comparable to the carbon content of the most recently tilled soil. In our opinion, there are no potential candidates in the pool provided by the IOM. The amount of unique species that may be grouped together into a certain category is referred to as their abundance. The average number of years a species lives, in addition to the rate at which its population grows. Since the need of quality evaluation in commercial settings has been acknowledged for a significant amount of time, a whole certification industry has developed around it. This novel information is beginning to be useful in a variety of other domains as well, such as environmental studies. The richness of a river’s aquatic invertebrates may be used as an indicator of the river’s hydro biological health, and the biotic index can be used to determine the river’s overall level of health. In France, the AFNOR NF T90-350 index is the one that is utilized (Austin et al., 2005). On the other hand, the requirement for consistency is not nearly as great in the field of ecology as it is in other areas. There are many different causes for this to explain it. There is a possibility that ecologists will be hesitant to move to a “quantitative” and “coded” technique of study due to the fact that they experience more freedom and confidence in “naturalist” work. Taking precise measurements is a difficult and time-consuming operation. In contrast to numbers that can be measured, such as physical or chemical quantities, biodiversity cannot be quantified. A diverse set of elements, some of which are outside of our sphere of influence, might give
本书版权归Arcler所有
48
Introduction to Environmental Statistics
rise to unpredictability. Due to the complexity of the situation, reaching a consensus on the most effective course of action can be challenging. There are a few different ways to talk about the variety of life on this planet, the most common of which are genetic diversity, specific diversity, and ecological diversity. In spite of the fact that the most of the concerns outlined below are just as applicable to the other two levels, the emphasis of this article will be placed on diversity in particular. The monitoring of direct biodiversity, on the other hand, will take precedence over the monitoring of indirect biodiversity. Indirect biodiversity monitoring focuses on quantifying resources or potential habitats, such as the amount of dead wood, rather than biodiversity itself. Direct biodiversity monitoring, on the other hand, monitors biodiversity directly.
3.8. PARAMETERS THAT ARE USED TO MEASURE BIODIVERSITY When France’s first biodiversity laws were enacted in the 1960s and 1970s, serious efforts to preserve the country’s flora and wildlife got underway for the first time. When coming to their findings, naturologists have historically relied on a mix of observations and hypotheses, which leaves room for some author bias. The end result of all of these efforts was the production of atlases, either regional or national, that depict the distribution of flora and animals. The amount of work put in to collecting samples, the degree to which habitats were represented, the ease with which the experiment could be repeated, and the statistical power it possessed were all inadequate. Although it is a relatively new example of a program for monitoring biodiversity, the STOC-ESP program was built from the bottom up with the intention of including all of the characteristics that have been discussed in the preceding paragraphs (Pleil et al., 2014). It is sad that the scientific community does not agree on how to proceed, and there are practically just as many sampling processes as there are monitoring programs that may be used. The UNECE’s (United Nations Economic Commission for Europe) establishment of the ICP-Forest network in 1985 was the most notable example. Researchers from all around the globe have gathered together in order to further their understanding of the impacts that international pollution has on forest ecosystems. The network may be broken down into two distinct parts. In the first phase, physical, and chemical testing is carried out at 800 different locations in line with the protocols that have been set. In 1995, monitoring of flora was initiated at all
本书版权归Arcler所有
Environmental Data
49
locations. When conducting studies on flora, the standard plot size ranged from 4,000 to 5,500 square meters, and this was often even the case within the same nation! The number of plant species on the plot rises in proportion to its growing area (Browning et al., 2015). Because of this, it was never feasible to conduct an in-depth study of the data. It took the network 10 years to come to an agreement on a standardized methodology and plot size in order to be able to analyze the evolution of flora across time. A more standardized approach to monitoring of biodiversity might potentially act as an external baseline for every research area. You can receive similar data and answers to questions that you can’t get from isolated sources if you pay the expenditure of combining databases and conducting research. This is possible if you combine the two activities. Through the utilization of common datasets and the development of national monitoring systems, such Vigie-Nature, many of the initiatives that are financed by the Natural History Museum in Paris contributed to the standardization of procedure. Standardization of procedures is crucial for the reasons that have been discussed above; nevertheless, it is also required to conduct an in-depth analysis of the quality of the measurements. Utilizing a standard process that consistently produces subpar outcomes is a waste of time (Figure 3.7).
Figure 3.7. A silhouette of a man stands in the background of large office windows and views a hologram of corporate infographic with work data. Source: https://media.istockphoto.com/photos/silhouette-of-the-man-in-theoffice-and-corporate-infographic-picture-id926799886?s=612x612.
本书版权归Arcler所有
50
Introduction to Environmental Statistics
3.9. DIVERSITY OF SPECIES AND REPRESENTATION OF TAXONOMIC GROUPS IN THE DATA The phrase “biodiversity” refers to a broad variety of different subject areas. When making decisions, one just takes into account a very small percentage of the whole. When looking at the genetic variety that exists within a species or community, researchers often only look at a very small section of a genome. Because there is such a broad range of species, evaluations of biodiversity are typically restricted to a particular order, family, or genus, or to an ecological category, such as creatures that are found in rotting wood. This is because there are so many different species (Piegorsch and Edwards, 2002). If the decisions taken had no impact on the findings of the research, then those decisions would be meaningless. Responses from different taxonomic classifications do not always agree with one another. As forest stands age, the number of vulnerable plant species decreases, but the number of fungi, insects, and vertebrates that live there increases. This is the case with saproxylic creatures as well as plants with vascular systems. Even among members of the same taxonomic group, reactions are not always the same. It is not always the case that the behaviors of a taxonomic group reflect the entirety of the biodiversity. Nevertheless, representativeness of samples is a difficulty that is independent of the taxonomy being applied. In most cases, just a tiny proportion of the members of the target family are subjected to in-depth sampling. Listening stations, for instance, don’t sample enough ducks and birds of prey when it comes to the sounds they record. You won’t be able to catch any flying insects with a glass trap since they have to be able to reach the top of the glass. There are a number of things that might have an effect on the outcome, including completion, identity, and the reality of the world.
3.10. CONCERNING MEASUREMENTS, ACCURACY, AND POSSIBLE BIAS It is necessary to take measurements that are objective and accurate. This is accurate to a certain extent; the results of any measurement will include errors. As a consequence of this, the analysis of data is affected in a variety of unique ways. It is more difficult to depict changes in biodiversity levels between habitats, management choices, or through time in a research
本书版权归Arcler所有
Environmental Data
51
study without delving into unneeded detail when measuring imprecision is included since it reduces the statistical power of the data (Pleil et al., 2014). The greatest danger for those in charge of making decisions is the possibility that they may fail to recognize an issue in time, which would result in a delay in taking remedial action. The utilization of biased measurements has the potential to either disguise or exaggerate variations in biodiversity. As a consequence of this, it’s likely that remedial actions that aren’t essential will be carried out. Listening stations are commonly utilized in research on bird populations, particularly popular bird species. On the other hand, due to reverberation, the sounds of birds tend to travel shorter distances in close quarters. It is also possible for an ornithologist to overstate the number of birds in a limited setting as opposed to an open one, which might lead to the incorrect conclusion that distinctions exist.
3.10.1. The Many Benefits and Potential Drawbacks of Having a Large Number of Species in an Area The amount of various kinds of organisms that are found in a place is a notion that is referred to as “species richness,” and it may be used as a measurement tool for biodiversity. All of the studies that looked at this measurement, which was praised for its ease of use, revealed that surveys of biodiversity were lacking a significant number of species. A botanical survey will miss around one in every five plants on average. It is possible to carry out surveys of birds in a similar fashion by using listening stations. It’s not so much that the surveys don’t have all the answers as it is that different things have different levels of each question answered that makes this the most concerning aspect (Bustos-Korts et al., 2016). Simulations may be used to determine how accurate an assumption that one characteristic is more diverse than another is due to variations in the chance of detection. The risk is still present even if the likelihood of detection shifts by only a few numbers of percentage points. There is a wealth of evidence to support the hypothesis that there is a significant amount of variation in the capacity of biodiversity studies to identify individual species. There is not much of an impact that can be attributed to taxonomic category on the detectability of individuals or species. These disparities might be the result of a wide range of different reasons. When carrying out the survey, Flora takes into account the quantity of vegetation, the degree of expertise held by the botanist, the number of
本书版权归Arcler所有
52
Introduction to Environmental Statistics
people who take part in the survey, the length of time spent, the amount of exhaustion felt by the team, and the overall amount of experience held by the group. Both the climate and the time of year play a role in determining the ultimate outcomes. Mistakes in identification can occasionally make detection errors appear much worse; nonetheless, these errors have a tendency to become less frequent as the botanist gains experience (Chai et al., 2020). It can be challenging to evaluate the influence that workers have had over a protracted period of time because, as workers gain expertise, they are gradually replaced by new employees. This makes it more difficult to observe the impact that workers have had. The presence of singing birds and flying insects is another element that makes it difficult to determine which families are present. This is especially the case when there are unfavorable weather conditions present. When it is cloudy and cold outside, birds are less likely to sing, bats are less likely to go hunting, and insects are less likely to fly. In the case of insects, additional potential sources of mistake include the kind of trap, its height and exposure to the outdoors, the lures that are employed, and, finally, the employees’ level of experience when it comes to setting the traps. An expert entomologist has a greater probability of discovering the preferred migration patterns of entomofauna and, as a result, collecting more insects than a less experienced entomologist does.
3.11. THE BENEFITS AND DRAWBACKS OF “AVERAGED” INDEXES Due to the sensitivity of species-richness assessments to a variety of factors, some authors recommend “averaged” indices that include all species found. In theory, all that is required to calculate the averaged index without bias is a sample of species that is representative of the community. One of these factors is the average degree of specialization of bird communities. A community dominated by specialist species is preferable than one dominated by generalist species, according to the underlying concept. At the national level, each common bird species was awarded a specialization index, enabling scientists to place species along a gradient ranging from specialist species to generalist species that can withstand a wider variety of conditions (Christensen and Himme, 2017). To determine the average degree of specialization of a community, the index mean for all identified species at a particular site must be calculated. Nonetheless, because the probability of detection fluctuates with species specialization, this method is susceptible
本书版权归Arcler所有
Environmental Data
53
to the critique leveled against species diversity. A specialized species, for instance, may be detected less frequently than a generalist one. Comparable average indices exist for invertebrates and plants. However, they offer more information on habitat quality than biodiversity. These averaged assessments cannot replace absolute metrics like species diversity. Consider the two choices listed below. In the first community, more generalist than specialized species were introduced. The community lost a comparable proportion of specialized and generalist species in the second period. The community’s specialization index declined in the first scenario, but stayed steady in the second. On the basis of this one indication, it is possible to assume that the first community’s developments are more worrisome than those of the second, despite the fact that the opposite is manifestly true. Due to the fact that averaged indices are not always devoid of sample bias, they should be used in conjunction with absolute indices rather than in substitute of them.
3.12. CONSIDERING THE NUMEROUS ERROR CAUSES There are ways to adjust data to account for the inaccuracy of surveys. In the 1930s, these techniques were originally utilized for bird ringing. Ringing is the activity of collecting, identifying, and recapturing individuals at varying intervals for the purpose of analyzing biological features such as life expectancy, site attachment, and population size. A bird’s failure to be captured on a particular occasion does not always indicate that it is dead; it may have simply evaded capture. A number of statistical methods were created to differentiate between the probability of survival and the probability of discovery. To utilize these technologies, further local surveys must be undertaken. End of the 1990s was the first time these approaches were utilized in community ecology research by comparing individuals and species, calculating species and other characteristics such as local extinction and colonization rates rather than the number of individuals (Parise et al., 2014). Returning to the same location many times, setting a greater number of traps, or summoning multiple naturalists at once can all enhance the likelihood of capture. If the average probability of identifying species during a visit is sufficiently high, these strategies are beneficial. However, they are especially susceptible to determination mistakes, which can be difficult to uncover during data collection, and differences in detection likelihood between people and species.
本书版权归Arcler所有
54
Introduction to Environmental Statistics
Despite these limitations, this field of study continues to expand. It is capable of handling a growing number of variables and potential identification mistakes, and it should continue to develop rapidly. For birds, another approach is employed to calculate the distance between singing males and the central survey point. To estimate bird numbers, models relating detection probability to distance are revised to incorporate distance data. Different functions and parameters can be adjusted such that the detection probability declines linearly, non-linearly, fast, or slowly as the distance from the center point increases. By measuring the distance between each specimen and the transect line, this approach may also be utilized for line-transect sampling of plants (Dutilleul et al., 2000). However, it is essential to recognize that they are only Band-Aid solutions, and that it is always desirable to minimize detection and identification mistakes during surveys.
3.13. ANALYSIS OF BIODIVERSITY DATA Standardizing statistical data analysis has advantages, just as standardizing data gathering has advantages. Frequently, a researcher may wish to determine if the stated changes are also present in neighboring regions. It is extremely difficult to draw quantitative comparisons, much alone qualitative ones, when diverse statistical methods are employed. The scientific community is becoming increasingly conscious of the need for more standard data analysis, and non-governmental organizations (NGOs) such as the European Bird Census Council in Denmark offer free online monitoring tools for data processing (Patil and Taillie, 2003). The objective is not to provide a list of possible tactics. The usually complex nature of biodiversity data is a persuasive argument for preserving method diversity. Univariate and multivariate analysis have to be viewed as allies, not competitors. However, these operations must adhere to severe technical specifications: • the statistical model must account for the geographical and temporal structure of the data; • the underlying assumptions must be consistent with the data; and if they aren’t, the findings of the analysis must be robust enough to resist non-compliance with the assumptions. Although it appears that the present statistical methodologies are satisfactory in general, there is need for improvement in a number of areas (Figure 3.8).
本书版权归Arcler所有
Environmental Data
55
Figure 3.8. Spring colored birds flirting, natural design, and unique moments in the wild. Source: https://image.shutterstock.com/image-photo/spring-colored-birds-flirting-natural-600w-1100237423.jpg.
3.14. SOCIETAL AND OCCUPATIONAL HEALTH INFORMATION • Total number of slums in a particular location; • Affected population by state; • Community health condition; • Amenities of infrastructure in a certain area. There are few statistics on socioeconomic differences in health-care usage grouped by different care schemes, as well as exclusive and concurrent health-care utilization at different schemes among various socioeconomic groups. In Finland, government, occupational, and commercial programs offer simultaneous outpatient primary health care services. Due to variances in availability, cost, and gatekeeping, each plan primarily targets diverse demographic subgroups. In a study to determine how socioeconomic position affects the chance of a population of working-age individuals gaining access to health care provided under the three schemes. Individual-level registerbased data on use of public, occupational, and private outpatient primary health care in 2013 were connected with sociodemographic factors for the complete 25–64-year-old population of Oulu, Finland (Farrell et al., 2010). Using descriptive methods and multinomial logistic regression models, the data was evaluated. The bulk of the research population depended
本书版权归Arcler所有
56
Introduction to Environmental Statistics
primarily on occupational or public health care or had no health insurance coverage. With lower socioeconomic position, the probability of not receiving treatment or receiving only public care rose. Socioeconomic status enhanced the chance of utilizing occupational care, either exclusively or in conjunction with private care. When sociodemographic characteristics and chronic disease were adjusted for, education, occupational class, and income were all linked with care usage, but income was the greatest predictor of the three. The findings are consistent with the architecture of the Finnish healthcare system, which includes a comprehensive occupational health-care plan for the working population, which leads to inequalities in health-care usage and probably socioeconomic disparities in health. Air quality, groundwater quality, particle matter, volatile organic compounds (VOCs), ozone, and nitrogen dioxide levels, as well as the number of wildlife fires and open area burning, are included in pollution data.
3.15. EXCELLENT AIR QUALITY The Air Data website offers air quality data gathered from outdoor sensors throughout the United States, Puerto Rico, and the United States Virgin Islands. The majority of data are provided by the AQS database. Data analysis choices include: saving data to a file, exporting data to one of Air Data’s basic reports, and creating graphical displays using one of the visualization tools (Notaro et al., 2019).
3.16. EXAMINE THE LOCATIONS OF MONITORS USING AN INTERACTIVE MAP Air Data provides raw data to air quality analysts in the regulatory, academic, and health research sectors. Air Data serves a wide range of users, from concerned citizens who want to know how many days of poor air quality his area experienced last year to air quality analysts in the regulatory, academic, and health research sectors.
本书版权归Arcler所有
Environmental Data
57
3.17. DATA VISUALIZATION Seeing data is sometimes the greatest way to comprehend it. The visualization tools provided by air data exhibit data in unique and useful ways:
本书版权归Arcler所有
•
•
•
•
•
Air Quality Index (AQI) Plot: Compare the AQI readings of different pollutants at a specific place and time. This tool provides a whole year’s worth of AQI readings — for two pollutants at once – and is useful for studying how the number of unhealthy days fluctuates over the year for each pollutant. Tile Plot: Plot daily AQI values over time for a particular place. Each square or “tile” symbolizes a particular day of the year and is colored according to that day’s AQI score. The legend displays the total number of days spent in each AQI category (Fourcade et al., 2018). Concentration Plot: Generate a time series plot for a specified region and time interval. This program provides daily data summaries on air quality through the monitoring of qualifying pollutants. You may map all monitors in a county or core based statistical areas (CBSA), or you can choose specific monitors to plot. Concentration Map: Generate an animated sequence of daily concentration maps for a given time period. The daily publication of the AQI for the criteria pollutants and concentration levels for certain PM species, such as organic carbon, nitrates, and sulfates. This application may be useful for locating air pollution incidents, such as wildfires. Ozone Exceedances: Compare the 8-hour ozone “exceedances” of this year to those of prior years. There are three possible ways to illustrate comparisons (Ng and Vecchi, 2020). In the first graph, comparisons are sorted by MONTH. The second diagram illustrates comparisons by DAY. The last graph illustrates yearby-year comparisons (Figure 3.9).
Introduction to Environmental Statistics
58
Figure 3.9. Abstract visualization of data and technology in graph form (3D illustration). Source: https://analyticalscience.wiley.com/do/10.1002/imaging.3292.
3.18. SOME FUNDAMENTAL AIR QUALITY CONCEPTS Air data users should bear in mind the following air quality concepts:
本书版权归Arcler所有
•
•
Monitoring Data: Numerous types of data, approximately 4,000 monitoring stations, the majority of which are owned and operated by state environmental agencies, measure environmental pollution levels. The agencies contribute hourly or daily pollutant concentration values to the EPA’s AQS database. AQS data are retrieved using Air Data. Data on Emissions: The Environmental Protection Agency (EPA) monitors the quantity of pollutants emitted by autos, power plants, and factories, among other sources. The EPA can receive emissions data from state environmental agencies in the form of actual measurements collected at the source or estimates based on mathematical calculations (Marohasy, 2003). Air Data does not yet include any emission data. About the Air Emissions Inventories webpage, you may get data on emissions.
Environmental Data
59
•
Data on Water Quality: Monitoring water quality is an essential element of safeguarding water resources. Under the Clean Water Act, state, tribal, and federal authorities analyze the water quality of lakes, streams, rivers, and other forms of water bodies. These monitoring operations collect data that supports water resource managers in determining where pollution problems occur, where pollution management efforts should be focused, and where progress has been achieved. The water quality exchange allows data partners to provide EPA water monitoring data, whereas the water quality portal allows anybody, including the general public, to receive EPA data. The water quality portal is the United States’ most comprehensive database of water quality monitoring data. The water quality portal distributes about 380 million water quality data records from 900 federal, state, tribal, and other partners using the Water Quality Exchange data format (Madavan and Balaraman, 2016).
3.19. PARTICULATE MATTER DATA The EPA has established ambient air quality trends for particle pollution, often known as Particulate Matter, using a national network of monitoring stations. PM2.5 refers to tiny inhalable particles with sizes of at least 2.5 micrometers. The EPA establishes and assesses national air quality requirements for PM under the Clean Air Act. Nationwide air quality sensors detect PM concentrations. The EPA, state, tribal, and local governments utilize this information to ensure that airborne PM levels are safe for individuals and the environment. In recent years, PM2.5 concentrations across the nation have declined (Kobayashi and Ye, 2014).
3.20. OZONE DEPLETION TRENDS Air quality sensors around the nation assess ozone concentrations. The EPA, state, tribal, and local governments utilize this information to ensure that ozone levels are safe for people and the environment. In the 1980s, average ozone levels decreased, reached a plateau in the 1990s, and then plunged dramatically after 2002. The EPA uses a statistical model to account for weather-related variation in seasonal ozone concentrations in order to offer a more precise evaluation of the underlying trend in precursor emissions that lead to ozone.
本书版权归Arcler所有
Introduction to Environmental Statistics
60
The air quality database of the World Health Organization offers information on annual mean particulate matter and nitrogen dioxide concentrations measured at ground level. Since 2011, the database has been updated every two to three years. WHO is the custodian of the Sustainable Development Goal Indicator 11.6.2, Air quality in cities, which is derived from the data in WHO’s database (Khormi and Kumar, 2011). The World Health Organization’s Fifth Air Quality Database, the largest in the world, consists of almost 6,000 cities/human settlements, predominantly cities, in 117 countries and highlights where air pollution levels and associated health risks are the highest. Environmental data collecting typically requires meticulous preparation, technical proficiency, and a comprehensive understanding of environmental regulations. Goals, timelines, project structure, data needs, data collecting techniques, data quantity and quality, QA, and QC requirements, and analysis methodologies are all components of systematic planning that guarantee the acquisition of relevant and correct data for the intended purpose.
3.21. SOURCES THAT GIVE INFORMATION ON THE ENVIRONMENT Statistical Surveys: You may conduct statistical surveys to collect environmental data from a particular population segment. • Administrative Records: Environmental reports can be created using administrative records from government and other organizations. • Remote Sensing and Thematic Mapping: Using satellites and airplanes, remote sensors collect high-quality environmental data on inaccessible or hazardous sites and objects (Fuentes et al., 2007). This technique is commonly used to gather and evaluate information regarding forests, soil erosion, pollution levels, animal population estimates, and the effects of natural disasters. Utilize field monitoring stations to collect all qualitative and quantitative environmental data, such as soil, air, water, or quality; meteorological or hydrological parameters. The most advantageous aspect of this method is that data are often obtained using established scientific processes, and modeling is commonly used to improve data quality. Accurate environmental data can also be gathered through scientific investigation. Typically, this type of information is offered for free or at a little cost.
本书版权归Arcler所有
•
Environmental Data
61
3.22. HOW SHOULD DATA ABOUT THE ENVIRONMENT BE EVALUATED? Environmental judgments, analysis, and conclusions rely heavily on the precision of environmental data. To get the intended result, the proper data quality models and procedures must be applied (Köhl et al., 2000). The technique for examining the quality of environmental data is as follows: •
•
•
Step 1: Research Sources and Strategies for Data Collection: Before reviewing environmental data, you must be familiar with its sources and gathering methods. In order to avoid future problems, the source must also be legal, ethical, and entirely original. Step 2: Request a List of References: Request references from prior clients of your data source in order to verify the quality of the data. You may also directly contact their prior customers. Step 3: Conduct an Analysis of the Data: Request an example data set from your data source and examine it closely. Examine the results carefully to ensure they match your expectations.
3.23. WHAT IS THE COST OF ENVIRONMENTAL DATA ON AVERAGE? If you fail to plan ahead and simulate your exact environmental requirements, you will be charged extra by your data provider. The following are some of the most prevalent pricing strategies: Subscription/Licensing: The data source requires a subscription. This model type delivers constantly updated datasets when API access has been granted (Kessler et al., 2015). You may make a one-time payment for each significant batch of environmental datasets. Quotes for Individuals with Disabilities: If you have any special needs, please inform your environmental data provider. Consider that your expenses will vary based on your requirements. The greatest difficulty is managing environmental data. Long-term environmental data collection is essential for monitoring environmental changes on a global or regional scale. The most challenging issue for academics is the collection and management of high-quality environmental data sets (Frankenhuis et al., 2019).
本书版权归Arcler所有
Introduction to Environmental Statistics
62
Even if companies employ cutting-edge technology to overcome data shortages, there will always be unique, difficult-to-solve challenges. Existing environmental data must be investigated thoroughly to be comprehended.
3.24. WHAT QUESTIONS SHOULD YOU ASK ENVIRONMENTAL DATA PROVIDERS? Here are some important questions to ask a prospective environmental data provider:
本书版权归Arcler所有
•
•
•
•
How Are Environmental Data Analyzed and Validated?: It is difficult to determine the quality of enormous amounts of environmental data in a reasonable length of time. It is easy to become lost in the mountains of paperwork when acquiring data for a large site or set of sites. Therefore, appropriate preparation is required to manage and evaluate the enormous volume of data. This information will assist you in selecting a vendor (Kandlikar et al., 2018). Where Environmental Data may be Purchased?: Database’s data suppliers and merchants sell environmental data items and samples. Popular Environmental Data products and datasets on our platform include: AWIS Weather Services’ Hourly Weather Data – Worldwide – 1940s to Present; Custom Weather’s Historical Hourly and Daily Weather Observations – 100 Years; and AWIS Weather Services’ Historical Hourly Weather Data – Worldwide – 1940s to present. How can I Obtain Environmental Data?: Environmental data is available in a variety of forms; the ideal one for your needs will depend on your specifications. Typically, environmental data from the past is available in bulk and transmitted via an S3 bucket (John et al., 2021). If your use case demands real-time information, you may get APIs, feeds, and streams for real-time Environmental Data. What Data Formats are Equal to Environmental Data?: Environmental data encompasses B2B data, energy data, real estate data, geographical data, and commerce data. Frequently employed in analytics and weather analytics.
4
CHAPTER
THE ROLE OF STATISTICS IN ENVIRONMENTAL SCIENCE
CONTENTS
本书版权归Arcler所有
4.1. Uses of Statistics in Environmental Science........................................ 66 4.2. Sources of Information....................................................................... 67 4.3. Methods............................................................................................ 68 4.4. Basic Concepts.................................................................................. 68 4.5. Applications of Statistical Tools in Environment Science..................... 69 4.6. Statistical Models............................................................................... 75 4.7. Goodness of Fit Test........................................................................... 77 4.8. Theoretical or Biological Models........................................................ 78 4.9. Fitting Niche Apportionments Models to Empirical Data.................... 81 4.10. Species Accumulation Curves.......................................................... 82 4.11. Users of Environmental Data........................................................... 84 4.12. Environmental Information.............................................................. 85 4.13. Sources of Environmental Statistics.................................................. 88 4.14. Monitoring Systems......................................................................... 90 4.15. Scientific Research........................................................................... 90 4.16. Geospatial Information and Environment Statistics........................... 92 4.17. Institutional Dimensions of Environment Statistics........................... 93 4.18. Importance of Environmental Statisticians........................................ 94
64
Introduction to Environmental Statistics
Environment is getting hotter day by day due to increased global warming and release of higher amount of greenhouse gases in the environment. This changing climate is adversely affecting the lifeforms on the earth and decreasing the crop productivity due to several abiotic stresses. Therefore, the environment is considered a global concern and increased the scope and importance of Environmental sciences. During the development of different stages of civilization, humans were accompanied both by statistics and the environment. In the early days, they were found to be knowingly accustomed to the environment and unknowingly played with statistics (Iwai et al., 2005). The environment and statistics have shared a long history of mutual reciprocation. Currently, both the subjects have been able to attract academic attention of scholars across the world (Figure 4.1).
Figure 4.1. Environmental statistics. Source: https://www.shutterstock.com/search/environmental+statistics.
There is an exclusive branch in the United Nations Statistics Division (UNSD) or environmental statistics. This branch was established in 1995. Environmental statistics deals with data collection, methodology, capacity development and coordination of environmental statistics and indicators. They also dedicated a newsletter called the ENVSTATS. This newsletter publishes the activities of UNSD within the scope of environment statistics. The framework used in the Development of Environmental Statistics is considered an updated version of the original Framework for the Development
本书版权归Arcler所有
The Role of Statistics in Environmental Science
65
of Environment Statistics (FDES) (Gallyamov et al., 2007). This version was published in 1984 by UNSD. Environmental statistics is handled differently depending on the country. For instance, in India, the Ministry of Statistics and Program Implementation produce a specific publication report in the branch of environment statistics called EnviStats. The publication updates recent developments in the field of environmental statistics. It also includes any recent developments in the field of environmental statistics. Statistics has extensively been used in Environment statistics. This is because statistics is considered an inevitable context in any given arena. Environmental statistics is said to have an integrated multidisciplinary face useful in shedding light on the pure biological field of modern science attributed to its analytical nature. There is an existing interconnection with environmental science and statistics, creating anan easy access point for future investigators. Statistical techniques are used in understanding various environmental phenomenon. There are various kinds of statistical tools that includes stochastic and probabilistic models, data analysis, data collection, and inferential statistics, among others. When conducting statistical analysis, it is important to note that there are principles and models used with regards to different kinds of environmental issues that include monitoring, sampling, standards, control, management, conservation, and pollution (Girshick et al., 2011). These methods and principles can be used across all fields and concern water quality, air quality, radiation, forestry, food, climate, noise, fisheries, soil conditions and environmental standards. There are sophisticated statistical techniques that include extreme processes, stimulusresponse, sampling principles and methods, spatial models, and design of experiments, among others (Figure 4.2).
Figure 4.2. There are various applications of statistics in environmental science. Source: https://www.pixtastock.com/illustration/10823549.
本书版权归Arcler所有
66
Introduction to Environmental Statistics
The use of statistics in environmental science has led to what is known as environmental statistics. This is because, statistical methods are used in environmental science. There are various procedures used in dealing with questions concerning the natural environment while in the undisturbed state, interactions between humans and the environment and urban environments. The use of statistics in environmental science has grown throughout the years as a response to the increasing concern over the environment in the public, government sectors and organizations. The use or scope of statistics in environmental science as environmental statistics has been defined by the United Nation’s Framework for the development of Environment Statistics as follows: the scope of environmental statistics covers biophysical aspects of the environment and aspects of the socio-economic system that directly influences and interacts with the environment. This means that there is an overlap in the scope of environment, economic, and social statistics. However, it may not necessarily or easily draw a clear line that divides such areas (Homma et al., 2020). Economic and social statistics can be used to describe activities or processes with a direct impact on or direct interaction with the environment used widely in environment statistics.
4.1. USES OF STATISTICS IN ENVIRONMENTAL SCIENCE Statistics is used in conducting analysis considered essential to the field of environmental sciences. This allows researchers to gain an understanding of environmental issues through researching and developing potential solutions to the issues of study. There are numerous applications of statistical methods to environment sciences. They also vary depending on the nature of the study being done (Higuchi and Inoue, 2019). Use of statistics in environmental science is used in various fields, including standard bodies, research institutes, health, and safety organizations, water, and river authorities, meteorological organizations, protection agencies, fisheries, and in pollution, risk, regulation, and control concerns. Statistics in environment sciences is mostly pertinent and widely used in the academic, governmental, technological, regulatory, and consulting industries. Some specific applications of statistical analysis in the fields of environmental science include environmental policy making, earthquake risk analysis, environmental forensics, and ecological sampling planning (Figure 4.3).
本书版权归Arcler所有
The Role of Statistics in Environmental Science
67
Figure 4.3. Data from environmental statistics is used by government agencies, among other groups. Source: https://www.alamy.com/stock-photo-studio-cut-out-concept-of-government-agencies-leaking-information-20949842.html.
In the scope of environmental statistics, there are two categories of their uses that includes inferential statistics and descriptive statistics. Inferential statistics is used in making inferences about data, making predictions, and testing hypotheses, while descriptive statistics is not used in making inferences about data but is instead used in describing its characteristics (Giraldo et al., 2010). Some studies covered in environmental statistics include targeted studies to describe the likely impact of changes being planned or of accident occurrences, baseline studies to document the present state of an environment to provide background in case of unknown changes in the future and regular monitoring with the aim of detecting changes in the environment.
4.2. SOURCES OF INFORMATION There are various sources of data used in environmental statistics. They include survey related to the environment and human populations, maps, and images, records from agencies managing environment resources, equipment used in examining the environment and research studies around
本书版权归Arcler所有
68
Introduction to Environmental Statistics
the world. The key components of data are direct observation though most environmental statistics use a variety of sources (Figure 4.4).
Figure 4.4. Survey is a good source of information. Source: https://www.dreamstime.com/survey-concept-shot-survey-written-paper-chit-image163484143.
4.3. METHODS There are various methods of statistical analysis in environmental sciences. The same applies to its applications. There is a basis for the methods used in other fields. Most of these methods are modified to suit the needs or limitations of data in environmental science. Linear regression models, non-linear models, and generalized models are examples of methods used in statistical analysis widely used in environmental science in studying relationships between variables (Gotelli et al., 2012).
4.4. BASIC CONCEPTS In the mid-seventeenth century, there was the development of the theory of probability and games of chance that led to the concept of modern statistics was born. The statistics is said to come from the German word ‘Statistik,’ the Latin word ‘Status’ or the Italian word ‘Statistik.’ They all mean
本书版权归Arcler所有
The Role of Statistics in Environmental Science
69
political state or statecraft. The term statistics can be used in two different senses. When used in the plural sense it means a collection big a numerical fact. Horace Secrist says that statistics can be defined as the aggregate of facts, affected to a marked extent by multiplicity of causes, enumerated, numerically expressed, or estimated according to a reasonable standard of accuracy (Han and Li, 2020). In this sense, such kind of data is collected in a systematic manner for a predetermined purpose and placed in relation to each other. This kind of definition is used in explaining the characteristics of statistical data. When used in the singular sense, the term Statistics is used in defining statistical methods of dealing with numerical data. Cowden and Crocton state that statistics is the science of collection, analysis, presentation, and interpretation of numerical data. This kind of definition is useful as it points out different stages of statistical investigation. This means that statistics is concerned with the exploration, summarizing, and making inferences about the state of complex systems including the state of a nation also known as social statistics, the state of the environment also known as environmental statistics and the state of people’s health also known as medical and health statistics. Statistics is known for its wide range of applications and advantages. Among the allegations made against statistics is that the concerned parties may make misleading statement to favor their beliefs. However, it is important to note that in the field of science experts use statistical tools effectively (Girshick et al., 2011). They should also ensure that any statistical study is conducted by the right person. Such studies are usually done with an open mind as there are many ways it can be properly done and numerous ways it can be done wrongly. This means that high levels of keenness should be maintained when selecting the kind of tools to be used during the process.
4.5. APPLICATIONS OF STATISTICAL TOOLS IN ENVIRONMENT SCIENCE Data analysis is statistics is usually divided into two sections namely inferential and descriptive statistics. There are extensively described as follows:
本书版权归Arcler所有
70
Introduction to Environmental Statistics
4.5.1. Descriptive Statistics This kind of statistics is considered the initial stage of data analysis. It involves the exploration, visualization, and summarization of data. This kind of analysis makes use of population and random samples. There are different kinds of data used in this kind of analysis. They include quantitative or qualitative, continuous, or discrete data. Such kind of data is used in studying features of data patterns, distribution, and associations. Tools such as frequency tables, pie diagrams, histograms, and bar charts are used in representing data positions, distribution, shape, and spread efficiently (Antweiler and Taylor, 2008). Descriptive statistics is used in interpreting information contained in data allowing one to draw conclusions. The different measures of central tendency such as a median and mean can be calculated for when analyzing environmental data. Other elements taken into consideration is the study of dispersion such as standard deviation and range among other. They are used in measuring variability in small samples. Concepts of correlation and associations are used to demonstrate relationships between variables. They are considered useful tools allowing a clear understanding of linear and non-liberal relationships (Figure 4.5). Some important measures are discussed in further sections.
Figure 4.5. Descriptive statistics explores and visualizes environmental data. Source: https://www.shutterstock.com/search/descriptive+statistics.
本书版权归Arcler所有
The Role of Statistics in Environmental Science
71
4.5.2. Central Tendency Central tendency is defined as the tendency of observations to cluster around some central value. Measures of central tendency are termed average. Among the commonly used averages in environmental statistics include mean which is the average of all values (Zio et al., 2004). There is also median, which is the middle most observation when all the obtained vales or observations are arranged in ascending or descending order. Finally, there is mode which is the frequently occurring observation.
4.5.3. Dispersion Dispersion is defined as the scattering of observations about the central value. Among the import measures of dispersion include standard deviation, range, mean deviation, quartile deviation and coefficient of variation. All these four measures are dependent on the unit of measurement of observations. For this reason, they are absolute measures. Range is defined as the deviation between the smallest and largest observations. Measures are made on the basis of quartiles. Quartiles include quartile deviation, mean deviation, standard deviation, and coefficient of variation. There are different formulas used in calculating these quartiles.
4.5.4. Skewness Skewness is also known as asymmetry and is used in defining the lack of symmetry. When observations are plotted on a frequency curve and both sides of the mode are distributed in a similar manner, then the distribution is said to be asymmetric therefore it is skewed. If more area tends to the right side of the mode, such a distribution is considered positively skewed; however if more area is on the left side of the mode, such a distribution is said to be negatively skewed (Ying and Sheng-Cai, 2012).
4.5.5. Kurtosis Kurtosis is the measure of the degree of flatness or peakedness of a curve. Mesokurtic refers to a normal curve. When a curve peaked more than normal it is called leptokurtic. A curve is called platykurtic when it is flatter than normal.
本书版权归Arcler所有
72
Introduction to Environmental Statistics
4.5.6. Inferential Statistics This is kind of statistics makes use of the concept of probability. This concept is important when studying uncertainties in the environment. For example, whether it will rain or not in a given day can be inferred using probability. There are a number of theoretical probability distributions including the Poisson distribution, the binominal distribution, and the Bernoulli distribution among others. They are all useful in modeling the probability distribution of real environmental data. This may include decisions such as rain or no rain, coin tossing and yes or no among others. They all can be explained using the Bernoulli variables as their outcomes are binary in nature. If one wants to count the number of times floods occurred in a given area out of the total number of floods occurred, a method is followed. As we are counting the number of times a flood occurs, which is a Bernoulli event, with a probability of p out of a total, this kind of probability distribution of the variable is given by a binomial distribution (Xia et al., 2017). If one does not know. The total number of flood occurrence but known the meaning of flood occurrences, such a distribution can be modeled by the Poisson distribution. Among the statistical tools that play a vital role in analyzing environmental data include estimation and hypothesis testing. Among the frequently used statistical tests in environmental and atmospheric science are the F-test, T-test, and Z-test among others. The other statistical approach is the time series analysis. This kind of analysis studies environmental quantities with respect to time. A good example is the monthly/yearly mean temperature, rainfall, and humidity, all of which can be studied using the time series (Figure 4.6).
Figure 4.6. Inferential statistics is formed on the basis of probability. Source: https://www.shutterstock.com/search/inferential+statistics.
本书版权归Arcler所有
The Role of Statistics in Environmental Science
73
As mentioned earlier, there are many statistical techniques available that can be used in the field of environmental science. There are some practical examples in the context of data based on environmental science. There are cases where there is a collaboration between quantitative researchers and environmental scientists. Such kind of collaborations have aided future learning in both fields, based primarily on two works. One of them deals with statistical techniques while the other deals with practical examples (Wieland et al., 2010). Statistical techniques can be classified according to the method of plotting species abundance data. Under this there is the Whittaker plot.
4.5.7. Whittaker Plot Among the most informative methods is the abundance/rank plot or dominance diversity curve. In such a curve, the species are plotted from the most to the least abundant along the x-axis and abundance in the y acid in format. This enables the accommodation of the abundance of several orders in the same graph. Easy comparison can be achieved by using percentage or proportional abundance. These plots were given such a name, the Whittaker plot, in remembrance of R. H. Whittaker for his famous contribution. There are several advantages of using this plot. It allows clean contrasting patterns of species richness. When there are only a few species making information concerning their relative abundance visible and can be represented in a histogram formation. The plot is also very effective when following environmental impacts and succession (Adelman, 2003). In such cases rank/abundance graph is used. The shape or behavior of the curve gives inferences about the species abundance model suitable for the given data. When the graph has a steep plot describes assemblages with high dominance while one with a shallow plot signifies low dominance. In most cases, high dominance plots are consistent with log or geometric series while the low dominance plots suit the log-normal or broken stick kid. The challenge however is that the curves of different models are rarely fitted with empirical data (Figure 4.7).
本书版权归Arcler所有
74
Introduction to Environmental Statistics
Figure 4.7. A Whittaker plot. Source: https://www.researchgate.net/figure/Whittaker-plot-for-the-total-collection-of-fish-sampled-at-the-Itupararanga-reservoir_fig3_24405182.
4.5.8. K-Dominance Plot This kind of plot is often used when one wants to show the relationship between percentage cumulative abundance in the y axis and species log/rank series rank in the x-axis. Such kind of curves are a representation of the less diverse assemblages (Wagner and Fortin, 2005).
4.5.9. Biomass/Abundance Comparison Curve of ABC Curve This curve is a variant of the K-Dominance Plot. The related curve in the plot is constructed by using two measures big abundance that includes the biomass and individuals. The level of pollution-induced, disturbance or otherwise that affects the assemblage can be inferred from the resulting curve. This method is mostly used with benthic macrofauna as it was developed for it. The method has been used productively by various investigators. In most cases, the ABC plot is used in studying entire species abundance distribution. The results of the plot are said to be positive when the biomass curve is
本书版权归Arcler所有
The Role of Statistics in Environmental Science
75
consistently above the individual curve. This is interpreted as undisturbed abundance. On the other hand, a grossly perturbed assemblage gives negative values that are constantly above the biomass curve. For curves that produce a value of W close to 0 and overlaps signifies moderate disturbance. Values of W usually ranges from –1 to +1. W statistics are generally computed for every sample separately. Significant differences can be tested using ANOVA. This is mostly done when treatments have been replicated. Another benefit of graphing W values is that they are an effective way of illustrating shifts in the composition of the assemblage if un-replicated samples have been taken over a time series of along a transect such as before, during, and after pollution events. W statistics is mostly useful when considering ABC curves at discriminating samples (Toivonen et al., 2001).
4.5.10. Species Abundance Models When statistics models were developed, they were devised as the best empirical fits to all the observed data. They allow the investigator to compare different assemblages objectively. This is one of its advantages. There are cases where the parameter of the distribution can be used as an index of diversity. There are other set of models, including the theoretical or biological models.
4.6. STATISTICAL MODELS There are various kinds of models used when analyzing environmental data. They are discussed as follows:
4.6.1. Log Series Models The log series model has the number of species j the y-axis as I’d displayed in relation to the number of individuals per species that is displayed on the x-axis. The abundance classes are usually represented in log scale. Such kind of a plot is used when the log-normal distribution is chosen. This kind of graph is at times called the Preston plot. This name is used in remembrance of Preston F. He pioneered the use of the log model during analysis. It is noted that when using the log model, the mode will fall to the class with the lowest abundance which often represents a single individual (Austin et al., 2005). Such plots are known to place more focus on rare species. It is also notable that in log transformations, the x-axis has a tendency to shift a mode to the right so as to reveal a log-normal pattern (Figure 4.8).
本书版权归Arcler所有
76
Introduction to Environmental Statistics
Figure 4.8. The log series model. Source: https://www.researchgate.net/figure/The-log-series-distribution_ fig2_12607564.
4.6.2. Negative Binomial Model In the field of ecology, there are numerous applications of the negative binomial model. This model is mostly used in estimating the richness of species. Some researchers, however, have pointed out that it is only rarely fitted to data of species of abundance. There is potential interest in this model as it came from a stable log series model.
4.6.3. Zipf-Mandelbrot Model The linguistics and information theory are said to be part of the origin of this model. There are several applications of this model in environmental diversity. However, this model is highly valued attributed to its use in a rigorous sequence if colonists from the same species that are always present at the same point in the succession in identical habitats. Some researchers may argue that this model is not better than the log-normal model or the log series. This has been successfully used in various environmental fields.
本书版权归Arcler所有
The Role of Statistics in Environmental Science
77
Among the areas of use include in terrestrial studies and also aquatic systems. This model is known to be effective in testing the performance of various diversity estimators (Sivak and Thomson, 2014). Data on biomass is known to be compatible with log-normal distribution while the Zipf-Mandelbrot model is very effective in giving the best description of cover data.
4.7. GOODNESS OF FIT TEST In the analysis of data, the goodness of fit test is very essential as it allows suitable data to be used in analysis. This kind of test is used in finding the relationship between the observed and expected frequencies of a species in each abundance class. The conventional method used is assigning the observed data to abundance class to ensure that the data fits a deterministic model. There are different classes used and are usually on log to the base 2. This model allows the determination of the number of species expected in each abundance class. Classes based on are usually used. According to the model used, the number of species expected in each abundance class is determined. The model takes the number of species (S) as observed values and the total abundance (N) and determines how the N individuals should be distributed among the S species. There are cases where the model is rejected as it does not describe the pattern of species abundance adequately. This happens when the value of p is less than 0.05. The fit is considered good\ does not fail when the value of p is greater than 0.05. In most cases, tests of empirical data are done when a small number of abundance classes are to be used. This is known to cause a reduction in the degrees of freedom available (Bankole and Surajudeen, 2008). This means that the more the degrees of freedom get the least value, the harder it becomes to reject a model. It is said that the goodness of fit tests works most effectively with large assemblages though they might not be ecologically coherent units. For this reason, one is advised to adopt the K-S-testing as the standard method of assessing the goodness of fit of deterministic models to gauge whether they are suitable for the analysis. The advantage of using the K-S two-sample test is that it can be used to compare two datasets directly, allowing the description of their abundance patterns. It is important to note that though a model may fit a given type of data, it may not necessarily be applicable for another type of data. Also, if one model fits the data and another does not, it is not possible to conclude that the fit of the two is significantly different. For this reason, one may be advised to use replicated observations. Any deviations noted in the study
本书版权归Arcler所有
78
Introduction to Environmental Statistics
can be log transformed with the aim of achieving normality. When models are significantly different from one another, they can be inferred by using the multiple comparison test that includes Duncan’s new multiple range test.
4.8. THEORETICAL OR BIOLOGICAL MODELS There are various categories of such kind of models discussed as in subsections.
4.8.1. Deterministic and Stochastic Models When using the deterministic model, the assumption is that N individuals will be distributed amongst the S species in the assemblage. With the deterministic niche apportionment model, the geometric series is used. The stochastic models work by recognizing that replicate communities structured according to the same set of rules will vary according to the relative abundances of species found in that area. It seeks to capture the random elements inherent in natural processes. This is said to make biological sense. When the stochastic models are compared to the deterministic models, the stochastic model is said to be more challenging (Simoncelli and Olshausen, 2001). Practically, it is very necessary for a researcher to know whether a model is stochastic or deterministic. When using the stochastic model, one should know that it has a complexity that requires replicated data. There are some methods that have been developed to deal with this challenge.
4.8.2. Geometric Series When the dominant species pre-empts a limiting resource percentage K and the same case applies for the second most dominating species (it preempts the same K of the remaining part), all S have to be chosen. This is the underlying assumption. This assumption is full filed when the species abundance is proportional to the resource amount. The generated resultant pattern will follow a geometric series or a niche pre-emption hypothesis. In this series, species abundance will be ranked from the most to the least. The ranked list is very important, more so when the ratio of abundance of each species to abundance of predecessor is a constant through the species. When the observations are plotted in a log abundance/species ran graph, the series will appear as a straight line. Such plots are very useful in identifying whether or not the dataset is consistent with a geometric series. Analysis makes use of a full mathematic treatment of the geometric series. It can
本书版权归Arcler所有
The Role of Statistics in Environmental Science
79
also be used to present the species abundance distribution corresponding the abundance/rank series.
4.8.3. Broken Stick Model This model is also known as the random niche boundary hypothesis. When using this model plots of relative species abundance in the y-axis and on a linear scale, and in the X-axis in the logged species sequence abundance is plotted to represent data from most to least. This graph is usually as a straight-line graph. This model, however, has its own demerit which is that it can be derived from more than one hypothesis. It is useful in providing evidence of some ecological factors being more or less evenly between species (Seid et al., 2014). This method can be used in grouping S species with equal competitive ability vying for niche space. In this model, data is arranged in the order of rank order abundance. This model may present a challenge when fitting empirical data.
4.8.4. Tokeshi’s Model As the name suggest, this model was developed by Tokeshi who is known for developing several niche apportionment models, including power fraction, random fraction, dominance pre-emption, the dominance decay and MacArthur models. When making use of this model, the assumption is that abundance is proportional to the fraction of niche space occupied by a species. For the Tokeshi model, the assumption is that the target niche selected is divided at random. The difference between all the models generated is with regards to how the target niche is selected. When the niche is large, then the resulting species abundance distribution will be large (Rovetta and Castaldo, 2020). The evenness will range from least to most from the dominance preemption models that follow the order of explanation. A random collection of niches of arbitrary sizes is represented using the random assortment model.
4.8.5. Random Fraction The random fraction is also a model use when the niche space is divided at random into two pieces. From the two, one is selected randomly and is subjected to further subdivision till all species are accommodated. In the sequential breakage model is used to depict a situation in which a new colonist competes for the niche of a species that is already in the community and takes over a random proportion of the previously existing niche. Among the uses of this model includes it being used in covering speciation events.
本书版权归Arcler所有
80
Introduction to Environmental Statistics
Another advantage of this model is that it is conceptually simple and can be suitable when dealing with a small community. The other advantage is that there is a Microsoft Excel program that can model the species abundance and distribution associated with it (Pleil et al., 2014).
4.8.6. Power Fraction Model The power fraction model was also developed by Tokeshi. The Tokeshi model is used when handling species-rich assemblages that is an exception to others. During this time, the niche space is subdivided in a similar manner to random fraction. When the power function (K) is used, there is a probability of a niche splitting increases and albeit only slightly in relation to size. When this kind of approach is used then the largest niche can be selected for fragmentation. There are cases where there power fraction resembles the MacArthur fraction model. In most cases, the value of K is set to 0.5 when using the power fraction model. When developing the models, Tokeshi accounts for virtually all assemblages. It is important to note that larger niches have a high fragment probability or could occur either evolutionarily or ecologically.
4.8.7. Dominance Pre-Emption Model The dominance pre-emption model makes use of the assumption that each species pre-empts more than half the niche space remaining. This makes it dominant among the combined species. In most cases, the proportion of available niche space assigned ranges between 0.5 and 1. Any increase in the number of replications causes it to become more similar to the geometric series (Browning et al., 2015). This method can be applied to niche fragmentation.
4.8.8. MacArthur Fraction Model This model is used in predicting species abundance distribution. It is known to produce the same results as the broken stick model. When using this model, the probability of niche fragmentation is inversely proportional to size. This leads to the creation of a very uniform distribution of species abundances. This method is known to be plausible in small communities of taxonomically related species. The disadvantage of using this method is that un-replicated data is not compatible with it. The same case applies for the broken stock model.
本书版权归Arcler所有
The Role of Statistics in Environmental Science
81
4.8.9. Dominance Decay Model This model is known to work with uniform pattern of species. When using it, there is a random selection of the niche space for fragmentation. The dominance decay model has been used in the prediction of data, and currently, there is no empirical data that indicates that communities predicted the model is found in nature can be found to date (Piegorsch and Edwards, 2002).
4.9. FITTING NICHE APPORTIONMENTS MODELS TO EMPIRICAL DATA More research has led to the development of new ways of testing stochastic models. In these cases, species are arranged and listed in decreasing order of abundance. There is an equation used in fitting a niche apportionment model when the mean observed abundance falls within the confidence limits of the expected abundance. In most cases, the mean abundance will constitute the observed distribution. An estimation is usually done when the assemblage has the same number of species. When using this model, a large N and n has to be chosen. It is easy to use as the confidence limits are assigned to each rank of expected abundance by considering n rather than the number of times simulation of the model was done(N).
4.9.1. Species Richness Indices Under the species richness indices, there are two well-known indices. The advantage of these indices is that they are easy to calculate. There is Menhinick’s index and Margalef’s index. When using these indices, there have been several attempts to correct sample size. The challenge may be the fact that both measures are strongly influenced by sampling efforts. Both indices are considered meaningful indices that can be used in biological diversity research.
4.9.2. Estimation of Species Richness There are two main methods used when estimating species richness from samples. In the first method, the approach involves the extrapolation of species accumulation or species-area curves while the second approach used a non-parametric estimator (Bustos-Korts et al., 2016).
本书版权归Arcler所有
82
Introduction to Environmental Statistics
4.10. SPECIES ACCUMULATION CURVES These kinds of curves are also known as the collector curves. These curves have a plot S which is the total number of species as a function of sampling effort. The species accumulation curves are mostly used in botanical research. There are various kinds of curves that may include curves f S versus A for different areas while other curves are used in increasingly larger parcels of the same region. The order of the samples is known to affect the overall shape of species accumulation curves, the curves are made smoother by randomizing. These curves are known to resemble rarefaction curves. In most cases, these curves will move from the left to right with the introduction of new species. It is important to note that in ideal cases, the rarefaction curves conventionally move from right to left. In several studies, scientists have been known to plot species accumulation curves using linear scales on both axes (Pleil et al., 2014). Identification of asymptotic curves from logarithmic curves can be achieved through the use of log-transformed x-axis. This is attributed to the fact that semi-log plots make the identification process much easier. An extrapolation is done on the graph when estimating total species richness. When conducting the extrapolation process, there are functions used that can be classified as either asymptomatic or non-asymptomatic. They all play different roles and are both used to help users predict any increase in species richness with additional sampling effort rather than estimating total species richness.
4.10.1. Asymptotic Curves Asymptotic curves can be generated through two main methods. The first method involves the use of a negative exponential model while the second method makes use of the Michaelis-Menten equation (Figure 4.9).
本书版权归Arcler所有
The Role of Statistics in Environmental Science
83
Figure 4.9. There are two methods of generating asymptotic curves. Source: https://www.researchgate.net/figure/Graph-of-the-maximal-asymptotic-yield-Hx-as-a-function-of-the-harvesting-threshold-x_fig8_320190990.
4.10.2. Non-Asymptomatic Curves The non-asymptomatic curves are used in estimating species richness. It was proposed that the relationship between species and a given area can best be described by a log-linear model that can be extrapolated to a large area. To avoid extremely high estimates of species richness, an asymptote was imposed on the log-log species-area curve (Chai et al., 2020). Under parametric methods, there is the log-normal distribution and the log series distribution. They are considered the most potent two abundance models. The log series distribution is considered the easiest fit and is very simple to apply. The log series distribution model allows users to obtain a good estimate of total species richness when the number of individuals in the target area can be estimated. This may lead to the underestimation of S where it should not be. This method is mostly used during rarefaction. When fitting continuous log-normal distribution, the pragmatic approach is most likely to be adopted. This adoption is considered in appropriate when the observed data is in discrete form. This method is mostly used as it has the unique property of generating a mode in the second of third class thus giving the appearance of a log-normal distribution even though it does not have the log-normal distribution.
本书版权归Arcler所有
84
Introduction to Environmental Statistics
4.11. USERS OF ENVIRONMENTAL DATA With regards to the use of environmental data, there are various kinds of users of such data. The type of use will determine the type, levels of thematic, spatial, and temporal aggregation and the format of environmental statistics. Such kind of data is usually modified so as to meet the intended use. For instance, among decision-makers and policymakers they make use of environmental indicators and more aggregated statistics (Parise et al., 2014). In the case of the general public, which includes the civil society and media, they also make use of more aggregate statistics and environmental indicators. Academia, researchers, and analysts make use of extensive and detailed environmental statistics (Figure 4.10).
Figure 4.10. Government agencies utilize environmental data. Source: https://www.researchgate.net/figure/Distribution-of-environmentaldifferences-in-government-agencies_fig4_353910622.
When discussing products of environmental statistics, some of the main products include environmental indicators and detailed descriptive environmental statistics series. These products of statistics can be disseminated in the form of thematic reports, publications, online databases, and analytical publications. They can also be stored in multi-purpose databases. There are various kinds of environmental information that include
本书版权归Arcler所有
The Role of Statistics in Environmental Science
85
environmental data, environmental indicators, environmental statistics, environmental indices, and environmental economic accounts.
4.12. ENVIRONMENTAL INFORMATION This kind of information is used in describing qualitative, quantitative, or geographically referenced facts that represent the current or previous state of the environment and some of the changes it has undergone. With the quantitative environmental information, it is made up of data, indicators, and statistics. This kind of information is generally disseminated through spreadsheets, databases, compendia, and yearbook kind of products. Qualitative environmental information, on the other hand, is made up of descriptions of the environment. In other cases, its constituent parts cannot be adequately represented by accurate geographically or quantitative referenced descriptors (Patil and Taillie, 2003). Finally, with geographically referenced environmental data, it is useful in providing data on the environment and its components through the use of satellite imagery, digital maps, and other sources linked to a location or map feature (Figure 4.11).
Figure 4.11. Environmental data is used in environment statistics. Source: https://www.istockphoto.com/search/2/image?phrase=digital+environ ment.
本书版权归Arcler所有
86
Introduction to Environmental Statistics
Under qualitative environmental information, there is environmental data. Environmental data is defined as large amounts of unprocessed observations and measurements about the environment and its components and all related processes. Environmental data can be collected or compiled by NSOs, sectoral authorities and environmental ministries among others (Notaro et al., 2019). There are different sources of environmental data that includes administrative records, inventories, registers, remote sensing, monitoring works, scientific research, statistical surveys, and field studies. Another part of environmental data is environmental statistics. Such kind of statistics is used in the structure, synthesis, and aggregation of environmental data among other kinds of data. Environmental statistics makes use of collected data and subjects it into processing leading to the production of meaningful statistics that describe the state and trends of the environment and some of the processes affecting it. It is important to note that not all environmental data is used in the production of environmental statistics. For instance, the FDES usually provides a framework that identifies environmental data that fall within its scope (Christensen and Himme, 2017). It also contributes to the synthesizing, structuring, and aggregation of data into statistical series and indicators. In environmental statistics, there are units involved in the compilation, collection, validation, description, and structuring of environmental data to produce environmental statistics series (Figure 4.12).
Figure 4.12. Environment indicators gives information on the state of the environment.
本书版权归Arcler所有
The Role of Statistics in Environmental Science
87
Source: https://www.shutterstock.com/search/environmental+indicators.
Environmental indicators also fall under quantitative environmental information. They are defined as environmental statistics that are in need of further processing and interpretation. In most cases, environmental statistics are too numerous and detailed to satisfy the needs of the general public and policymakers. Environmental indicators are used for a variety of reasons. They are involved in the synthesis and presentation of complex statistics. Environmental statistics are considered a measure that summarizes, simplifies, and communicates information. These indicators are used in the definition of objectives, assessing present and future direction with respect to goals and targets. Environmental indicators are used in evaluating specific programs, demonstrating progress, and measuring change in a situation over time or in s specific condition. There are also used when determining the impact of programs and conveying of messages (Ng and Vecchi, 2020). Environmental indicators are developed using an indicator framework. Some examples of such framework include the drivers, pressures, state, impact, and response model of intervention (DPSIR). Policy frameworks are also used that include the MDGs. They are all used in the identification and structuring of indicators. There are various SD indicators/regional environmental frameworks that are already in place for use (Figure 4.13).
Figure 4.13. Environmental indices are part of the environment. Source: Environmental sustainability indicators. | Download Scientific Diagram (researchgate.net).
本书版权归Arcler所有
88
Introduction to Environmental Statistics
Environmental indices are also part of environmental information. They are defined as the more complex or composite measure that combines and synthesizes more than one statistic or indicator that are weighted according to different methods. There are numerous benefits of using environmental indices, among them being the fact that an index can provide a valuable summary measure allowing communication of important messages in a popular way therefore raising awareness. Among the limitations of using environmental indices is that they may raise questions related to its methodological soundness, the quality of the underlying statistics, their proper interpretation, and the subjectivity of weight.
4.13. SOURCES OF ENVIRONMENTAL STATISTICS There is a wide range of sources of environmental data, all of which are synthesized by environmental statistics. After data has been collected to be used in environmental statistics, it is compiled by many different collection institutions and techniques. Usually, those conducting the analysis in environmental statistics know and understand the pros and cons of each source. There are different sources of environmental data that includes statistical surveys that include sample or census surveys of population, agriculture, housing, enterprises, employment, household, and different aspects of environmental management. Government and non-government agencies in charge of natural resources can also generate administrative records, which can be considered a source of environmental data. The same applies for authorities and ministries (Marohasy, 2003). Other sources of environmental data are remote sensing and thematic mapping which includes the use of water bodies, satellite imaging of land use or forest cover. There are monitoring systems used in providing data on various things. There are field monitoring stations for water quality, climate, or air pollution. There is special research and scientific research projects undertaken to fulfill both international and national demand for environmental data.
4.13.1. Statistical Surveys This kind of survey can be used to generate data through two main methods, which include the use of sample surveys and censuses. Censuses involve the collection of data from the entire population of interest, while sample surveys are usually carried out using a specific sampling method whereby data collected from a given portion will represent the entire population of
本书版权归Arcler所有
The Role of Statistics in Environmental Science
89
interest. Surveys can be used in the collection of environmental statistics. This is achieved by adding environment-related questions to surveys primarily intended to collect data on certain topics. It can also be collected through the use of surveys that are primarily intended to collect environmental statistics. In most cases, environmental data is collected through environment statistics surveys. Such surveys are usually designed according to the objective of producing environment statistics. When conducting environmental statistics surveys, it is important to note that they are not always economical or feasible with restricted budgets. Data can also be collected from other existing sectoral, economic, social, and demographic statistical surveys with the objective different from the generation of environmental statistics (Dutilleul et al., 2000).
4.13.2. Administrative Records Such kind of data is usually kept by organizations or government agencies. This data can be used in producing environmental statistics. There are numerous benefits of using administrative records as a source of environmental statistics. They include the fact that one will spend less when collecting such kind of data. This method is known to have a minimized level of response burden. It also allows complete coverage of units under administration. Though there are numerous benefits of these methods, there are also some disadvantages of using administrative records. They include the fact that there are differences between statistical and administrative terms and definitions. There is a greater risk of deliberate misreporting, collected data may not be checked or validated for statistical purposes, thereby limiting access to data, and finally coverage of data may not reach statistical requirements.
4.13.3. Remote Sensing and Mapping Remote sensing is defined as the science of obtaining information about areas or objects from a distance, more so from satellites or from aircrafts. The use of remote sensing in data collection makes it possible to replace costly and slow data collection on the ground thereby ensuring that objects or areas are not disturbed during the process. It also makes it possible to collect data in areas considered to be inaccessible or dangerous. Some of the tools used in remote sensing include helicopter images, balloons, ship, buoy, spacecraft, air crafts, and satellites. Generated results from remote sensing can be observed, tracked, imaged, or mapped. For instance, remote
本书版权归Arcler所有
90
Introduction to Environmental Statistics
sensing data can be capture and analyzed so as to measure forest cover, make comparisons of the impacts of natural disasters, measure changes in the area of soil erosion, measure changes in land cover, measure the extent of pollution or make estimations on the population of different animal species (Farrell et al., 2010). There is a case where remote sensing is combined with sufficient validation using actual measurements in the field. This enables the provision of high-quality data for environment statistics.
4.14. MONITORING SYSTEMS Monitoring systems are comprised of filed-monitoring stations used in describing the quantitative and qualitative aspects of environmental media such as water, soil, or air quality; meteorological or hydrological parameters and characteristics. The use of monitoring systems in the generation of data to be used in environmental statistics is very advantageous. Some of the advantages include the fact that monitoring systems are usually reliable as time series, it is usually validated, and it is usually collected using verifiable scientific methods and often makes use of modeling to improve data quality. Some limitations include the fact that field monitoring systems are usually located in hot spot areas where there are high levels of pollution, such areas are considered highly sensitive and large numbers of the population are affected. When using monitoring systems, there are high chances that measurements will be location specific. Also, as a result of the challenges in its representativeness, it is very difficult to aggregate over space.
4.15. SCIENTIFIC RESEARCH Data from scientific research is mostly used in environmental statistics. This is because this method of data collection is very advantageous. Data from scientific research is usually available for low cost or for free. Such kind of data is known to minimize response burned; it is at times used to fill in data gaps and is very useful in developing coefficients for models. There are some limitations of using data from scientific research. Data from this source is known to use terms and definitions different from those used in statistics, there may be some difficulties when accessing microdata, metadata may be missing, data may be available on a one-time basis and often data are available only for case examples (Madavan and Balaraman, 2016). There are certain classifications or groupings that are relevant to environmental statistics. Currently, there is no single overarching,
本书版权归Arcler所有
The Role of Statistics in Environmental Science
91
internationally agreed upon environmental statistics classification. Researchers make use of co-existing and emerging classifications, groupings, and categorization for specific subject areas that are relevant to environmental statistics. There are cases where environmental statistics makes use of specific classifications such as the Classification of Environmental Activities, UN Classification for Energy and Mineral Resources, and FAO Land Cover Classification System. It also makes use of categories, classification, and grouping, such as the classification for protected areas and for threatened species. Some classifications were not developed for statistical purposes. Other than these, environmental statistics makes use of both social and economic demographic classifications that include the Central Product Classification, International Standard Industrial Classification of All Economic Activities, and International Classification of diseases. These classifications are very important as they facilitate the integration of environment statistics with social and economic demographic statistics. There are some temporal considerations with regards to environmental statistics. This includes the use of different time scales; shorter or longer time periods have been found useful in the aggregation of environmental data over time. When determining the appropriate temporal aggregation and periodicity of production of environmental statistics may require certain considerations to be made, all of which are dependent on the nature of the measured phenomena. In most cases, environmental data is mostly produced at different intervals when there are enough data points in each period, the environment statistics based on such data can still be produced at regular intervals. Another temporal consideration is when dealing with environmental phenomena. Usually, environmental phenomena are fluids, therefore, careful consideration of the temporal dimension is required as there are factors that can influence measurement (Kobayashi and Ye, 2014). Some of these factors include runoffs, snow, floods, and floods and ebbs and flows in the case of water resources. There are cases where daily variations are exhibited, and in other cases, the variations are seasonal depending on what is being measured. In the case of seasonal variations, they can be seen in the fluctuations in temperature, precipitation, and certain types of fish biomass, ice cap surface, surface water levels or the incidence of fires. When conducting monitoring of seasonal variations, more focus is placed on certain months compared to others. Other than temporal considerations, there are spatial considerations. This kind of considerations when dealing with space. This may include
本书版权归Arcler所有
92
Introduction to Environmental Statistics
the occurrence and impacts of environmental phenomena that are distributed through space without regard for political-administrative policies. In environmental statistics, some meaningful spatial units are natural units. They include ecosystems, watersheds, landscapes, or land covering and eco-zones units. Natural units also include planning and management units that are based on natural units such as coastal areas, river basin districts and protected districts. Administrative units are usually considered the guidelines in the aggregation of economic and social statistics (Khormi and Kumar, 2011). The difference is said to complicate the analysis and collection of environment statistics, especially when there is a great need to combine them with data that originates from economic and social statistics. Challenges related to spatial analysis are often overcome by using geo-referenced data.
4.16. GEOSPATIAL INFORMATION AND ENVIRONMENT STATISTICS Geospatial information is used in presenting the location and characteristics of different attributes of the surface, sub-surface, and atmosphere. This kind of information is also used in describing, displaying, and analyzing data that have discernible spatial aspects, such as natural disasters, water resources and land use, among others. It also allowed for the visual display of different statistics when a map-based layout is used. There are numerous benefits of geospatial information that includes the ease of handling among users in that users can easily work with and understand such kind of data. Geospatial data also enables deeper analysis of the relationship among phenomena such as environmental health, environmental quality, and population.
4.16.1. GIS, Remote Sensing, and Satellite Images Satellites are used in generating remote sensing data. This data is usually acquired digitally and communicated for processing and analysis in GIS. In GIS, digital satellite images can be analyzed so as to produce maps of land use and land cover. There are different kinds of geospatial data that is combined in GIS. In most cases, such data has to be transformed to ensure they fit the same coordinates. A good example is when satellite remote sensing land use information is combined with aerial photographic data on housing development growth (Fourcade et al., 2018). For GIS to effectively work, it utilizes the processing power of a computer alongside geographic mapping techniques so as to transform data from different sources onto one projection and once scale so that data can be analyzed together.
本书版权归Arcler所有
The Role of Statistics in Environmental Science
93
4.16.2. Geographic Information System (GIS) The geographic information system (GIS) is a computer system with the ability to capture, store, analyze, and display geographically referenced information. There are various technologies used in acquiring geospatial data. Some examples include Remote Sensing Satellites and Global Positioning System. There are two main ways of entering collected data. The first is manual entering, as is the case for land-use information, landscape features and demographics. The other ways are digital entering where data from a map is entered in a digital format through electronic scanning. This allows the final representation of data which is constructed by superimposing different layers of information as required by the analytical and policy requirements.
4.17. INSTITUTIONAL DIMENSIONS OF ENVIRONMENT STATISTICS Among the elements considered very important in the development of environmental statistics at the national level is the technical capacity. Environmental statistics is a multi-disciplinary with a cross-cutting nature. Production of environmental data and statistics usually involves various stakeholders, producers, and actors. Among the challenges faced by countries with regards to environment statistics include insufficient institutional development, overlapping mandates and functions, and inadequate interagency coordination, among other institutional issues (Fuentes et al., 2007). Some of these challenges are experienced at the international level whereby multiple partner agencies can operate with different mandates, production timetables and work programs. Resolving institutional concerns is very important as most of the concerns are affecting environment statistics. One important step in resolving these concerns is the identification of the primary institutional obstacles that impede the production of environment statistics. The next step is to develop a strategy to overcome the issue. This is very important for countries that are very keen to develop or strengthen their environment statistics program. Some of the key elements pertaining to the institutional dimension that needs to be considered and dealt with simultaneously during the development of environment statistics include inter-institutional collaboration, institutional development, the legal framework, clear mandate, and institutional cooperation of national, regional, and global bodies. Under the legal framework, the chosen framework should be relevant to environment statistics production and includes environmental,
本书版权归Arcler所有
94
Introduction to Environmental Statistics
statistical, and sectoral legislation such ad for agriculture, energy, and water. The National Statistical legislation is given the responsibility of creating and coordinating the national statistical system (Frankenhuis et al., 2019). Regulations and laws do not explicitly refer to environment statistics. In some countries, there are insufficient guidelines for statistical coordination among relevant statistical parties. Environmental ministries handle the responsibility for national environmental information systems. Certain challenges are encountered within a complex institutional context. They include duplication of efforts, overlapping of mandates and other coordination difficulties. The increasing importance of the environment in the development agenda has led to NSOs including production of environment statistics in their program through challenges such as the loss of clarity may be faced.
4.18. IMPORTANCE OF ENVIRONMENTAL STATISTICIANS In the field of environmental statistics, those who conduct research are known as environmental statisticians. They provide critical insights on the environment and provide a guideline on how to keep the world a better place. When involved in ecological studies, they are able to study and analyze the relationship between various organisms and the Environmental. This enables them to predict any changes in the relationship. They are well suited to create models to help with the quantification and explanation on variability on certain areas. They can create models from a microscopic to entire biological system. They give more information on certain issues that concern world populations that includes the effects of pollution, biodiversity crisis, and the potential effects of global climate change on ecosystems. Information they get from the statistical analysis provides information that is vital in understanding some of these issues (Gallyamov et al., 2007). They also make some predictions of climate changes. These scientists are able to measure and model the Earth’s current circumstances before beginning and create future projections. They work hand in hand with the government such that they educate decision makers in various parts of the world to determine how the nation handles the global climate. They also evaluate how Earth’s resources have been used. They usually work with the government and independent organizations to ensure that natural resources are well managed. They provide useful insights used in forming policies on issues such as climate change, conservation, and alternative energy.
本书版权归Arcler所有
The Role of Statistics in Environmental Science
95
They help improve pollution standards. These statisticians worked closely with environmental scientists in regulating pollution standards that make meaningful change by improving the environment and human health (Köhl et al., 2000). They are also involved in the measuring and analyzing of data from the Earth’s atmosphere, soil, and water systems. These insights are learned from statistical methods are used in developing safe and efficient systems for offsetting the impact of pollution on the environment. Generally, environmental statistics is very important in proper management of the environment. It provides information that is used by government agencies among other organizations. This information is used in making laws and regulations that will ensure that the environment is safe from environmental threats
本书版权归Arcler所有
本书版权归Arcler所有
5
CHAPTER
TYPES OF DATA SOURCES
CONTENTS
本书版权归Arcler所有
5.1. What Are Sources of Data?............................................................................... 99 5.2. Types of Data Sources...................................................................................... 99 5.3. Statistical Surveys........................................................................................... 102 5.4. Collection of Data.......................................................................................... 104 5.5. Processing and Editing of Data....................................................................... 105 5.6. Estimates and Projections Are Created............................................................ 105 5.7. Analysis of Data............................................................................................. 106 5.8. Procedures for Review................................................................................... 106 5.9. Dissemination of Information Products........................................................... 107 5.10. The Benefits of Administrative Data.............................................................. 110 5.11. Limitations of Administrative Data................................................................ 111 5.12. Obtaining and Learning from Administrative Data........................................ 113 5.13. Remote Sensing and Mapping...................................................................... 114 5.14. Technologies of Digital Information and Communication............................. 119 5.15. Environmental Monitoring Types.................................................................. 121 5.16. Iot-Based Environmental Monitoring............................................................ 123 5.17. Reasons For Environmental Monitoring........................................................ 124 5.18. Data From Scientific Research and Special Projects...................................... 124 5.19. Global And International Sources of Data.................................................... 127 5.20. Key Government Databases......................................................................... 128 5.21. A Data Source is the Location Where Data That is Being Used Originates From................................................................... 131 5.22. Data Source Types........................................................................................ 131 5.23. Sources of Machine Data............................................................................. 132
98
Introduction to Environmental Statistics
A data source is a location from which information is obtained. A database, a flat file, an XML file, or any other configuration that a system can read can be used as the source. The input is saved as a set of records containing information used in the business process. Customer information, accounting figures, sales, logistics, and other data can all be included. Knowledge can assist businesses in responding to changes in the market, overcoming logistical challenges, and identifying innovative methods to enhance the customer satisfaction. These particulars can give you a unique perspective on your company’s operations. Data sources often include data that has already been collected as well as data that will be gathered during the course of the study (Kessler et al., 2015). Data Sources are terms that can be used to describe various data collection techniques and/or tools. There are no guidelines or specific requirements concerning how this paper should be completed; instead, it is a tool to help you characterize your study if ever data needs collection. Data sources can vary depending on the application or field in question. Depending on their purpose or function, computer applications can have numerous data sources defined. Databases are used as the main source of data by applications like relational database management systems (RDBMSs) and even websites. The environment serves as the primary data source for hardware like input devices and sensors. A temperature and pressure regulation system for a fluid circulation system, like those used in industrial facilities and oil refineries, takes all pertaining environmental data or whatever they are monitoring; thus, the data source in this case is the environment (Girshick et al., 2011). Data like fluid temperature and pressure are collected on a regular basis by sensors and stored in a database, which then serves as the primary source of data for another software program that manipulates and displays this data. A data source is most commonly used in relation to databases and DBMSs, or any system that mainly deals with data, and is alluded to as a data source name, which is represented in the application so that it can locate the data. It simply means what it says: where the data is coming from. Data is the foundation of any data analysis jobs completed during the research process. Data analysis and interpretation are based solely on the collection of various data sources. Data refers to unorganized statistical facts and figures gathered from various sources. The researcher or analyst collects data in order to gather information (Kandlikar et al., 2018). Data sources can differ based on the need for data for the research project.
本书版权归Arcler所有
Types of Data Sources
99
The type of data also has an impact on data collection. All data is classified into two types: primary and secondary data. Both kinds of data are gathered from various data sources. The sources are trustworthy and broadly used for gathering specialized data about the research project.
5.1. WHAT ARE SOURCES OF DATA? Data analysis begins with gathering all sources of data, either through primary or secondary research. The collection of statistical facts and figures or non-statistical figures gathered by the researcher or analysts to further process the research work is referred to as a data source. Data sources are primarily of two types: • Sources of statistical data; and • Sources of census data. Both data sources are widely used by researchers in their research. Primary or secondary research methods are used to collect data from these data sources.
5.2. TYPES OF DATA SOURCES Both data sources are common in the research field. It is effective at gathering information (Figure 5.1).
Figure 5.1. Types of data sources. Source: https://www.marketing91.com/sources-of-data/.
本书版权归Arcler所有
100
Introduction to Environmental Statistics
5.2.1. Statistical Data Sources Statistical data sources are data sources used for official purposes such as surveys and other statistical reports. Data is collected by asking respondents a series of questions, which can be qualitative or quantitative. The qualitative data is of the non-figure variety, whereas the quantitative data sources are of the figure variety (Giraldo et al., 2010). The data sampling method employs both types of statistical data. A statistical survey is typically carried out by directing a sample survey. The method entails gathering sample data and examining it later using statistical methods and tools. The questionnaire process can also be used to direct the surveys in this case.
5.2.2. Census Data Sources The data for this method is gathered from a previously published census report. It is the polar opposite of statistical polls. The Census method relies on all of the population items that are further investigated for the research process. The data is gathered over a specific time period known as reference time. The researchers execute their research at a specific time and then analyze it further to reach a conclusion. The country conducts a census for official purposes. The participants are asked questions and must respond. This interaction can take place in person or over the phone. The census, on the other hand, is time-consuming and laborious because it involves the entire population (Gotelli et al., 2012). Aside from the data sources mentioned above, some other data sources are recognized for data collection. These include:
5.2.3. Internal Data Sources Internal data sources are those from which data is derived from reports and records compiled in the organization. Internal data sources are used to conduct primary research on a specific subject. All of the research work is simple for it. As a researcher, you can collect data from internal sources. Accounting resources, sales force reviews, Internal Experts, and Miscellaneous Reports are all examples of internal data sources.
本书版权归Arcler所有
Types of Data Sources
101
5.2.4. Data from Outside Sources When data is collected outside of the institution, it is referred to as an external source of data. In every way, the data sources are external; as a researcher, you can work on external data collection. Data collected from external sources is more difficult to collect because the data is more diverse, and external sources can be numerous (Girshick et al., 2011). External data sources can be classified into the following categories: •
•
•
Publications from the Government: Researchers can consider an extremely rich range of data from government sources. Furthermore, much of this information is available for free on the internet. Publications by Non-Governmental Organizations (NGOs): Researchers can also find various non-government publications that provide various industry-related data. The only disadvantage of non-government articles published is that their data may be prejudiced at times. Services of Syndication: Certain organizations provide such services, and in the process, they consistently collect and tabulate marketing data for various clients. Surveys, mail diary panels, e-services, whole-sellers, manufacturing sectors, retailers, and other methods are used to collect data from households (John et al., 2021).
5.2.5. Data Sources from Experiments Data is collected from experiments and other associated hardware in this data source. The researcher conducts an experiment to collect all data. Experiment data is statistical and factual. Researchers can learn about various methodological approaches that can be used to conduct experiments. The four most common experimental designs are as follows:
本书版权归Arcler所有
•
Completely Randomized Design (CRD): Treatments are assigned at random in a CRD, so each experimental unit has the same likelihood of obtaining any treatment. When the experimental materials are homogeneous, it is available. This strategy is simple to grasp and offers a good deal of flexibility. It also provides the most degrees of freedom.
Introduction to Environmental Statistics
102
•
•
•
Randomized Block Design (RBD): In agriculture, it is the most frequently used experimental design. It is thought to be more precise and productive than the CRD (Iwai et al., 2005). It also provides more flexibility. RBD also allows for simple statistical analysis, which is appealing to researchers. Even so, it is not recommended for a wide range of treatments. Latin Square Design (LSD): An experimental material is split into ‘m’ rows, ‘m’ treatments, and ‘m’ columns in LSD design. Statistical analysis is relatively simple in this method. It is also thought to be the most efficacious design when compared to the CRD and RBD. However, it is difficult to implement for agricultural treatments and can be intimidating when there are multiple treatments. Factorial Designs (FD): Factorial experiments are used to determine the effects of more than one factor. Investigators can use this to study a problem that is influenced by a large number of variables (Figure 5.2).
Figure 5.2. Factorial design experiments. Source: https://www.marketing91.com/sources-of-data/.
5.3. STATISTICAL SURVEYS 5.3.1. Survey Planning Agencies generating a new survey or major revised version of an existing survey must create a written plan that includes the following justification:
本书版权归Arcler所有
Types of Data Sources
103
goals and objectives; potential users; the decisions the survey is intended to inform; key questionnaire estimates; the accuracy required of the projections (e.g., the size of differences that must be detected); tabulations and analytic outcomes that will inform decisions and other uses; linked and previous surveys; steps taken to avoid duplicating information from other sources; when and how frequently users require the data; and the level of detail required in calculations, highly classified microdata, and public-use data files (Homma et al., 2020).
5.3.2. Survey Design Agencies must create a survey design, which includes defining the target population, creating a sampling plan, specifying the data collection instrument and methods, creating a realistic timeframe and cost approximation, and selecting samples using commonly accepted statistical techniques. Nonprobability sampling methods must be statistically justified and capable of measuring estimation error. The sample size and design must reflect the level of detail required in tabulations and other data products, as well as the precision required for key estimates. Documentation of all of these activities and the resulting judgments must be kept in the project files for future reference (Figure 5.3).
Figure 5.3. A basic questionnaire in the Thai language. Source: https://en.wikipedia.org/wiki/Survey_methodology#/media/ File:Questionaire_in_Thai.png.
本书版权归Arcler所有
104
Introduction to Environmental Statistics
5.3.3. Survey Response Rates To make sure that survey data are indicative of the target populace and can be used with trust to inform decisions, agencies must create the survey to obtain the maximum functional rates of response, proportionate with the significance of survey uses, participant burden, and collecting data costs. Nonresponse bias analyzes must be performed when unit or product response rates or other factors indicate the possibility of bias (Higuchi and Inoue, 2019).
5.3.4. Pretesting Survey Systems By conducting a pretest of the survey components or having successfully fielded the survey elements on a prior occasion, organizations must guarantee that all elements of a survey function as designed when instated in the fullscale survey and that measurement error is controlled.
5.4. COLLECTION OF DATA Sampling agencies must ensure that the sampling frames for the intended sample survey or census are suitable for the study design and are qualitychecked against the target population. Notifications to potential survey respondents are required – agencies must focus on ensuring that each collection of data instrument explicitly shows the reasons for collecting the information; how such data is planned to be used to advance the operation of the agency’s functions; whether responses to the gathering of information are discretionary or mandatory (citing authority); and the type and degree of nondisclosure to be provided, if there are any, citing authority; an approximate of the average respondent burden, along with a proposal that the public direct any comments regarding the precision of this load estimate and any recommendations for lowering this burden to the agency may be required; the OMB control number CRM method is necessary and an assertion that an agency may not conduct, and no person is required to respond to, a data gathering request unless it exhibits a currently viable OMB control number is also necessary.
本书版权归Arcler所有
Types of Data Sources
105
5.4.1. Methodology of Data Collection Agencies must create and administer their data gathering methodologies in a way that maximizes data quality and controls measurement error while reducing respondent workload and cost.
5.5. PROCESSING AND EDITING OF DATA • •
•
•
Data Editing: To ameliorate or correct detectable errors, agencies must edit data properly based on available information. Analysis of Nonresponse and Calculation of Response Rate: To assess the effects of unit and item nonresponse on data quality and to inform users, agencies must adequately measure, adjust for, report, and evaluate unit and item nonresponse. As an indicator of possible nonresponse bias, response rates must be calculated using standard formulas to assess the amount of the eligible sample defined by the responding units in each study. Coding: In order for users to properly analyze the data, agents must add codes to gathered data to find aspects of data value from the selection (e.g., missing data). To improve comparability, codes added to transform data gathered as text into a form that allows immediate analysis have to use standardized codes when available (Zio et al., 2004). Evaluation: To allow users to interpret the results of analyzes and to help creators of regularly occurring surveys focus improvement initiatives, agencies must assess the accuracy of the research and make the assessment public (via technical notes and records included in reports of findings or through a separate study).
5.6. ESTIMATES AND PROJECTIONS ARE CREATED 5.6.1. Creating Estimates and Forecasts When deriving direct survey-based projections, as well as model-based projections and projections based on survey data, agencies must use acknowledged theory and methods. Error projections must be calculated and communicated to aid in determining the suitability of the forecasts or
本书版权归Arcler所有
106
Introduction to Environmental Statistics
projections’ uses. Agencies must plan and carry out evaluations to determine the accuracy of their guesstimates and projections (Figure 5.4).
Figure 5.4. Karl Pearson, a founder of mathematical statistics. Source: https://en.wikipedia.org/wiki/Statistics#/media/File:Karl_Pearson,_1910.jpg.
5.7. ANALYSIS OF DATA 5.7.1. Analysis and Report Preparation Prior to the initiation of a specific analysis, agencies must devise a strategy for the analysis of survey data to guarantee that statistical tests are used correctly and that ample resources are sufficient to conduct the analysis (Antweiler and Taylor, 2008).
5.7.2. Comparisons and Inference Statements of comparison and other statistical findings derived from survey data must be based on acceptable statistical exercise.
5.8. PROCEDURES FOR REVIEW Product evaluation of informational products – agencies must implement suitable content/subject matter, statistical, and systematic review procedures to conform with OMB and agency information quality guidelines.
本书版权归Arcler所有
Types of Data Sources
107
5.9. DISSEMINATION OF INFORMATION PRODUCTS 5.9.1. Information Dissemination Agents must release information meant for the general public in accordance with a dissemination plan that ensures equal, timely access to all users and informs the public about the agencies’ dissemination policies and procedures, including those relating to any planned or unexpected data revisions. Data security and avoidance of disclosure for dissemination – when distributing information products, agents must strictly adhere to any confidentiality pledge made to respondents as well as applicable federal regulations.
5.9.2. Documentation of Surveys Agencies must create survey documentation that contains materials for understanding how to properly evaluate data from each poll, as well as information for replicating and evaluating each survey’s results (Ying and Sheng-Cai, 2012). Users must have easy access to survey documentation, unless access is restricted to protect confidentiality.
5.9.3. Microdata Documentation and Distribution Agencies that make microdata available to the general populace must include documentation that clearly describes how the information is built and provide the metadata required for people to access and exploit the data. Users must have easy access to public-use microdata records and metadata.
5.9.4. Administrative Records An institutional record is any type of document that is related to how an entity is organized or run. Professionals keep administrative records to keep track of operations, conferences, clients, and finances. An administrative record may be necessary for legal purposes in some cases. These documents are frequently stored electronically. When official administrative data are required, professionals may print documents on watermarked or sealed paper to demonstrate that they are approved records. An administrative record may be used by a business manager to make economic choices. For instance, if he or she is deciding how much to allocate funds for a specific project, he or she may consult records to understand how much was spent previously (Adelman, 2003). Financial administrative records can also be
本书版权归Arcler所有
108
Introduction to Environmental Statistics
used in auditing processes conducted internally for financial management purposes or by regulatory agencies (Figure 5.5).
Figure 5.5. A manager keeping an administrative record of a meeting. Source: https://www.wise-geek.com/what-is-an-administrative-record.htm.
An administrative record may be used by a business manager to make financial decisions. For instance, if he or she is deciding how much to allocate funds for a specific project, he or she may consult records to understand how much was spent previously. Financial administrative records can also be used in auditing processes conducted internally for personal finance purposes or by regulatory agencies. Administrative data are a valuable resource for social science research. School records, for example, have been used to find patterns in student scholarly performance. Administrative data is information gathered as part of the administration and operation of a publicly sponsored scheme or service (Xia et al., 2017). Administrative data are increasingly being used in early childhood care and education studies today. These data are frequently a relatively low-cost way to gain knowledge about the people and families who use a specific service or participate in a specific program, and they do have some significant limitations.
本书版权归Arcler所有
Types of Data Sources
109
The benefits and drawbacks of using administrative information are discussed here. The issues surrounding data access are discussed: • Administrative data benefits; • Data limitations; • Acquiring and learning about administrative data; • Sampling. It is also common to keep an administrative record for the intent of facility management. When a manager is responsible for overseeing facilities to ensure that they are in compliance with safety and health regulations, data can be useful in proving how he or she took the correct precautions. If an incident happens at a facility, the manager can start producing the required administrative data to demonstrate that he or she is not liable for any damages or bodily harm. When it comes to contract compliance, an administrative documentation can be extremely useful. If a person is accused of acting in a way that violates the definitions of a contract, legal action may be taken. Administrative records are being used to verify an individual’s innocence or guilt in this case. Administrative records are frequently used when a company or government agency is suspected of operating in a manner that is harmful to the environment, public health, or safety. A judge in these cases may order that an institution submit its administrative record (Wieland et al., 2010). In this case, a record could be a paper trail that assists legal officials in understanding the actions an organization took. These documents can be used by judges to determine whether an organization acted maliciously or capriciously. Any expert in an organization who already has access to data about managerial decisions can keep an administrative record. Managers and executives may hire assistants to keep track of meetings and actions taken in some cases. Digitally tracked work, such as monetary operations and data entry, can also be used in an institutional record (Figure 5.6).
本书版权归Arcler所有
Introduction to Environmental Statistics
110
Figure 5.6. During meetings, a note taker is typically assigned to keep an administrative record that will be distributed later. Source: https://www.wise-geek.com/what-is-an-administrative-record. htm#comments.
5.10. THE BENEFITS OF ADMINISTRATIVE DATA
本书版权归Arcler所有
• •
•
• •
Administrative data allow for analyzes at the local and state levels that would be impossible with national survey data. Such information frequently includes detailed, accurate assessments of engagement in numerous social programs. They typically include a large number of cases, allowing for a wide range of analyzes. Longitudinal and zeitgeist studies can be conducted using data along the same individuals and/or programs over time (Wagner and Fortin, 2005). Possibility of linking data from multiple programs to obtain a more comprehensive picture of people and services received. At the state level, such data are useful for assessing state-specific
Types of Data Sources
•
•
111
initiatives and can be used for a variety of program evaluations. Because of the large sample sizes, small program effects can be detected more easily, and effects can be estimated for various groups. Obtaining administrative data is less expensive than collecting data instantly on the same group.
5.11. LIMITATIONS OF ADMINISTRATIVE DATA
本书版权归Arcler所有
•
•
•
•
•
• •
Administrative data is collected in order to manage services and meet government reporting requirements. This introduces several challenges because the primary intent of the data is not research. Administrative data only describe individuals or families who use a service and do not provide information about likeminded individuals who do not use the service. The potential observation duration for any topic being studied (e.g., a person, a household, a child care program) is confined to the time the subject uses the service for which data is being collected. In general, administrative data only includes services that are publicly funded. A researcher, for instance, cannot depend on subsidy data to gain knowledge about all child care providers in the state, or on non-subsidized forms of child care that are used to supplement subsidized child care. Because many variables in administrative data are not up to date on a regular basis, it is critical to understand how and when each parameter is collected. For example, in administrative data for financial assistance child care, an “earnings” variable is typically entered when eligibility is defined and then revised when eligibility is predefined (Austin et al., 2005). When this is the case, there is also no way to know what a family generates in the months between eligibility dedication and redetermination using only administrative data. Critical variables required for a specific research study may not be gathered in administrative data. Since the data are confined to program participants, data on those who are eligible but are not enrolled is frequently unavailable.
112
Introduction to Environmental Statistics
As a result, administrative data may be ineffective for estimating certain characteristics, such as participation rates. • Measurement error can be a significant challenge for analysts working with administrative data. The following factors influence measurement error: – Incorrectly entered data at the agency; – Incomplete or incorrect data items, especially those that the agency does not require for management or reporting purposes; – When cases are reviewed, there are missing values on different factors that have been rewritten by updated versions. • Data access procedures for research specific purpose can be time consuming and complicated. Protecting program participants’ privacy and the confidentiality of data when it is used for research is a big issue for program officials. Researchers involved in using administrative data for research should expect to spend a significant amount of time learning about the administrative data system’s details, the specific data components being used, the data entry process and guidelines, and adjustments in the data system and data interpretations over time. Transforming administrative data into research sets of data that could be used in statistical analyzes also takes a bit of time (Figure 5.7).
Figure 5.7. Importance of administrative data sources. Source: https://www.summitllc.us/hs-fs/hub/355318/file-2536336408-png/Adminitrative_Data_Infographic.png#keepProtocol.
本书版权归Arcler所有
Types of Data Sources
113
5.12. OBTAINING AND LEARNING FROM ADMINISTRATIVE DATA When researchers decide to use administrative data documentation in their research, they frequently face serious barriers. Among the most pressing of these concerns are: Gaining Data Access and Maintaining Confidentiality: In order to acquire administrative information, a researcher, and the agency in charge of the data must agree on how the data will be used and analyzed, how confidentiality will be sustained, and how the research findings will be disseminated (Toivonen et al., 2001). • Source Data Documentation: • After obtaining the administrative records of interest, researchers must become acquainted with the peculiarities of the data. • When combining data from multiple administrative databases, researchers must take care to evaluate the accuracy and consistency of the data elements as well as the performance of record corresponding procedures. • In addition, researchers must learn about, comprehend, and document changes in the interpretation and meaning of data components over time, as well as the procedures for updating data values. • Researchers should meticulously document variable interpretations, value codes, any recodes implemented by the agency, definition changes and their effective dates, and information on how the agency gathered the data. • Program Criteria and Context Documentation: In addition to examining the source data, researchers should take special care to document the key parameters of the program that gathered the data and describe the conceptual framework at the time the information was collected. For instance, as to if all those who were qualified and tried to apply for the program were prepared to be served, given that available funding was limited and service provision was limited (Sivak and Thomson, 2014). When administrative information is used for research purposes, sampling is rarely used because information on the overall population of recipients is available. However, in order to protect subject confidentiality, a subsample
本书版权归Arcler所有
•
114
Introduction to Environmental Statistics
from the entire population may be chosen. Studies that combine survey studies and administrative metadata may also choose only a subset of the population to reduce data collection costs.
5.13. REMOTE SENSING AND MAPPING With the emergence of geographic information systems (GIS) both in industry and government for a wide range of applications, the growth for remote sensing as a data acquisition source for spatial database development has skyrocketed. Remote sensing products are particularly appealing for GIS database development because they can offer cost-effective, large-area coverage in a digital format that can be directly input into a GIS. Because remote sensing data is typically collected in raster format, it can be converted to vector or quadtree layout for data interpretation or implementation success at a low cost. Though the use of remote sensing data for geospatial data development is rapidly increasing, our understanding of the associated data processing errors, particularly when integrating multiple spatial data sets, falls short far behind (Figure 5.8).
Figure 5.8. Geographic information system (GIS). Source: https://www.nationalgeographic.org/encyclopedia/geographic-information-system-gis/.
Performing spatial data assessment operations on data with unknown accuracy or error types will result in a product with reduced confidence intervals and limited use in decision making. Although some research has been conducted on spatial error, we must clearly identify the kinds of errors
本书版权归Arcler所有
Types of Data Sources
115
that may enter the process, recognize how the error propagates all through the processing flow, and establish methods to better measure and report the error using predefined techniques, i.e., techniques applicable to all spatial data users. The following analytical procedures are typically included in the process of incorporating remote sensing techniques into a GIS: data acquisition, data management, data analysis, data conversion, error analysis, and finished product presentation (Simoncelli and Olshausen, 2001). Error may be relocated from one data process phase to the next without being detected by analysts until it crops up in the finished product, error may accrue throughout the process in an admixture or multiplicative manner, and individual process error(s) may be outshined by other errors of larger significance. Although the suitable processing flow is shown clockwise, bidirectional, and cross element computation flow patterns are also possible. Data conversion, for example, is typically performed after data analysis. However, conversion may occur in some cases during the data processing stage. These conversions are typically raster-to-raster (e.g., resampling pixel size) or vector-to-raster. The quantity of error going through the system at each step can theoretically be estimated. In practice, nevertheless, error is usually only assessed at the end of data analysis (i.e., the final product), if at all. Typically, decision-makers are given graphic finished products, statistical information, or modeling outcomes with little or no data about the level of trust that can be inserted in the information. This reduces trust in the implemented decision (s). It is critical that we improve our ability to quantify data error and monitor the error as it perpetuates through a GIS application. A GIS is a computer system that captures, stores, manipulates, analyzes, manages, and displays various types of geographical data. GIS refers to the intellectual pursuit or career of operating with GISs and is a huge realm within the broader intellectual pursuit of Geoinformatics. GIS is a system that allows for the entry, management, information extraction, analysis, and visualization of spatial data. GIS implementation is frequently motivated by jurisdictional (such as a city), intent, or technical specifications. In general, a GIS implementation can be tailored to a specific organization. As a result, a GIS deployment created for one application, jurisdiction, entrepreneurship, or intent may not be interoperable or compatible with another application, jurisdiction, entrepreneurship, or purpose (Seid et al., 2014). A spatial data architecture, which has no such constraints, is what goes further than a GIS (Figure 5.9).
本书版权归Arcler所有
116
Introduction to Environmental Statistics
Figure 5.9. Layers in a GIS. Source: https://www.geologyin.com/2014/05/a-geographic-information-system-gis.html.
GIS is a broad term that can refer to a variety of technologies, operations, and methods. It is associated with numerous operations and has numerous applications in engineering, planning, management, transportation/logistics, insurance, telecommunications, and business. As a result, GIS and location intelligence applications can serve as the foundation for a variety of locationenabled systems that focus on analysis, visualization, and distribution of findings for collaborative decision-making. For decades, aircraft and satellites have been able to collect data and images of the Earth’s land, atmosphere, and oceans. During the 1930s and 1940s, aerial photography was increasingly used for military reconnaissance. The first US meteorological satellite, Meteosat, was launched in 1960, and Landsat, the first US civil satellite to examine and measure the land surface, was launched in 1972 (Rovetta and Castaldo, 2020). Over the last four decades, the United States government has contributed to the advancement of civil space-based remote sensing. The United States Government has launched an ongoing series of increasingly capable Earth-observing satellites through NASA and the National Oceanic and Atmospheric Administration (NOAA) to support an operational weather monitoring service and to conduct
本书版权归Arcler所有
Types of Data Sources
117
investigations to better comprehend the Earth’s land, ocean, atmosphere, and biosphere, their interactions, and how the Earth system changes over time. Furthermore, the US Geological Survey has been in charge of cataloging and maintaining civil land remote sensing data (Pleil et al., 2014). The Land Remote Sensing Policy Act of 1992 established commercial land remote sensing as a policy goal for the United States and included a provision for funding. The licensing of private remote sensing satellite technicians Private remote sensing operators were granted the first licenses in the early 1990s, and the first commercial remote sensing satellite was launched in 1999. Furthermore, the US Geological Survey has been in charge of cataloging and maintaining civil land remote sensing data (Bankole and Surajudeen, 2008). The ability to collect information over large spatial areas; characterize natural features or physical objects on the ground; observe surface areas and objects on systematic grounds and monitor their changes over time; and integrate this information with other data to enable decision-making are all advantages of remote sensing. Remote sensing data from aircrafts or satellites can be accumulated at various spatial resolutions [the smallest characteristic that can be fixed in an image is referred to as spatial resolution]. High resolution remote sensing images can discern features as small as a meter in size, whereas medium or lower resolution pictures can detect elements as large as hundreds of meters in magnitude. Remote sensing instruments may also collect data in various spectral bands of the electromagnetic spectrum, which can be used to recognize and classify and categorize vegetation. Data gathered in the thermal infrared spectrum is particularly useful for water management. Topographic data from light detection and ranging (lidar) instruments can be used to create digital elevation models. Local government frequently requires high resolution data, which aerial imagery has long provided. With the introduction of commercial high resolution remote sensing imagery in the late 1990s, another source of data for local and regional governments was created. Furthermore, states have benefited from moderate resolution. Landsat data from the United States Government is used to monitor natural resources such as forests and wetlands over large areas, to evaluate the ecological systems of land and watershed regions, and to help protect nature reserves (Browning et al., 2015).
本书版权归Arcler所有
118
Introduction to Environmental Statistics
State and local governments can also stand to gain from remote sensing data to improve land use monitoring, transportation planning, and dealing with other infrastructure and public safety concerns. Furthermore, business companies use the information to help support their operations. Real estate companies, for example, use imagery to improve the information about real estate postings, and transportation companies could use remote sensing data to guide route trucks (Figure 5.10).
Figure 5.10. Providers of remote sensing data. Source: https://agfundernews.com/remote-sensing-market-map.
NASA operates fourteen Earth-observing research satellites from space to advance our understanding of the Earth’s atmosphere, oceans, land surface, and biosphere. Some of these are spacecraft are used for practical purposes by both public and private organizations. For example, the Terra and Aqua satellites collect data that is used to support tropical cyclone monitoring and the quick scatterometer (QuickSCAT). Rainfall Measuring Mission collects data to aid in tropical climate improvement forecasting of cyclones and hurricanes. Furthermore, within NASA’s Earth Applied Sciences Program, part of the Science Division, collaborates with federal agencies to apply NASA’s Earth remote sensing technology will be used by partner organizations and other institutions sensing information to decision support systems in agricultural fields air quality, air transport, pollution abatement, and coastal management, emergency preparedness, ecological forecasting, energy management, national security, invasive species, population health, and environmental protection and water administration (Bustos-Korts et al., 2016). Many of the
本书版权归Arcler所有
Types of Data Sources
119
organizations and agencies that use these tools provide services at the state, municipal, and regional levels. The NOAA of the Department of Commerce operates the nation’s fleet of civil operational weather monitoring satellites, which provide data to the National Weather Service’s (NWS) forecasts. Future NOAA satellite systems, such as the National Polar-orbiting Operational Environmental Satellite System (NPOESS), will gather global data on the Earth’s weather, oceans, land, and space environments. NOAA also runs data centers that store geophysical, atmosphere, ocean, and coastal data as well as provide information products to aid in scientific research and other endeavors. Aerial photography is collected by a number of independent firms throughout the country. Aerial imagery is also collected by federal agencies to support their services. Commercial space remote sensing companies in the United States operate satellites and sell imagery and applications to clients in the public, private, and non-governmental sectors. High resolution imagery has dominated the commercial remote sensing data market. Commercial remote sensing depictions has been widely used by the Department of Defense (Chai et al., 2020). Among the many applications supported by commercial remote sensing industries are mapping, national security, surveillance systems, urban planning, sustainable use of natural resources, domestic security, and emergency management and relief efforts. A number of non-US companies gather and sell space remote sensing data.
5.14. TECHNOLOGIES OF DIGITAL INFORMATION AND COMMUNICATION The advancement of computer and communication technology has aided in the development of remote sensing applications. Digital remote sensing data can be obtained and distributed via the Internet, as well as manipulated on desktop computers. GISs allow multiple contributions of geographic information (such as power plant and hospital locations) to be combined with remote sensing images. To enable applications that depend on precise locational information, global positioning data can be merged with remote sensing data sources (Christensen and Himme, 2017). Furthermore, software tools enable the combination of multiple sources of remote sensing data in order to maximize the information contained for remote sensing applications. Due to the availability of civil remote sensing data, companies dedicated to processing and transforming remote sensing data into data products and solutions for
本书版权归Arcler所有
120
Introduction to Environmental Statistics
users have emerged. These businesses produce mapping products such as topographical line map data and digital elevation designs, as well as threedimensional visual data and other remote sensing applications.
5.14.1. Environmental Monitoring Environmental monitoring is the use of tools and techniques to monitor an environment, define its quality, and define environmental parameters in order to accurately quantify the impact of an action on the environment. The results are collected, statistically analyzed, and then authored in a risk assessment, environmental monitoring, and impact assessment review. Environmental monitoring refers to the mechanisms and activities required to typify and monitor the state of the environment. Environmental monitoring is used in the preparatory work of environmental impact assessments, and in many other situations where human activities have the potential to harm the natural environment. All tracking strategies and programs have motives and justifications, which are frequently designed to determine the current state of an environment or patterns in environmental parameters. Monitoring results will be reviewed, statistically analyzed, and published in all cases. Before beginning monitoring, the architecture of a monitoring program must consider the final use of the data. Environmental monitoring includes air quality, soil quality, and water quality (Figure 5.11).
Figure 5.11. Air quality monitoring station. Source: https://en.wikipedia.org/wiki/Environmental_monitoring#/media/ File:Perugia,_2012_-_Air_quality_monitoring_station.jpg.
本书版权归Arcler所有
Types of Data Sources
121
The primary goal of environmental monitoring is to oversee and minimize the environmental impact of an organization’s activities, either to guarantee compliance with legislation or to minimize the risk of damaging effects on the natural environment and safeguard human health. As the human population, industrial activities, and energy consumption expand, the continued advent of sophisticated, automated monitoring applications and equipment is critical for improving the precision of environmental monitoring reviews and the cost-effectiveness of the environmental monitoring process (Dutilleul et al., 2000). Within an organization, monitoring programs are published outlines that detail the components that are being supervised, overall objectives, particular strategies, proposed sampling techniques, projects within each approach, and time frames. Environmental data management systems (EDMS), which include a central data management hub, electronic environmental monitoring notifications, compliance verifying, validation, quality control, and the gathering of information on dataset comparisons, make it easier to implement and monitor environmental monitoring and assessment activities.
5.15. ENVIRONMENTAL MONITORING TYPES Soil, atmosphere, and water are the three main kinds of environmental monitoring. Filtration, sedimentation, electrostatic specimens, impingers, absorption, humidification, scoop upsampling, and composite sampling are some environmental scanning and monitoring techniques. Data collected through these environmental monitoring methods can be entered into a DBMS, where it can be classified, analyzed, depicted, and used to generate actionable insights that enable informed decision making. Air Monitoring: Environmental data collected from multiple environmental networks and institutes using specialized observation tools like sensor networks and GIS configurations is incorporated into air dispersion models, which integrate emissions, meteorological, and topographic information to pinpoint and predict concentrations of air pollutants (Piegorsch and Edwards, 2002). To monitor soil, set benchmarks, and detect risks like acidification, habitat destruction, compaction, pollutants, erosion, organic material loss,
本书版权归Arcler所有
•
Introduction to Environmental Statistics
122
soil salinity, and slope instability, grab sampling (independent samples) and composite sampling (multiple samples) are used. •
•
Soil Salinity Monitoring: Remote sensing, GIS, and electromagnetic induction are used to track soil salinity, which can have negative consequences for water quality, infrastructure, and plant yield if it is out of balance. Contamination Monitoring: Toxic elements such as nuclear waste, coal ash, nanoplastics, petrochemicals, and acid rain are measured using chemical techniques like chromatography and spectrometry, which can contribute to the development of pollution-related illnesses if ingested by humans or animals (Figure 5.12).
Figure 5.12. Collecting a soil sample in Mexico for pathogen testing. Source: https://en.wikipedia.org/wiki/Environmental_monitoring#/media/ File:Soil_samples_(15083707144).jpg.
本书版权归Arcler所有
Types of Data Sources
123
•
Erosion Monitoring: Monitoring and modeling soil erosion is a complex process that makes accurate predictions for large areas nearly impossible. To predict soil loss due to water erosion, the Universal Soil Loss Equation (USLE) is most commonly used. Rainfall, surface runoff, rivers, streams, floods, wind, mass movement, climate, soil composition and structure, topography, and a lack of vegetation management can all contribute to erosion. Water monitoring techniques involve judgmental, simple irregular, stratified, structured, and grid, adaptive groupings, grab, and passive; semi-continuous and continuous environmental sensing; remote sensing and environmental monitoring; and bio-monitoring (Pleil et al., 2014). Environmental monitoring for water is managed by federal, state, and local agencies, universities, and volunteers, and is critical for characterizing waterways, establishing the effectiveness of existing pollution prevention programs, recognizing trends, and emerging concerns, redirecting pollution control efforts as needed, and in emergency service efforts.
5.16. IOT-BASED ENVIRONMENTAL MONITORING Over time, environmental monitoring solutions have evolved into smart environmental monitoring (SEM) systems that now include modern sensors, machine learning (ML) tools, and the Internet of Things (IoT). IoT devices and wireless sensor networks (WSNs), for example, have made sophisticated environmental monitoring with IoT a more streamlined and AI-controlled process. Data collected by IoT environmental sensor devices from a wide range of environmental circumstances can be integrated into a single, cloud-based environmental system via the WSN, in which IoT devices integrated with ML can document, characterize, monitor, and evaluate elements in a specific environment. IoT for environmental monitoring facilitates the generation of wireless, remote environmental monitoring systems, allowing operations to remove much of the human interaction in system function, reducing human labor, increasing the range and frequency of sampling and monitoring, facilitating sophisticated on-site testing, providing lower latency, and connecting detection systems to respond directly teams, ultimately leading to increased rates of significant tragedy and containment (Parise et al., 2014).
本书版权归Arcler所有
124
Introduction to Environmental Statistics
5.17. REASONS FOR ENVIRONMENTAL MONITORING The benefits of environmental monitoring stem from its ability to improve societal quality of life by emphasizing the relationship between the ecosystem and health. It is critical to keep citizens informed about the state of their environment by converting environmental monitoring data into useful information and communicating valuable intelligence to the community in a timely manner. Protection of water sources, management of dangerous and radioactive waste, identification, and analysis of pollution sources that affect urban air quality and its impacts on human health, protection, and conservation of natural resources such as soil and water supplies, weather forecasting, resource allocation for land planning, and economic development with energy analytics and energy business intelligence are examples of practical environmental monitoring applications.
5.18. DATA FROM SCIENTIFIC RESEARCH AND SPECIAL PROJECTS A research project that does not include proper research data management (RDM) is akin to constructing a house without first laying the foundation. Good RDM is important to the achievement of a research project; even so, it is a component that is frequently overlooked, even when funding bodies require it. Planning your RDM should consider not only the present moment, but also how you will reflect back on your data in many years or how you could be able to communicate the data with colleagues, potentially across various disciplines, in an easily understandable format. Many aspects of the data research lifecycle contribute to research projects and its data being discoverable, accessible, interoperable, and reusable (Farrell et al., 2010). In this short article, we will go over some of the key aspects of RDM, such as organizing, storing, and sharing your data, developing data management plans (DMPs), and making sure that any studies conducted are both socially responsible and reproducible. We explain why these areas are essential and how they can be incorporated into your work, and we conclude with a list of our top ten RDM tips. Your ability to find, manage, publish, and re-use data will be shaped by how you organize and store it. Your future self is the first person who is positioned to gain from a reasonable organizational system. Using sensible, simple folder and file constructions will allow you to easily pinpoint various
本书版权归Arcler所有
Types of Data Sources
125
pieces of your data. Simply naming files at random and putting them in a disorderly folder structure will not gain you or anyone else who might want to use your information in the future. Another thing to think about is where you are allowed to store your research information, such as regulations or specialized colleague requirements for security and international data transfer. This is something you should consider from the start of your project because it is much easier to add to an existing structure than it is to go back and rework years of files at the end (if you can even remember what each file referred to). If you store a lot of similar types of data, you should think about creating template directories that you can use it every time you establish a new dataset (Patil and Taillie, 2003). When working on a collaborative research project, it is also critical to agree on organizational strategies as a group at the outset to ensure consistency across the team in terms of where and how team members store and organize project data (Figure 5.13).
Figure 5.13. Top 10 tips for good research data management. Source: https://bmcresnotes.biomedcentral.com/articles/10.1186/s13104-02205908-5.
It is also worth determining which facets of your information you will need to contain in the short and long term, as well as how you will store the data. Furthermore, it is worthwhile to consider the trade-off with both data storage and data recreation, as data that is costly to store but simple to recreate does not inherently need to be saved. If your data is generated in a proprietary format, you must figure out how to store it so that you and anyone else can use it later, even if you don’t have access to the application that generated it. One way to ensure the
本书版权归Arcler所有
126
Introduction to Environmental Statistics
longevity of your data is to save it in a .txt file, which means that even if the patented technology files become unusable, the data will still be readable and editable. However, rather than reliance on this as the main storage method, this strategy should be used in conjunction with saving the data in the original data formats (Notaro et al., 2019). Furthermore, it is critical to provide context alongside the data, possibly in the form of a README data summary file or as additional metadata, as simply saving the information in raw text files isn’t very helpful when it comes to communicating or knowledge it later on. It is also important to remember that communication is an important aspect of data organization when working on cohesive research projects. Projects and data cannot be successfully controlled even with the most organized group members unless all group members communicate and agree on who is doing what with the various pieces of data. The more data, and the more complicated the data, the more time you must devote to data planning, organization, and communication. Poor data management can result in data breaches and, as a result, failed, and possibly hazardous research projects. Managing your data effectively and planning how to do so from the start is critical to a successful research project and is frequently a requirement of Research Application Funding Bodies. To accomplish this, a DMP should be developed that outline how the data will be managed throughout the project’s lifecycle. Many universities have their own internal systems for developing these plans, but DMP Online also has templates available. A DMP should be an energetic document that is alluded to and used to determine whether the project is on track. It is critical that these documents are revised all throughout the project to represent any changes, and data supervisors should be consulted about the DMP throughout the project lifecycle, not just at the start. Furthermore, in collaborative data projects, these plans become even more important because all group members must collect and handle data in the same manner (Ng and Vecchi, 2020). Version control is another important aspect of data management. All research project work should be supported, but editions should also be kept so that changes can be tracked and records or data can be pressed back to a previous version if needed. Based on the nature of the information and the project team’s expertise, there are several ways to version data. For code or datasets, there are version control processes such as GitHub. You can use integrated track changes for documents, or separate files with version
本书版权归Arcler所有
Types of Data Sources
127
numbers can be created, or version control tables can be introduced at the top of documents to recognize in document changes (Figure 5.14).
Figure 5.14. Data activities Source: http://kids.nceas.ucsb.edu/DataandScience/index.html.
5.19. GLOBAL AND INTERNATIONAL SOURCES OF DATA The Central Intelligence Agency of the United States publishes the World Factbook every year (CIA). Profiles can be accessed by country and contain data on the history, people, government, economic system, geography, communications, transportation, armed services, and transnational issues of 267 global entities. Additional reference tabs include maps of major world regions, world flags, and a useful guide to country comparisons. The Factbook can be downloaded for free in compressed .zip format. The globalEDGE Database of International Business Statistics (DIBS) is a thorough data source. There are 2,460 data fields available, which can be obtained across specific years and countries. Academics (students, faculty, and researchers associated with an institution of higher education, government agency, or research group) who are interested in downloading data may do so free of charge (Marohasy, 2003). The only prerequisites are that you sign up on globalEDGE with an institutional email address (.edu or.gov) and provide a summary of the DIBS study you are conducting (each time you download data). The International Labor Organization’s flagship publication, world employment and social outlook – trends, examines the most recent statistics and projections on a variety of labor market indicators. Global employment/ unemployment, continuing disparities, falling wage shares, and other
本书版权归Arcler所有
Introduction to Environmental Statistics
128
variables influencing the rising middle class are all included in the data. The report also discusses structural factors influencing the labor market.
5.20. KEY GOVERNMENT DATABASES
本书版权归Arcler所有
•
•
•
•
•
•
•
•
•
•
Eurostat: EU publications and statistics on economics, citizenry, agriculture, trade, transportation, the environment, and other topics. FAOSTAT: The FAO provides data on agriculture, food supply, food, and nutrition security, prices, goods, forest products, and fisheries. Observatory for Global Health: The World Health Organization provides epidemiological and statistical data on health and healthrelated issues. ILOSTAT: International labor statistics from the International Labor Organization on employment, profits, wages, migration, strikes, and other topics from 1979 to the present (Fourcade et al., 2018). The IMF eLibrary: Exchange and interest rate data, balance of payments, government financial services, national accounts, trade, foreign reserves, and more are all available. The OECD iLibrary: Economic, financial, trade, telecommunication services, development, aid, population movement, and other data. UN Information: The United Nations and a number of UN associated or related organizations provide economic, social, cultural, and demographic indicators. UNCTADStat: Data on FDI and the creative and information economies from the United Nations Conference on Trade and Development. Indicators from the United Nations Development Program: The Human Development Index (HDI) and the Gender Inequality Index are included. The UNIDO Data Portal: The United Nations Industrial Development Organization provides data on industries and mining.
Types of Data Sources
129
•
Data Catalog of the World Bank: The World Bank’s catalog of international economic, financial, and socioeconomic data (Madavan and Balaraman, 2016). • Global CEIC: Worldwide database of countries. National Accounts, Industrial, Sales, Construction-Property, Demography, Labor, Foreign Trade, Stock Markets, Banking, Inflation, Monetary, Forex, Investment, and other data are available. • Online China Data: China’s macroeconomic statistics, city, and county information, industrial sector data, and survey updates are available on a monthly and annual basis. “Direct access for institutional users” should be selected. • National Time Series Crossing: Longitudinal national-level data spanning 200 years and more than 200 countries, covering a wide range of demographic, social, political, and economic topics. • Planet of Data: Data points from authorized and public domain sets of data from local, state, and international governments and organizations are accessible. • Global Financial Information (GFD): Long runs of historical data (over 6,000 series) and macroeconomic patterns for over 150 countries are included, including financials, interest rates, exchange rates, and more. Each dataset is described in online encyclopedias. • Indiastat: Finance, agriculture, health, housing, mass transit, and many other areas have statistical data from the Indian government and private sources. However, to access the database registration is mandatory (Kobayashi and Ye, 2014). • Inter-University Political and Social Research Consortium (ICPSR): A group of institutions working together to collect and maintain social science data. ICPSR, based at the University of Michigan, collects, processes, and disseminates data on social phenomena in 130 countries. Country Data from the PRS Group Political risk data for 140 countries dating back to 1980, as well as macroeconomic data and political risk services (PRS) forecasting on bureaucracy, corruption, civil unrest, ethnic tensions, poverty, inflation, terrorism, and other metrics NIS Statistical Publications in Russia Reports and data sets from the Russian Federation’s State Statistics Committee and the Commonwealth of
本书版权归Arcler所有
Introduction to Environmental Statistics
130
Independent States’ Interstate Statistical Committee. The 2002 All-Russia Census is also included. •
Statista: Information on the media, business, politics, and other topics. Market research reports, trade journals, scientific journals, and official sources are examples of information sources (Khormi and Kumar, 2011). The World’s Statistical Abstracts Statistical abstracts published by foreign governments’ national statistical offices. International organizations’ global and regional statistical compendia are also included (Figure 5.15).
Figure 5.15. The United Nations Statistics Division is committed to the advancement of the global statistical system. We compile and disseminate global statistical information, develop standards and norms for statistical activities, and support countries’ efforts to strengthen their national statistical systems. We facilitate the coordination of international statistical activities and support the functioning of the United Nations Statistical Commission as the apex entity of the global statistical system. Source: https://unstats.un.org/home/.
本书版权归Arcler所有
Types of Data Sources
131
5.21. A DATA SOURCE IS THE LOCATION WHERE DATA THAT IS BEING USED ORIGINATES FROM A data source can be the initial place where data is created or where physical documentation is first digitized, but even the most refined data can serve as a source if another process accesses and uses it. A database, a flat folder, live measurements from hardware objects, scraped web data, or any of the numerous static and streaming data services available on the internet are all examples of data sources. Here’s an illustration of a data source in activity. Consider a fashion brand that sells products online. The website uses an inventory database to determine whether an item is out of stock. In this case, the inventory columns are a data source that the web application that serves the site to customers can access (Köhl et al., 2000). Concentrating on how the term is used in the recognizable database management context will help clarify what types of data sources emerge, how they work, and when they are beneficial.
5.21.1. Data Source Nomenclature Databases continue to be the most common data sources, serving as the primary data storage in ubiquitous RDBMS. The Data Source Name is an important concept in this context (DSN). Within destination databases or applications, the DSN is defined as a pointer to the actual data, whether it exists locally or on a remote server (and whether in a single physical location or virtualized.) The DSN is not always the same as the relevant database or file name; rather, it is an address or label that is used to easily access the data at its source. Because the systems doing the ingesting (of data) ultimately determine the context for any discussion about data sources, definitions, and nomenclature differ extensively and can be confusing. This is particularly true in technical documentation. For instance, in the Java software system, a ‘Datasource’ is an object that represents a correlation to a database. Meanwhile, some newer systems use the term “DataSource” to refer to any collection of data that provides a standardized means of access.
5.22. DATA SOURCE TYPES Though the diversity of data content, format, and location continues to grow as a result of efforts from technologies IoT and the implementation of big
本书版权归Arcler所有
132
Introduction to Environmental Statistics
data methodologies, most data sources can still be divided into two broad categories: machine data sources and file data sources (Kessler et al., 2015). Despite the fact that both serve the same basic intent — pointing to the location of data and explaining similar link characteristics — machine and file information sources are stashed, accessed, and used in different ways.
5.23. SOURCES OF MACHINE DATA Users define the names of machine data sources, which must reside on the machine ingesting data and cannot be shared easily. Like other data sources, machine data sources provide all the information necessary to connect to data, such as relevant software drivers and a driver manager, but users need only ever refer to the DSN as shorthand to invoke the connection or query the data. The connection information is stored in environment variables, database configuration options, or a location internal to the machine or application being used. An Oracle data source, for example, will contain a server location for accessing the remote DBMS, information about which drivers to use, the driver engine, and any other relevant parts of a typical connection string, such as system and user IDs and authentication (Kandlikar et al., 2018).
5.23.1. File Data Sources All connection information is contained within a single, shareable computer file in file data sources (typically with a .dsn extension). File data sources are not assigned a name by users because they are not recorded to individual application domains, systems, or users and do not have a DSN like machine data sources. A connection chain for a single data source is stored in each file. Unlike machine data sources, file data sources can be edited and copied just like any other computer database. This allows users and systems to share a common connection (by moving the data source between individual machines or servers), and for the streamlining of data connection processes (for example by keeping a source file on a shared resource so it may be used simultaneously by multiple applications and users). It is important to note that ‘unshareable’ .dsn files also exist. These are the same type of file as described above, but they exist on a single machine and cannot be moved or copied. These files point directly to machine data sources (John et al., 2021). This means that unsharable file data sources are wrappers for machine data sources, serving as a proxy for applications that expect only files but also need to connect to machine data.
本书版权归Arcler所有
6
CHAPTER
ENVIRONMENTAL SAMPLING
CONTENTS
本书版权归Arcler所有
6.1. Introduction..................................................................................... 134 6.2. Importance of Environmental Sampling........................................... 135 6.3. Environmental Sampling Methods.................................................... 136 6.4. Hydrological Traces......................................................................... 140 6.5. Measuring Ph and Electrical Conductivity (EC)................................. 141 6.6. Stream Gaging................................................................................. 143 6.7. Winkler Method for Measuring Dissolved Oxygen........................... 146 6.8. Measuring Turbidity Using a Secchi Disk......................................... 148 6.9. Conductivity, Temperature, and Depth Rosette (CTD)....................... 150 6.10. Stable Isotope Primer and Hydrological Applications..................... 152 6.11. Challenges of Environmental Sampling.......................................... 154
134
Introduction to Environmental Statistics
6.1. INTRODUCTION Environmental sampling is a technique used to collect samples of a material or substance from the environment. Environmental sampling is often used to determine whether an area has been contaminated by hazardous materials. Environmental sampling may be done in indoor or outdoor environments and can be done by hand, with shovels and other tools, or using automated equipment. Samples may be taken from the soil, water, air, plants, or animals (Iwai et al., 2005). The purpose of environmental sampling is to determine what types of chemicals are present in an area and how much of each chemical there is. In some cases, environmental sampling may help determine whether there are any health risks associated with exposure to these chemicals. Environmental sampling has been around for centuries, with the first documented instance being in China during the Song Dynasty (960–1279). However, it wasn’t until the 20th century that technology became available to allow for the widespread use of environmental sampling. The first known use of environmental sampling was in 1836 when chemist Humphry Davy used it to analyze water quality at various locations around England. He discovered that some areas had higher levels than others due to pollution from industrial processes such as mining and manufacturing industries. This led him to develop methods for determining these levels accurately so that they could be measured more easily later on down the road by other scientists who wanted access to such information about their local environments such as those who lived nearby (Homma et al., 2020). After this initial discovery by Davy came many others – including those made by scientists like Robert Koch who worked with Louis Pasteur during his time studying fermentation processes. In 1932, American scientist Albert Einstein developed a new type of radiation detector that could be used to measure radioactive elements present in soil samples. This device enabled scientists to measure levels of radioactivity more accurately in the environment and proved useful when analyzing areas believed to be contaminated by nuclear waste. In 1947, French chemist Georges Urbain invented another type of instrument: an electrostatic precipitator (ESP) which he used to collect airborne particulates such as dust and soot particles from industrial chimneys and other sources such as factories or smoke stacks where they could then be analyzed by scientists looking for environmental contaminants like lead or mercury
本书版权归Arcler所有
Environmental Sampling
135
oxides which might cause health problems downwind if inhaled by those living nearby. The history of environmental sampling can also be traced back to the early 20th century. In 1914, an article was published in Scientific American by Dr. F. L. Crane that detailed how to take samples from the environment for analysis. The first use of environmental sampling techniques occurred in 1916 when a group of scientists led by Dr. Walter Reed collected samples from the Panama Canal Zone and used them to identify yellow fever as the cause of an outbreak there (Higuchi and Inoue, 2019). Environmental sampling has been used throughout history to detect pollutants in water and air, as well as identify hazardous materials on land or at sea.
6.2. IMPORTANCE OF ENVIRONMENTAL SAMPLING Environmental sampling is the process of collecting surface samples from the environment and analyzing them for contaminants. Environmental sampling can be used to determine whether a certain area has been contaminated with hazardous materials, such as chemicals or other pollutants. It is also used to detect pathogens that may cause disease and to find out what types of organisms live in an area. Environmental samples can be collected from soil, water, plants, and animals. Environmental sampling can be done on-site or off-site depending on what is being tested and where it needs to be done. On-site means in the place where it’s needed; off-site means somewhere else. Environmental sampling may be performed using different methods depending on what is being tested and how much time is available (Han and Li, 2020). Some tests take only minutes while others may take several days or longer depending on how many samples need to be taken from each area before enough information has been collected for analysis purposes only one sample per month per location because of cost savings associated with this approach versus multiple samples per week per location that requires more equipment or even personnel costs. Environmental sampling can be performed in several different ways, including; air sampling that involves collecting a sample from the air for testing. This is done by drawing air through a filter and then analyzing the filter for contaminants. Air samples can be taken indoors or outdoors, depending on the situation and what needs to be analyzed. Soil sampling involves collecting soil samples from an area of land to determine if contaminants
本书版权归Arcler所有
136
Introduction to Environmental Statistics
are present in that area. Soil samples can be taken using several different methods, including digging up soil samples using a shovel or trowel, or using a machine called an auger to take out large areas at once and then analyzing those samples individually. Sediment and water table sampling: This involves collecting samples from the sediment at the bottom of lakes or rivers and groundwater; water below ground level. These samples can be analyzed for contaminants like heavy metals, pesticides, and herbicides. For example, an environmental sample might include the air around an animal or person who is experiencing asthma. This will allow scientists to analyze what exactly is causing the asthma attack. Another example might be collecting water samples from a river or lake near a factory that is polluting them with chemicals like mercury (Girshick et al., 2011). These samples could then help scientists determine how much pollution is being released into the environment and what it’s doing there. There are many benefits to using environmental sampling as part of a quality control program. First, it is relatively inexpensive compared to other methods because you don’t need to purchase any equipment or set up any special locations. Second, environmental sampling allows you to gather information about an entire area at once rather than just one specific location. Third, environmental sampling does not require any specialized training or experience in order to perform it correctly. Environmental sampling allows you to collect samples that have been exposed to different conditions in real-time, which makes it easier for you to detect any changes caused by contamination or other factors. Environmental sampling also allows you to test for contaminants that are not easily detectable through other methods such as visual inspection or laboratory analysis.
6.3. ENVIRONMENTAL SAMPLING METHODS Direct observation is one of the most common methods of environmental sampling. It involves taking measurements that are visible to the naked eye, such as temperature or humidity, and comparing them against known measurements taken at other times or in other places. This method can be effective when you need to measure something that has already been established as important like water quality but not so effective if you’re trying to establish a baseline for something new like pollution. Samples of soil and water can be collected by digging holes in the ground or using specialized equipment like pumps and filters (Gotelli et al., 2012). The samples can then be analyzed for chemicals, bacteria, or other substances
本书版权归Arcler所有
Environmental Sampling
137
that might affect humans’ health or safety. Radiation levels can also be measured using specialized equipment like Geiger counters. The most common methods of environmental sampling include: •
Air Sampling: This involves collecting a sample of air from different areas for analysis. For example, it can be used to determine if there are any chemicals present in the air or if there are any unusual odors that may indicate a problem with pollution. • Soil Sampling: Soil samples are collected from different locations to analyze them for contaminants and other issues that could affect plant life in the area. They can also be used to determine if any harmful substances are being released into the environment by nearby businesses or factories. • Water Sampling: Water samples can be taken from lakes, streams, rivers, and other bodies of water to evaluate their quality and detect any contaminants that may be present in them. There are two main types of environmental sampling: physical and chemical. Physical sampling involves taking a sample from an area of concern and analyzing it for contaminants. Chemical sampling involves taking a sample from an area of concern and analyzing it for physical qualities that indicate contamination. Physical methods include direct Contact Sampling which involves contacting the surface of a contaminated site with a swab, tape, or other material that has been treated with a reagent to detect contaminants. Direct contact sampling can be done by hand or by machine. Indirect Contact Sampling: This method involves placing an absorbent material onto a contaminated site so that it picks up contaminants as it dries out or absorbs them over time while standing on the contaminated surface (Fuentes et al., 2007). Chemical environmental sampling is a process that involves the collection of samples from a physical environment in order to determine the presence and concentration of chemicals. Sampling can be done for many reasons, including determining if a toxic substance is present at dangerous levels, or testing water quality. There are five types of chemical environmental sampling: passive, active, visual inspection and smell, direct contact, and indirect contact using sampling equipment. Passive chemical environmental sampling is a technique that uses a passive sampler to capture air samples, which are then analyzed for the
本书版权归Arcler所有
138
Introduction to Environmental Statistics
presence of chemicals. It is a passive technique because it does not involve the use of any chemicals or other substances in the sample collection process. The samples will typically be collected from a surface, such as soil or water, rather than from inside a contaminated site. Passive samplers are typically used in the field and include such devices as charcoal tubes, XAD-2 cartridges, and filter pads. These devices are often used to detect volatile organic compounds (VOCs), which can be harmful to human health and the environment. Passive chemical environmental sampling is used by scientists to measure contaminants in the air. It involves placing a passive sampler in an environment where it can collect airborne samples that are then analyzed for the presence of chemicals (Giraldo et al., 2010). Airborne contaminants may come from natural sources or from human activities such as industrial processes or automobile exhaust fumes. When passive chemical environmental sampling is used, no physical sampling occurs; instead, an instrument collects the chemicals that are dispersed into the environment. This can be done through a variety of methods: passive diffusion tubes, passive samplers, and passive samplers with pumps. Active chemical environmental sampling is a technique used to determine the presence of chemicals in the environment. The sample is taken at the point of interest and then analyzed for its chemical content. The active approach involves taking samples from areas where pollutants are known to be present or suspected to exist. This can include areas near factories or manufacturing plants as well as in remote locations where it would be difficult for people to go without being detected. The samples are then analyzed using laboratory equipment such as gas chromatography and mass spectrometry. The process involves collecting a sample from an area of interest, typically by taking a sample from the environment or from humans, animals, or plants. Samples are collected using a variety of methods depending on what type of sample is being taken and where it is being collected from. Once samples have been collected and prepared for analysis, they are analyzed to determine their chemical makeup (Frankenhuis et al., 2019). This can involve using various analytical techniques, including spectroscopy, chromatography, mass spectrometry, and other methods depending on what type of information needs to be extracted from each sample type. The active chemical sample is typically introduced using a solid-phase extraction (SPE) cartridge containing an extraction medium (typically silica gel), which allows for maximum recovery of the analyte from its surrounding matrix without interference from other substances in the matrix.
本书版权归Arcler所有
Environmental Sampling
139
Visual inspection of an area is a common chemical environmental sampling technique. This method involves observing the area and noting any visible evidence of contamination. The human eye is able to detect many different qualities, such as color and odor. When used in conjunction with other methods, visual inspection can provide great insight into whether or not an object or area is contaminated with chemicals. The first step in visual inspection is to determine what you are looking for. The most common contaminants are oil, heavy metals, solvents, and chlorinated solvents. If you are unsure of what to look for, contact your local health department or environmental protection agency (EPA) for guidance. Once you have determined what contaminants you want to find, examine the area from different angles and distances. You may be able to spot certain types of contamination easier from certain angles or distances than others. For example, if there is a spill of oil on the ground, it will be easier to see from above than up close (Girshick et al., 2011). Pictures can be used as proof if there are any disputes about whether or not there was contamination present at an incident site. Visual inspection is a highly reliable method of chemical environmental sampling because it allows you to detect and identify the presence of hazardous materials quickly and easily with minimal equipment. It also allows you to determine the extent of contamination within an area without having to take samples or perform other complicated procedures that require specialized equipment. Smell is an important chemical environmental sampling technique that can be used to collect information about the environment. The sense of smell is one of the oldest biological mechanisms, and it has many functions, including detecting chemicals in the air. When people smell something, they are actually detecting chemicals in the air with their noses. This can be done through a variety of methods, including sniffing, breathing in through the mouth or nose, or holding a paper or cloth up to your face. When you smell something new or unfamiliar, your brain sends signals to your olfactory bulb; the part of your brain that processes smell. These signals can tell you whether to approach or avoid something based on how similar it smells to things you’ve encountered before (Gallyamov et al., 2007). Smell is a useful tool for collecting information about an area because it can tell you if there are dangerous chemicals present in the air without requiring any specialized equipment or training. It also doesn’t require any protective clothing like other chemical analysis techniques might require. Direct contact is a chemical environmental sampling technique that involves collecting a substance from its surrounding environment, whether
本书版权归Arcler所有
140
Introduction to Environmental Statistics
that be soil, water, or air. This technique is used to collect specific chemicals, including VOCs and semi-volatile organic compounds (SVOCs). It is most commonly used for soil and water samples because it’s less invasive than other techniques and can be done more quickly. A known quantity of the analyte is then extracted into a solvent, usually water. The sample may be filtered first to remove particulates or other unwanted compounds, and then analyzed by some type of detection method. Direct contact works by using a device that can collect the sample without causing any changes to it. The device does not have to come in direct contact with the material being sampled, but rather it uses material like metal; commonly copper, to collect small amounts of material from its surroundings. Indirect contact is a chemical environmental sampling technique that involves collecting samples from the air and surfaces. This method can be used to determine whether chemicals are present in the environment, but it cannot be used to tell what type of chemical or how much of it is present. Indirect contact is useful because it can be performed quickly and easily, and requires little technical expertise (Zio et al., 2004). This method can be used to detect traces of chemicals in food, water, or air. A common use of indirect contact is in the detection of pesticides on fruit and vegetables. It is also less expensive than other techniques such as gas chromatography or mass spectrometry, which require specialized equipment and training.
6.4. HYDROLOGICAL TRACES Hydrological traces are a type of environmental sampling technique that involves the collection of water samples from various locations, with the purpose of identifying potential sources of pollution. This technique relies on the fact that water flows through an area in a predictable, measurable way. This can be used to track changes in water quality, including chemical, biological, and physical components. The samples are analyzed to determine if they contain any contaminants that could be damaging to human health or aquatic ecosystems. The main goal of hydrological tracing is to identify where pollutants may be coming from and how they are being transported through the waterways. This information can then be used to pinpoint areas where pollution prevention efforts should be focused, as well as to develop effective solutions for eliminating contaminants from these areas. This technique was first developed in 1990 by Dr. David Anderson, who wanted to find a way for scientists to more easily track the sources and movement of water pollution. The first hydro-trace was created by
本书版权归Arcler所有
Environmental Sampling
141
adding three different chemicals—diphenylurea (DPU), dimethyl phenyl sulfone (DMPS), and methylene blue—to water samples from various sites around the world. These chemicals were added in order to create a visible color change when they come into contact with certain types of pollutants, allowing scientists to test for those pollutants with ease. This method makes it easier for researchers who would otherwise have difficulty identifying specific types of pollutants because they cannot see them directly (Ying and Sheng-Cai, 2012). However, there are some challenges associated with hydrological tracing as an environmental sampling technique. One challenge of hydrological traces is that it requires specialized equipment in order to detect trace elements present in water samples. In addition to this, hydrological traces rely on the use of chemicals which may have negative impacts on aquatic life if released into rivers or lakes. First, this method can only be used if there is enough water for the sample to travel through; if there isn’t enough water, the sample will be lost before it reaches its destination. Second, if there are multiple contaminants present in the water supply; for example, if there is both saltwater and freshwater, then these contaminants may mix together and create a new contaminant that cannot be identified by conventional methods. Another challenge is that hydrological traces don’t always have a clear source or destination; they can form anywhere along their path through an environment. This makes them challenging to study because it’s hard to know where to start looking for answers when you don’t know where your starting point should be.
6.5. MEASURING PH AND ELECTRICAL CONDUCTIVITY (EC) PH and electrical conductivity (EC) are two techniques that can be used to measure the water quality in an environment. EC is a measurement of the ability of a solution to carry an electric current. It is measured in microsiemens per centimeter (µS/cm), millisiemens per meter (mS/m), or in units of ohms per centimeter (Ω/cm). It is often used as an indicator of pollution in water because it shows how many electrical charge carriers are available to move through the solution. For example, a low EC value means that there are few charged particles present in the water; this would indicate that there’s not much pollution present. A high EC value means that there are many charged particles present, which indicates pollution or excessive nutrients in the area. PH is a measure of how acidic or basic a solution is on a scale from
本书版权归Arcler所有
142
Introduction to Environmental Statistics
zero to 14. The pH scale runs from 0–14: 0–7 is acidic, 7–14 is basic, and 7 is neutral. When measuring pH for environmental purposes, it’s important to make sure you’re using specific calibrations for your equipment so you know what each reading actually means (Xia et al., 2017). Environmental sampling techniques such as these can be used to determine whether there are any problems with the local environment where you live or work. You can test for things like toxic chemicals in the soil or air around your home or office building by taking samples from various locations and sending them off for testing at an accredited lab facility. In order to measure the pH and EC of a sample, the following steps are required. Fill a clean, dry glass container with enough water to cover the pH paper and conductivity probe that will be used in the experiment. Place a piece of pH paper into the container and let it soak for 30 minutes so that all of its fibers are completely wetted out. Remove the piece of paper from the container and allow it to dry completely for at least 30 minutes. Place a strip of pH paper into an unmarked plastic baggie along with a blank strip of pH paper that has not been exposed in any way. Seal up baggies tightly so that no air can enter or escape from them. Add distilled water to fill up all spaces between strips so that they are completely submerged in the solution and make sure there are no air bubbles present between strips. Close bags securely again so that no air can enter or escape from them; place these bags into another baggie with just distilled water inside this second baggie should not have any other items inside it. Place all four samples into a refrigerator set at 4°C overnight. Measuring pH and EC as an environmental sampling technique have many benefits. The first benefit is that it is easy to use. The equipment is inexpensive and can be purchased at most hardware stores. It is also easy to use because the equipment does not require any special training or experience. The second benefit is that it allows for more accurate measurements than other methods do. Other methods of measuring water quality require a lab analysis which can take weeks or months for results to be returned, but this method allows for immediate results. It also provides more information about the water quality than other methods do because it measures pH levels as well as EC levels (Wieland et al., 2010). The third benefit is that it saves money by eliminating the need for lab analyzes which can cost hundreds or even thousands of dollars per sample taken. Ease of access. It is possible to measure the pH and EC of a sample in a remote location with minimal equipment. This makes it easy to use this technique in environments where it would be difficult to take other measurements, such as deep underground
本书版权归Arcler所有
Environmental Sampling
143
or underwater. Versatility. The two tests are compatible with all types of environments, so you can use them anywhere without worrying about whether or not they will work properly. The pH measurement tells us whether the water is acidic or alkaline. Alkaline means it is basic, and acidic means it’s more like acid. This information helps us determine if there are any harmful bacteria or other microbes that may be present in the water. The EC will tell us how many ions are in the water. If there are a lot of ions in the water, then that could indicate that there are pollutants present in its chemical makeup such as metals. Measuring pH and EC as an environmental sampling technique can be challenging for many reasons. First of all, the equipment needed to perform these tests is not always readily available. Some places may only have a few pieces of equipment for testing, which leads to long wait times for results. Additionally, the equipment that is available may not be properly calibrated or maintained by those who are responsible for it. This can lead to inaccurate results or even false positives that could cause irreparable harm to the environment. Another problem with this type of environmental sampling is that some chemicals used in testing have negative effects on the environment itself. For example, if someone uses a chemical, they don’t know how to dispose of properly such as a solvent, it could pollute groundwater or another body of water nearby where it was disposed of improperly; leading to serious consequences such as illness among people who live nearby due to drinking contaminated water supplies (Wagner and Fortin, 2005). To measure EC using a handheld meter, you must first clean the electrodes on your meter so that they don’t interfere with your readings; this is done by rubbing them with a cloth dipped in distilled water. This can also be difficult to do in the field when conditions are not ideal for cleaning your electrodes.
6.6. STREAM GAGING Stream gaging is an environmental sampling technique used to determine the flow rate of a stream. A gage is placed at a point on the stream, and then its height is measured over time. This allows us to calculate how much water is passing through that point in a given period of time. Stream gaging helps us understand how much water is flowing through streams and rivers, which can help us predict when floods will occur or when drought is likely to occur. It also helps us understand how much sediment is being carried by these rivers, which helps us understand what pollutants are affecting
本书版权归Arcler所有
144
Introduction to Environmental Statistics
these rivers as well as how best to clean them up (Toivonen et al., 2001). This information can be used in other ways as well: by tracking changes in stream flow over time, we can see whether there has been any change in the amount of pollution entering the river system from upstream sources such as factories and whether this has resulted in any changes in water quality downstream such as increased levels of mercury. Stream gaging is usually performed by government agencies, but it can also be done by private groups or individuals. It is often used as part of a larger study on water quality and quantity, particularly where there are concerns about pollution or contamination from nearby industrial operations. Stream gaging has been used to monitor water quality for more than 100 years. It was originally developed as a way to measure stream discharge, which is defined as the volume of water flowing past a point in the channel per unit of time. Stream discharge is one of the most common measurements made at stream gaging stations and is used in many different applications, including flood prediction, water supply management, and environmental monitoring. The earliest stream gage records were collected manually by an observer stationed at the location where measurements were taken. In 1874, John Wesley Powell created an automated method for collecting data on river flow rates through his use of weighted floats that moved up and down with changing river levels. These floats were placed into six-foot-long wood tubes that were driven into the ground along streamsides so that they could remain stationary while they floated up and down with changes in water level. The resulting data allowed Powell to estimate annual averages for rainfall over entire watersheds based on how much water flowed past his sampling sites each month during different seasons throughout yearlong surveys. Stream-gaging stations typically consist of a stage recorder, a telephone line or radio transmitter to transmit data and a battery-powered recording meter or gauge that records continuous measurements of stage height over time (Sivak and Thomson, 2014). The recorders send data to the USGS via radio telemetry or the Internet each day, providing real-time measurements. The USGS uses this information to help manage river flow and to provide flood warnings during storms. Stream gaging is a technique that uses flumes, weirs, and other devices to measure river discharge. It is an effective method for measuring water levels and flow rates. Stream gaging can help us understand how rivers and other bodies of water are being used by humans and wildlife, which in
本书版权归Arcler所有
Environmental Sampling
145
turn helps us protect them. Stream gaging has many benefits as a sampling technique. First, it is relatively inexpensive compared with other techniques such as remote sensing. Second, it can be used to monitor changes over time and within different seasons or areas of the year. This allows us to gather information about seasonal variation in stream flow rates or water quality conditions over time periods ranging from weeks to decades depending on the location. Thirdly, stream gaging provides data on water depth and velocity at fixed points along a river’s course or in its channel. This allows us to determine where sediment deposition may occur in order to prevent flooding or erosion problems that could result from human activities along these waterways. It’s quick (Simoncelli and Olshausen, 2001). Stream gaging can be done in a matter of hours, or even minutes, depending on the type of equipment you’re using and how far apart your samples are spaced out. It’s versatile. You can use stream gaging for anything from monitoring the quality of local water sources to evaluating the impact of global climate change on river levels across vast distances. However, there are some challenges with using stream gaging as an environmental sampling technique. The main challenge with this technique is that it requires careful placement of equipment in order for it to work effectively. The devices must be placed so that they will not be affected by high winds or other factors that can cause inaccurate measurements. Another issue is that if too many measurements are taken at one time, then there may be too much interference between each measurement point which could make it difficult to determine what information is accurate and what isn’t. Another challenge is that the devices must be placed in locations that are accessible to people who will take them out when they need to be replaced. This can present problems when trying to find places where there is little human activity so that it doesn’t interfere with data collection or affect stream health (Seid et al., 2014). Another challenge is that the devices themselves can become damaged or lost over time, which means they need to be replaced regularly so that accurate data can continue being collected consistently over time. Stream gaging also has some disadvantages: it requires an expensive piece of equipment called a gage; it only measures current velocity at one point on the stream channel; it requires frequent maintenance, and it does not always provide accurate measurements because of factors like sediment transport and bank erosion.
本书版权归Arcler所有
146
Introduction to Environmental Statistics
6.7. WINKLER METHOD FOR MEASURING DISSOLVED OXYGEN The Winkler method is a technique used to measure dissolved oxygen in the water. It is a quick, inexpensive test that can be conducted by anyone with minimal training. This technique is useful for measuring dissolved oxygen in soil and water samples. The Winkler method involves adding distilled water to a soil or water sample and allowing it to stand for about 10 minutes. The amount of dissolved oxygen present in the sample can be determined by measuring how much oxygen is released from the sample into the air when heated at high temperatures. Dissolved oxygen is a measure of the amount of oxygen present in water. It is important because it affects the health of aquatic organisms and can be used as a measure of water quality. The Winkler method is a technique for measuring dissolved oxygen in the water. This method uses standard laboratory equipment, such as a graduated cylinder, funnel, and burette. In order to use this method, you will need to first prepare your solution by mixing one part potassium manganate with about 50 parts distilled water (Rovetta and Castaldo, 2020). Once you have prepared your solution, fill up the graduated cylinder with it so that it reaches about 2/3 full, you might need to pour out some excess solution. Then place your funnel into the cylinder’s opening and attach its tube to the burette’s outlet port with rubber tubing or packing tape, you may need someone else to help hold these pieces together. Place one end of this tubing into your sample container, which should be about half full, and let it sit for 30–60 seconds. As soon as this time has elapsed, remove the tubing from your container and note how much liquid has been absorbed by its filter paper material. Finally pour this amount into your sample. The volume of sodium hydroxide added is proportional to the amount of dissolved oxygen present in the sample. The alkalinity is then converted into milligrams per liter (mg/L) using a standard formula. Winkler Method for Measuring Dissolved Oxygen as an environmental sampling technique has many benefits. This method is useful because it does not require any complicated equipment or specialized knowledge to perform it. You only need to be able to measure the intensity of a color change and then use some simple calculations to determine how muchdissolved oxygen is present in your sample. The most obvious benefit is that it provides accurate results with just one sample. This makes it a costeffective solution for businesses that need to get accurate data quickly. It also allows users to monitor changes in dissolved oxygen levels over time
本书版权归Arcler所有
Environmental Sampling
147
so they can see how these levels change over long periods of time or during different seasons or weather conditions such as floods or droughts. Winkler Method for Measuring Dissolved Oxygen as an environmental sampling technique is easy to use because all you need are two chemicals: potassium permanganate and sodium sulfite powder mix with distilled water until they dissolve completely then dip your test tube into the solution then read off the reading on your digital meter which gives you ppm dissolved oxygen (Pleil et al., 2014). There are several challenges to using this method. The main challenge with the Winkler Method is that it requires you to use a very precise instrument called a rotameter. This instrument can be difficult to find and expensive to purchase. Additionally, the process may take more time than other methods because of the time required for each step. The Winkler Method requires the use of chemicals that can be toxic or environmentally harmful. Some of these include potassium permanganate, sodium sulfide, potassium iodide, and sodium thiosulfate. These chemicals can be expensive and hard to find, especially if they need to be shipped internationally. They also need special storage conditions so they don’t spoil or lose their potency before they’re used in an experiment. The Winkler Method uses glass bottles that can break if they are dropped or mishandled. This can lead to contamination by glass shards or other impurities in your sample, which could skew your results. The Winkler Method requires strong sunlight or a UV lamp to measure the amount of dissolved oxygen in your water sample; if there is insufficient light available, you may not get accurate results from this method.” Dissolved oxygen is a critical measure of water quality because it is a necessary part of all living organisms. However, the Winkler method for measuring dissolved oxygen presents some challenges as an environmental sampling technique. In order to measure dissolved oxygen using Winkler’s method, you must first make up a solution by dissolving iron sulfate in distilled water. The solution will have a brownish color when it is first prepared, but this color will disappear after it has been allowed to stand for 24 hours. Then, you must shake the solution vigorously and discard any bubbles that form on top of it (Piegorsch and Edwards, 2002). After shaking the solution, you should see an increase in volume and no further changes in color or density. If your results differ from these standards, then there may be something wrong with your equipment or technique, and you should consult an expert before continuing with your testing. After shaking up your sample and letting it sit for 24 hours, you are ready to take measurements
本书版权归Arcler所有
148
Introduction to Environmental Statistics
using a spectrophotometer. You will need to take measurements at least three times during each test session; the results should be averaged together before reporting them to others.
6.8. MEASURING TURBIDITY USING A SECCHI DISK Turbidity is one of the many factors that can affect the health of an aquatic ecosystem. In order to determine the level of turbidity in a body of water, an environmental sample must be taken. A Secchi Disk is a tool used to measure turbidity, which is a measure of water clarity. The Secchi Disk is placed on the top of the water’s surface and then slowly lowered until it becomes invisible. The depth at which this occurs is recorded as a measurement of turbidity. This technique can be used to monitor changes in water quality over time, as well as predict how those changes may affect fish populations or other organisms that depend on clean water for survival (Antweiler and Taylor, 2008). The results are then compared against established standards for turbidity in order to determine whether or not the water quality is safe for human consumption. Turbidity is a measure of the amount of suspended particulate matter in water. It is measured on a scale from 0 to 2, where 0 represents zero turbidity and 2 represents complete opacity. Turbidity can be caused by anything from sediment particles to algae or bacteria. The Secchi disk was invented by Father Angelo Secchi in 1865 as a simple way to measure the clarity of the water. It is a circular disk that has been painted white with black stripes around the edge, resembling an old-fashioned dinner plate. The disk consists of a white disk with a black center and a black outer ring. The center is made to be a standard reflectance value, while the outer ring is made to be opaque. When placed in water, the amount of light reflected back by the disk is used to determine turbidity. When lowered into the water, it will sink to different depths depending on how clear or cloudy it is: if you can see the disk at all, it means that the water has average transparency of about 10 centimeters. If you cannot see the disk at all, it means that there’s no light penetrating through at all, in other words, complete darkness. No one knows exactly who first invented this technique or why, though some historians believe that it was either Father Secchi himself or someone else who had read his description before him, but it has been used throughout history ever since its invention in 1865 as a quick way to get an idea of how to clear water is without taking any samples or doing any measurements whatsoever (Adelman, 2003). The Secchi disk has been used in many areas of the world
本书版权归Arcler所有
Environmental Sampling
149
over the years, including water quality monitoring, fish stock assessment and health management, marine research and conservation efforts, beach restoration projects, and more. Measuring turbidity using a Secchi Disk as an environmental sampling technique has many benefits. The Secchi Disk is a simple, inexpensive instrument that can be used for any type of water quality measurement. It is easy to operate and does not require special training or equipment. The Secchi Disk provides valuable information about the water column because it gives an estimate of the transparency of the water. This means that it can be used to determine whether there are any suspended particles in the water, which would make it difficult for organisms to live there. It also gives an idea of what types of organisms could be living in the water by giving an indication of how much light penetrates through it. This information can be very useful when deciding where to put a water treatment plant or if there are any potential problems with pollution in your area. The benefits of using a Secchi Disk as an environmental sampling technique include the fact that it is an easy, low-cost method for determining water quality. This means that it can be used by people who are not trained in science or engineering to monitor the water quality of their area. The ability to use it on lakes and rivers as well as other bodies of water (Pleil et al., 2014). This makes it useful in many different environments, including those with high levels of pollution or algae blooms. It does not require any special equipment besides a Secchi disk and a ruler or measuring tape that is long enough to reach the bottom of the water body being tested. The Secchi Disk method has several challenges that make it difficult to use as an environmental sampling technique. One of the challenges of using a Secchi disk to measure turbidity is that it is not accurate. The disk is prone to error and can be difficult to read. A second challenge is that a Secchi disk will only work in clear water. If there are particles in the water, such as silt or mud, then it will be difficult for the disk to float on top of them, causing it to sink instead of float. Finally, another challenge is that you need sunlight for your Secchi disk to work properly. If there is too much cloud cover or if it’s nighttime, then the method won’t work at all. These include: The size of the disk affects its visibility in different types of water; smaller disks are better for shallow waters and larger disks are better for deeper waters. This means that if you want to measure turbidity over a wide range of depths, you’ll need multiple sizes of disks to get accurate readings (Parise et al., 2014). The color of the disk affects its visibility in different types of water; darker colors are better for clearer waters and lighter colors are better for cloudier
本书版权归Arcler所有
150
Introduction to Environmental Statistics
waters. This means that if you want to measure turbidity over a wide range of clarity levels, you’ll need multiple colors to get accurate readings. The shape and weight of the disk affect its visibility in different types of water; circular disks are best because they don’t sink as quickly as squares.
6.9. CONDUCTIVITY, TEMPERATURE, AND DEPTH ROSETTE (CTD) Conductivity, temperature, and depth rosette (CTD) is an environmental sampling technique that measures the conductivity of a water sample. It can be used to measure salinity and other parameters in water samples. This test involves taking a sample of seawater and putting it through a rosette sampler system. The rosette sampler system consists of a wire mesh head with five discs attached to it. The disc filters are then inserted into plastic cups filled with seawater. The cups are placed in a rack so that they can be lowered into the ocean by rope or cable. As they are lowered into the water column, they collect samples from different depths and locations within the water column. Once all five cups have been collected at various depths over time, you can analyze your samples for temperature, salinity, oxygen levels, and pH level among others. The CTD probe measures conductivity, temperature, and depth at regular intervals along its length. The probe is lowered vertically into the water column until it reaches a desired depth. The probe continues to measure conductivity as it descends. At a pre-determined depth, the probe switches from measuring conductivity to measuring temperature. As the probe continues to descend, it switches back and forth between measuring conductivity and temperature every few meters until it reaches the bottom of the water column. The CTD is a device that has been used for decades to measure the physical properties of water. It can be attached to a ship or used in an underwater habitat to collect data about the surrounding environment. The CTD was invented by Professor Dr. David A. M. Smith in 1957 while he was working at the University of California, San Diego (UCSD). The idea came from his father, who had worked on sonar systems during World War II. Smith wanted to create a system that could be used for oceanography research without requiring divers to collect data manually by lowering instruments into the water column (Patil and Taillie, 2003). Smith worked with a team of researchers at UCSD to develop the first prototype of their new invention: a cylindrical glass tube equipped with sensors at different depths inside
本书版权归Arcler所有
Environmental Sampling
151
it so that it could measure conductivity and temperature as well as depth without having to be lowered into the water column manually by scientists or technicians themselves as was required by previous methods. The team tested this initial prototype on an expedition off the coast of California near San Clemente Island in 1958, where they collected data about temperature profiles along with pressure levels within each layer of seawater surrounding them as well as within their own vessel. The benefits of CTD as an environmental sampling technique include Ease of use. The equipment can be used by anyone with minimal training. It does not require specialized equipment or knowledge of how to use it. Versatility; the equipment can be used in many different settings including rivers, lakes, and oceans. It can also be used to investigate pollution sources in urban areas or rural areas where there may be fewer sources of pollution than in urban areas. Cost-effectiveness; CTD as an environmental sampling technique is cost-effective because it only requires one apparatus, unlike other methods which require multiple devices or equipment such as depth gauges which are expensive, and underwater cameras which are also expensive, underwater lights which are also expensive; among others (Notaro et al., 2019). This method is very effective at collecting data about the water quality of an area. It is also quick and easy to use, which makes it ideal for testing water quality on a regular basis. The main challenge of using this technique is that it requires an underwater robot to be deployed into the water at various depths and locations in order to collect data. This is not always possible as there may not be enough space for a robot on board an aircraft or ship that is carrying out environmental sampling. Another major drawback of CTD is that it takes longer than other methods. For example, if you want to take six samples over three hours using a different method such as a bucket system then each sample will take 15 minutes per sample; if you want six samples over three hours using CTD then each sample will take around 70 minutes per sample because there are more steps involved in taking samples with this technique; for example, pumping water from the sea floor. It can also be difficult to accurately measure the samples once they have been collected because of how long it takes for them to be retrieved from the ocean.
本书版权归Arcler所有
152
Introduction to Environmental Statistics
6.10. STABLE ISOTOPE PRIMER AND HYDROLOGICAL APPLICATIONS Stable isotope primer and hydrological applications are common methods for the study of environmental samples. Stable isotopes are naturally occurring elements found in rocks and soil. They have different atomic weights than other elements which allow them to be used as markers to determine where they came from. When these isotopes are absorbed into living things such as plants or animals, they become part of their makeup and can be used to determine where they originated from. The stables of carbon, hydrogen, nitrogen, and oxygen are all used to determine the composition of water, soil, or other material samples. The first step in this process is to take a sample from the environment. This can be done by taking soil or water samples. A small amount of this sample is then placed into a container filled with liquid nitrogen. This will freeze the sample so that it does not decay or decay too much while being stored for later analysis. The next step is to grind up the sample into small pieces using a mortar and pestle so that it can be analyzed properly (Ng and Vecchi, 2020). The third step involves heating up this ground up sample until it becomes gas form again at which point it will be injected into a mass spectrometer where it will then be measured against known values which were generated by measuring other samples from nature such as fossil fuels or peat bogs, etc.; this allows us to see how much carbon dioxide is present in the atmosphere today compared with what was there millions of years ago when these things were created. This technique has been used extensively as an environmental sampling technique because it allows scientists to determine how much water has been removed from certain areas over time due to human activity such as irrigation or farming practices. The history of stable isotope primer and hydrological applications as an environmental sampling technique can be traced back to the early 1990s. A stable isotope primer is a tool that uses radiogenic isotopes to determine the age of an object by measuring the amount of 14C present in it. Stable isotopes are atoms that have different numbers of neutrons than their neighboring elements. The use of stable isotopes as an environmental sampling technique has its origins in hydrological applications. Hydrologists have long been interested in measuring the sources and transport of groundwater because such knowledge is essential for understanding how water moves through subsurface aquifers and rivers. The first use of stable isotopes for this purpose was reported by R. C. Hartley et al. who used Carbon-13 and Nitrogen-15 analyzes on groundwater samples from two aquifers near San Francisco Bay. This work showed that different aquifers
本书版权归Arcler所有
Environmental Sampling
153
could be distinguished based on their isotopic composition, which suggests that they may have distinct geological histories or origins. The use of stable isotopes for environmental sampling began in the 1960s but did not become widely used until the 1990s. In those days, it was used primarily to measure groundwater flow rates by measuring the rate at which water leaches into soil from rocks above ground level (Marohasy, 2003). It was also used to measure infiltration rates and water usage within agricultural fields. Stable isotope primer and hydrological applications are powerful tools for environmental sampling. This technique has the advantage of being able to identify a wide range of different types of materials, including oil spills and other industrial pollutants. This makes it particularly useful in determining whether there are any serious problems with the environment. This process is useful in determining whether or not toxic materials have been dumped into nearby water sources by polluters who have been responsible for polluting the environment. It is also helpful in determining if animals are drinking water from local sources or if they are drinking surface runoff from an area that has been contaminated with toxic chemicals. This knowledge can help prevent further damage to the ecosystem of an area by allowing agencies responsible for overseeing environmental issues to take action before further damage occurs. One of the benefits of stable isotope primer and hydrological applications is that it provides a more accurate picture of what is happening in an area than other techniques such as aerial photography or satellite imagery alone. This is because it combines two different sources of information: one that reveals how much material there is in an area, and one that reveals how much water there is as well (Austin et al., 2005). Stable isotope primer and hydrological applications are also very useful for identifying where certain types of pollution may have come from, whether it’s from natural sources or industrial ones. It can even be used to determine whether there might be any long-term effects on the environment from these sources after they’ve been cleaned up or removed from an area entirely. Stable isotope primer and hydrological applications have been used as an environmental sampling technique for many years. However, there are some challenges that must be overcome in order to make this method work. The first challenge is that the samples need to be collected at the right time. If they aren’t, then you won’t get an accurate reading because the water will have changed its isotopic composition. For example, if you collect your sample in winter, when there is less evaporation than during other times of the year, then your results will not accurately reflect what’s going on in the environment at other times of the year. Another challenge is that you need
本书版权归Arcler所有
154
Introduction to Environmental Statistics
to make sure everything runs smoothly when collecting samples. If any part of it breaks down or malfunctions, then you won’t get a good result either. Another challenge with using this technique is that it requires extensive training to use properly. The results of using this technique will be different depending on who conducts the testing because they all have different backgrounds and experience levels which could affect how they interpret the data they receive from their samples. A third challenge with using this technique is that it may not always be accurate due to errors in measurement or interpretation by researchers conducting tests using this method.
6.11. CHALLENGES OF ENVIRONMENTAL SAMPLING As we move into an increasingly digital world, it can be easy to overlook the importance of environmental sampling. While we may not be able to see it or quantify it, the environment often provides valuable insight into the health of our society and planet. Environmental sampling can help us understand how humans are impacting the environment and what steps need to be taken in order to mitigate those effects. It can also help us gain a better understanding of how our own bodies interact with our surroundings. This kind of information is essential for making informed decisions about how we live, work, and play in our world, and it’s only possible through environmental sampling. In order for environmental sampling to take place, there must be a sample taken from an area where an organism lives. This sample will then be analyzed in order for scientists to determine what that organism’s needs are in order for it to survive (Madavan and Balaraman, 2016). The results from these studies will then be used by governments and organizations such as The United Nations in order to create policies regarding how humans interact with their natural environment. There are many challenges facing environmental sampling, including the need for accurate measurement devices. The presence of a wide variety of contaminants can interfere with the results. A lack of awareness among people who live in areas where environmental sampling needs to take place. One of the biggest challenges facing environmental sampling is contamination. Contaminants such as dust, dirt, and aerosols can get into samples and make them unrepresentative of what’s actually happening in the environment. Contamination can also occur when people wear gloves or don’t wash their hands before taking samples, which can lead to false positive results for bacteria or viruses.
本书版权归Arcler所有
7
CHAPTER
MODELS FOR DATA
CONTENTS
本书版权归Arcler所有
7.1. Introduction................................................................................................... 156 7.2. Literature Review........................................................................................... 158 7.3. The Process of Developing Models for Data................................................... 163 7.4. Types of Data Models..................................................................................... 164 7.5. The Advantages that Come With Using The ER Model..................................... 167 7.6. Importance of Data Models............................................................................ 170 7.7. What Makes a Data Model Good?................................................................. 176 7.8. Data Properties.............................................................................................. 183 7.9. Data Organization......................................................................................... 184 7.10. Data Structure.............................................................................................. 184 7.11. Data Modeling Tools to Know...................................................................... 185 7.12. ER/Studio..................................................................................................... 186 7.13. Db Modeling................................................................................................ 186 7.14. Erbuilder...................................................................................................... 186 7.15. Heidisql....................................................................................................... 187 7.16. Open-Source................................................................................................ 187 7.17. A Modeling Tool for SQL Databases............................................................. 188 7.18. Data Flow Diagram (DFD)........................................................................... 188 7.19. Data Conceptualization............................................................................... 188 7.20. Unified Modeling Language (UML) Models.................................................. 189 7.21. Data Modeling Features............................................................................... 190 7.22. Data Modeling Examples............................................................................. 191 7.23. Summary...................................................................................................... 192
156
Introduction to Environmental Statistics
7.1. INTRODUCTION When a computer system manages real-world things like commodities, suppliers, customers, or orders in a database, a data model is used to describe these entities and their interactions with one another. A data model may also be used to describe the database itself. Data models are usually utilized to facilitate communication between business people who state requirements for a PC framework and experts who outline the architecture in light of those criteria. These data models are used to define the data that is requested and generated by business operations. A notation for data modeling is utilized in the production of graphical representations of data models (Kobayashi and Ye, 2014). When talking about a data model in computer programming languages, the word “data structure” is often what’s meant to be used. Work models are widely utilized as an adjunct to data models in the context of large-scale business models. The process of reviewing and reporting on the data of an organization can benefit greatly from the use of data modeling. Because of the function it plays in the storing and retrieving of data, it is a vital component of any IT framework study (Figure 7.1).
Figure 7.1. Data modelling Source: https://media.istockphoto.com/photos/risk-management-and-assessment-for-business-picture-id1158174961?k=20&m=1158174961&s=612x612 &w=0&h=TH1ZDTTdDfuvikMnRqdH3PpoBU4VqqVtX-ACnjJ-UlQ=.
本书版权归Arcler所有
Models for Data
157
A data model is a document that organizes the database structure of an organization. On the other hand, the realization of a shared idea and objective ought to be an easy process. The construction of an information framework should be represented by data modeling in order to fulfill a number of objectives. One of these objectives is to identify some of the fundamental components of the information framework, which can be expressed in terms of entities and the functions that they perform within an association. Data modeling has developed into an imperative need in recent years as a direct result of the proliferation of non-traditional computer applications. Another way to say it is that data is usually hidden, which is why database architecture is becoming more crucial. Many different kinds of organizations are finding that it is becoming more and more important to keep reliable records. In today’s increasingly digital society, the benefits of keeping and utilizing databases, as well as the number of enterprises that do so, are both on the rise (Bankole and Surajudeen, 2008). Data sets are compilations of a diverse assortment of information, including but not limited to corporate records, financial data, email addresses, and phone numbers. Databases cannot exist without a theoretical model that is referred to as a data model. The information consistency criteria, as well as a data representation and an explanation of the paradigm’s semantics, are incorporated into this paradigm. The data model, rather than concentrating on how the information will be used, focuses on the types of information that are necessary and how they are ordered. This is in contrast to the focus on how the information will be utilized in the application. Because they make it simpler to appreciate the information needs at hand, data models may be useful as a problem-solving tool in circumstances like these. In the field of information technology, the production of conceptual models and the establishment of linkages between different data items may benefit from the use of a data model, which is analogous to the structural design created by an architect. The underlying structure and relationships of a collection of data may sometimes be seen through data modeling. This is the most crucial criteria to take into consideration, as shown by the data structure. The gathered information about data modeling may be put to use in the development of database applications, printed reports, and electronic displays. The economic success that may be attributed to data modeling can be ascribed, at least in part, to its ability to arrange and organize data. The use of analysis and modeling is necessary for businesses in order for them to recognize and address their issues and concerns, as well as
本书版权归Arcler所有
Introduction to Environmental Statistics
158
develop solutions that are both realistic and implementable. A data model may be described in a variety of ways, including logically, physically, and conceptually; prioritizing the utilization of a data model for the aforementioned goals is recommended. •
• •
• •
•
To begin, it ensures the accurate processing of the data entities that have been obtained during the process of data gathering. If the data are not properly collected and documented, it is hard to draw any accurate conclusions from the investigation (Khormi and Kumar, 2011). All of the aforementioned aspects of data creation can benefit from the utilization of a data model in varying degrees. Database developers now have access to a credible representation of the essential data, which they are free to put to use in whichever way they see fit in order to produce an actual data collection. For instance, a Data Model can symbolize social tables, critical, and unknown keys, and storage procedures. This method may be utilized to determine whether or not there is an insufficient amount of data or whether or not data has been duplicated. Because your data model is unique, upgrading, and maintaining your IT infrastructure is less expensive and faster. After some time, the basic building could start to feel routine to you (Browning et al., 2015).
7.2. LITERATURE REVIEW The Data Model is a graphical depiction of how the framework will appear once all of its constituent parts have been assembled. This section provides an explanation of the informative components as well as how those components interact with one another. The database management framework makes use of data models to illustrate the saving of information, the linking of information to other information, the retrieval of information, and the updating of information. In order for the members of the organization to be able to interact with one another and understand the information, the data is processed utilizing a wide range of symbols and languages. One example of a data model is a building that has offices, buildings, and processing facilities. Examples of other models include the following: In a number of different contexts, a particular data model, such as that of an office, a
本书版权归Arcler所有
Models for Data
159
building, a factory, or another data type, is brought up for discussion. This kind of model is produced by combining aspects of the office with data and papers pertaining to the workplace (Köhl et al., 2000). In the world of computer languages, the phrases “data model” and “data structure” are frequently used synonymously with one another (Figure 7.2).
Figure 7.2. Data models. Source: https://previews.123rf.com/images/radiantskies/radiantskies1212/ r a d i a n t s k i e s 1 2 1 2 0 0 2 5 9 / 1 6 6 3 3 4 0 8 - a b s t r a c t - w o rd - c l o u d - f o r- d a ta-model-with-related-tags-and-terms.jpg.
The management of huge quantities of both structured and unstructured data falls under the purview of data frameworks. For example, relational database management systems (RDBMSs) rely on data models to represent the structure, control, and integrity of their data. These models can be created manually or automatically. The majority of individuals do not have access to digital files that have not yet been produced, such as papers written using word processing software or emails or digital images or digital music, or digital films. A typical practice is to make use of data models in order to integrate database models, construct data systems, and facilitate the flow of data. The majority of individuals will utilize a data modeling language if they need to display a data model. Following is a list of the components that
本书版权归Arcler所有
160
Introduction to Environmental Statistics
make up a data model: A component that incorporates a predetermined set of guidelines for the creation of databases in an automated fashion (BustosKorts et al., 2016). A portion of the data outlines the various operations that can be performed on the data in question (for example, tasks for refreshing or restoring data from the database and modifying the database structure). There might be a predetermined list of standards to follow in order to validate and verify the data. There are a few different types of models, the most notable of which are conceptual, coherent, and physical data models (PDM). A specific kind of data model is able to accomplish a specified goal when used in the appropriate context. A conceptual data model is a consistent representation of the relationships and ideas that comprise a database. It is also known as a conceptual data schema. The creation of entities, their associated attributes, and the relationships between them are the objectives of a conceptual data model. At this point in the process of data modeling, it is only possible to obtain the most fundamental information on the database. The construction of this model is often the duty of the information planners as well as the business partners associated with the data model. In order to put up a Conceptual Data Model, we are going to require the following three things: The same thing, an entity can also be referred to as a certifiable entity. Entity relationships (E-R) can be recognized by certain defining characteristics, such as the existence of a connection between two entities or a mutual dependency on one another. The following characteristics may be found in an example of a conceptual data model: This particular kind of Data Model must be accessible to all members of the organization if the whole organization is to profit from using it. When formulating this plan, we did not give any thought to the database management system (DBMS) vendor or any other programming-related factors; rather, we just made use of the resources that were at our disposal (Chai et al., 2020). When working with this data, it is essential to put yourself in the shoes of a contemporary customer. By establishing necessary ideas and degrees in conceptual data models, which are sometimes referred to as Domain models, all of the stakeholders may be able to communicate using the same language. The Logical Data Model is utilized to explain both the process by which data components are formed as well as the manner in which these components interact with one another. Through the use of the coherent data model, additional data may be added to the conceptual data model. Utilizing logical data models has a number of benefits, one of which is that these models may act as the basis for physical models. Having the capacity to do so is advantageous. On the
本书版权归Arcler所有
Models for Data
161
other hand, the structure of the exhibition will not be exclusive but rather welcoming to all participants. At this point, the data model does not contain any keys that are required or that are elective. Before any new connections can be formed at this level, all of the information that was previously gathered on the established connectors must be checked for accuracy and brought up to the current. The following are some of the defining properties of a logical information model: Data are provided for a particular project, however that data can be synchronized with other logical data models dependent on how far along the project is in its development. They are in no way connected to the technology used for database administration and must be built from the ground up (Christensen and Himme, 2017). The lengths and precisions of the data types that are going to be used for information credits are going to be figured out. In the time leading up to the formation of 3NF, models went through a series of standardization cycles. An explanation for data collection could be found in something called a PDM, which is a type of information model. As a direct result of this, pupils see an improvement in their writing talents, and they also develop a greater desire in acquiring new information. This is because a PDM makes available a massive quantity of additional information, known as meta-information. To facilitate an easier representation of the process of creating data sets, the PDM frequently makes use of RDBMS features such as data set segment keys, requirements, and lists, amongst other things. This objective will be accomplished through the use of repetition. The following are some of the properties of a PDM: There is the potential for a significant number of actual data models to be synchronized with the PDM. The scale of the project is going to be the deciding factor in this matter. When developing a data model, it is essential to take into account important factors such as the DBMS, location, data storage, and technology (Kessler et al., 2015). A data model with linked tables can accommodate nullability and cardinality requirements. When employing default values, data types, and lengths, segmentation should be undertaken with extreme caution. There is a great deal of specificity regarding the keys, views, records, and access profiles. This strategy is essential because it makes it possible for the three points of view to operate mostly independently from one another. It is not necessary to alter one’s logical or conceptual framework in order to engage in creative behavior. Under any circumstances, it is an absolute need that the designs be similar to those of the other model. The table/segment design may provide a variety of interpretations of entity classes and characteristics, but in the
本书版权归Arcler所有
162
Introduction to Environmental Statistics
end, it should still be able to accomplish the goals that were outlined by the calculated element class structure. During the early stages of product development, a conceptual data model is utilized in a number of different contexts and settings. It is possible to build a logical data model by using this method as the foundation for the construction. A PDM may be constructed in the future using this paradigm. If it is necessary to do so, a conceptual model can be immediately implemented. After using a data model to coordinate the various components of the data and specify how they identify with one another, the following step is to use a database model to decide how data may be stored, coordinated, and managed, as well as the logical architecture of a database. This step comes after using a data model to decide how data may be stored, coordinated, and managed. The next level involves utilizing a data model to organize and specify the relationships between the different data items. The following are examples of data models that may be included in DBMSs: Structure of the Hierarchy Model This model is laid up in the form of an upside-down tree, with the higher-level entries coming before the lower-level data. Within the structure of the tree, only the root hub, which is sometimes referred to as the root record, does not have any parents. To put it another way, it’s at the very top of the list. Continue to the records of the parents, each of which includes a significant number of records of their offspring. On the other hand, the personal information of only one of a child’s parents can be included in their record. There should be no oneto-many or coordinated relationships between the parents and their children when determining the connections that exist between the two parties (Dutilleul et al., 2000). They are used to store information and retrieve it in a timely and accurate manner from the storage location. They are deployed on a single server the vast majority of the time. For example, the banking and insurance sectors both employ significant inheritance systems, which call for consistent and high-volume transfers (TPS). A hierarchical database organizes its data in accordance with a hierarchy that has been established beforehand. The most efficient method of gathering information is to do it in a methodical, step-by-step approach; this method also has the potential to be a rather quick one, given that you are aware of the specific purposes that each item of data will serve. It is required to construct databases with several layers where there is a very low probability that the requirements will be altered.
本书版权归Arcler所有
Models for Data
163
7.3. THE PROCESS OF DEVELOPING MODELS FOR DATA The data model explains how a certain category of data might be put to use to solve a specific problem that a customer is having in their business. A company can use its internal language and a set of ideas to explain how a database works and how it is constructed. These concepts can be used to describe how a database is structured. The term “data structure” is used to describe all of the information, connections, and requirements that are included in a database. A data model is a specification of the database procedures that, if implemented, will enable the retrieval and update of authoritative data. The conception that an organization or any other entity has about a certain data demand scenario) (Farrell et al., 2010). When trying to explain organizational data and demonstrate the connection between the data and its value, a conceptual data model is the tool of choice. Data for an organization’s cycles may be gathered through the collection of information, which can then be used to construct and explain a high-level data model. Information collected may also be used to collect data for an organization. When it comes to the management of the data structure, data modeling is an all-encompassing strategy that may be used for database administration. Data management that utilizes information technology may be performed on both organized and unstructured datasets. Utilizing a data model allows for the representation of structured data within a database. For the creation of automated systems, having a solid understanding of hierarchical data is necessary. This notation is not typically utilized when representing unstructured data. The requirements for the database architecture as well as the data structure are described and examined in further detail. A data model is an excellent illustration of a product that was developed with the end-user in mind. In addition to this, it is adaptable enough to be used as a template for a wide range of other business data frameworks. The database holding crucial information is linked to the real work data through the utilization of the incapacity data model. In the realm of information technology, the term “blueprint” refers to a document that is utilized frequently. Even though it is quite comparable to other well-known models, only a small number of individuals outside of the information technology field are familiar with it (Fourcade et al., 2018). One of the most well-known blueprint matches uses a series of graphs to make even the most challenging and complex business tasks intelligible to individuals who are not experts in the subject matter.
本书版权归Arcler所有
Introduction to Environmental Statistics
164
Programmers and end-users both stand to benefit from an improved grasp of data needs if data modeling is successful in its overarching mission. It is remarkable in terms of effectively expressing the requirements of a company. Data modeling can seem very different from company to company as a result of the underlying organizational differences that exist in each business. Within an organization, the data model is put to use by both the functional teams and the technical teams. The functional group is comprised of end-users and business experts, while the technical group is composed of software developers and designers. As a point of departure, the group has the presumption that the data model would, in the same vein, live up to their anticipations. The utilization of numerical data is a component of data modeling. The client’s location name must possess a particular set of characteristics chosen from a given list in order for the name to be distinct. For the sake of preserving the reliability of the system’s internal database, these particulars are an absolute necessity.
7.4. TYPES OF DATA MODELS There are different data models that are available (Figure 7.3). Some of them are: • • • •
Entity-relationship (E-R) model; Hierarchical model; Network model; Relational model.
Figure 7.3. Types of data models. Source: https://cdn.educba.com/academy/wp-content/uploads/2019/11/Typesof-Data-Model.png.
本书版权归Arcler所有
Models for Data
165
7.4.1. Entity Relationship (E-R) Model The process of creating a database is illustrated by the E-R model through the use of a diagram that is referred to as an entity relationship diagram (ERD) (ER Diagram). In the not-too-distant future, ER models could be incorporated into the design or architecture of databases. A high-level data model is what this is, and it specifies the data components and how they interact with one another using a computer language. The entity set and the relationship set are the two components of the E-R paradigm that are considered to be of the utmost significance. With the use of an ER diagram, it is possible to examine the interrelationships that exist between different groupings of entities (Fuentes et al., 2007). A collection of related things, regardless of whether or not that collection has been given credit, is known as an entity set. Examples of database entities are tables and the characteristics associated with them. The ER graph is a diagram that illustrates the links between tables and the attributes of those tables. Its purpose is to show that the architecture of a database is sound.
本书版权归Arcler所有
•
•
Database entities can be of many different types, including people, places, things, and events. The utilization of a rectangle shape denotes the presence of an entity within the bounds of a connection outline. An excellent instance of this would be the use of entities, such as managers. One of two techniques can be used to characterize entities. Due to the fact that it possesses a distinguishing quality, this powerful entity may be used to identify each item in the database. In ER diagrams, it is often shown as a single rectangle in most cases. Because it is lacking in a significant basic quality, this entity must rely on the superficial characteristics of another powerful entity. On the ER diagram, a weak entity is often denoted by the presence of a double rectangle. Attributes – when developing an Entity-Relationship Model, often known as an ERM, properties are what are utilized to represent the attributes of an entity. It is shown as either a circle or an oval in the ER diagram. It is abundantly clear that each oval shape represents quality and is connected to the rectangle with which it is affiliated in some kind. A wide range of an object’s properties can be considered in order to establish what kind of manager the item represents. The following is a statement made regarding an attribute in the ER model: A simple property is one that cannot have any of its atomic components broken down into smaller
Introduction to Environmental Statistics
166
本书版权归Arcler所有
•
pieces in any additional way. Consider the individual’s sexual orientation. If one characteristic of a component can be used to distinguish one material from another in a group of components, then that property is regarded as the key attribute for the group of components. It is referring to a key in the ER graph that has a significant influence on the overall picture. The most crucial aspect of an entity-relationship diagram is represented by an ellipse with a single line running through it. One example is that the id of a manager will be unique to each manager (Kandlikar et al., 2018). Compositing is the term used to describe the process of combining two or more essential characteristics. The entityrelationship diagram has additional ellipses that provide context for it. The Name property of a management type has fields for a person’s First Name, Middle Name, and Last Name. An attribute that may be acquired by using more credits to pay for it. These credit sources are represented as a dashed oval in an entityrelationship diagram. It is considered an inferred characteristic due to the fact that the age of the manager might change over time and that it can be determined by another property known as DOB (Date of Birth). A management team may keep a record of customers’ phone numbers and email addresses in many locations. Relationship – within the context of the paradigm of entities and relationships, the term “relationship” refers to the connection that exists between two or more entities. On the ER graph, it appears in the form of a diamond. When operating a firm, it is customary to have a manager on hand to oversee daily operations. The relationship between the two is “Works in” in this instance. The degree of a connection is composed of several different aspects that come together to form the whole. The various degrees of connectivity are broken down as follows in the following list: There is just one group of entities involved in a transaction when the connection is unary. For instance, the responsibility of monitoring or ordering the work performed by others in the company falls on the manager. A binary connection is formed when two independent entities of any kind are linked together in some way. In order to establish a ternary connection, there must be three distinct elements (Figure 7.4).
Models for Data
167
Figure 7.4. An entity-relationship (E-R) model diagram. Source: https://www.learncomputerscienceonline.com/entity-relationship-diagram/.
7.5. THE ADVANTAGES THAT COME WITH USING THE ER MODEL •
•
•
In theory, the ER model may be constructed in as little as a few hours. As long as the model’s properties and elements are linked in a meaningful fashion, the construction of this diagram may be completed very rapidly. This paradigm is utilized rather frequently by database designers as a means of communicating their thoughts as it is an efficient tool for doing so. A straightforward method for the modification of any model due to the fact that the relational model naturally leads the ER model, and the ER model may be quickly transformed into a table (Frankenhuis et al., 2019). This model is capable of being turned into a variety of additional models, in addition to the network and hierarchical models.
7.5.1. Hierarchical Model When using this data model, the information is structured as a tree with a single root, and there is a connection made between each branch and the root. Instead of being constructed from the ground up like pyramids, hierarchies are created from the bottom up, with the number of child nodes growing in
本书版权归Arcler所有
168
Introduction to Environmental Statistics
the same manner as that of their parents. A kid node in this diagram is only connected to one of its parents, despite the fact that a parent might have more than one child. Because the data are stored in a tree structure using this format, it is necessary to begin at the topmost node in order to traverse the tree and extract the data. In the hierarchical data model, a link that goes from one type of data to another is called a one-to-many connection. Each piece of information is connected to the others, and the record of the information is preserved as a separate file (Figure 7.5).
Figure 7.5. An example of a hierarchical model. Source: https://upload.wikimedia.org/wikipedia/commons/thumb/e/eb/Hierarchical_Model.svg/1200px-Hierarchical_Model.svg.png.
Imagine a company that is tasked with the obligation of maintaining complete and precise records on every one of its workers. In addition to the individual’s first name, employee code, and department, the individual’s last name is also shown in the table. In addition, the company gives each employee their personal computer to use throughout their time there. As a consequence of this, storing the data in a distinct computer table is something that is strongly recommended (Gallyamov et al., 2007). A computer table may be used to hold the code, serial number, and type of each employee. In a hierarchical data architecture, employees are represented by a parent table, whilst computers are represented by a child node.
7.5.2. Network Model The network model, which is a form of database model, is dependent on the availability of a versatile method for expressing both the items and the links that link them together. The schema is an essential part of the network data
本书版权归Arcler所有
Models for Data
169
model, which may be visually represented as a graph. In this representation, the edges of the graph denote relationships, while the nodes of the graph denote entities. The manner in which data are structured inside a hierarchy is the primary contributor to the disparities that may be observed between the two categories of models. On the other hand, data in a network data model are often represented utilizing a graph (Figure 7.6).
Figure 7.6. An example of a network model. Source: https://upload.wikimedia.org/wikipedia/commons/thumb/d/d2/Bachman_order_processing_model.tiff/lossless-page1-1200px-Bachman_order_ processing_model.tiff.png.
Utilizing a network model provides a number of benefits, one of which is the ability to see the most important relationships. This data model is capable of supporting both one-to-one and many-to-many relationship configurations. When compared with the hierarchical data model, which is more difficult to access, this model is much easier to navigate. They will always be connected so long as there is a connection that can be made between the parent and child nodes (Girshick et al., 2011). In addition, the other node does not have any effect on the data in any way, shape, or form. The method suffers from the additional fatal defect of being unable to accommodate newly acquired information or altered circumstances. There is evidence to suggest that in order to make any changes, it will be necessary to make adjustments to the entire system. This would necessitate a significant investment of both time and effort. The incapacity of this paradigm to preserve data, in conjunction with the fact that each record is connected by pointers, results in the creation of a system that is both complicated and highly developed.
本书版权归Arcler所有
170
Introduction to Environmental Statistics
7.5.3. Relational Model In this data model, the gathering of a collection of items into associations is accomplished through the utilization of data tables. Tables that are connected are used to depict the many components of this paradigm. The table is composed of a great number of rows and columns, with each row denoting a separate record and each column indicating a particular attribute of the item. In this particular data structure, each row in the table is distinguished from the others by a primary key that is one and alone. SQL is utilized in order to reintroduce the data into the system (Structured Query Language). The primary key is the most important tool that the relational data model has to offer (John et al., 2021). In addition to this, every single piece of data that is gathered must be unique. It is important to eliminate any inconsistencies in the data tables wherever possible because they might create access issues. The relational data model has a number of issues, including data that is duplicated, insufficient data, and incorrect linkages (Figure 7.7).
Figure 7.7. An example of a relational mode. Source: https://raw.githubusercontent.com/gulvaibhav20/assets/master/Scaler/ Relational Model/Table.jpg.
7.6. IMPORTANCE OF DATA MODELS Data are the fundamental elements upon which an information system is constructed. Applications are utilized for the management and transformation of data and information. On the other hand, various individuals view the
本书版权归Arcler所有
Models for Data
171
evidence in a variety of different ways. Relational databases are the type that needs data modeling because of this reason. Compare the way the management of a firm and a clerk at the company interprets the identical (data) information. Even if the manager and the secretary are both employed by the same company, it is more probable that the manager has an enterprisewide perspective on the data that the company maintains than it is that the secretary does (Figure 7.8).
Figure 7.8. Importance of data models. Source: https://www.csc.gov.sg/articles/bring-data-in-the-heart-of-digital-government.
本书版权归Arcler所有
Here are some of the importance or benefits of data models: •
Reduce Costs and Shorten the Time it Takes to Realize Benefits: With the use of data modeling, business users may be able to actively participate in the design of key business rules. This helps to reduce the number of times that modifications have to be made throughout the process of implementation. When the requirements gathering and development phases of the process are merged, it is possible to shave off a significant amount of time. Because of this, you will have an easier time
Introduction to Environmental Statistics
172
本书版权归Arcler所有
•
presenting new items and initiatives to the market, which is a direct consequence of what has just been said. We were able to assist a client in reducing the amount of time it would take them to get their product into production from nine to three months by using data modeling (Giraldo et al., 2010). The earlier issues are discovered through the use of data modeling, the more money that may be saved. As a direct result of this transformation, your staff will be less likely to deliver products to top management or, heaven forbid, consumers that are riddled with faults. Because of this, if you use data modeling, you could be able to reduce your expenditures for software development by as much as 75%. The modeling of data demands perseverance in addition to detailed documentation (both on the IT and the business side). Despite this, it is one of the most effective ways to get control over your data, save expenses, and speed up growth. When compared to the perspective of a typical end-user, a data analyst or programmer working on an application takes a very different look at the information. Application programmers are the individuals responsible for taking corporate operations and policies and transforming them into user-friendly forms that are shown in reports and inquiries. During the process of developing software, this transition takes place. It is unimportant whether the perspective of an application programer differs from that of management or the end-user when an adequate database design has been given. Problems are far more likely to arise when there isn’t a reliable database architecture in place (Iwai et al., 2005). This risk is greatly enhanced. It is possible for many kinds of software, such as inventory management software and order entry systems, to make use of a variety of distinct product-numbering systems. It is feasible that a firm will lose hundreds of thousands, or even millions, of dollars as a result of this. Because the data model is an abstraction, it is unable to supply the information that is required because of this. Examine and Make Enhancements to the Company’s Workflow: The processes and procedures of your company may be presented in a manner that makes it possible for other contributors to take part in the modeling process if you make use of data modeling. If you don’t have a firm understanding of how the inner workings of your organization function, you won’t
本书版权归Arcler所有
Models for Data
•
•
173
be able to express your facts or what it performs in an effective manner. Consider, for example, how your company is employing the consumer data that it already holds in order to construct the client database that it is now working on. If your firm makes use of data modeling tactics, it is possible that the business operations of the organization will become easier to understand and will also be improved. Cut Down on Both the Risk and the Complexity: You need to ensure that your data is both vital and low-risk in order to efficiently handle the avalanche of data that every company must struggle with. If you want to be successful in this endeavor, read on. When you get access to extra information, you should immediately begin to consider how you will arrange it. This should be done as quickly as possible. In addition to this, it is your responsibility to carry out these tasks in the appropriate manner, taking into consideration the myriad of data compliance problems that are shared by all businesses. In order to accomplish this goal, every detail must be meticulously documented and linked to the ever-expanding data collection (Homma et al., 2020). An overview of your data architecture may be obtained by looking at the graphical representations of your data processes, which are known as data models. Because of this, it will be feasible for you to examine all of your data, which will result in a reduction in the risk associated due to the fact that transformations, metadata, and filters will no longer be hidden or scattered. Now, your firm has the chance to have rapid access to a unified version of the truth, and you should take advantage of this opportunity. Data modeling makes tough and highly technical business sectors more accessible to employees who are not technically qualified, such as business leaders and C-level executives. Data modeling was developed by IBM. The Coordination is Much More Effective Now: Your nontechnical workers and your information technology department should now be able to interact more effortlessly. They can communicate by employing data models in a way that is independent of the underlying technology, in addition to having the ability to generate actual data structures. Through the use of data modeling, enterprise-level business processes have the potential to be connected with data rules, data structures, and
Introduction to Environmental Statistics
174
•
the actual physical deployment of data technology (Gotelli et al., 2012). Data models bring together all of the processes and data consumption that occur inside your company in a form that can be understood by everyone bringing your father into the equation. Leverage: The idea of leverage, which asserts that seemingly insignificant alterations to a data model can have a significant impact on the entirety of the system, is one of the primary reasons why data organization is becoming an increasingly important topic. In contrast to databases, the majority of commercial information systems are constructed on top of computer programs, the creation of which requires a large investment of both time and labor. A database that has been appropriately constructed, on the other hand, has a significant impact, not only on the information contained inside the entries but also on the organization of the data. The vast majority of applications will, in some capacity, save database data, update, remove, or modify that data, print, or display that data, or any combination of these activities. They are required to adhere to the structure of the data model since it provides an explanation of how the information is arranged (Figure 7.9).
Figure 7.9. Leverage. Source: https://image.shutterstock.com/image-vector/leverage-dollar-financial-technology-vector-260nw-1459488365.jpg.
It’s possible that the way data is organized will have a big impact on how a program is put together. In the beginning, stages, having a data model that has been carefully established might make the programming process simpler and cheaper. Even very minor adjustments to the model can result
本书版权归Arcler所有
Models for Data
175
in significant cost reductions for the programming. The cost of fixing data that has been badly organized might be prohibitively high. You should keep this in mind as you prepare your approach in the event that we need to create an exception to the rule that states a customer may only have one address (Girshick et al., 2011). Changing the data model should be a rather straightforward process, at least in theory. The policy table may need two or three more address columns in order to accommodate any prospective demand. If modern database administration tools are utilized, the database could be able to be altered such that it better satisfies the requirements of the new paradigm. Nevertheless, the effect is discernible across the entirety of the system. Loops will need to be included in programs in order to manage a variable number of addresses, and displays will need to provide users with the ability to enter and read many addresses at the same time. The fact that there are no more addresses to keep track of means that the report forms will need to be modified. Changing the structure of the database is not difficult, but it is expensive since every application that relies on the affected region will need to be modified to accommodate the change. Fixing a single problem in software is a fairly straightforward and constrained endeavor, even if doing so requires starting from scratch with the coding process. However, if the fundamental needs of the business are not met, or if the company goes through a change after the database is established, this might result in difficulties with the structure of the data. Even if just one customer is recorded for each phone call, and accurate billing database may be the consequence of revisions to the company’s billing policy, new product lines, or technical advancements. It is not uncommon for the costs associated with implementing these sorts of alterations to result in the elimination of an entire system or the incapacity of a company to carry out a product or strategy that had been previously deliberated about. Some people exacerbated the collapse of the system by making it more difficult to manage and by attempting to “fix” the problem in an inefficient manner. This contributed to the system’s overall instability.
本书版权归Arcler所有
•
Improves Data Quality: When it comes to the upkeep of databases, one must devote a large amount of both time and effort to the process of acquiring the information that is stored inside the database. In the event that the data is inaccurate, the value of a damaged object will decrease, and it will no longer be able to repair. The lack of consistency in defining and interpreting data, as well as in developing ways to enforce the definitions, is by far the most prevalent cause of problems with the quality of the
176
Introduction to Environmental Statistics
data. Those whose jobs include the collection and analysis of data are prone to making false assumptions. The way in which the information was collected creates a significant risk that a significant percentage of it cannot be relied upon, and this risk is proportionate to the amount of information in question (Higuchi and Inoue, 2019). In a general sense, the date of birth may be subject to several different integrity requirements. For illustration purposes. It is necessary to have both a formatted range and a format definition. As a consequence of this, the data model is a crucial component in the process of preserving the high quality of the data. This is because it ensures that everyone is aware of what is included inside each table in addition to the many interpretations of those contents (Figure 7.10).
Figure 7.10. Components of quality data. Source: https://media.istockphoto.com/photos/diagram-of-data-quality-picture-id825975188?k=20&m=825975188&s=612x612&w=0&h=G96c6x5Xfs k64XuYBy0lwo9--VZHwLjW0imPNJnsRKA=.
7.7. WHAT MAKES A DATA MODEL GOOD? To evaluate the efficacy of several data models with regard to the same business challenge, we will need a selection of quality indicators. We ask, “How effectively does this model build a robust overall system design that fulfills business requirements?” in a more general sense. A further step that may be taken would be to establish some fundamental criteria for the evaluation and comparison of models (Han and Li, 2020). As we gain more knowledge about data models and data modeling approaches, as well as the ways
本书版权归Arcler所有
Models for Data
177
in which they may be utilized in a variety of contexts, we will circle back around to these.
7.7.1. Completeness Completeness is the model able to effectively manage all of the necessary data? For instance, our insurance model does not have a column to record a customer’s employment, nor does it include a table to track premium payments. If the system does require such information, then the absence of it is a serious problem. In addition, we have observed that there is a possibility that we will be unable to record a commission rate if there is no insurance sold at that rate.
7.7.2. Non-Redundancy Non-redundancy is the model outfitted with a database that enables the recording of the same information several times, as required? Through the use of an example, it was demonstrated that it is possible to keep the commission rate uniform across the policy table. Age, in contrast to Birth Date, seems to have the same amount of information. If we create a new database specifically for monitoring insurance agents, there is a chance that we may end up with duplicate information on individuals who have served in the capacity of both customers and agents (Zio et al., 2004). The need for additional database storage space, an increase in the number of steps (and processing), and an increase in the likelihood of inconsistency are all consequences of the practice of duplicating data and not ensuring that all copies are maintained current (Figure 7.11).
Figure 7.11. Practical environmental monitoring and aseptic models. Source: Effective Environmental Monitoring and Aseptic Techniques Course – Texas International Produce Association (texipa.org).
本书版权归Arcler所有
178
Introduction to Environmental Statistics
7.7.3. Observance of Standard Operating Procedures To be more specific, how well does the model explain and enforce the business rules that govern the data? Due to the fact that each Policy database item may only include a single Customer Number, our insurance model requires that each policy be held by a single client, despite the fact that this need may not appear to be the case at first glance. Because there is no way to register a large number of customers in violation of the policy, this rule will never be breached by a user of the system or a programmer of the system (short of such extreme measures as holding a separate row of data in the policy table for each customer associated with a policy). If this rule correctly represents the business requirement, the database that is produced as a consequence will be a helpful tool for assuring best practices and protecting data quality, provided that this rule truly reflects the business need. On the other hand, if there is any distortion of business rules in the model, it could be difficult to fix it afterward or write around it or code around it.
7.7.4. Making Use of Previously Collected Data Sets Will the information included in the database be used for many purposes apart from those outlined in the process model? After a company has amassed enough data to satisfy a particular need, there is virtually always an increase in the number of new applications and consumers. The first step that a company may do to simplify the billing process is to maintain a record of the specifics of the policy. The marketing department wants demographic information, the regulators require statistical reports, and the sales department wishes to leverage the data in order to generate commissions. It is not common to be able to anticipate all of these aspects in advance (Ying and Sheng-Cai, 2012). When data has been arranged with a single purpose in mind, it is typically difficult to use that data for another reason. There are few things that users of a system find more frustrating than paying for the gathering and storage of data, only to be advised that the data collected cannot be used to fulfill a new information requirement without lengthy and costly reorganization. This need is frequently discussed in terms of the solution that it requires, which is that the data be organized in a manner that is as independent as possible from any specific application.
7.7.5. Stability and Flexibility The ability to adapt to changing corporate requirements is an important trait to look for. Is it feasible to include any new data that may be necessary for these
本书版权归Arcler所有
Models for Data
179
changes in the existing tables? Are simple additions sufficient to accomplish the role of a replacement? Or will we be forced to make drastic structural changes that will have far-reaching consequences for the whole system? The speed with which a company can modify its system to meet changing market demands is an important aspect in assessing how effectively it can manage a number of different circumstances. It is feasible that the rate at which information systems may be updated can determine how soon a new product can be put to the market or a corporation can comply with new legislation. The underlying database either no longer accurately reflects the business rules or requires costly continuous maintenance to keep up with change, which are both major reasons why the system must be changed. If the data model does not require any updates, it is stable, despite the fact that demands are always evolving. Models can be classified as more or less stable based on the degree of change planned for them. A data model is deemed flexible if it can be easily updated to meet newly created requirements while having little impact on the structure it presently possesses. If we utilize a generic policy table instead of individual tables (with related processes, screens, and reports) for each type of policy, our insurance model will be more stable if the product line is changed. Following that, it is possible that new types of policies will be able to fit inside the existing policy table and make use of the same programming logic that exists across all types of policies (Xia et al., 2017). The type of modification that is offered determines the amount of adaptability. Modifying the insurance model to add information on the agent responsible for selling each policy appeared to be a simple task.
7.7.6. Elegance It is essential to search for candidates who have the ability to adjust their work to meet the ever-changing demands of the organization. Is it possible to update the tables that are already in place so that they can handle any more data that could be required? Is it possible that the replacement function might be fulfilled by only making additions? Or are we going to be compelled to make significant structural adjustments that will have farreaching implications on the whole system? In order to determine how well a company can handle a variety of situations, one of the most important factors to consider is the rate at which it can adapt its system to the shifting demands of the market. How quickly information systems are updated may be a factor in determining how quickly a new product may be introduced to the market or how quickly a firm can bring itself into compliance with new regulations. The primary reasons why the system needs to be updated are because either
本书版权归Arcler所有
180
Introduction to Environmental Statistics
the underlying database does not correctly represent the business rules or it requires costly ongoing maintenance to keep up with change. Either of these is an important reason the system needs to be updated. A data model is considered stable if there are no modifications that need to be made to it, regardless of how frequently requirements change. Models can be categorized as more stable or less stable depending on the amount of change that is anticipated for them. In order for a data model to be flexible, it must provide quick adaptation to accommodate new requirements while at the same time preserving the structure it now possesses. If we change the products that we offer, our insurance model will be less susceptible to disruption if we use a generic policy table rather than separate tables (together with the associated procedures, screens, and reports) for each type of policy. After that, it is quite probable that new types of policies will be able to fit within the current policy table and use the same programming logic that is used for all different kinds of policies (Antweiler and Taylor, 2008). This will be possible since the existing policy table will be modified. The kind of modification that is provided determines the level of flexibility that is available. It seems unnecessary to make any changes to the insurance model in order to add information on the agent who is responsible for the sale of each policy.
7.7.7. Communication How well does this strategy assist various people to communicate with one another when there are many different persons working on the design of the system? Users and professionals in the relevant fields have to be able to verify the information included in the tables and columns. Will programmers be able to understand the model to an appropriate degree? The quality of the final model will be significantly impacted by the intelligent business input that is provided. On the other hand, programmers need to have a comprehensive comprehension of the model before they can make good use of it. The most common causes of difficulties in communicating are unfamiliarity with the issues being discussed as well as the language being used. If it is not explained, ideally with pictures, the majority of people who are not professionals will find a model that has 20 or 30 tables to be quite daunting. It’s possible that providing larger models with varying degrees of complexity will be necessary in order to make it possible for the reader to learn using the “divide and conquer” method. It’s possible that new ideas, such as highly generic tables that can store a variety of data, may make the model more stable and beautiful, but it’s also possible that business
本书版权归Arcler所有
Models for Data
181
executives and programmers would have a hard time understanding them (Adelman, 2003). The aim of a data modeler to be accurate and consistent when naming tables and columns may lead to the usage of uncommon terminology rather than terms that are commonly used in the industry but may be misleading depending on the context (Figure 7.12).
Figure 7.12. Communication. Source: https://image.shutterstock.com/image-vector/communication-colorfultypography-banner-overlapping-260nw-1398444674.jpg.
7.7.8. Performance In the list of quality standards that were presented in the part before this one, performance was nowhere to be seen. The user of the system will experience a significant amount of annoyance if our exhaustive, nonredundant, adaptive, and appealing database is unable to meet the criteria for throughput and reaction time. On the other hand, the performance is heavily dependent on the software and hardware platforms that the database will work on. This can have a significant impact on the overall quality of the database. Putting these skills to use is a technical undertaking, which stands in contrast to the business-oriented modeling activities that we have been outlining up to this point in the discussion. It is common practice to construct data models without giving performance consideration, and then to execute them with pre-existing hardware and software. This is the method that comes highly recommended. We won’t even think about updating the model if its performance isn’t considerably worse than we anticipated it would be (Wieland et al., 2010). The general rule that “performance requirements should be thrown in after the other criteria” does have a few notable exceptions (Figure 7.13).
本书版权归Arcler所有
182
Introduction to Environmental Statistics
Figure 7.13. Performance. Source: https://media.istockphoto.com/photos/performance-increase-pictureid579421538?k=20&m=579421538&s=612x612&w=0&h=0SnlhveJRwOLH k387yf4uK_OPsESCq2UdA939no7d3Y=.
7.7.9. Conflicting Objectives All of the aforementioned objectives usually compete with one another. It could be challenging to provide conservative consumers with an answer that is sophisticated while still being radical. We may be so enchanted by a magnificent model that we’d overlook restrictions that might not even be relevant. A model that is capable of properly enforcing a large number of different business rules will become unstable if even one of those rules is altered. A model that is easy to understand because it incorporates the perspectives of the system’s immediate users may not be reusable or successfully interface with other databases despite the fact that it is simple to understand. To create a model that strikes the optimal balance between these seemingly incompatible goals is our primary mission, and we want to do this by whatever means necessary (Wagner and Fortin, 2005). This is done through a process of suggestion and evaluation rather than a stepby-step development to the eventual answer, as is done in other design professions. It’s possible that we won’t have any idea about a better option or compromise until we see it (Figure 7.14).
本书版权归Arcler所有
Models for Data
183
Figure 7.14. Conflicts. Source: https://image.shutterstock.com/image-illustration/word-cloud-conflictmanagement-concept-260nw-2134415505.jpg.
7.8. DATA PROPERTIES Some fundamental properties of data for which requirements need to be met are:
本书版权归Arcler所有
• • • • • • • • •
Definition-Related Aspects: Relevance: The usefulness of the data in the context of your business. Clarity: The availability of a clear and common explanation for the facts. Consistency: The compatibility of the same type of data from numerous sources. Content-Related Qualities: Timeliness: The availability of data at the moment requested and how up-to-date that data is. Accuracy: How near to the truth the data is. Properties Tied to Both Definition and Content (Toivonen et al., 2001). Completeness: It is how much of the relevant knowledge is available.
Introduction to Environmental Statistics
184
• •
Accessibility: It is where, how, and to whom the data is available or not available. Cost: The cost involved with acquiring the data, and making it available for consumption.
7.9. DATA ORGANIZATION Another sort of data model illustrates how to structure data using a DBMS or other data management tools. It describes, for example, relational tables and columns or object-oriented classes and attributes. Such a data model is commonly referred to as the PDM, yet in the original ANSI three schema design, it is tagged “logical.” In that design, the physical model specifies the storage medium. Ideally, this model is created from the more conceptual data model discussed above. It may fluctuate, however, to adapt to restrictions like processing capability and consumption patterns. While data analysis is a frequent phrase for data modeling, the activity has more in common with the principles and methods of synthesis (inferring broad concepts from particular examples) than it does with analysis (identifying component concepts from more genera. Data modeling seeks to bring the data structures of interest together into a cohesive, indivisible, and whole by eliminating excessive data redundancy and by integrating data structures through links. A third possibility is to deploy adaptive systems such as artificial neural networks that can automatically construct implicit representations of data.
7.10. DATA STRUCTURE A data structure is a technique of storing data in a computer such that it may be used efficiently. It is an organization of mathematical and logical conceptions of data. Often a carefully created data structure will allow the most efficient approach to be selected. The choice of the data structure generally arises from the choice of an abstract data type. A data model specifies the order of the data inside a certain domain and, by implication, the underlying structure of that domain itself. This means that a data model in actuality gives a specialized grammar for a committed artificial language for that subject. A data model incorporates classes of entities (kinds of things) about which a business wants to store information, the features of that information, linkages among those entities, and (often implicit) relationships among those attributes (Sivak and Thomson, 2014).
本书版权归Arcler所有
Models for Data
185
The model describes the layout of the data to some extent irrespective of how it may be represented in a computer system. The entities represented by a data model can be the physical items, although models that integrate such actual item sorts tend to change with time. Robust data models often recognize abstractions of such elements. For example, a data model may include an entity class called “Person,” representing all the people who interact with an organization. Such an abstract entity class is typically more acceptable than ones designated “Vendor” or “Employee,” which implies true responsibilities done by those folks.
7.11. DATA MODELING TOOLS TO KNOW When developing database systems, many database developers forget to consider data modeling. This might result in a substantial change in the data structure over time. As a proof-of-concept, fiddling is permissible. When working on a big and dynamic project, the data structure, on the other hand, should be determined from the start. If the development and business teams are separated, a large quantity of potentially important data may be lost, forgotten, or underused. Is data modeling beneficial for bridging the gap between database administration and business? What are the finest tools to utilize for this task? Data modeling tools allow you to model data fast and efficiently. They act as a link between the different layers of the data model. The majority of data modeling technologies can produce database schemas or DTDs automatically, integrate, and compare schemas and models, and reverse engineer existing databases into models. With the use of strong data modeling tools, non-technical people may easily model conceptual data. Erwin works as a data scientist. Erwin Data Modeler has been around for about 30 years as of this writing. Erwin’s understanding of data and data modeling is unquestionable. This data modeling tool integrates with common databases such as MySQL and PostgreSQL, allowing you to examine your data in real-time and make changes as you go. Customers can select from several variations. Comparative methods with merit. A technique for showing meta-data. Each of Erwin’s versions differed significantly. Model creation and deployment are included in the base edition. Users can see data in read-only mode using the navigator (Simoncelli and Olshausen, 2001). The workshop edition is a collaborative solution based on a repository. The NoSQL variation uses non-relational databases, making it the most specialized tool accessible. The standard and workshop editions both
本书版权归Arcler所有
Introduction to Environmental Statistics
186
provide comparison tools to let users see the differences between various databases or versions.
7.12. ER/STUDIO The studio’s history, like Erwin’s, has both positive and negative parts. ER/ Studio offers a broad feature set that has been developed over many years, however, keeping up with emerging technology can be tough. Tools for merging and comparing Git branches. Thinking about profit and loss. With the capacity to reverse engineer as well as a forward engineer because of the usage of industry standards like SSIS and SSRS, the Git integration is up to date and simple to use. ER/studio was created to link developers and business users to maximize the value of your data. When it comes to making the most of your data, ER/Studio provides everything you’ll need. It will also assist you in reducing trash.
7.13. DB MODELING The dB schema database designer and manager support SQL, NoSQL, and cloud databases. The dB schema has the following elements: • • • •
GIT, Mercurial, SVN, and CVS are all supported. The platform and bug patches are regularly provided (every 2 or 3 months) In-built random number generator The dB schema lacks version control, and the field definitions are imprecise. Furthermore, many believe that it is less dependable than other programs.
7.14. ERBUILDER The purpose of Erbuilder Data Modeler is to make data modeling easy for programers. It is not adequate for conceptual and logical data modeling (Seid et al., 2014). Erbuilder does not provide any version control or collaborative features. Compare this to the graphical interface’s robustness and usability. In Erbuilder, switching between tables and creating complicated diagrams is as simple as a single click.
本书版权归Arcler所有
Models for Data
187
7.15. HEIDISQL Heidisql, a free and open-source data modeling program for the physical layer, is available. Heidisql is a popular MariaDB and MySQL utility owing to its inexpensive cost. It lacks any unique traits compared to its proprietary competitors. However, it has some stability concerns, but users don’t notice any problems and it requires a restart. A data modeler that works with Navicat – Navicat data modeler is an inexpensive and effective data modeling tool that is very simple to use. Likewise, Navicat seems to be current. ER/studio and Erwin, two of the most costly data modeling tools, are included with Navicat. These tools are used for reverse engineering objectives. Modeling can take numerous forms, including physical, mental, or logical. Navicat’s cost structure is substantially more economical than Erwin and ER/Studio, despite some customers’ concerns about the absence of field descriptions (Rovetta and Castaldo, 2020). Connectivity, queries, and models may all be synchronized across many platforms using the Navicat cloud. Many platforms and operating systems are supported by Toad Data Modeler. MS SQL Server has been supported by Toad since 2000. Toad’s license and installation methods are more involved and might be improved easier. Oracle and MySQL need to be executed as distinct applications as well. A one-stop store would make things easier for customers. Modeling software Archi – ArchiMate is an Open-Source program. Archi is a data modeling tool that may be used by both large and small enterprises. Using the visual notation system ArchiMate helps describe complicated systems.
7.16. OPEN-SOURCE A useful handbook and a pleasant internet presence. It is possible to examine the revision history and the general development plan. Similar to the free program HeidiSQL, Arch enables both conceptual and physical data modeling with an easy-to-use interface. Databases is an easy-to-use data modeling tool for programmers and database managers alike. N: M and other sophisticated features are available. Data visualization and data modeling are the core emphasis of the Design program. For
本书版权归Arcler所有
188
Introduction to Environmental Statistics
conceptual modeling, it lacks the capabilities needed by developers and database managers.
7.17. A MODELING TOOL FOR SQL DATABASES SQL Database Modeler is a cutting-edge SaaS that is web-based and contemporary in appearance. A comprehensive range of cloud-based capabilities and collaboration tools are featured in the program. Nothing has to be downloaded or set up before usage.
7.18. DATA FLOW DIAGRAM (DFD) A data-flow diagram’s purpose is to graphically depict the “flow” of an information system (DFD). Rather than depicting the program’s control flow, it depicts the data flow. To illustrate data processing, a data-flow diagram (also known as a structured design) can be used (structured design). Martin and Estrin created the “data-flow graph” computing paradigm, which Larry Constantine, the original pioneer of structured design, used to create data flow diagrams (DFDs). A context-level DFD is typically used to demonstrate how the system interacts with external entities. The DFD depicts how a system’s components are subdivided and emphasize the flow of information between them. The DFD at the context level is “exploded” for a more thorough understanding of the system.
7.19. DATA CONCEPTUALIZATION However, information modeling is more than just database modeling. In the software engineering discipline, a data model and/or an information model can be abstract formal representations of entity types that include their properties, relationships, and activities. Depending on the model, an entity type may be real-world or abstract, such as a billing system entity. When using these entities as a container, a restricted region can be represented by a limited set of entity types, characteristics, relationships, and actions (an information model is a representation of concepts, connections, constraints, rules, and actions that provide data semantics for a specific domain of discourse. This framework can be used to organize and share domainspecific information. The larger definition of the term “information model” includes models of various objects such as buildings, structures, processing plants, and so on. Models of information for various types of facilities (e.g., building, plant, etc.). This information model combines a facility model
本书版权归Arcler所有
Models for Data
189
with facility-related data and documents. An information model, unlike a software specification, does not require the description of the issue domain to be converted into code (Pleil et al., 2014). The information model may contain multiple mappings. These mappings are referred to as data models regardless of whether they are object models or not. Object models are collections of objects or classes that allow software in a computer-based environment to examine and manipulate certain aspects of the real world. In other words, the service or system’s object-oriented interface. This interface represents the object model of the service or system. There are several examples of this, such as using the document object model (DOM) to analyze and modify a web page. Object models can be used to manipulate. Microsoft Excel from another program and an astronomical telescope can be controlled using an object model, the ASCOM Telescope Driver. The term “object model” is used by computers to describe how a programming language, technology, notation, or approach implements objects in general. Object models in general are well represented by the COM object model, the Java object model, and the OMT object model. Class, message, inheritance, polymorphism, and encapsulation are terms that are frequently used to describe various object types. Formalized object models constitute a significant portion of the formal semantics of computer languages. ORM, or object-role modeling, is a conceptual modeling approach for analyzing data and rules. The fact-based approach provided by Object-Role Modeling can benefit conceptual-level systems research. The architecture of a database application has a substantial impact on its quality. When defining information systems from the start, it’s critical to use ideas and language that everyone can understand to help ensure the system’s correctness, clarity, adaptability, and productivity (Austin et al., 2005). The DBMS used to implement the design can be based on a variety of logical data models, and the conceptual design can include data, process, and behavioral perspectives (relational, hierarchical, network, object-oriented, etc.). Data structures include relational, hierarchical, network, and object-oriented models. “Networked,” “hierarchical,” and “object-oriented”
7.20. UNIFIED MODELING LANGUAGE (UML) MODELS The unified modeling language (UML) is a modeling language that is used in software engineering. This graphical language can be used to visualize,
本书版权归Arcler所有
Introduction to Environmental Statistics
190
define, produce, and document software-intensive system artifacts. The UML can be used to create system blueprints in the following ways: • • • •
Conceptual objects include business processes and system capabilities; Computer language declarations, database structures, and other tangible components are examples; Software modules that can be reused; UML includes functional, data, and relational models.
7.21. DATA MODELING FEATURES The following elements are required for data modeling techniques to be complete:
本书版权归Arcler所有
•
Entities and attributes must be distinguished because entities are abstractions of real-world data. The characteristics of things distinguish them. Use them to identify and develop connections between objects, also known as similarities (Piegorsch and Edwards, 2002). • UML is an acronym for Unifying Modeling Language. Consider UML to be a collection of building blocks and best practices for data modeling. The UML can be used by data analysts to discover and design model architectures that are appropriate for their specific needs. Working with a large amount of data will reveal that several different sets of data must be repeated to show all of the important connections. Data items are assigned unique keys or numeric numbers that are assigned to different sets of data to avoid duplication. You will be able to normalize, or list only keys, as a result of using this labeling strategy, rather than repeating data entries in the model whenever entities create new associations. Data modeling has several advantages. • •
Data modeling, a component of data management, offers businesses several benefits, including the following: Before you can build a database to figure out what the next step is, you must first clean, classify, and model your data. Data modeling improves data quality, and as a result, database errors and bad design are reduced.
Models for Data
•
•
191
Using data modeling, you can see how your information will be structured visually. Employees will have a better understanding of how data is managed and how their role fits in. Furthermore, it facilitates inter-departmental data exchange (Pleil et al., 2014). Data modeling, which promotes more intelligent database architecture, may result in more robust applications and datadriven business insights.
7.22. DATA MODELING EXAMPLES Data modeling is a key early stage in building and documenting the architecture necessary to support any application, whether for business, entertainment, personal usage, or something else. This category covers any transactional system, data processing application set or suite, or other systems that gather, generate, or use data. Because a data warehouse is a store for data from many sources, which are likely to have comparable or related data in varying forms, data modeling is vital. To describe how to adapt each incoming data set to satisfy the needs of the warehouse architecture – and thus make the data useful for analysis and data mining – it is required to first map out the warehouse formats and structure. The data model is leveraged by analytical tools, executive information systems (dashboards), data mining, and interfaces with any data systems and applications. Data modeling is a vital component in the early phases of any system’s design since it functions as the basis for all future procedures and stages. The data model functions as a common language, allowing systems to interact by reading and accepting the model’s data (Bankole and Surajudeen, 2008). This is more important than ever in today’s era of big data, machine learning (ML), artificial intelligence, cloud connectivity, IoT, and distributed systems, including edge computing. Data modeling has changed throughout time. Although data modeling has existed for as long as data processing, data storage, and computer programming, the phrase itself did not become prevalent until the 1960s, when DBMSs began to mature. The notion of creating and architecting a new building is not novel. Data modeling has gotten increasingly formal and rigorous as more data, databases, and data sorts have been accessible. Data modeling is more important than ever as technologists grapple with new data sources (IoT sensors, location-aware devices, clickstreams, social media) as well as an onslaught of unstructured data (text, audio, video, raw sensor output) at volumes and speeds that exceed the capabilities of
本书版权归Arcler所有
192
Introduction to Environmental Statistics
traditional systems. There is now a continuing demand for new systems, imaginative database structures and methods, and new data models to bring this new development effort together.
7.22.1. What Does the Future Hold for Data Modeling? The availability of enormous amounts of data from a plurality of sources, including sensors, speech, video, email, and more, broadens the scope of modeling activities for IT employees. The Internet, of course, is one of the facilitators of this transformation. Because it is the only computer infrastructure large enough, scalable enough, and flexible enough to satisfy current and future connectivity demands, the cloud is a critical component of the solution. Options for database design are also growing (Parise et al., 2014). A decade ago, the popular database structure was a row-oriented relational database employing ordinary disk storage technologies. The data for a typical ERP’s general ledger or inventory management was kept in hundreds of separate tables that needed to be updated and modeled. Modern ERP solutions leverage a columnar architecture to retain current data in memory, resulting in fewer tables and improved performance and efficiency. Today’s new self-service capabilities will continue to improve for line-ofbusiness people. New tools will also be offered to make data modeling and visualization easier and more collaborative.
7.23. SUMMARY A well-thought-out and thorough data model is crucial for the building of a highly effective, useable, secure, and accurate database. Begin with the conceptual model to construct all of the components and characteristics of the data model. Then, take those thoughts and translate them into a logical data model that describes how data flows and outlines what data is required, as well as how it will be received, managed, stored, and distributed. The logical data model is the full design document that directs the development of database and application software, and it drives the PDM specific to a database product. Good data modeling and database architecture are essential for the construction of effective, dependable, and secure application systems and databases that operate well with data warehouses and analytical tools – and allow data interchange with business partners and across different application sets (Patil and Taillie, 2003). Data integrity is ensured by well-thought-out data models, boosting the value and trustworthiness of company’s data.
本书版权归Arcler所有
8
CHAPTER
SPATIAL-DATA ANALYSIS
CONTENTS
本书版权归Arcler所有
8.1. SA Geometric.................................................................................. 197 8.2. History............................................................................................. 201 8.3. Spatial Data Analysis In Science...................................................... 203 8.4. Functions of Spatial Analysis............................................................ 204 8.5. Spatial Processes.............................................................................. 206 8.6. The Spatial Data Matrix: It’s Quality................................................. 209 8.7. Sources of Spatial Data.................................................................... 211 8.8. The Purpose and Conduct of Spatial Sampling................................. 213 8.9. Models for Measurement Error......................................................... 214 8.10. Analysis of Spatial Data and Data Consistency............................... 214 8.11. EDA (Exploratory Data Analysis) and ESDA (Exploratory Spatial Data Analysis).................................................................... 215 8.12. Data Visualization: Approaches and Tasks...................................... 220
194
Introduction to Environmental Statistics
Despite their importance in geographic data systems, often these spatial analysis methods evolved prior to and independently of GIS technology. The majority of these methods were later incorporated into GIS technology. GIS platforms perform a variety of functions, including data acquisition, data management, and visualization. When these functions are combined with analytical operations, they become even more efficient. The history of spatial analysis, as it is known in the frame of reference of GIS today, dates back many years. Berry et al. readers contain a selection of early articles on spatial statistics and quantifiable spatial analysis (1998). Some of these papers were published in the 1930s, but the majority were published in the 1950s and 1960s (Notaro et al., 2019). The application of mathematical and other quantitative methods to the analysis of spatial structure and dynamics was particularly prominent and evolved in spatial sciences, which concentrate on the assessment of those spatial conditions and trends. Regional science, quantitative geography, and landscape ecology are examples of such disciplines. In geography, the use of inferential statistics was associated with a paradigm shift known as the “quantitative revolution.” This was especially noticeable in the 1960s and 1970s. It is well presented in the books by Abler (1971); Haggett (1965); and Haggett et al. (1977). Geodetic sciences were not focused on the advancement of SA methods at the time. However, as GIS became a more integral platform in the 1980s, there has been a growing consolidation of techniques used in SA involved in the advancement of GIS technology sciences. With his geomatics background, he devoted a large portion of the subject matter to data analysis, concentrating particularly on SA. But which methods existed prior to the introduction of GIS technology? According to Fisher (1999), these were primarily quantitative methods for characterizing and analyzing: • Patterns (e.g., distribution, and arrangement); • Geographical feature shapes (points, lines, areas, and surfaces). Originally, a geometric strategy monopolized the field, with a particular emphasis on point analysis techniques and network characterization. Later, the emphasis shifted to method development.
本书版权归Arcler所有
Spatial-Data Analysis
195
The analysis of intrinsic geographical properties (for example, the relative distance between spatial objects), mechanisms of spatial interests (e.g., shopping center orientation), and spatial interactions became more important in method development (Ng and Vecchi, 2020). Strategies of multivariate data were growingly used and appropriate to the requirements of spatial science during this period of development; standard statistical packages were linked to GIS for data exploration, statistical analysis, and theory testing. The following are some examples of such statistical methods: • Regression modeling; • Analysis of principal components; • Discriminant analysis linear; • Surface trend analysis. Because these methods were advanced in other fields of study rather than spatial science, their application in spatial related analysis induced several issues because they did not account for spatial heterogeneity or geographic dependence of measurements distributed in space. As a result, methods for geostatistical analysis that perform much better with geographically relevant data were developed beginning in the late 1960s. Geometry-oriented spatial analysis operations (e.g., spatial query, point pattern analysis, polygon overlay, buffering, etc.), and geostatistical analysis methods were made available in GIS by the late 1970s. Nowadays, most commercial GIS packages include a variety of features for geometric and geostatistical evaluation. The relationship between data acquisition, simulation models, data management, and visualization tools is a significant benefit of incorporating analysis functions into GIS (Marohasy, 2003). This enables the entire spatial data processing cycle to be completed using a single piece of software to give a response to complex questions (Figure 8.1).
本书版权归Arcler所有
196
Introduction to Environmental Statistics
Figure 8.1. Data management and analysis. Source: http://www.gitta.info/AnalyConcept/en/text/AnalyConcept.pdf.
The Definition of “Spatial Analysis” According to Bailey, the definition of “spatial analysis” generates some issues: “One difficulty encountered in any discussion of the links with both GIS and spatial analysis is defining what constitutes spatial analysis. The issue arises because GIS is a multidisciplinary field by definition, and each discipline has established a nomenclature and procedure for spatial analysis that reflects the specific interests of that field. Given the breadth of analytical perspectives, it is difficult to define spatial analysis more precisely than as: a general ability to influence spatial data into different manifestations and obtain additional meaning as a consequence.” This means that spatial analysis questions and methods evolved “naturally.” They have been advanced in numerous sciences connected to GIScience that focus on various interests and research topics. A distance query to unveil all ski resorts under a certain distance, for instance, maybe a simple data recovery for some, but for others, it may represent
本书版权归Arcler所有
Spatial-Data Analysis
197
a comprehensive spatial analysis. The definition attempt fails. The term spatial analysis cannot be defined; it can only be described (Madavan and Balaraman, 2016). To give you an idea, consider the following descriptions: “A general designed to influence to manipulate spatial information into different forms in order to extract additional meaning.” In broad terms, spatial analysis can be defined as quantitative study of phenomena that are located in space.” The following section provides a broader and more complex perspective on the term “spatial analysis,” which can have several meanings. But first, the role of SA in the context of GIS should be discussed. Within quantitative SA, there are two major streams: geometrically centered and geostatistically centered. A new stream, visual data analysis, has recently emerged. Maps are well known for their ability to supply data about spatial conditions and processes. These three mentioned streams can be classified as follows, using various approaches:
8.1. SA GEOMETRIC The geometric approach is primarily descriptive and focuses on geometric requirements (location of items and attributes). It cannot be used to test hypotheses. Some applications of geometric SA include point distribution analysis, network analysis (route measurements, shortest distance), polygon overlay, distance relations analysis, shape analysis, and the calculation of gradient and exposure in elevation designs.
8.1.1. SA in Geostatistics The geostatistical method refers to parameters that are spatially distributed (arbitrary). This method employs statistical methods not only for description but also for theory testing. Multivariate statistical data, spatial correlation analysis (spatial auto-correlation), and geostatistical analysis are some applications of geostatistical SA (Kobayashi and Ye, 2014).
8.1.2. Explorative Spatial Visualization This is an exclusively visual approach, both in visualization and interpretation. Visualization is a descriptive and exploratory method for exploring new data, identifying outliers, and developing hypotheses. As shown in the image below, these combined approaches provide a variety of spatial analysis methods. Spatial analysis is used to obtain
本书版权归Arcler所有
198
Introduction to Environmental Statistics
answers rather than as an end in itself. As a result, methods for representing phenomena and processes are required, as are appropriate methods for analyzing these occurrences and processes. That is, there is a direct relationship between modeling and depiction of geographic actuality and spatial analytical techniques. The module “Basic Spatial Modeling” discusses spatial modeling methods and their effect on spatial analysis. There are three major components to spatial analysis. It begins with cartographic modeling. Each set of data is depicted as a map, and map-based processes (or map algebra implementation) generate new maps. Buffering, for example, is the operation of characterizing all areas on a map that are within a fixed distance of certain spatial object, such as a doctor’s office clinic, a well, or a linear element, like a road. Overlaying would include logical operations (.AND.; .OR.; .XOR.) as well as arithmetic (+; ; ; /). The logical overlay indicated by .AND. identifies areas on a map that satisfy a combination of factors on two or more variables at the same time (Khormi and Kumar, 2011). The addition arithmetic overlay operation sums up the principles of two or more factors area by area. The research and practice of making and using maps is known as cartography. Cartography, which combines science, esthetics, and technique, is based on the supposition that reality (or a fictional reality) can be fashioned in ways that efficiently and successfully communicate spatial information (Figure 8.2).
Figure 8.2. Ornate world maps were characteristic during the “age of exploration” in the 15th through 17th centuries. Source: https://www.worldatlas.com/what-is-cartography.html.
本书版权归Arcler所有
Spatial-Data Analysis
199
Traditional cartography’s primary goals are as follows: •
Create the map’s objectives and select the characteristics of the object to be mapped. This is where map editing comes in. Physical characteristics, like roadways or land masses, can be abstract, like toponyms or political demarcations. • Represent the mapped object’s terrain on flat media. This is the issue with map projections. • Remove mapped object characteristics that are irrelevant to the map’s intent. This is the issue with generalization. • Reduce the intricacies of the mapped characteristics. This is also a heuristic concern. • Arrange the map’s elements to best express its message to its viewer. This is the responsibility of map designers. Modern cartography constitutes many theoretical and practical foundations of geographic information systems (GIS) and geographic information science (GISc). Cartographic design, also known as map design, is the process of creating a map’s appearance by applying design principles and understanding of how maps are used to generate a map with both esthetic value and practical function. It communicates this dual objective with almost all kinds of configuration; it also shares the three skill sets of artistic creativity, scientific method, and technology with other forms of design, particularly graphic design. It combines design, geography, and spatial information science as a discipline (Köhl et al., 2000). In the United States, Arthur H. Robinson, regarded as the father of cartography as a scholastic study and discipline, stated that a poorly designed map “will be a cartographic disaster.” He also claimed that “map layout is possibly the most important aspect of cartography.” Second, spatial analysis encompasses mathematical modeling in which model outcomes are determined by the type of spatial interaction between items in the model, spatial relationships, or the geographical positioning of items within the model. In a hydrological model, for instance, the setup of streams and the geographical location of their crossings will affect the flow of water through various areas of a catchment. The geographical dispersion of different demographic groups and their density in an area may impact the propagation of an infectious disease, whereas the location of topographical obstacles may influence the invasion of an area by a new species.
本书版权归Arcler所有
200
Introduction to Environmental Statistics
A mathematical model is a system description that employs mathematical principles and language. Mathematical modeling is the process of creating a mathematical model. Mathematical models are employed in the natural sciences (physics, biology, earth science, chemistry), engineering disciplines (computer programming, electrical engineering), and non-physical systems like the social sciences (such as economics, psychology, sociology, political science). A huge chunk of the field of operations research is the use of mathematical methods to fix issues in corporate or military operations. Music, language teaching, and philosophy all use mathematical models. A model can help to describe a system, investigate the impacts of multiple components, and predict behavior. Second, spatial analysis encompasses mathematical modeling in which model outcomes are determined by the type of spatial interaction between items in the model, spatial relationships, or the geographical placement of items within the model. In a hydrological model, for instance, the setup of streams and the geographical location of their crossings will affect the flow of water through various areas of a catchment (Kessler et al., 2015). The geographical dispersion of different demographic groups and their density in an area may impact the propagation of an infectious disease, whereas the placement of topographical obstacles may influence the invasion of an area by a new species (Figure 8.3).
Figure 8.3. The application of mathematical models. Source: https://www.semanticscholar.org/paper/Mathematical-Modeling-inSchool-Examples-and-Kaiser/d2f01bc00f86e14be37ed16a2c882dec69b19026.
Finally, spatial analysis encompasses the development and implementation of statistical methodologies for proper spatial data analysis, which, as a result, make use of spatial citation in the data. This is the domain of spatial analysis that we allude to as spatial data analysis in this context.
本书版权归Arcler所有
Spatial-Data Analysis
201
8.2. HISTORY Previous efforts at cartography and collecting data paved the way for spatial analysis. Land surveying has been practiced in Egypt since at least 1,400 B.C., when the measurements of taxable plots of land were measured with measuring ropes and plumb bobs. Numerous domains have led to its modernization. Botanical investigations of global plant dispersion and local plant sites, ethnological research of animal action, landform ecological investigations of vegetation partitions, ecological investigations of spatial population interplay, and biogeography research all contributed (Kandlikar et al., 2018). Epidemiology contributed early work on disease mapping, most notably John Snow’s work mapping a cholera outbreak, research on disease spread, and location research for health care delivery. Statistics has made significant contributions through efforts in spatial statistics. Spatial econometrics has made a significant contribution to economics. Because of the significance of geographic operating system in the newer analytic toolbox, GISs are currently a major contributor. Remote sensing has made significant contributions to morphometric and cluster analysis. The study of algorithms, particularly in computational geometry, has made significant contributions to computer science. With recent projects on fractals and scale invariance, mathematics proceeds to provide fundamental toolkits for analysis and to unveil the intricacies of the spatial realm. Scientific modeling can be used to develop new approaches. Many characteristics of spatial data necessitate careful consideration when conducting statistical analysis. Although spatial dependence analysis is an important part of spatial data analysis, it is also important in defining sampling designs and conducting spatial prediction. However, focusing too much on that element of spatial data can cause the analyst to overlook other issues. For example, the impact of aerial separation on estimator precision, or the broader assumptions and information impacts that evaluate whether a model is able to satisfy the needs intended. In this sense, spatial data analysis is a subdiscipline of data analysis in general. In defining the skills and concepts required for conducting a proper analysis of spatial data, areas of statistical theory built to manage other varieties of non-spatial data play an important role. By adopting this wider definition of spatial data analysis, a link to the larger body of statistical concepts and theories is maintained. Spatial analysis, also known as spatial statistics, refers to any of the formal techniques used to study entities based on their topological, geometric, or geographic properties. Spatial analysis encompasses a wide
本书版权归Arcler所有
202
Introduction to Environmental Statistics
range of techniques, many of which are still in their early stages, which use various analytic approaches and are used in fields ranging from astronomy, which studies the placement of galaxies in the universe, to chip fabrication engineering, which uses “place and route” algorithms to create large wiring structures. In a more limited sense, spatial analysis is the approach used to analyze human-scale structures, most prominently in the assessment of geographic information or transcriptomics data (John et al., 2021). In spatial analysis, complex issues arise, which are often neither properly delineated nor fully resolved, but serve as the foundation for current research. The most important component of these is determining the spatial position of the entities under investigation. The classification of spatial analysis techniques is difficult due to the large number of various fields of study involved, the various fundamental approaches that can be selected, and the many manifestations the information can take (Figure 8.4).
Figure 8.4. Map by Dr. John Snow of London, showing clusters of cholera cases in the 1854 broad street cholera outbreak. This was one of the first uses of map-based spatial analysis. Source: https://en.wikipedia.org/wiki/Spatial_analysis#/media/File:Snowcholera-map.jpg.
本书版权归Arcler所有
Spatial-Data Analysis
203
8.3. SPATIAL DATA ANALYSIS IN SCIENCE All occurrences have space and time coordinates – they occur somewhere and at some point. The precise spatial co-ordinates of where studies are conducted are not usually required to enter the database in many areas of experimental science. Because all data available to the results is conducted by the explanatory variables, such information is of no material importance in analyzing the outcomes. Personal experiments are independent, and any particular instance indexing could be swapped across the set of cases without losing information pertinent to clarifying the outcomes (Browning et al., 2015). The environmental and social disciplines are observational rather than experimental. Outcomes must be accepted as discovered, and the researcher is usually unable to dabble with the tiers of explanatory variables or replicate. The design matrix of explanatory variables is frequently fixed in future attempts to model identified variation in the dependent variables, both in contexts of what variables have been evaluated and their levels. As a result, at later stages of modeling, model errors encompass not only the consequences of measurement and sampling inaccuracy, but also various types of possible misspecification mistakes. Recording the location and time of single events in the database will be critical in many sectors of observable phenomena. To begin, the social sciences investigate processes in various types of places and spaces – the framework of places and spaces can influence the unfolding of social and economic processes, and social and economic processes can shape the structure of places and spaces. Schaeffer (1953) discusses the significance of this kind of theory in geography, and Losch (1939) discusses it in economics. Second, recording where events took place enables linking with data in other databases, such as linking postcoded or address-based health information with Census socioeconomic data. To achieve precise linkage across databases, recording location may require an elevated degree of precision. Spatial data analysis can help with the quest for scientific concepts. It also plays a role in more overall problem solving because findings in geographic space are dependent – findings that are geographically close together are more similar than those that are further apart. This is a generic characteristic of geographical area that can be used to solve problems like spatial interpolation (Iwai et al., 2005). Even so, this same asset of spatial dependence complicates the application of ‘classical’ statistical benchmark
本书版权归Arcler所有
Introduction to Environmental Statistics
204
theory because information dependence causes data redundancy, which impacts a sample’s information content (‘effective sample size’).
8.4. FUNCTIONS OF SPATIAL ANALYSIS In GIS, what operations are commonly used as spatial analytical functions? Three distinct types are depicted: • • •
Attribute query; Spatial query; and Derive new data from existing data. GIS can be used to measure slope, contours, aspect and develop a map showing visibility. • Attribute Query: Involves the handling of attribute data only, not spatial data. In other words, it is a method of gathering information through the use of logical questions. A simple attribute query may entail the recognition of all plots for a specific land use type from a database of an urban parcel map where each parcel is enumerated with a land use code. Such a query can be managed using the table rather than the parcel map (Homma et al., 2020). The query is classified as an attribute query because no spatial data is needed to answer it. In this instance, the attribute table entries with land use codes exact to the defined type are identified (Figure 8.5).
Figure 8.5. Listing of parcel number and value with land use = ‘commercial’ is an attribute query. Identification of all parcels within 100-m distance is a spatial query. Source: http://www.wamis.org/agm/pubs/agm8/Paper-8.pdf.
本书版权归Arcler所有
Spatial-Data Analysis
205
•
Spatial Query: Selecting characteristics depending on geography or spatial relationships, which necessitates the processing of spatial data. For example, a question about parcels inside one mile of the motorway and each parcel may be raised. In this scenario, the answer can be acquired from a printed copy map or by using a GIS that contains the necessary geographic information. For example, if a request for rezoning is submitted, all owners whose property is within a certain range of all parcels that may be redeveloped must be alerted for a public hearing. To recognize all parcels within the predetermined radius, a spatial query is needed. This procedure cannot be carried out without spatial information. In other words, the database’s attribute table on its own does not provide enough information to solve problems involving location (Figure 8.6).
Figure 8.6. Landowners within a specified distance from the parcel to be rezoned identified through spatial query.
It should be noted that only a subset of these operations creates additional data. The first two systems referenced are simple queries that return a list of objects from the databases. Spatial analysis functions can also be classified based on the data type involved (point, line, network, polygons/areas, surface), the data flow diagram (DFD), or the conceptual model of space (separate entity vs. continuous field). Other experts suggest another differentiation of spatial analysis functionality. They differentiate between functions depending on the level of dynamics, interplay between objects in space, or assessment of spatiotemporal (ST) changes). The institution and categorization of SA functions are described differently based on the author’s point of interpretation and knowledge. The unit “basic spatial analysis” instruction is organized in a hybrid style. The module is made up of lessons that cover various analysis and application functions like terrain analysis, functionality analysis, and suitability analysis,
本书版权归Arcler所有
206
Introduction to Environmental Statistics
among others. Albrecht provides a useful classification of GIS SA functions. This classification was created in order to provide a unified interface for GIS (Higuchi and Inoue, 2019). As a result, this categorization has two benefits: It gets the user’s perspective (rather than the technical perspective), it’s the shortest description, but it covers all of the SA functions accessible in GIS (at least in commercial GIS). The phrase “spatial analysis” may irritate you. As a result, “pattern analysis” is an improved description of SA.
8.5. SPATIAL PROCESSES Certain processes, known as ‘spatial processes,’ operate in geographic space, and four basic types are now mentioned: diffusion processes, exchange, and transfer processes, interaction processes, and dispersal processes. A diffusion process occurs when an attribute is adopted by a population, and it is possible to clarify which individual people (or areas) have the characteristic and which do not at any moment in time. The mechanism through which the characteristic spreads in the population is determined by the attribute itself. As in voting behavior or the spread of political power, conscious or unconscious procurement or adoption may be dependent on inter-personal contact, communication, or the exertion of influence and pressure. In the situation of a contagious diseases, such as influenza, disease spread may be caused by contact between diseased and non-infected but vulnerable individuals, or by the dispersal of a virus, as in the case of foot and mouth disease in livestock. The population density and spatial distribution in relation to the magnitude at which the spreading mechanism operates will have a significant impact on how the attribute diffuses and its speed of diffusion. Mutual exchange value and income transfer procedures bind urban and regional economic systems together. Income earned from the manufacture and sale of a product in one location may be spent on products and services in another location. The economic successes of different countries and towns become intertwined as a result of such exchange and transfer processes. The spatial structure of per capita income may reflect the binding all together local spatial economic systems through wage expenditure, also known as wage diffusion, and other ‘spillover’ effects.
本书版权归Arcler所有
Spatial-Data Analysis
207
A third type of process involves interaction, in which results at one location impact and are impacted by results at other locations. Prices set at a group of retail outlets in a given area may represent a system of price action and response by retailers in that industry. The anticipated effect of a price change by another retailer (B) on demand levels at A determines whether retailer A responds. This may affect whether or not any price reaction at A must match exactly the price shift at B. The nearer the retail competitor at B, the more likely A will be required to respond fully. Such interaction appears to be influenced by seller spatial distribution, such as density and clustering (Han and Li, 2020). The attribute propagates through a community in a diffusion process, and the people in the population have a set structure. The final kind of procedure, a dispersal process, symbolizes population dispersal. Such dispersal processes may include the spread of seeds from a parent plant, the spread of physical properties such as atmospheric or maritime pollution, or the distribution of nutrients in a soil.
8.5.1. Geographic Space: Objects, Fields, and Geometric Representations The process of recording the issues in the real world in a limited representation so that data processing is possible is referred to as modeling geographic reality. This abstracting of a “real, continuous, and complex geographical distribution” into a finite sequence of distinct “bits” involves generalization and simplification processes. Objects and fields are two basic conceptual models of the entities that make up geographic reality. The distinction is best expressed through examples. Temperature, snow depth, and elevation above sea level are all appropriate concepts for fields (Girshick et al., 2011). Objects are typically thought of as a house (point), a road (line), or a political unit (area). Objects are things in the real world, whereas a field is a singlevalued feature of spot in two-dimensional space. Typically, one of these two corresponds better with our psychological perception of the actual world and could also provide a good basis for accurate calculations (Figure 8.7).
本书版权归Arcler所有
208
Introduction to Environmental Statistics
Figure 8.7. Device controlling agricultural robot. Source: https://unsplash.com/photos/Bg0Geue-cY8.
Points, lines, areas, and surfaces are the four most common types of digital objects used to represent geographic phenomena. Digitally, object space is depicted by points, lines, or areas. A town can be signified as an area or as a point at a different scale. The representation of an area can be refined using census tract-level information (e.g., wards or municipals). Each enumeration district can be depicted as an area object with the administrative border or as a point object with the area or population-weighted median identified. At the magnitude of individual households, the town’s population can be depicted as address point objects (Gotelli et al., 2012). As a result, census tracts are a spatial grouping of these foundational entities. A city or a forest are defined as area items with a boundary line traced, even if the boundary is ambiguous and ‘fuzzy’ in reality. Data values linked with characteristics are feasible at each of an endless amount of point locations on the ground in the case of a field. In order to store data about a domain in a data matrix, it must be made finite. Contour lines can be used to represent the field as a surface. When representing a field with areas, the region is frequently divided into small regular spatial groups called pixels. Pixel size defines the representation’s spatial resolution and is the field version of spatial aggregation. To depict a field with points, sample locations must be chosen. In the case of measuring soil or snow depth, a point indicator may be sufficient. Any measure, such as air pollution, is a function of the magnitude of the block (‘support’) used to define the quantity (Giraldo et al., 2010). In the situation of area objects and field representations, the areas are either defined autonomously of data values, as in census tracts and image pixels, or their boundary lines reflect an alteration in datasets. The areas are said to be intrinsic in the first case. In the second case, after analyzing data values, the areal division is foisted, and the division defines
本书版权归Arcler所有
Spatial-Data Analysis
209
homogeneous (or quasi-homogeneous) zones or areas. Fields can be divided into pixel blocks with similar or identical values. Census tracts can be pooled into larger groups of contiguous parcels of land that are similar in at least with respect to a small number of variables. However, there are times when there is a preference of conceptualization. Population dispersion can be thought of as an object or a field. When objects are considered, the representation can take the form of points (e.g., by residence) or numbers by regular areas (pixels) or irregular areas (e.g., census tracts). If the concept is in terms of a field, the representation can take the form of a saturation surface, or it can be distributed spatially using kernel density smoothing or interpolation. Bithell, Kelsall and Diggle create relative risk surfaces for disease by converting population count and disease count statistics to density surfaces using kernel density methods (Figure 8.8).
Figure 8.8. Macro of Washington DC on the map. Source: https://www.istockphoto.com/photo/washington-dc-on-the-mapgm172657210-589118?utm_source=unsplash&utm_medium=affiliate&utm_ campaign=srp_photos_top&utm_content=https%3A%2F%2Funsplash.com% 2Fs%2Fphotos%2Frepresentations-mapping&utm_term=representations%20 mapping%3A%3Asearch-explore-top-affiliate-outside-feed-x-v2%3Acontrol.
8.6. THE SPATIAL DATA MATRIX: IT’S QUALITY The two factors of a mapping from actuality to any particular data matrix impact the connection between the actual world and the data fields, such as
本书版权归Arcler所有
210
Introduction to Environmental Statistics
the inheritance of fundamental precepts like spatial dependence. These really are, first, decisions made on the representation to be used (in terms of both the portrayal of geographical area and the characteristics to be involved and how they are to be measured) and, second, the precision of measurements (with both geographic coordinates and attribute values) granted the selected representation (Girshick et al., 2011). The chosen representation serves as the employed model of the real world. Any data matrix can be evaluated in terms of the model’s quality, so the first stage of evaluation of a data matrix can be in contexts of model quality. Model quality can be measured in terms of a representation’s precision (as compared to vagueness), clarity (as opposed to ambiguity), comprehensiveness (in aspects of what is incorporated), and consistency. A representational problem is also the level of precision or spatial aggregation. The second stage of evaluation is based on the model’s data quality. The accuracy (or absence of error in) the data, as well as completeness in terms of coverage, are important considerations here. The overall relationship (arguably the most important relationship) between the space-(time-) attribute data matrix and the actual reality it is intended to collect is sometimes defined in terms of the mapping’s uncertainty. The structure of ambiguity affiliated with any matrix is thus a complex combination of the two stages involved in moving from geographic actuality to the spatial data matrix, model concept and data acquisition through measurement (Figure 8.9).
Figure 8.9. Assessment icon set. Source: https://www.istockphoto.com/vector/assessment-icon-setgm1156682628-315339724.
本书版权归Arcler所有
Spatial-Data Analysis
211
Any set of spatial data represents an abstract concept of a complex reality. There are generic features of data quality in terms of parameter, spatial object, and time accuracy, precision, consistency, and completeness. The main message is that the analyst must be alert not only to problems, but also to how they may affect differently across a study region or in making parallels across time or between separate study areas. It is up to the user to assess the data quality set and determine its suitability (Bustos-Korts et al., 2016). It may not be possible to examine the whole data set because doing so would significantly increase data capture costs, but a reflective sample of the information should be examined to evaluate its quality. It will also be essential to determine which inconsistencies are essential and which errors are improbable to have serious consequences.
8.7. SOURCES OF SPATIAL DATA A researcher collects primary data to meet the precise objective of the study. Primary data in observational science come from field research and sample surveys. If hypotheses with a spatial or geographical component are to be tested, surveys should guarantee accurate and cautious geo-referencing of each analysis – as accurate as privacy and security allows. This will be useful in later stages when data from other surveys may need to be linked. Focused measurement in differing areas is required to investigate local-area and contextual influences. When national survey results are applied to local areas, they may not produce accurate estimates because the test size in the local region is small. If local-area projections are required, stratification must be incorporated into the sampling strategy. Maps, nationwide and regional social, economic, and demographic survey data, data produced by governmental institutions such as health, police, and local governments, and also commercial data sets produced by private organizations such as the retail and economic sectors, are examples of secondary spatial data sources. Even when such data (for example, national censuses) appear to be full enumerations of the population, it is sometimes safer to treat them as samples. Satellites are an essential source of ecological data, and they can be used in conjunction with socioeconomic, topographic, and other ancillary data to create descriptions of urban areas, for example. These advancements are largely due to recent hardware advancements and the development of geoinformation systems that enable the handling, such as linkage, of huge georeferenced data sets. Decker reviews GIS data sources, while Lawson and
本书版权归Arcler所有
212
Introduction to Environmental Statistics
Williams discuss spatial epidemiology (Chai et al., 2020). Data integration, the procedure of assigning distinct spatial data sets to a prevalent spatial framework, raises a number of technical issues concerning how such inclusion should be put in place as well as the reliability of such incorporated data sets. Data sets can have varying qualities, different lineages, different collection frequencies, and be on different spatial structures that evolve over time (Figure 8.10).
Figure 8.10. The concept of collecting data on humidity, temperature, illumination of acidity, fertilizers, and pests without human intervention, the transmission of the obtained data and their analysis to increase the yield. Source: https://www.istockphoto.com/photo/the-concept-of-collectingdata-on-humidity-temperature-illumination-of-acidity-gm1218970790356394416?utm_source=unsplash&utm_medium=affiliate&utm_ campaign=srp_photos_top&utm_content=https%3A%2F%2Funsplash. com%2Fs%2Fphotos%2Fsatellite-data-environment&utm_term=satellite%20 data%20environment%3A%3Asearch-explore-top-affiliate-outside-feedx-v2%3Acontrol.
Some data are created by feeding sample data into a model, which produces a spatial surface of datasets. If the factor of interest is complicated or costly to collect, this may be accomplished. Monitoring air pollution is costly. Data on known point, line, and area outlets of air pollution are combined with different climatic data and presumptions about how contaminants distribute to create air pollution maps for an area. After calibrating and justifying
本书版权归Arcler所有
Spatial-Data Analysis
213
model output against available sample data, model output is used to generate air pollution maps. If a probability model is stipulated for a factor of interest, not only the ordinary surface but also variability around the average will be of importance. Simulation methods are used to display this variability.
8.8. THE PURPOSE AND CONDUCT OF SPATIAL SAMPLING The goal of spatial sampling is to draw conclusions about a population in which each participant has a geographical allusion or geo-coding based on a subset of individuals derived from that population. For a variety of reasons, sampling is used somewhat more than conducting a full census. A complete census may be physically difficult or impossible due to the size of the population. As in the scenario of ground-level air quality or soil characteristics in an area with a constant covering of soil, there may be a (countless) infinite amount of locations where metrics could be taken. The cost of gathering information on each person may preclude a full census. The 1991 UK Census only includes data on domestic employment based on a 10% sample of all returns, owing to confidentiality concerns as well as the expenses of manual coding (Gallyamov et al., 2007). Remote sensing data provides a comprehensive census (at a given resolution), but the data is understood by ground truthing predicated on a sample of locations for budget reasons. In other cases, it is the level of precision required by the application, rather than the overall population or the costs of obtaining the data, which necessitates sampling. In the context that the estate of interest is approximated to within some level of precision – the exact opposite of the error variance or sampling error affiliated with the analysis tool – sampling introduces error. The decision of sample size can restrict the error variance to set limits. Taking a full enumeration (or even a very large sample) could be a waste of time if such precision is not required. Furthermore, if measurement error exists, the precision of a census may be merely an illusion, and, in a reversal of what would be deemed the normal connection with both census and sample, sampling might well be required to improve the quality of ‘census’ information. In the scenario of crime data, it is well known that counts centered on police records generate undercounts, so household sampling is used to enhance estimates. Predefined groups (like the homeless) are underrepresented in population censuses, and sampling may be used to optimize information on them.
本书版权归Arcler所有
214
Introduction to Environmental Statistics
8.9. MODELS FOR MEASUREMENT ERROR All data contain errors as a result of the inaccuracies intrinsic in the measurement process. In data analysis, error models are essential. An error model quantifies the likelihood that the true value is within a specified range of the values obtained. Valid error models allow for the investigation of the effects of error propagation when arithmetic or processes are executed on one or more variables that encompass error individually. An important aspect of regression modeling is the configuration of a suitable model for the errors.
8.10. ANALYSIS OF SPATIAL DATA AND DATA CONSISTENCY Consistency checks are necessary to guarantee that data values do not drop beyond the permitted ranges, like percentages that must be between 0% and 100%, or measures of spread or distances which must be positive valued. When using interpolation methods to merge spatial units or move to a popular spatial framework, problems can arise. Counts can be added together, but when calculating new percentages or averages, the analyst should refer back to the original data. Consistency checks are required to ensure that errors are not presented into a database as a result of inappropriate or inaccurate data manipulation (Frankenhuis et al., 2019). GIS software, for instance, does not always issue warnings when improper spatial operations are conducted. Many types of statistical analysis must be performed outside of a GIS, and discrepancies can gain entry to a database as a consequence of data transfer errors or the creation of multiple copies of a file, which may then, inadvertently, undergo distinct modifications and updates (Figure 8.11).
Figure 8.11. Consistency check. Source: https://www.computerhope.com/jargon/c/consiste.htm.
本书版权归Arcler所有
Spatial-Data Analysis
215
The most crucial detail for performing consistency checks is when multiple databases must be merged or synchronized, especially if those databases were accumulated by multiple agencies. The issues are likely to be exacerbated when the data sets are not collected at the same scale. When combining data sets that refer to various time periods, the analyst must be aware of this, report the time period distinctions, and consider the potential implications for analyzing findings. When the health data correspond to an intercensus period, population information generated from a census could provide a poor measure of the suitable denominator for calculating an incidence rate. Along with maintaining consistency in attribute values, it is also essential to ensure continuity when combining spatial objects, so that houses, for instance, are not situated in the middle of water bodies. Data inconsistency is a type of data error, but it is treated separately from data error. Inconsistency errors can be subtle or extreme, but they can be avoided in theory at least by performing appropriate inspections on the database during and after data operations.
8.11. EDA (EXPLORATORY DATA ANALYSIS) AND ESDA (EXPLORATORY SPATIAL DATA ANALYSIS) Exploratory data analysis (EDA) is defined by one expert as a set of techniques for outlining data properties (summary statistics), but also for identifying trends in information, identifying unexpected or interesting characteristics in data, sensing data errors, differentiating accidental from significant characteristics in a data set, and making inferences from data. EDA techniques can also be used to investigate the results of the model, offer indications on whether model assumptions are met, and determine whether there are data effects that influence model fits. EDA techniques make no claims about the population wherein the data is derived, and hypothesis testing is frequently avoided. The techniques used are visual (such as charts, graphs, and figures) and/or quantitative in nature in the sense that they are statistical summaries of the data. Statistical numbers such as a distribution’s median and quartiles are calculated and then presented in a boxplot or introduced to a dot plot (Fuentes et al., 2007). A scatterplot is created, and the data is summed up using a numerical blending operation like a loess curve. Numerous exploratory techniques remain ‘close’ to the original data by employing only simple transitions of the original data and do not utilize inference theory (Figure 8.12).
本书版权归Arcler所有
216
Introduction to Environmental Statistics
Figure 8.12. What is exploratory data analysis? Source: 1cb20fd15.
https://towardsdatascience.com/exploratory-data-analysis-8fc-
The statistics used are typically ‘resistant,’ which means they are unaffected by the involvement of a small number of extreme values. The median and inter-quartile range are both resistant estimators of the center and spread of a numerical distribution, while the mean and standard deviation aren’t. The median is a robust method for determining the center (or location) of a set of data points because it describes the middle value (the 50th percentile) as the center when the data values are ordered. Even though a few values on either end of a distribution are significantly larger or smaller than the rest, the calculation of the median is unaffected. The mean, on the other hand, as an indicator of the center of a distribution, is influenced by extreme values because each data value in a group of n values relates (1/n) th to the mean’s value. The inter-quartile range, or the difference between the bottom and top quartiles, is a conferring resistance measure of spread of a value distribution, while the standard deviation is premised on the mean, and each value’s squared difference from the mean contributes one-tenth of the value of the square of the standard deviation. ESDA techniques include summing up spatial properties of data, detecting spatial trends in information, formulating hypotheses that allude to the geography of the data, and identifying instances or subsets of cases that are unexpected given their location on the. ESDA techniques, like EDA, are visually and numerically resistant. Now, the map, which recognizes where instances are and their spatial correlation to one another for the analyst,
本书版权归Arcler所有
Spatial-Data Analysis
217
plays an especially important role in data analysis or investigating model results. It will be necessary to be able to answer questions like: ‘where do those exceptional examples on the histogram fall on the map?’; ‘where do attribute values from this portion of the map drop on the scatterplot?’; ‘which occurrences fall in this subregion of the map and satisfy these stipulated attribute criteria?’; ‘what are the spatial trends and spatial associations in this data set?’; or ‘what are the spatial patterns and spatial connections in this designated geographical subset of cases?’ In the case of regression modeling, we would want to view a map of the positive and negative regressions and ask whether there is any proof of spatial pattern in the setup of the residuals (Fourcade et al., 2018). According to this viewpoint, the set of ESDA tools includes those used in EDA as well as supplementary methods that resolve the special queries that occur as a result of the spatial referencing of the information. Spatial variation conceptual models (a) Regionalization The regional model has a long history in geography as a conceptual model for spatial variation. Many different types of regions have been specified in the literature, reflecting the uniqueness of requirements when looking to theorize about occurrences with a geography. The focus here is on the description of areas as spatial units being used for spatial data analysis. There are three types of regions. The sharp segmentation of space into cohesive or quasi-homogeneous areas creates formal (or uniform) regions. Formal regions are divisions of space based on the classification similarity and spatial contiguity. As a result, formal regions should be deduced using operations that follow the standard classification rules. Adjustments in attribute levels determine region borders. Such a separation or segmentation into formal areas is easier to rationalize when the number of factors used in the ‘regionalization’ (process of partitioning space) is small, and becomes increasingly more challenging as the number increases, unless the variables have very strong covariation. When attribute variation is consistent across space, it is also hard to manage the formal regional model (Farrell et al., 2010). Although automated techniques have been created to assist analysts in specifying formal regions, as in the field of land cover maps, it is still frequent to see them constructed using a combination of formal methods and local understanding in social science applications. Interaction data is used to demarcate functional (or nodal) regions. Whilst the formal regions are characterized by attribute value uniformity, functional
本书版权归Arcler所有
218
Introduction to Environmental Statistics
regions are delineated by the sequence of social or economic interplay that takes place within them and distinguishes them from neighboring functional regions. In the United Kingdom, labor market regions are functional areas identified by worker commuting patterns (commute areas) and search areas used by businesses when hiring new employees. Formal and functional criteria are combined in some regional definitions. Community and neighborhood areas can be defined by attribute resemblance, like housing type, but also by reference to social networks and the shared use of local amenities like shops or General Practice surgeries. Landmark linear features (like roads and railway lines) can help define the boundaries of socially constructed functional regions, especially if they act as obstacles to interplay across them. Political and other aspects of decision-making result in administrative regions. Administrative regions are typically defined by precise boundaries and are created by government agencies and public and private sector institutions to manage space. They offer a framework for data collection, service delivery, and the distribution of public funds, and in the scenario of private sector companies operating on a national scale, they can provide structure for product marketing and the implementation of price-setting guidelines (Dutilleul et al., 2000). Administrative regions, more than the other two regional types, have reasonably sharp boundaries when policies are enforced that distinguish the difference between places based on which administrative unit they are located within (Figure 8.13).
Figure 8.13. Administrative regions of the Netherlands. Source: https://www.researchgate.net/figure/Administrative-Division-of-theNetherlands_fig3_313359679.
本书版权归Arcler所有
Spatial-Data Analysis
219
8.11.1. Exploratory Data Analysis (EDA) and Data Visualization Visualization is a natural part of EDA. Visualization aids in the process of detecting data attributes, which is a primary goal of EDA and ESDA. As separate, end-product perspectives of the data, many forms of graphical interface are provided. They are intended to communicate openly to the many audiences (who are unfamiliar with the data) the features of the information that have been identified. These presentation graphics typically provide static representations of the data and serve as a lasting, selective log of what is known. These visuals are not intended to aid in data analysis. However, data visualization or scientific visualization is concerned with the provision of numerous graphical views of a data set as part of an ongoing process of comprehension and acquiring insight into the data – that is, recognizing data properties. The operator of data visualization tools is likely familiar with the data, possibly because he or she gathered it (Christensen and Himme, 2017). The user expects to be able to ‘interact’ with the data easily in the context of generating numerous, dynamic but probably temporary views of the data, many of which could be used again and discarded. Data visualization frequently entails preserving each individual piece of data in a graph while also utilizing some (resistant) smoother to aid in the detection of patterns in what can be a complex array of data. Rather than providing graphics for a final report, data visualization encompasses a variety of strategies and personal tools to support data analysis. The literature on data visualization in statistics and computer science is primarily concerned with graphic tools. Spatial data visualization must also make use of cartographic tools – various types of map display – referred to as cartographic visualization. Cartographic visualization concerns include, but are not limited to, those of ESDA. Spatial data visualization also includes multi-media and visual actuality (virtual landscape) representations, which are not covered in this article. The immediately following section focuses on issues inherent in all aspects of data visual representation for ESDA and draws specifically on Cleveland’s work. There is a great deal of effort in this area, and it is important to illustrate the work that has attempted to bring order to this chaos (Figure 8.14) (Christensen and Himme, 2017).
本书版权归Arcler所有
220
Introduction to Environmental Statistics
Figure 8.14. Businessmen in a dark room standing in front of a large data display. Source: https://www.freepik.com/premium-photo/businessman-working-withbig-displays-dark-room_16303656.htm.
8.12. DATA VISUALIZATION: APPROACHES AND TASKS The proposed taxonomy for categorizing work in visual analytics divides methods of information visualization (rather than individual tools) into two categories: rendering and manipulation. Rendering is the process of deciding what to display in a plot and, more specifically, what type of plot to build. For univariate data, this includes techniques for showcasing distributions (scatterplot, boxplots, Q–Q and rankit plots) as well as time series (plots). When dealing with multivariate data, one expert recognizes scatterplots (in which cases are represented as points), visualization methods traces (in which cases are represented as functions of a variable, as in parallel coordinate plots), and graphic symbols. Scatterplots are the most fundamental visual tool for two or more independent variables (Christensen and Himme, 2017). Glyphs are typically placed in a layout to aid in their interpretation (Figure 8.15).
本书版权归Arcler所有
Spatial-Data Analysis
221
Figure 8.15. Types of data visualization. Source: https://boostlabs.com/blog/10-types-of-data-visualization-tools/.
Manipulation is the process of operating on individual plots and organizing multiple plots to explore data. Plot manipulations can be classified according to the data exploration tasks they are intended to support. Buja et al. Identify three data exploratory tasks: discovering gestalt, posing queries, and drawing comparisons. The task of recognizing trends, shapes, and other characteristics in a data set is known as ‘finding gestalt.’ Individual cases or subsets of the data set are queried in order to better understand and explore the gestalt characteristics that have been identified. The 3rd task is to make comparisons between variables, projections of a set of variables, or subsets of data. Specific manipulations are best suited to specific tasks (Zio et al., 2004). Machinations that involve determining which parameters to include, what forecasts to use, and what type of resolution and detail to use are appropriate for locating gestalt. A parallel can be drawn between this process and the process of using and, in particular, focusing a camera. Manipulations involving linking numerous views of the data and showcasing data subsets to recognize where the different subsets rest in each of the different views (brushing) are appropriate for posing queries, whereas decisions about how to organize plots will have significant implications for the task of making comparisons. The usefulness of any statistical graph in relation to a task is also determined by the quality of its design. One theorist develops a model that differentiates between table look-up and pattern perception, which he claims
本书版权归Arcler所有
222
Introduction to Environmental Statistics
are the primary activities of an observer are when reading a graph. These operations overlap with task classification. Table lookup is analogous to the task of performing queries on specific cases. The reader recognizes the initial data values encoded in the graph. The task is completed slowly, one data case at a time. The identification and arrangement of geometric items in order to see patterns in encoded data is what pattern perception is all about. Cleveland defines this as “extremely fast processes that seem to function in parallel to generate objects, or gestalts.” To assess the effectiveness of any statistical graph, we must first identify the tasks that the user performs when analyzing graphs in either of these two modes. Table look-up necessitates the user performing one or more of the following tasks: scanning (identifying values by relating to the axis or legend); extrapolation (like between tick marks on an axis); and matching (finding a symbol in the key). Pattern perception necessitates the user performing one or more of the following tasks: detection (of the interpretation of the symbols used to decipher values); assembly (visual organization of similar icons); and estimation (of relationships between assembled groups). A great visual design will help the user complete these tasks. It has been stated that good design principles can be considered as facilitating the reader in completing one or more tasks. Experts list graph construction principles that ensure clarity of vision: data points should stand out from other graph characteristics; avoid clutter (Ying and Sheng-Cai, 2012); prevent overlapping data points; and avoid putting notes or other attributes on the graph that cannot be turned off when not needed. Graphs must be insightful rather than convenient to read. Visualization plays a role in ESDA, and it works as well as provided evidence of methods and technologies. When fitting results in a summary representation of data, the involvement of visualization tools might well be expanded even further. This aspect of visual representation will be discussed further in the following chapter. However, it is also critical to emphasize the constraints of visualization in data analysis. First, without some sort of data summary, the eye is confronted with a dizzying array of images. When confronted with a plethora of explorative routes through the data, it may be challenging to construct and organize ideas logically. Another expert suggested using graphic commands to help construct such spatial data interrogations. Second, images can be deceptive and ambiguous. Individuals may see distinct patterns and correlations. This can be mitigated in part by creating better visuals and maps. However, just as careful data preparation is required prior to visualization, so are good fitting processes and other
本书版权归Arcler所有
Spatial-Data Analysis
223
statistically predicated manipulations of the data in order to proceed with comprehension and acquiring insights into the data (Figure 8.16).
Figure 8.16. Land cover surrounding Madison, WI. Fields are colored yellow and brown, water is colored blue, and urban surfaces are colored red. Source: https://en.wikipedia.org/wiki/Spatial_analysis#/media/File:NLCD_ landcover_MSN_area.png.
Big tables of spatial data collected from censuses and surveys are dealt with in urban and regional studies. To obtain the main trends, a massive amount of specific data must be simplified. Multivariable analysis enables for variable change, converting the many parameters of the census, which are usually correlated among themselves, into fewer individual “Factors” or “Principal Components,” which are, in fact, the eigenvectors of the information correlation matrix ranked by the inverse of their eigenvalues. This change in variables has two major benefits: Due to the concentration of information on the first new factors, it is feasible to put only just some of them while having lost only a limited amount of data; mapping them eliminating potential and more significant maps. By definition, the factors, or eigenvectors, are orthogonal, i.e., uncorrelated. In most instances, the dominant factor (with the highest eigenvalue) is the Social Component, which divides the city’s rich and poor (Xia et al., 2017).
本书版权归Arcler所有
Introduction to Environmental Statistics
224
Because the factors are uncorrelated, other smaller processes, such as social status, that would otherwise be hidden, show up on the second, third, and subsequent factors. The distances between observations are measured in factor analysis, so selecting a significant metric is critical. Among the more popular are the Euclidean metric (Principal Component Analysis), the Chi-Square distance (Correspondence Analysis), and the Generalized Mahalanobis distance (Discriminant Analysis). More complex models, such as those based on commonalities or rotations, have been proposed. Using multivariate methodologies in spatial analysis began in the 1950s (though some examples date back to the turn of the century) and resulted in the 1970s, as computer power and accessibility increased. Two sociologists, Wendell Bell and Eshref Shevky, demonstrated in a seminal publication in 1948 that most city populations in the United States and around the world could be reflected by three separate factors:
本书版权归Arcler所有
• •
•
The “socioeconomic status,” which is divided into rich and poor regions and distributed along highways from the city centers; The “life cycle,” or the demographic factors such as age of households, represented by concentric circles (Wieland et al., 2010); and “Race and ethnicity,” identifying spots of migrants throughout the city. British geographers used FA to characterize British towns in a groundbreaking study in 1961.
9
CHAPTER
CHALLENGES IN ENVIRONMENTAL STATISTICS
CONTENTS
本书版权归Arcler所有
9.1. Introduction..................................................................................... 226 9.2. Statistical Models for Spatiotemporal Data (STD)............................. 227 9.3. Spatiotemporal (ST) Relationships.................................................... 228 9.4. Data Characteristics......................................................................... 229 9.5. Random Fields................................................................................. 231 9.6. Gaussian Processes and Machine Learning (Ml)............................... 232 9.7. Neural Networks............................................................................. 232 9.8. Population Dynamics Stochastic Modeling...................................... 233 9.9. Population Dynamics...................................................................... 234 9.10. Spatial Extended System................................................................ 235 9.11. Non-Gaussian Noise Sources......................................................... 236 9.12. Environmental Exposures and Health Effects in Collection of Environmental Statistics........................................ 237 9.13. General Logic and Strategy............................................................ 237
226
Introduction to Environmental Statistics
9.1. INTRODUCTION Two major issues face current ecological and environmental complex systems studies: i) the ability to collect and simulate a profusion of data relevant to diverse elements of environmental, climatic, and ecological processes; ii) the rising need to uncover and comprehend general principles that underpin the impacts of environmental noise on population dynamics. Ground-based stations, remote-sensing, and sensor networks equipment are used to collect this information. In contrast to the past, when environmental data was limited and difficult to get, there is now an abundance of Earth observation data (Wagner and Fortin, 2005). Complex spatiotemporal (ST) dependencies, big volume (large size), high speed (time series of satellite pictures), high dimensionality (many spectral bands/sources, etc.), high uncertainty (due to measurements and registration mistakes), and non-repeatability are all characteristics of these (due to non-stationary evolution). Furthermore, the data frequently has geographical or temporal gaps. New issues arise when modeling complicated environmental data and the behaviors of ecological complex systems. Statistical physics can assist solve these issues by: i) offering physiologically inspired statistical modeling tools, ii) explaining the underlying physical mechanisms through theoretical models, and iii) enhancing our knowledge of the effect of ambient noise. In ecological and environmental modeling, stochastic components are crucial. In statistical physics, stochastic signals are frequently referred to as “noise.” It should be noted, however, that “noise” includes both correlated and uncorrelated fluctuations. Short- and long-range correlations include valuable physical information that may be utilized to analyze, mimic, and forecast the underlying processes. Furthermore, the interaction of stochastic fluctuations with nonlinear dynamics has a substantial influence on climate, environmental quality, and natural resource availability. Prediction and forecasting require statistically accurate physics models of fluctuations and a knowledge of their interactions with nonlinearity (Toivonen et al., 2001). The continual and inescapable presence of random variations arising in the environment generates noise in ecological systems. As a result, stochastic techniques with multiplicative noise sources should be used to represent community dynamics, epidemics, and genetics. Stochastic fluctuations, in particular, cause phenomena that can be described using deterministic techniques that consider noise as a nuisance. For example, stochasticity in gene expression allows cells to adapt to changing surroundings and respond to unexpected stimuli. Throughout
本书版权归Arcler所有
Challenges in Environmental Statistics
227
cellular differentiation and development, it also helps to establish population heterogeneity. Through adaptive mechanisms, noise can potentially cause heterogeneity in cell fate. Indeed, the existence of noise can profoundly alter the system’s physics. As a result, it’s not unexpected that noise research in ecological and biological systems has become a “hot” academic area in recent years. The purpose of this letter is to summarize current research and to pique physicists’ interest in open research issues arising from environmental data modeling and ecological complex systems. Questions about stochastic dynamic models for ecological systems and data analysis models that may offer precise information about the underlying properties of environmental and ecological systems are examples of such issues. Because the interpretation of such data might gain from physical understanding of the basic processes, the arrival of massive Earth observation and environmental data poses numerous difficulties and possibilities for physicists (Sivak and Thomson, 2014).
9.2. STATISTICAL MODELS FOR SPATIOTEMPORAL DATA (STD) Statistics’ field of spatial data analysis is rapidly expanding. Such information is used here to describe things in space that have a physical position. If these places are significant to the understanding of the data, the analyzes are spatial. Invariably, data analysis entails the evaluation of some type of map; in many cases, it entails nothing more. As a result, graphical approaches for analyzing such maps are an important aspect of the process. A variety of such methods are discussed. Information technology advancements are a driving force in spatial data analysis. Automatic data acquisition (as in remote sensing), image processing hardware, geographic information systems (GISs), and dynamically connected windows are only a few examples. These innovations will have a significant impact on the sorts of spatial studies in which statisticians participate. The interconnections between several forms of data can be complicated. The spatial breakdown of these interactions is critical to the integration process. Observational and multivariate data are commonly used (Simoncelli and Olshausen, 2001). The analyzes are frequently exploratory in nature. Linking a variety of distinct but relatively basic data views might be beneficial. Models can be linked to such perspectives.
本书版权归Arcler所有
228
Introduction to Environmental Statistics
These are physics-inspired statistical models and methods for representing and analyzing environmental, climatic, and ecological (henceforth, environmental) data. Model creation, parameter inference, selection of the model (if more than one model is evaluated), and prediction are all common activities in statistical modeling. Extrapolation (prediction outside the data’s spatial region), Interpolation (covering geographic or temporal gaps inside the data domain), and forecasting are examples of the latter (projection at future times). Considering the omnipresence of stochastic disturbances and the data properties outlined in the Introduction, the probabilistic framework is excellent for the building of accurate and flexible models.
9.3. SPATIOTEMPORAL (ST) RELATIONSHIPS Spatiotemporal (ST) items that occur in the same place or at the same time and have comparable properties are frequently linked. Finding connections between items is useful for a variety of applications, including ST hotspot prediction. However, because of the unique data properties, identifying meaningful associations from ST data is more difficult than discovering valuable correlations from standard numerical and categorical data. Three of these features of ST connections are discussed in the following subsections: implicitness, complexity, and non-identical distributions:
9.3.1. Complexity Extracting ST patterns is difficult due to the intricacy of ST interactions. This intricacy is due to the fact that ST data are discrete approximations of what are continuous in space and time in reality. For example, traffic sensing devices installed on highways collect data from moving cars in specific areas while they are in motion. Furthermore, co-located ST items impact one another, making connection recognition difficult. In other words, a moving object’s pattern may be influenced by other items, much as a car’s direction, speed, and acceleration are influenced by other automobiles in its vicinity (Seid et al., 2014).
9.3.2. Implicitness Arithmetic relations like ordering, subclass-of, instance-of, and memberof are used to indicate explicit connections in non-ST data. Relationships between ST objects, on the other hand, are implicit. Distance, volume, size, and time are all traits or features that are used to construct spatial
本书版权归Arcler所有
Challenges in Environmental Statistics
229
connections. These connections can occur between points, lines, regions, or a combination of these. Include, meet, covers, overlap, disjoint, equal, within, and covered-by topological links between two areas. A ST point can be found next to another. In a ST context, a line can cross, overlap, touch, or be contained by another line or ST region. Flying birds, for example, might be depicted as a complicated network of various ST lines in the instance of migrating birds.
9.3.3. Non-Identical Distributions Positive autocorrelation or dependence exists in spatial-temporal data items. Things that are close together in space and time are more connected and similar. Many dependent variables exist in moving autos, such as position, direction, connection, and temporal features. If there is heavy traffic in a crossroads at 4 p.m., there is a good possibility that there will be traffic at 4.01 p.m. as well. Furthermore, chance has a larger role in the likelihood of a ST pattern forming. When a victim and a criminal are discovered in the same place at the same time, for example, crime is more likely to occur. This is an autocorrelation, as opposed to traditionally dispersed data that is independent and identical (Rovetta and Castaldo, 2020). The autocorrelation within ST objects causes data mining methods to perform poorly. Furthermore, computing the ST autocorrelation in big data sets is very consuming.
9.4. DATA CHARACTERISTICS At various degrees of granularity, STDM uses geographical, temporal, and non-ST or thematic data. This fact presents a number of difficult data qualities, like specificity, ambiguity, networking, social, dynamicity, heterogeneity, privacy, and low quality. The following subsections discuss these characteristics:
9.4.1. Specificity When spatial-temporal data is utilized to create a good model for one application area, it may not be helpful in another. A model designed for bird migrations, for example, will not be appropriate for vehicle movements in a metropolis or molecule movement at a microscopic level. This challenge may be applied to a variety of geographical places with varying natural and cultural qualities. As a result, while ST models are built for specific domains, they cannot be generalized. Two events that are identical may belong to
本书版权归Arcler所有
230
Introduction to Environmental Statistics
distinct classes or be generated by different patterns. The ambiguity makes analysis more difficult and increases modeling and processing complexity in a variety of STD activities like classification, clustering, and pattern extraction. As a result, there is a growing demand for more STD semantic enrichment and annotation research.
9.4.2. Dynamicity To represent the development of spatial-temporal data’s distributions or densities, dynamical models are required. Over time, the dispersion of ST moving objects changes. This dynamic development of densities may be seen in a variety of settings.
9.4.3. Heterogenicity Continuous change in space and time has an impact on every environment. Due to the influences of this continual change, ST data show variance in measurements and connections. For example, the trajectories and behaviors of city road users might change across time and location. As a result, trajectories differ on chilly days vs sunny days. Spatial heterogeneity and temporal non-stationarity are terms used to describe this variance. As a result, most ST data contain an inherent degree of uniqueness, which can lead to contradictions between global and regional models. As a result of this variability, different mining models for different ST areas are required (Antweiler and Taylor, 2008). Otherwise, a global model derived from a ST dataset may fail to adequately characterize the observed data for a given region and time. Finding the optimal parameters for building local models is therefore a critical task. In many application fields, such as social network analysis, variability in socioeconomic observations between regions is a significant issue that must be handled when analyzing ST data.
9.4.4. Poor Quality The outcomes of the studies are directly influenced by the quality of the ST data. As a result, ensuring high-quality data before analyzing it is critical. When working with transdisciplinary data that is fragmented and distorted in chaotic circumstances, this data quality assurance may be difficult to obtain. Uncertainties, limited information, and conjectures all contribute to low quality. STD on bird migrations, for example, necessitates temperature, water, and forest data that are imprecise, limited, and reflect non-measurable features at all times and in all locations. Errors and noise
本书版权归Arcler所有
Challenges in Environmental Statistics
231
induced by malfunctioning or blocked sensors have an impact on monitoring the physical environment. To avoid affecting the STD tasks, these mistakes and noise must be rectified.
9.5. RANDOM FIELDS Environmental data is dispersed over several time periods and spatial regions of variable sizes. As a result, appropriate statistical models should account for both spatial and temporal dependency. They should also account for the data’s intrinsic uncertainty as well as the producing process’ a priori unknown complex unpredictability. Kolmogorov and his students pioneered the use of random fields in fluid turbulence research. Random fields are frequently employed in a variety of areas, including Hydrology, statistical field theory, and ST data modeling, among others, because of their flexible space-time (ST) dependency. In statistical statistics and physics, the development of Gaussian random fields takes different paths: in the latter, the field is frequently formulated in the Boltzmann-Gibbs (B-G) interpretation by using a suitable energy function, whereas in the latter, the field is expressed in terms of its covariance function and expectation (mean). For lattice data, the B-G framework enables sparse representation and computing performance, for example. The precision (inverse covariance) matrix may be expressed explicitly, which provides several benefits. The B-G representation produces Gauss-Markov random fields on regular grids. In continuous systems, however, the B-G representation leads to Gaussian field concepts (Pleil et al., 2014). If the latter allow closed-form solutions, new spatial covariance measures can be created, for example. However, the covariance function is difficult to state explicitly in ST field theories. Interpolation, forecasting, and modeling of complicated ST processes are all possible with random field theory. The main technique for prediction is the best linear unbiased predictor (BLUP), commonly known as kriging. At an unobserved ST point, solving a N*N linear system (with N being the amount of data) using the field’s covariance value, is required. The computing time and memory storage requirements for solving these systems with extensive covariance matrices scale is large. As a result, in order to manage huge datasets, approximations or various methodologies are required. The following are some open research problems in statistical physics: 1) The construction of ST covariance values that are both mathematically and physically significant, for example; 2) Computationally efficient
本书版权归Arcler所有
232
Introduction to Environmental Statistics
interpolation and simulation methods for large datasets, such as methods that rely on sparse precision matrices; 3) More adaptable B-G random fields for lattice and continuum spaces; and 4) Innovative, computationally efficient models for non-Gaussian dependency (Adelman, 2003).
9.6. GAUSSIAN PROCESSES AND MACHINE LEARNING (ML) Physicists are interested in machine learning (ML) research. ML approaches may nearly automatically “learn” from data (i.e., with little or no intervention from the modeler) and quickly adapt (generalize) to new data. In the age of big data, such characteristics are particularly enticing. Complex categorization problems may be effectively completed using ML approaches, and they also give novel ways for solving partial differential equations that reflect physical processes. Gaussian processes are a generalization of Gaussian random fields; they are functions with an input vector in a D-dimensional space that isn’t necessarily limited to the ST domain. GP regression (GPR) is a ML process that calculates the best estimate of an output variable focused on a series of input vectors and corresponding outputs. BLUP and GPR predictive equations are comparable. The key difference is that in GPR, the ST covariance employed in BLUP is replaced with a kernel function, which assesses output correlations in regards to input vector distance. For GP parameters, the Bayesian framework enables for informed (non-flat) prior predictions (Austin et al., 2005). As a result, the mathematical machinery established for random fields is applicable to GPs as well. GPs can also benefit from open research topics in a variety of domains. A current priority is the creation of scalable GP models that can manage large amounts of data. Sparse GPs are estimates that can reduce the computational cost and memory requirements of GPs. The creation of sparse GPs on the basis of local inverse covariance (precision) operators is a somewhat different method.
9.7. NEURAL NETWORKS NN are widely used ML methods for classification and regression. As a result, NN are excellent choices for modeling environmental data. When used to ground contamination data, for example, a GPR network provided better statistical validation measures than other approaches. NN and GP
本书版权归Arcler所有
Challenges in Environmental Statistics
233
(and hence random fields) have strong linkages that are seldom recognized outside the ML community (Bankole and Surajudeen, 2008). Single-layer, feed-forward Bayesian NNs with an unlimited number of hidden units (i.e., an indefinitely wide NN) and independent, equally distributed priors across the parameters, for example, have been demonstrated to be equal to GPs. In this instance, the kernel (covariance function) of the analogous GP may be found in closed form. Deep, infinitely wide NNs have also been demonstrated to be comparable to GPs lately. The GP covariance for broad neural networks having a finite number of layers was computed using a computationally efficient recipe, and NN accuracy was shown to approach the relevant GP’s accuracy as layer width increased. Another recent paper shows that in the limit of infinite depth, the output of a (residual) convolutional NN with an adequate prior over the weights and biases is equal to a GP, and that the analogous kernel can be calculated precisely.
9.8. POPULATION DYNAMICS STOCHASTIC MODELING Statistical models for natural science phenomena have traditionally been either linear or black boxes. It is now widely acknowledged that stochastic models whose behavior more closely resembles the scientific structure of the system under investigation give a better interpretable framework for data analysis. These concepts are currently applied in a variety of domains, including atmospheric science, climate, geophysics, biology, environmental science, and hydrology. This wide range of applications necessitates a wide range of modeling methodologies (Piegorsch and Edwards, 2002). However, there are a number of issues with measuring uncertainties in data analysis utilizing these stochastic processes that apply to a wide range of applications. Building stochastic methods that respect current knowledge while still include plausible uncertainty quantification is one example. Another is developing statistical inference tools for analyzing complex stochastic processes observed on potentially large or complex datasets. Solving these difficulties will aid scientists in a variety of fields, including climate change forecasting, and flood risk prediction. Through three key topics: population dynamics, geographically extended systems, and non-Gaussian noise sources, this focuses on non-equilibrium processes used to characterize the complex dynamics of ecological systems. Researchers are particularly interested in studying stochastic nonlinear
本书版权归Arcler所有
234
Introduction to Environmental Statistics
effects in diverse domains of biological sciences, transdisciplinary physics, and condensed matter. Stochastic pattern formation, noise-induced transport, transition from order to chaos, noise-induced synchronization and excitability, noise enhanced stability, stochastic resonance activation, stochastic, and coherence resonance, and noise-induced phase transitions are examples of new counterintuitive phenomena that can arise from the interaction between the nonlinearity of living systems and environmental noise (Pleil et al., 2014). The study of ecological time series and the modeling of ecosystem dynamics need the characterization of the ensuing ST patterns and spatial structure. Neuron models are also being used to investigate such events. In addition, random fluctuations induced qualitative alterations and ecological shifts in population systems, comparable to phase transitions.
9.9. POPULATION DYNAMICS Population dynamics is a topic of non-equilibrium statistical physics that might be regarded as a core subfield of complex ecological system dynamics. Population dynamics has recently emerged as a critical tool for delving into the basic mystery of biodiversity’s genesis and stability. Ecological complex systems are open systems with nonlinear interactions between constituent pieces that are prone to random environmental disturbances. Initial circumstances, deterministic external perturbations, and random fluctuations are all particularly sensitive to these systems. Modeling ecological dynamics and comprehending the mechanisms that drive the ST dynamics of ecosystems need a knowledge of far-from-equilibrium stochastic processes. Even low-dimensional systems show a wide range of noise-driven behaviors, from chaotic to well-ordered system dynamics. Even modest random disruptions in systems of interacting populations might have opposing impacts, such as extinction or exponential population expansion. Identification of general rules that control such stochastic processes, as well as the creation of constructive tools for mathematical modeling and analysis, are critical objectives (Browning et al., 2015). The nonlinear nature of interacting biological components such as restricted ecological niches, age, and gender disparities, fertility dependency on population size, relationships with other populations, and environmental impacts all contribute to the complex and diverse behavior of population systems.
本书版权归Arcler所有
Challenges in Environmental Statistics
235
For wild populations, random environmental variations are a key source of danger. As a result, discovering universal processes that relate global environmental changes to population dynamics is a crucial study area. In this context, understanding, and predicting the spatial and temporal autocorrelation patterns of ambient noise is critical. The influence of environmental autocorrelation on population dynamics and extinction rates is important, and the latter may be reliably predicted provided the memory of the previous environment is taken into consideration. Nearly 1,000 lines of the Dunaliella salina microalgae were treated to randomly changing salinity, with auto-correlation varying from negative to substantially positive values. Decreased autocorrelation values resulted in lower population growth and higher extinction rates, suggesting that nongenetic inheritance can be a primary determinant of population dynamics in randomly varying settings.
9.10. SPATIAL EXTENDED SYSTEM The importance of correlations and noise in biological systems is demonstrated by stochastic population dynamics models in geographically extended systems. Theoretical techniques characteristic of non-equilibrium statistical physics has led to surprising new and fascinating behavior in basic archetypal model systems, capturing the noisy kinetics of large many-body biological systems. This behavior can range from the onset of constant outof-equilibrium phase transitions, and also the spontaneous creation of rich ST patterns, to continual population oscillations solidified by intrinsic noise and powerful renormalization of the related kinetic parameters induced by correlations (Parise et al., 2014). Through the formation of noise-stabilized structures, spatial degrees of freedom can dramatically increase extinction periods, promoting species diversity and ecological stability. For instance, noise-stabilized activity fronts in spatially extended predator-prey systems establish persistent correlations, allowing the directed percolation universality class to drive the key steady-state and non-equilibrium relaxation dynamics near the predator extinction threshold. The geographical structure of ecosystems is complicated. Nevertheless, spatial models still have a number of unsolved issues, restricting quantitative comprehension of spatial biodiversity and temporal scales. Indeed, the relationship between the physics of nonequilibrium phase transitions and spatially expanded ecological models is a critical outstanding subject.
本书版权归Arcler所有
236
Introduction to Environmental Statistics
Furthermore, changes in external circumstances have a significant impact on the dynamics and structure of biological systems. Indeed, the distinctive timeframes of environmental changes, and also their correlations, play a critical role in how biological systems adapt to and respond to environmental variability. The fixed population density and the impact of random fluctuations in extinction dynamics inside an ecosystem are both relevant topics. Random fluctuations have an important function in cell biology. Noise in intracellular processes has been shown to restrict cell function by limiting the ideal concentration of the cell’s molecular constituents. Noise, on the other hand, may be utilized to promote variation in a clonal population, which can be used to support bet-hedging methods in changing settings (Patil and Taillie, 2003). As a result, noise reduction can have significant selection effects on genome evolution.
9.11. NON-GAUSSIAN NOISE SOURCES In natural systems, empirical population dynamics data is typically described as additive white Gaussian noise. Some new research study inherently non-Gaussian noise signals, which are characterized by rapid random fluctuations, in order to offer a more accurate explanation of stochastic dynamics of natural systems. The environmental variations were specifically approximated using the archetypical pulse noise generator, which is a succession of rectangular pulses with the corresponding characteristics: I constant width; ii) height dispersed according to a probability function; iii) occurrence times distributed as per a probability function. In, the effects of a pulse noise source represented as Poisson white noise on population dynamics were investigated. In the presence of a noise source with diverse statistical features, ranging from sub- to super-Poisson process, the stability criteria for the dynamics of termite populations have recently been explored in two separate cases: 1) negative-defined pulses; 2) positive-defined pulses. The impact of noise correlations was assessed using a stochastic differential equation and a noise source described as a renewal process with appropriate statistics (Notaro et al., 2019).
本书版权归Arcler所有
Challenges in Environmental Statistics
237
9.12. ENVIRONMENTAL EXPOSURES AND HEALTH EFFECTS IN COLLECTION OF ENVIRONMENTAL STATISTICS Environmental epidemiology is concerned with a wide range of diseases and health conditions. Mortality, including case fatality rates, is at the top of the list in terms of gravity; morbidity is measured in a variety of ways, including new cases of illness, worsening of pre-existing illness, the onset of occurrences of hospitalization or requirements for medical service, the event of accidents, impairment of function, or the production of symptoms. Other less obvious impacts, like physiological and biochemical reactions that are difficult to understand in terms of long-term consequences, the development of irritation reactions, and the accumulation of potentially dangerous materials like lead and pesticides in the human body, are also important. Chemical or biological substances can be ingested, inhaled, or absorbed via the skin on rare occasions. The toxicity of a chemical can vary greatly depending on how it is exposed. Exposure locations and methods vary widely. Exposure to radiation occurs as a result of normal background levels that vary to some extent. There is a significant contribution from therapeutic radiation and medical diagnostic, as well as occupational and community exposures associated with nuclear power development. Similarly, there is the potential of absorption through contact and occupational exposures of employees in the case of pesticide exposure (Bustos-Korts et al., 2016). There is the unintentional exposure of a few individuals who live next to regions being decontaminated, there is the deposit in food, and sometimes a food chain gradient. As a result, the examples of environmental exposures are many, and they frequently interact.
9.13. GENERAL LOGIC AND STRATEGY According to Hopkins, Massey, and Goldsmith, who reviewed the statistical elements of air pollution medical study, the concerns may be divided into three categories. There were (a) multivariate problems involving multiple factors influencing a single reaction, as well as the corollary of multiple reactions from a single exposure; (b) problems involving complex interrelationships in space and time time-space series; and (c) problems that were better explained as systems problems since they were so complicated that established single methods did not seem suitable for them.
本书版权归Arcler所有
238
Introduction to Environmental Statistics
Aside from this, there is a new set of issues that have just been identified. For starters, there is the issue of non-specificity; most environmental exposures do not cause reactions that are unique to those exposures, but they might alter the likelihood of other conditions (coronary heart disease) or responses (increased airway resistance). In general, it is prudent to presume that an environmental contact will only be linked to a specific, well-defined clinical entity or change under health status in exceptional circumstances. It is a good idea to expect that environmental exposures can affect health in a variety of ways. Second, there is the issue of agent interaction inside biological systems as well as in the environment. Oxides of carbon monoxide, nitrogen, and altitude, for example, can all impede oxygen transport performance. The oxygen transport function might be affected by the presence or absence of a respiratory illness. As evidenced by the concomitant link of both exposures with lung cancer in uranium miners, the interplay of cigarette smoking with radiation exposure creates a huge potentiation (Ng and Vecchi, 2020). The complex and poorly comprehended photochemical reactions, the connection of atmospheric radioisotopes with whatever particulate matter is in the atmosphere, and the significance of particulate matter in transporting sulfur oxides into the lungs in regards to long-term respiratory disease reactions from particulate and sulfur oxide pollution are all factors that interact in the environment. However, there are several additional interactions in the environment, including acts that either neutralize pollutants, remove pollutants, or cause pollutants to have greater damaging effects. A variety of biological processes have an impact on environmental statistics strategy. They include, for example, adaptive processes. There may be significant toxicity when mice are exposed to a specific amount of ozone for the first time, but if pre-exposure occurs at a lower level, the toxicity of the following level is lowered by an unknown mechanism. It is unknown whether this occurs in people. Physiologic processes for heat adaptation are extensively understood and researched. Respiratory changes as a result of inhaled irritants are extensively documented. Changes in red cell production rate and type as a result of being exposed to altitude, carbon monoxide, or bleeding are also well recognized. Those that are unfamiliar or abrupt can cause vascular responses, whereas noises that are familiar or constant do not. Many similar reactions occur, and it’s never clear whether a completely adaptive reaction is also harmful in some manner. In general, experienced experts in this sector understand that
本书版权归Arcler所有
Challenges in Environmental Statistics
239
there is always a health cost associated with adaptation, although proving this may take years. Sensitization can also happen as a result of inhaled allergens or plants like poison oak or poison ivy. The phenomena of avoidance are another issue in environmental epidemiology. People may leave if residing in a given region causes them discomfort; as a result, if one investigates that group on a cross sectional basis alone, the exceptionally vulnerable person may not be found. However, it is the exceptionally vulnerable individuals, if they can be identified in terms of age and medical condition, who are most relevant to air quality requirements, for example (Marohasy, 2003). As a result, the possibility of avoidance must be considered. This is especially important when working with irritating compounds or those that might cause allergic responses (Ferris). There are a number of conflicting hazards that have been seen in a prospective study of mortality; people who are exposed to a dose that may be carcinogenic may not get cancer because the time required for it to manifest is inadequate, and they die of other reasons. According to research, 20 years of cigarette smoking is required to cause lung cancer. However, in a realistic population, 20 years of observation is not always possible, so one should not rule out the possibility that the phenomenon could occur if a young population was exposed and studied for a sufficient period of time, or if the rate of mortality was lower due to other “competing” causes.
本书版权归Arcler所有
本书版权归Arcler所有
10
CHAPTER
FUTURE OF ENVIRONMENT STATISTICS
CONTENTS
本书版权归Arcler所有
10.1. Use of New Technologies.............................................................. 242 10.2. Technologies that Can Be Used in Environment Statistics: Predictive Analytics........................................................ 243 10.3. Changes in Utilization of Resources............................................... 245
242
Introduction to Environmental Statistics
With the increased need to preserve the environment, there are expected changes with regards to the use of environment statistics. Environment statistics is expected to grow across the world. The future of environment statistics also includes future trends in environment statistics.
10.1. USE OF NEW TECHNOLOGIES With the growing use of cutting-edge technology. This is expected to proceed into the world of environment technology. This includes the adoption of recently developed technology when conducting various analysis in Environment data. New technologies are often used by researchers in analyzing data. This usually requires significant efforts during data preparation so that data analysis can commence. Preparation includes the cleaning, pre-processing, harmonizing, and integration of data from one or multiple sources and finally placing them in a computational environment. When data is placed in a computational environment, it should be placed in a suitable form to allow analysis. Environmental data is usually hosted in research infrastructures and data repositories (Madavan and Balaraman, 2016). This makes data available to researchers. The Challenge with most technologies is that they rarely offer a computational environment to facilitate data analysis. In most cases, published data is persistently identified. However, most identifiers resolve to landing pages that are manually navigated to identify data accessibility. This kind of navigation presents a challenge for machines handling them. This is one of the problems faced in environment statistics. As a future trend in environment statistics, there are big data analytics technologies used. This technology is usually a combination of several techniques and processing methods. By combining several techniques, the technology becomes effective and the therefore be collectively used by enterprises when obtaining relevant results for strategic management and implementation. Despite the enthusiasm and ambition in investment so as to leverage the power of data to transform the environment, there is a variation with regards to success. Among the challenges reported by environmental organizations among other organizations is forging a data-driven culture. Few organizations record a success when conducting certain projects. Big transformation in environment statistics take time (Chai et al., 2020). There is a growing desire by some organizations to fully become data-driven. Few have been able to actualize this goal. When handling environment data,
本书版权归Arcler所有
Future of Environment Statistics
243
there is need for good technology, proper understanding, and management of data.
10.2. TECHNOLOGIES THAT CAN BE USED IN ENVIRONMENT STATISTICS: PREDICTIVE ANALYTICS Among the main reasons for conducting environment statistics is to make predictions on expected changes in the environment and global climate. This makes predictive analytics very important. Predictive analytics is both hardware and software solutions used during the discovery, evaluation, and development of predictive scenarios by processing environment data. Analysis of environment data, is used to make predictions that can be used by the government to put in place preventive measures to reduce the impacts of the anticipated change (Kobayashi and Ye, 2014). There are various predictive analytics technologies that have been invented and improved to help in this process:
本书版权归Arcler所有
•
•
•
Environmental Databases: There are various databases utilized in reliable and efficient data management across a scalable number of storage nodes. Such databases have the ability to store data as relatable database tables or key-value pairings. More databases may be created in the future to store larger amounts of data as environment data is large by nature. Knowledge Discovery Tools: There are tools that allow researchers to mine big environmental data, whether it is structured or unstructured. The mined data is then stored in multiple sources. Sources of such data can be different file systems or similar platforms. There are search and knowledge discovery tools that allow researchers to isolate and utilize various information to their benefit. Stream Analytics: There are cases where environmental data held by an organization is stored j multiple platforms and in multiple formats. This brings about the usefulness of steam analytics software. It enables the filtering, aggregation, and analysis of large data (Khormi and Kumar, 2011). This software also enables connection to external data sources and their integration to application flow.
Introduction to Environmental Statistics
244
本书版权归Arcler所有
•
•
•
•
•
•
In-Memory Data Fabric: It is a technology useful in distributing large quantities of data across system resources such as solid-state storage devices, flash storage and dynamic Ram. This system in turn enables low latency access and processing of big data on the connected nodes. Distributed Storage: This technology was developed to deal with interdependent nodes of failures and corruption or loss of big data sources. In most cases, the distributed file stores carry the replicated data. There are cases where data is replicated for low latency quick access on large computer networks. Such storage is usually non-relational data bases. Data Virtualization: This kind of technology allows various applications to retrieve data without implementing technical restrictions, including data formats and physical location of data, among others. This technology is used by Apache Hadoop and other distributed data stores for near real-time and real-time access to data stored in various platforms. In environmental statistics, data virtualization is one of the most used big data technologies. Data Integration: Among the challenges faced by environment statisticians is handling of big data that is usually processed in terabytes or petabytes of data. Such data is then made available for researchers and other relevant individuals. Data Integration tools enable streamlining of data across a number of big data solutions (Köhl et al., 2000). Data Processing: There are various software solutions utilized in the manipulation of data into a format that is consistent and be further used in analysis during environment statistics. The data preparation tools are used in accelerating data sharing processes through formatting and cleansing unstructured data sets. Data pre-processing is limited as, in most cases, data pre-processing tasks cannot be automated and require human oversight that is usually tedious and time-consuming. Data Quality: This in environment statistics is an import parameter in big data processing. Environment statistics makes use of data quality software. This software is used in cleansing and enriching large data sets. This is achieved through the utilization of parallel processing. This software is widely used in getting consistent and reliable outputs from big data processing.
Future of Environment Statistics
245
Generally, recent trends in environment statistics allows the use of various technology have allowed big data is used in improving operational efficiency and the ability to mar informed decision based on the very latest-up-to-the-moment information has rapidly become the mainstream norm. For this reason, there is no doubt that big data will continue to play an important role in different industries across the world. It is expected to revolutionize environment statistics (Kessler et al., 2015). To effectively utilize this technology, it is important to ensure that all researchers are well equipped to know how to use the newly developed technology. This ensures that all technologies will allow proper management of big data.
10.3. CHANGES IN UTILIZATION OF RESOURCES In the future, it is expected that environment statistics will grow to greater levels. This is will affect the manner in which other resources are utilized. Some of the predictions made currently using environment statistics. For instance, currently, the world economy is expected to cause an increase in the utilization of energy without any new policy action. It is expected that by 2050, the global energy mix will not differ significantly for the current situation. The use of fossil fuels is expected to increase by 85% and renewables such as biofuels will increase in use by 10%. There is an expected balance in the utilization of nuclear energy. There are various companies that are expected to become major energy users. This is expected to increase the reliance on fossil fuels. This will enable the government to feed the growing population with changing dietary preference. Environment statistics has predicted an expansion in the use of agricultural land globally in the next decade to match the increase in food demand. This, however, is expected to occur at a diminishing rate. There is expected substantial increase in competition for scarce land is expected in the coming decades. Data from environment statistics predict that if by 2050, there are no new policies, there is an expected increase in global greenhouse gas emissions. This is expected to increase by 50%. This attributed to a 70% growth in energy-related CO2. It is also expected that atmospheric Concentra of GHGs will reach 685 parts per million by 2050. For this reason, the global average temperature is expected to be projected to be 3°C to 6°C above industrial levels by the end of the century. This exceeds the internationally agreed goals of limiting it to 2°C. Mitigation of GHG actions pledged by countries in the Cancun Agreements made at the United Nations Climate Change Conference will
本书版权归Arcler所有
246
Introduction to Environmental Statistics
not be sufficient to prevent the global average temperature from exceeding the 2°C threshold. This is unless there are rapid and costly reductions that are realized after 2020. If efforts are made after the released predictions made by environment statisticians, there will be several benefits by economic and environmental use. Outlook of current data suggests that global carbon pricing is sufficient to lower GHG emissions by nearly 70% by 2050 when compared to the Baseline Scenario and limit GHG concentrations to 450 ppm. This will lead to slow economic growth by 0.2% points per year on average. This is expected to cost roughly 5.5% of global GDP in 2050. This is known to pale alongside the potential cost of inaction on climate change. In some estimates, costs could be as high as 14% of average world consumption per Capita. It is expected that carbon pricing can increase revenues. This is achieved when the emission reduction pledges made by industrialized countries as listed in the Cancun Agreements were to be implemented through the adoption of carbon taxes or cap-and-trade schemes that have fully auctioned permits. It is expected that the fiscal revenues will amount to over 0.6% of their GDP for instance it will be more than USD 250 billion. Studies also show that delayed action on information will be very costly for the state. Any delays or adoption of moderate actions up a given year would lead to the increase in scale and pace of efforts needed after the set year (Christensen and Himme, 2017). Moderate actions include the implementation of the Cancun pledge only or stalking projects as better technologies are expected to enter the market. This is expected to cause a 50% increase in costs in 2059 as compared to timely action that could potentially leads to increased higher environment risks. Adoption of changes also leads to the reforming of fossil fuel subsidies. It is expected that fossil fuel production and its use will amount to between 45 and 75 billion dollars in a year in OECD countries. This involves developing and emerging economies providing over 400 billion dollars in fossil fuel consumer subsidies. In the future it is expected that more governments will become more reliant on data and information provided by environmentalists in environment statistics. With the increased discovery that the environment is very useful, then more people will need to be sensitive on how they deal with it. This means that more governments will greatly investing the purchasing of new technologies used by environment statisticians. This may also involve them involves the government investing greatly in research projects done
本书版权归Arcler所有
Future of Environment Statistics
247
by most environmentalists. The government can fully or partially support these research projects. Governments may also build institutions that will be used by researchers during their investigations. It is also expected that some countries will invest in more people learning environment statistics. They could offer to sponsor some students as they study in various institutions. By investing greatly in environment statistics, the government should also ensure that the generated data will be greatly used and totally used to achieve some of the environmental goals that have been put in place. All the actions put in place by the government goes a long way in ensuring environmental protection. The government should also make use of the public (Kandlikar et al., 2018). There are various ways they can involve the public in ensuring that environmental goals are achieved. The future of environment statistics is very promising.
本书版权归Arcler所有
本书版权归Arcler所有
BIBLIOGRAPHY
1.
2.
3.
4.
5.
6.
本书版权归Arcler所有
Adelman, D. E., (2003). Scientific activism and restraint: The interplay of statistics, judgment, and procedure in environmental law. Notre Dame L. Rev., 79, 497. Antweiler, R. C., & Taylor, H. E., (2008). Evaluation of statistical treatments of left-censored environmental data using coincident uncensored data sets: I. Summary statistics. Environmental Science & Technology, 42(10), 3732–3738. Austin, S. B., Melly, S. J., Sanchez, B. N., Patel, A., Buka, S., & Gortmaker, S. L., (2005). Clustering of fast-food restaurants around schools: A novel application of spatial statistics to the study of food environments. American Journal of Public Health, 95(9), 1575–1581. Bankole, P. O., & Surajudeen, M. A., (2008). Major environmental issues and the need for environmental statistics and indicators in Nigeria. Terrain, 280, 320C. Browning, M., Behrens, T. E., Jocham, G., O’reilly, J. X., & Bishop, S. J., (2015). Anxious individuals have difficulty learning the causal statistics of aversive environments. Nature Neuroscience, 18(4), 590– 596. Bustos-Korts, D., Malosetti, M., Chapman, S., & Eeuwijk, F. V., (2016). Modeling of genotype by environment interaction and prediction of complex traits across multiple environments as a synthesis of crop growth modeling, genetics, and statistics. In: Crop Systems Biology (pp. 55–82). Springer, Cham.
250
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
本书版权归Arcler所有
Introduction to Environmental Statistics
Chai, W., Leira, B. J., Naess, A., Høyland, K., & Ehlers, S., (2020). Development of environmental contours for first-year ice ridge statistics. Structural Safety, 87, 101996. Christensen, B., & Himme, A., (2017). Improving environmental management accounting: How to use statistics to better determine energy consumption. Journal of Management Control, 28(2), 227–243. Dutilleul, P., Stockwell, J. D., Frigon, D., & Legendre, P., (2000). The mantel test versus Pearson’s correlation analysis: Assessment of the differences for biological and environmental studies. Journal of Agricultural, Biological, and Environmental Statistics, 131–150. Farrell, S., Ludwig, C. J., Ellis, L. A., & Gilchrist, I. D., (2010). Influence of environmental statistics on inhibition of saccadic return. Proceedings of the National Academy of Sciences, 107(2), 929–934. Fourcade, Y., Besnard, A. G., & Secondi, J., (2018). Paintings predict the distribution of species, or the challenge of selecting environmental predictors and evaluation statistics. Global Ecology and Biogeography, 27(2), 245–256. Frankenhuis, W. E., Nettle, D., & Dall, S. R., (2019). A case for environmental statistics of early-life effects. Philosophical Transactions of the Royal Society B, 374(1770), 20180110. Fuentes, M., Chaudhuri, A., & Holland, D. M., (2007). Bayesian entropy for spatial sampling design of environmental data. Environmental and Ecological Statistics, 14(3), 323–340. Gallyamov, M. O., Tartsch, B., Mela, P., Potemkin, I. I., Sheiko, S. S., Börner, H., & Möller, M., (2007). Vapor‐induced spreading dynamics of adsorbed linear and brush‐like macromolecules as observed by environmental SFM: Polymer chain statistics and scaling exponents. Journal of Polymer Science Part B: Polymer Physics, 45(17), 2368– 2379. Giraldo, R., Delicado, P., & Mateu, J., (2010). Continuous time-varying kriging for spatial prediction of functional data: An environmental application. Journal of Agricultural, Biological, and Environmental Statistics, 15(1), 66–82. Girshick, A. R., Landy, M. S., & Simoncelli, E. P., (2011). Cardinal rules: Visual orientation perception reflects knowledge of environmental statistics. Nature Neuroscience, 14(7), 926–932.
Bibliography
251
17. Gotelli, N. J., Ellison, A. M., & Ballif, B. A., (2012). Environmental proteomics, biodiversity statistics and food-web structure. Trends in Ecology & Evolution, 27(8), 436–442. 18. Han, F., & Li, J., (2020). Assessing impacts and determinants of China’s environmental protection tax on improving air quality at provincial level based on Bayesian statistics. Journal of Environmental Management, 271, 111017. 19. Higuchi, Y., & Inoue, K. T., (2019). Environmental effects on halo abundance and weak lensing peak statistics towards large underdense regions. Monthly Notices of the Royal Astronomical Society, 488(4), 5811–5822. 20. Homma, N. Y., Hullett, P. W., Atencio, C. A., & Schreiner, C. E., (2020). Auditory cortical plasticity dependent on environmental noise statistics. Cell Reports, 30(13), 4445–4458. 21. Iwai, K., Mizuno, S., Miyasaka, Y., & Mori, T., (2005). Correlation between suspended particles in the environmental air and causes of disease among inhabitants: Cross-sectional studies using the vital statistics and air pollution data in Japan. Environmental Research, 99(1), 106–117. 22. John, K., Afu, S. M., Isong, I. A., Aki, E. E., Kebonye, N. M., Ayito, E. O., & Penížek, V., (2021). Mapping soil properties with soilenvironmental covariates using geostatistics and multivariate statistics. International Journal of Environmental Science and Technology, 18(11), 3327–3342. 23. Kandlikar, G. S., Gold, Z. J., Cowen, M. C., Meyer, R. S., Freise, A. C., Kraft, N. J., & Curd, E. E., (2018). Ranacapa: An R package and shiny web app to explore environmental DNA data with exploratory statistics and interactive visualizations. F1000Research, 7. 24. Kessler, D., Suweis, S., Formentin, M., & Shnerb, N. M., (2015). Neutral dynamics with environmental noise: Age-size statistics and species lifetimes. Physical Review E, 92(2), 022722. 25. Khormi, H. M., & Kumar, L., (2011). Examples of using spatial information technologies for mapping and modeling mosquito-borne diseases based on environmental, climatic, socioeconomic factors and different spatial statistics, temporal risk indices and spatial analysis: A review. J. Food Agr. Environ., 9, 41–49.
本书版权归Arcler所有
252
Introduction to Environmental Statistics
26. Kobayashi, T., & Ye, J., (2014). Acoustic feature extraction by statistics based local binary pattern for environmental sound classification. In: 2014 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (pp. 3052–3056). IEEE. 27. Köhl, M., Traub, B., & Päivinen, R., (2000). Harmonization and standardization in multi-national environmental statistics – mission impossible? Environmental Monitoring and Assessment, 63(2), 361– 380. 28. Madavan, R., & Balaraman, S., (2016). Failure analysis of transformer liquid—Solid insulation system under selective environmental conditions using Weibull statistics method. Engineering Failure Analysis, 65, 26–38. 29. Marohasy, J., (2003). How useful are Australia’s official environmental statistics? IPA Review, 55(4), 8–10. 30. Ng, C. H. J., & Vecchi, G. A., (2020). Large-scale environmental controls on the seasonal statistics of rapidly intensifying North Atlantic tropical cyclones. Climate Dynamics, 54(9), 3907–3925. 31. Notaro, G., Van, Z. W., Altman, M., Melcher, D., & Hasson, U., (2019). Predictions as a window into learning: Anticipatory fixation offsets carry more information about environmental statistics than reactive stimulus-responses. Journal of Vision, 19(2), 5–8. 32. Parise, C. V., Knorre, K., & Ernst, M. O., (2014). Natural auditory scene statistics shapes human spatial hearing. Proceedings of the National Academy of Sciences, 111(16), 6104–6108. 33. Patil, G. P., & Taillie, C., (2003). Geographic and network surveillance via scan statistics for critical area detection. Statistical Science, 18(4), 457–465. 34. Piegorsch, W. W., & Edwards, D., (2002). What shall we teach in environmental statistics? Environmental and Ecological Statistics, 9(2), 125–150. 35. Pleil, J. D., Sobus, J. R., Stiegel, M. A., Hu, D., Oliver, K. D., Olenick, C., & Funk, W. E., (2014). Estimating common parameters of lognormally distributed environmental and biomonitoring data: Harmonizing disparate statistics from publications. Journal of Toxicology and Environmental Health, Part B, 17(6), 341–368.
本书版权归Arcler所有
Bibliography
253
36. Rovetta, A., & Castaldo, L., (2020). Relationships between demographic, geographic, and environmental statistics and the spread of novel coronavirus disease (COVID-19) in Italy. Cureus, 12(11). 37. Seid, A., Gadisa, E., Tsegaw, T., Abera, A., Teshome, A., Mulugeta, A., & Aseffa, A., (2014). Risk map for cutaneous leishmaniasis in Ethiopia based on environmental factors as revealed by geographical information systems and statistics. Geospatial Health, 8(2), 377–387. 38. Simoncelli, E. P., & Olshausen, B. A., (2001). Natural image statistics and neural representation. Annual Review of Neuroscience, 24(1), 1193–1216. 39. Sivak, D. A., & Thomson, M., (2014). Environmental statistics and optimal regulation. PLoS Computational Biology, 10(9), e1003826. 40. Toivonen, H. T., Mannila, H., Korhola, A., & Olander, H., (2001). Applying Bayesian statistics to organism‐based environmental reconstruction. Ecological Applications, 11(2), 618–630. 41. Wagner, H. H., & Fortin, M. J., (2005). Spatial analysis of landscapes: Concepts and statistics. Ecology, 86(8), 1975–1987. 42. Wieland, R., Mirschel, W., Zbell, B., Groth, K., Pechenick, A., & Fukuda, K., (2010). A new library to combine artificial neural networks and support vector machines with statistics and a database engine for application in environmental modeling. Environmental Modeling & Software, 25(4), 412–420. 43. Xia, X., Yang, Z., Yu, T., Hou, Q., & Mutelo, A. M., (2017). Detecting changes of soil environmental parameters by statistics and GIS: A case from the lower Changjiang plain, China. Journal of Geochemical Exploration, 181, 116–128. 44. Ying, A. N., & Sheng-Cai, L. I., (2012). Statistics of environmental events in China during the period from March to April in 2012. Journal of Safety and Environment, 12(3), 269–272. 45. Zio, S. D., Fontanella, L., & Ippoliti, L., (2004). Optimal spatial sampling schemes for environmental surveys. Environmental and Ecological Statistics, 11(4), 397–414.
本书版权归Arcler所有
本书版权归Arcler所有
INDEX
A Active chemical environmental sampling 138 administrative records 29, 30 Airborne contaminants 138 air quality 65, 90 Air Sampling 137 American Statistical Association (ASA) 2 application programming interfaces (APIs) 42 atmosphere 36, 38, 44 Attribute query 204 B bacteria 136, 143, 148, 154 best linear unbiased predictor (BLUP) 231 Big Data 39, 40 C carbon dioxide 36, 38 Cartography 198 Census Data Sources 100
本书版权归Arcler所有
central tendency 9, 13, 15, 16 Chemical sampling 137 civilization 64 climate computing 40 closed information systems 35 cluster sampling 18 cognitive abilities 8 Completely Randomized Design (CRD) 101 Complex spatiotemporal 226 computational environment 242 computer code 8 computer networks 44 computer programming languages 156 computer system 156, 185 computer technology 35 conservation 65, 94 Customer information 98 customer satisfaction 98 cutting-edge technology 242 D data analysis 37, 44, 54 database 98, 114, 129, 131, 132 database architecture 157, 163, 172, 191, 192
256
Introduction to Environmental Statistics
database management system (DBMS) 160 data collection 64, 65, 89, 90 Data models 156, 159, 174 data source 98, 99, 101, 127, 131, 132 data structure 156, 157, 159, 163, 170, 184, 185 descriptive statistics 9, 10, 17 Descriptive statistics approaches 9 digital films 159 digital images 159 digital music 159 dimethyl phenyl sulfone (DMPS) 141 diphenylurea (DPU) 141 Direct contact sampling 137 distance relations analysis 197 Dunaliella salina microalgae 235 E ecological complex systems 226, 227 Ecological model 26 economic statistics 25 educational settings 3 electrical conductivity (EC) 141 electrostatic precipitator (ESP) 134 email addresses 157, 166 entity relationship diagram (ERD) 165 Entity-relationship (E-R) model 164 Entity relationships (E-R) 160 environmental contaminants 134 environmental data 25, 28, 30 Environmental epidemiology 237 Environmental intelligence 40 Environmental media-based framework 26
本书版权归Arcler所有
environmental monitoring 39 environmental protection 247 environmental protection agency (EPA) 139 Environmental sampling 134, 135, 136, 142, 154 environmental science 24 environmental standards 65 environmental statistics 24, 25, 26, 28 Exploratory data analysis (EDA) 15 External data sources 101 F Factorial Designs (FD) 102 financial data 157 fisheries 65, 66 flash storage 244 fluid circulation system 98 food 65 forestry 65 Framework for Development of Environmental Statistics (FDES) 25 G geographic information science (GISc) 199 geographic information systems (GIS) 199 geometric approach 197 geostatistical analysis 195, 197 global warming 64 grade point average (GPA) 10 graph 29 greenhouse gases 64 Ground-based stations 226
Index
H hazardous materials 134, 135, 139 Hierarchical model 164 humidity 136 HVAC (heating, ventilation, and air conditioning) 41 Hydrological traces 140 hypothesis testing 7, 18, 19 I Indirect Contact Sampling 137 inferential statistics 17, 18, 19, 65, 67 information framework 157 Internal data sources 100 L Latin Square Design (LSD) 102 Linear regression 20 logistics 98, 116 London Special Service (LSS) 7 M machine learning (ML) 21 management 65, 88, 92, 95 mathematics 6, 7 matrix 36 methylene blue 141 multivariate descriptive statistics 9 Multivariate statistical data 197 N National Oceanic and Atmospheric Administration (NOAA) 43 National Weather Service (NWS) 46 natural occurrences 30
本书版权归Arcler所有
257
network analysis 197 Network model 164 noise 65 non-governmental organizations (NGOs) 29 P Passive chemical environmental sampling 137 phone numbers 157, 166 physical data models (PDM) 160 Physical sampling 137 plants 134, 135, 138, 152 point distribution analysis 197 pollutant emissions 30 pollution 42, 48, 56, 57, 58, 59, 60 Population dynamics 234 Predictive analytics 243 pressure regulation system 98 Q qualitative data 100 quality control program 136 quantitative data 100 R radiation 65 radiation detector 134 Randomized Block Design (RBD) 102 regression analysis 18, 20, 21, 22 Regression modeling 195 relational database management systems (RDBMSs) 98 Relational model 164 remote-sensing 226 Resource accounting model 26
258
Introduction to Environmental Statistics
S sales 98, 100 sampling 65, 66, 81, 82, 88 satellites 34, 44, 46, 60 Sediment 136 semi-volatile organic compounds (SVOCs) 140 sensor networks equipment 226 sensors 34, 44, 46, 56, 59, 60 shape analysis 197 simple random sampling 18 Smell 139 socioeconomic data 30 software program 98 soil conditions 65 Soil sampling 135 solid-phase extraction (SPE) 138 solid-state storage devices 244 Spatial Analysis 196 spatial correlation analysis 197 Spatial query 204 Spatiotemporal (ST) 228 state policy 5 statistical data analysis 4 Statistical data sources 100 statistical education 3 statistical graphics 6
本书版权归Arcler所有
Statistical physics 226 statistical research 8 Statistical surveys 30 statistical theory 2 stochastic signals 226 stratified sampling 18 Stream gaging 143, 144, 145 Stress-response model 26 Surface trend analysis 195 systematic sampling 18 T total annual precipitation (TAP) 46 U United Nations Statistics Division (UNSD) 64 univariate descriptive statistics 9 V Visualization 197, 219, 222 volatile organic compounds (VOCs) 138 W water quality 65, 88 Water Sampling 137
本书版权归Arcler所有
本书版权归Arcler所有