Applied Data Analysis and Modeling for Energy Engineers and Scientists [2 ed.] 3031348680, 9783031348686

Now in a thoroughly revised and expanded second edition, this classroom-tested text demonstrates and illustrates how to

111 60 39MB

English Pages 630 [622] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface (Second Edition)
Acknowledgments
Preface (First Edition)
A Third Need in Engineering Education
Intent
Approach and Scope
Assumed Background of Reader
Acknowledgements
Contents
1: Mathematical Models and Data Analysis
1.1 Forward and Inverse Approaches
1.1.1 Preamble
1.1.2 The Energy Problem and Importance of Buildings
1.1.3 Forward or Simulation Approach
1.1.4 Inverse or Data Analysis Approach
1.1.5 Discussion of Both Approaches
1.2 System Models
1.2.1 What Is a System Model?
1.2.2 Types of Models
1.3 Types of Data
1.3.1 Classification
1.3.2 Types of Uncertainty in Data
1.4 Mathematical Models
1.4.1 Basic Terminology
1.4.2 Block Diagrams
1.4.3 Mathematical Representation
1.4.4 Classification
1.4.5 Steady-State and Dynamic Models
1.5 Mathematical Modeling Approaches
1.5.1 Broad Categorization
1.5.2 Simulation or Forward Modeling
1.5.3 Inverse Modeling
1.5.4 Calibrated Simulation
1.6 Data Analytic Approaches
1.6.1 Data Mining or Knowledge Discovery
1.6.2 Machine Learning or Algorithmic Models
1.6.3 Introduction to Big Data
1.7 Data Analysis
1.7.1 Introduction
1.7.2 Basic Stages
1.7.3 Example of a Data Collection and Analysis System
1.8 Topics Covered in Book
Problems
References
2: Probability Concepts and Probability Distributions
2.1 Introduction
2.1.1 Classical Concept of Probability
2.1.2 Bayesian Viewpoint of Probability
2.1.3 Distinction Between Probability and Statistics
2.2 Classical Probability
2.2.1 Basic Terminology
2.2.2 Basic Set Theory Notation and Axioms of Probability
2.2.3 Axioms of Probability
2.2.4 Joint, Marginal, and Conditional Probabilities
2.2.5 Permutations and Combinations
2.3 Probability Distribution Functions
2.3.1 Density Functions
2.3.2 Expectations and Moments
2.3.3 Function of Random Variables
2.3.4 Chebyshev´s Theorem
2.4 Important Probability Distributions
2.4.1 Background
2.4.2 Distributions for Discrete Variables
2.4.3 Distributions for Continuous Variables
2.5 Bayesian Probability
2.5.1 Bayes´ Theorem
2.5.2 Application to Discrete Probability Variables
2.5.3 Application to Continuous Probability Variables
2.6 Three Kinds of Probabilities
Problems
References
3: Data Collection and Preliminary Analysis
3.1 Sensors and Their Characteristics
3.2 Data Collection Systems
3.2.1 Generalized Measurement System
3.2.2 Types and Categories of Measurements
3.2.3 Data Recording Systems
3.3 Raw Data Validation and Preparation
3.3.1 Definitions
3.3.2 Limit Checks
3.3.3 Consistency Checks Involving Conservation Balances
3.3.4 Outlier Rejection by Visual Means
3.3.5 Handling Missing Data
3.4 Statistical Measures of Sample Data
3.4.1 Summary Descriptive Measures
3.4.2 Covariance and Pearson Correlation Coefficient
3.5 Exploratory Data Analysis (EDA)
3.5.1 What Is EDA?
3.5.2 Purpose of Data Visualization
3.5.3 Static Univariate Graphical Plots
3.5.4 Static Bi- and Multivariate Graphical Plots
3.5.5 Interactive and Dynamic Graphics
3.5.6 Basic Data Transformations
3.6 Overall Measurement Uncertainty
3.6.1 Need for Uncertainty Analysis
3.6.2 Basic Uncertainty Concepts: Random and Bias Errors
3.6.3 Random Uncertainty
3.6.4 Bias Uncertainty
3.6.5 Overall Uncertainty
3.6.6 Chauvenet´s Statistical Criterion of Data Rejection
3.7 Propagation of Errors
3.7.1 Taylor Series Method for Cross-Sectional Data
3.7.2 Monte Carlo Method for Error Propagation Problems
3.8 Planning a Non-Intrusive Field Experiment
Problems
References
4: Making Statistical Inferences from Samples
4.1 Introduction
4.2 Basic Univariate Inferential Statistics
4.2.1 Sampling Distribution and Confidence Interval of the Mean
4.2.2 Hypothesis Test for Single Sample Mean
4.2.3 Two Independent Sample and Paired Difference Tests on Means
4.2.4 Single and Two Sample Tests for Proportions
4.2.5 Single and Two Sample Tests of Variance
4.2.6 Tests for Distributions
4.2.7 Test on the Pearson Correlation Coefficient
4.3 ANOVA Test for Multi-Samples
4.3.1 Single-Factor ANOVA
4.3.2 Tukey´s Multiple Comparison Test
4.4 Tests of Significance of Multivariate Data
4.4.1 Introduction to Multivariate Methods
4.4.2 Hotteling T2 Test
4.5 Non-Parametric Tests
4.5.1 Signed and Rank Tests for Medians
4.5.2 Kruskal-Wallis Multiple Samples Test for Medians
4.5.3 Test on Spearman Rank Correlation Coefficient
4.6 Bayesian Inferences
4.6.1 Background
4.6.2 Estimating Population Parameter from a Sample
4.6.3 Hypothesis Testing
4.7 Some Considerations About Sampling
4.7.1 Random and Non-Random Sampling Methods
4.7.2 Desirable Properties of Estimators
4.7.3 Determining Sample Size During Random Surveys
4.7.4 Stratified Sampling for Variance Reduction
4.8 Resampling Methods
4.8.1 Basic Concept
4.8.2 Application to Probability Problems
4.8.3 Different Methods of Resampling
4.8.4 Application of Bootstrap to Statistical Inference Problems
4.8.5 Closing Remarks
Problems
References
5: Linear Regression Analysis Using Least Squares
5.1 Introduction
5.2 Regression Analysis
5.2.1 Objective of Regression Analysis
5.2.2 Ordinary Least Squares
5.3 Simple OLS Regression
5.3.1 Estimation of Model Parameters
5.3.2 Statistical Criteria for Model Evaluation
5.3.3 Inferences on Regression Coefficients and Model Significance
5.3.4 Model Prediction Uncertainty
5.4 Multiple OLS Regression
5.4.1 Higher Order Linear Models
5.4.2 Matrix Formulation
5.4.3 Point and Interval Estimation
5.4.4 Beta Coefficients and Elasticity
5.4.5 Partial Correlation Coefficients
5.4.6 Assuring Model Parsimony-Stepwise Regression
5.5 Applicability of OLS Parameter Estimation
5.5.1 Assumptions
5.5.2 Sources of Errors During Regression
5.6 Model Residual Analysis and Regularization
5.6.1 Detection of Ill-Conditioned Behavior
5.6.2 Leverage and Influence Data Points
5.6.3 Remedies for Nonuniform Residuals
5.6.4 Serially Correlated Residuals
5.6.5 Dealing with Misspecified Models
5.7 Other Useful OLS Regression Models
5.7.1 Zero-Intercept Models
5.7.2 Indicator Variables for Local Piecewise Models- Linear Splines
5.7.3 Indicator Variables for Categorical Regressor Models
5.8 Resampling Methods Applied to Regression
5.8.1 Basic Approach
5.8.2 Jackknife and k-Fold Cross-Validation
5.8.3 Bootstrap Method
5.9 Case Study Example: Effect of Refrigerant Additive on Chiller Performance
5.10 Parting Comments on Regression Analysis and OLS
Problems
References
6: Design of Physical and Simulation Experiments
6.1 Introduction
6.1.1 Types of Data Collection
6.1.2 Purpose of DOE
6.1.3 DOE Terminology
6.2 Overview of Different Statistical Methods
6.2.1 Different Types of ANOVA Tests
6.2.2 Link Between ANOVA and Regression
6.2.3 Recap of Basic Model Functional Forms
6.3 Basic Concepts
6.3.1 Levels, Discretization, and Experimental Combinations
6.3.2 Blocking
6.3.3 Unrestricted and Restricted Randomization
6.4 Factorial Designs
6.4.1 Full Factorial Design
6.4.2 2k Factorial Designs
6.4.3 Concept of Orthogonality
6.4.4 Fractional Factorial Designs
6.5 Block Designs
6.5.1 Complete Block Design
6.5.2 Latin Squares
6.6 Response Surface Designs
6.6.1 Applications
6.6.2 Methodology
6.6.3 First- and Second-Order Models
6.6.4 Central Composite Design and the Concept of Rotation
6.7 Simulation Experiments
6.7.1 Background
6.7.2 Similarities and Differences Between Physical and Simulation Experiments
6.7.3 Monte Carlo and Allied Sampling Methods
6.7.4 Sensitivity Analysis for Screening
6.7.5 Surrogate Modeling
6.7.6 Summary
Problems
References
7: Optimization Methods
7.1 Introduction
7.1.1 What Is Optimization?
7.1.2 Simple Example
7.2 Terminology and Classification
7.2.1 Definition of Terms
7.2.2 Categorization of Methods
7.2.3 Types of Objective Functions and Constraints
7.2.4 Sensitivity Analysis and Post-Optimality Analysis
7.3 Analytical Methods
7.3.1 Unconstrained Problems
7.3.2 Direct Substitution Method for Equality Constrained Problems
7.3.3 Lagrange Multiplier Method for Equality Constrained Problems
7.3.4 Problems with Inequality Constraints
7.3.5 Penalty Function Method
7.4 Numerical Unconstrained Search Methods
7.4.1 Univariate Methods
7.4.2 Multivariate Methods
7.5 Linear Programming (LP)
7.5.1 Standard Form
7.5.2 Example of a LP Problem
7.5.3 Linear Network Models
7.5.4 Example of Maximizing Flow in a Transportation Network
7.5.5 Mixed Integer Linear Programing (MILP)
7.5.6 Example of Reliability Analysis of a Power Network
7.6 Nonlinear Programming
7.6.1 Standard Form
7.6.2 Quadratic Programming
7.6.3 Popular Numerical Multivariate Search Algorithms
7.7 Illustrative Example: Integrated Energy System (IES) for a Campus
7.8 Introduction to Global Optimization
7.9 Examples of Dynamic Programming
Problems
References
8: Analysis of Time Series Data
8.1 Basic Concepts
8.1.1 Introduction
8.1.2 Terminology
8.1.3 Basic Behavior Patterns
8.1.4 Illustrative Data Set
8.2 General Model Formulations
8.3 Smoothing Methods
8.3.1 Arithmetic Moving Average (AMA)
8.3.2 Exponentially Weighted Moving Average (EWA)
8.3.3 Determining Structure by Cross-Validation
8.4 OLS Regression Models
8.4.1 Trend Modeling
8.4.2 Trend and Seasonal Models
8.4.3 Forecast Intervals
8.4.4 Fourier Series Models for Periodic Behavior
8.4.5 Interrupted Time Series
8.4.5.1 Abrupt One-Time Change in Time
8.4.5.2 Gradual Change Over Time
8.5 Stochastic Time Series Models
8.5.1 Introduction
8.5.2 ACF, PACF, and Data Detrending
8.5.2.1 Autocorrelation Function (ACF)
8.5.2.2 Partial Autocorrelation Function (PACF)
8.5.2.3 Detrending Data by Differencing
8.5.3 ARIMA Class of Models
8.5.3.1 Overview
8.5.3.2 ARMA Models
8.5.3.3 MA Models
8.5.3.4 AR Models
8.5.3.5 Identification and Forecasting
8.5.4 Recommendations on Model Identification
8.6 ARMAX or Transfer Function Models
8.6.1 Conceptual Approach and Benefit
8.6.2 Transfer Function Modeling of Linear Dynamic Systems
8.7 Quality Control and Process Monitoring Using Control Chart Methods
8.7.1 Background and Approach
8.7.2 Shewart Control Charts for Variables
8.7.2.1 Mean Charts
8.7.2.2 Range Charts
8.7.3 Shewart Control Charts for Attributes
8.7.4 Practical Implementation Issues of Control Charts
8.7.5 Time-Weighted Monitoring
8.7.5.1 Cusum Charts
8.7.5.2 EWMA Process
8.7.6 Concluding Remarks
Problems
References
9: Parametric and Non-Parametric Regression Methods
9.1 Introduction
9.2 Important Concepts in Parameter Estimation
9.2.1 Structural Identifiability
9.2.2 Ill-Conditioning
9.2.3 Numerical Identifiability
9.3 Dealing with Collinear Regressors: Variable Selection and Shrinkage
9.3.1 Problematic Issues
9.3.2 Principal Component Analysis and Regression
9.3.3 Ridge and Lasso Regression
9.3.4 Chiller Case Study Involving Collinear Regressors
9.3.5 Other Multivariate Methods
9.4 Going Beyond OLS
9.4.1 Background
9.4.2 Maximum Likelihood Estimation (MLE)
9.4.3 Generalized Linear Models (GLM)
9.4.4 Box-Cox Transformation
9.4.5 Logistic Functions
9.4.6 Error in Variables (EIV) and Corrected Least Squares
9.5 Non-Linear Parametric Regression
9.5.1 Detecting Non-Linear Correlation
9.5.2 Different Non-Linear Search Methods
9.5.3 Overview of Various Parametric Regression Methods
9.6 Non-Parametric Regression
9.6.1 Background
9.6.2 Extensions to Linear Models
9.6.3 Basis Functions
9.6.4 Polynomial Regression and Smoothing Splines
9.7 Local Regression: LOWESS Smoothing Method
9.8 Neural Networks: Multi-Layer Perceptron (MLP)
9.9 Robust Regression
Problems
References
Untitled
10: Inverse Methods for Mechanistic Models
10.1 Fundamental Concepts
10.1.1 Applicability
10.1.2 Approaches and Their Characteristics
10.1.3 Mechanistic Models
10.1.4 Scope of Chapter
10.2 Gray-Box Static Models
10.2.1 Basic Notions
10.2.2 Performance Models for Solar Photovoltaic Systems
10.2.2.1 Weather-Based Models
10.2.2.2 Weather and Cell Temperature-Based Models
10.2.3 Gray-Box and Black-Box Models for Water-Cooled Chillers
10.2.4 Sequential Stagewise Regression and Selection of Data Windows
10.2.5 Case Study of Non-Intrusive Sequential Parameter Estimation for Building Energy Flows
10.2.6 Application to Policy: Dose-Response
10.3 Certain Aspects of Data Collection
10.3.1 Types of Data Collection
10.3.2 Measures of Information Content
10.3.3 Functional Testing and Data Fusion
10.4 Gray-Box Models for Dynamic Systems
10.4.1 Introduction
10.4.2 Sequential Estimation of Thermal Network Model Parameters from Controlled Tests
10.4.3 Non-Intrusive Identification of Thermal Network Models and Parameters
10.4.4 State Space Representation and Compartmental Models
10.4.5 Example of a Compartmental Model
10.4.6 Practical Issues During Identification
10.5 Bayesian Regression and Parameter Estimation: Case Study
10.6 Calibration of Detailed Simulation Programs
10.6.1 Purpose
10.6.2 The Basic Issue
10.6.3 Detailed Simulation Models for Energy Use in Buildings
10.6.4 Uses of Calibrated Simulation
10.6.5 Causes of Differences
10.6.6 Definition of Terms
10.6.7 Raw Input Tuning (RIT)
10.6.8 Semi-Analytical Methods (SAM)
10.6.9 Physical Parameter Estimation (PPE)
10.6.10 Thoughts on Statistical Criteria for Goodness-of-Fit
Problems
References
11: Statistical Learning Through Data Analytics
11.1 Introduction
11.2 Distance as a Similarity Measure
11.3 Unsupervised Learning: Clustering Approaches
11.3.1 Types of Clustering Methods
11.3.2 Centroid-Based Partitional Clustering by K-Means
11.3.3 Density-Based Partitional Clustering Using DBSCAN
11.3.4 Agglomerative Hierarchical Clustering Methods
11.4 Supervised Learning: Statistical-Based Classification Approaches
11.4.1 Different Types of Approaches
11.4.2 Distance-Based Classification: k-Nearest Neighbors
11.4.3 Naive Bayesian Classification
11.4.4 Classical Regression-Based Classification
11.4.5 Discriminant Function Analysis
11.4.6 Neural Networks: Radial Basis Function (RBF)
11.4.7 Support Vector Machines (SVM)
11.5 Decision Tree-Based Classification Methods
11.5.1 Rule-Based Method and Decision-Tree Representation
11.5.2 Criteria for Tree Splitting
11.5.3 Classification and Regression Trees (CART)
11.5.4 Ensemble Method: Random Forest
11.6 Anomaly Detection Methods
11.6.1 Introduction
11.6.2 Graphical and Statistical Methods
11.6.3 Model-Based Methods
11.6.4 Data Mining Methods
11.7 Applications to Reducing Energy Use in Buildings
Problems
References
12: Decision-Making, Risk Analysis, and Sustainability Assessments
12.1 Introduction
12.1.1 Types of Decision-Making Problems and Applications
12.1.2 Purview of Reliability, Risk Analysis, and Decision-Making
12.1.3 Example of Discrete Decision-Making
12.1.4 Example of Chiller FDD
12.2 Single Criterion Decision-Making
12.2.1 General Framework
12.2.2 Representing Problem Structure: Influence Diagrams and Decision Trees
12.2.3 Single and Multi-Stage Decision Problems
12.2.4 Value of Perfect Information
12.2.5 Different Criteria for Outcome Evaluation
12.2.6 Discretizing Probability Distributions
12.2.7 Utility Value Functions for Modeling Risk Attitudes
12.2.8 Monte Carlo Simulation for First-Order and Nested Uncertainties
12.3 Risk Analysis
12.3.1 The Three Aspects
12.3.2 The Empirical Approach
12.3.3 Context of Environmental Risk to Humans
12.3.4 Other Areas of Application
12.4 Case Study: Risk Assessment of an Existing Building
12.5 Multi-Criteria Decision-Making (MCDM) Methods
12.5.1 Introduction and Description of Terms
12.5.2 Classification of Methods
12.5.3 Basic Mathematical Operations
12.6 Single Discipline MCDM Methods: Techno-Economic Analysis
12.6.1 Review
12.6.2 Consistent Attribute Scales
12.6.3 Inconsistent Attribute Scales: Dominance and Pareto Frontier
12.6.4 Case Study of Conflicting Criteria: Supervisory Control of an Engineered System
12.7 Sustainability Assessments: MCDM with Multi-Discipline Attribute Scales
12.7.1 Definitions and Scope
12.7.2 Indicators and Metrics
12.7.3 Sustainability Assessment Frameworks
12.7.4 Examples of Non, Semi-, and Fully-Aggregated Assessments
12.7.5 Two Case Studies: Structure-Based and Performance-Based
12.7.6 Closure
Problems
References
Appendices
Appendix: A: Statistical Tables (Tables A.1, A.2, A.3, A.4, A.5, A.6, A.7, A.8, A.9, A.10, A.11, A.12, and A.13)
Appendix B. Large Data Sets
Appendix C. Data Sets in Textbook
Appendix D: Solved Examples and Problems with Practical Relevance
Index
Recommend Papers

Applied Data Analysis and Modeling for Energy Engineers and Scientists [2 ed.]
 3031348680, 9783031348686

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

T. Agami Reddy Gregor P. Henze

Applied Data Analysis and Modeling for Energy Engineers and Scientists Second Edition

Applied Data Analysis and Modeling for Energy Engineers and Scientists

T. Agami Reddy • Gregor P. Henze

Applied Data Analysis and Modeling for Energy Engineers and Scientists Second Edition

T. Agami Reddy The Design School and the School of Sustainable Engineering and the Built Environment Arizona State University Tempe, AZ, USA

Gregor P. Henze Department of Civil, Environmental and Architectural Engineering University of Colorado Boulder, CO, USA

ISBN 978-3-031-34868-6 ISBN 978-3-031-34869-3 https://doi.org/10.1007/978-3-031-34869-3

(eBook)

# The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable.

Dedicated to the men who have had a profound and significant impact on my life (TAR): my grandfather Satyakarma, my father Dayakar, my brother Shantikar, and my son Satyajit. Dedicated to the strong, loving, and inspiring women in my life (GPH): my mother Uta Birgit, my wife Martha Marie, and our daughters Sophia Miriam and Josephine Charlotte.

Preface (Second Edition)

This second edition has been undertaken over a dozen years after the first edition and is a complete revision meant to modernize, update, and expand on topic coverage and reference case study examples. The general intent remains the same, i.e., a practical textbook on applied data analysis and modeling targeting students and professionals in engineering and applied science working on energy and environmental issues and systems in general and in building energy domain in particular. Statistical textbooks often tend to be opaque from which an intuitive understanding is difficult to acquire of how to (and when not to) apply the numerous analysis techniques to one’s chosen field. The style of writing of this book is to present a simple, clear,1 and logically laid-out structure of the various aspects of statistical theory2 and practice along with suggestions, discussion, and case study examples meant to enhance comprehension, and act as a catalyst for self-discovery of the reader. The book remains modular, and several chapters can be studied as standalone. The structure of the first edition has been retained but important enhancements have been made. The first six chapters deal with basic topics covered in most statistical textbooks (and this could serve as a first course if needed), while the remaining six chapters deal with more advanced topics with domain-relevant discussion and case study examples. The latter chapters have been revised extensively with new statistical methods and subject matter along with numerous examples from recently published technical papers meant to nurture and stimulate a more research-focused mind set of the reader. The chapter on “classification and clustering” in the first edition has been renamed as “statistical learning through data analysis” given the enormous advances in data science, and has been greatly expanded in scope and treatment. The chapters on inverse methods as applied to black-box, gray-box, and white-box models (Chaps. 9 and 10) have been better structured, thoroughly revised, and improved. A new section in the last chapter on decision-making has been added which deals with sustainability assessments. It tries to dispel the current confusing nomenclature in the sustainability literature by defining and scoping various terms and clearly distinguishing between the different assessment frameworks and their application areas. An attempt is made to combine traditional decision-making with the broader domain of “sustainability assessments.” In recent years, “science-based data analysis” is a term often used by certain sections of the society to lend credence to whatever opinion they wish to promote based on some sort of analysis on some sort of data—it is close to assuming the aura of a “new religion.” Data analysis and modeling is an art with principles grounded in science. The same data set can often be analyzed in different ways by different analysts, thereby affording a great deal of methodological freedom. Unfortunately, a lack of rigor and excessive reliance on freely available software packages undermines an attitude of humility and cautious scrutiny historically expected of the scientific method. Therefore, we hold that the significant body of statistical analysis methods is something those wishing to become competent and trustworthy analysts are encouraged to “If you cannot explain the concepts simply, you probably do not understand them properly”—Richard Feynman. 2 Theory is something with principles which are not obvious initially but from which surprising consequences can be deduced and even the principles confirmed. 1

vii

viii

Preface (Second Edition)

acquire as a foundation, but more important is the need to develop a mind set and a skill set built on years of hands-on experience and self-evaluation. With the advent of powerful computing and convenient-to-use statistical packages, the analysts can perform numerous different types of analysis before reaching a conclusion, a convenience not available to analysts merely a few decades ago. As alluded to above, this has also led to cases of misuse and even error. Numerous technical papers report results of statistical analysis which are incomplete and deficient, which leads to an unfortunate erosion in the confidence of scientific findings by the research and policy community, not to mention the public. Further, there are several cases, where the original approach dealing with a specific topic went down an analysis path which was found later to be inappropriate, but subsequent researchers (and funding agencies) kept pushing that pathway just because of historic inertia. Hence, it is imperative that one have the courage to be impartial about one’s results and research findings. There are several instances where science, or at least applied science, is not self-correcting which can be attributed to the tendency to belittle confirmatory research, to limited funding, to constant shifting of research focus by funding agencies, academics, and researchers, and to the mindset that changing course direction will open one’s previous research to criticism. The original author (TAR) is very pleased to have his colleague, Prof. Gregor Henze, join the authorship of this edition. One of the most noticeable differences in the second edition of this book is the inclusion of electronic resources. This will enhance senior-level and graduate instruction while also serving as a self-learning aid to professionals in this domain area. For readers to have exposure with (and perhaps become proficient in) performing hands-on analysis, the open-source Python and R programming languages have been adopted in the form of Jupyter notebooks and R markdown files, which can be downloaded from https://github.com/henze-research-group/adam. This repository contains numerous data sets and sample computer code reflective of real-world problems, which will continue to grow as new examples are developed and graciously contributed by readers of this book. The link also allows the large data tables in Appendix B and various chapters to be downloaded conveniently for analysis.

Acknowledgments In addition to the numerous talented and dedicated colleagues who contributed in various ways to the first edition of this book, we would like to acknowledge, in particular, the following: • Colleagues: Bass Abushakra, Marlin Addison, Brad Allenby, Juan-Carlos Baltazar, David Claridge, Daniel Feuermann, Srinivas Katipamula, George Runger, Kris Subbarao, Frank Vignola and Radu Zmeureanu. • Former students: Thomaz Carvalhaes, Srijan Didwania, Ranajoy Dutta, Phillip Howard, Alireza Inanlouganji, Mushfiq Islam, Saurabh Jalori, Salim Moslehi, Emmanuel Omere, Travis Sabatino, and Steve Snyder. TAR would like to acknowledge the love and encouragement from his wife Shobha, their children Agaja and Satyajit, and grand-daughters Maya and Mikaella. GPH would like to acknowledge the encouragement and support of his wife Martha and their children Sophia and Josephine. Tempe, AZ, USA Boulder, CO, USA

T. Agami Reddy Gregor P. Henze

Preface (First Edition)

A Third Need in Engineering Education At its inception, engineering education was predominantly process oriented, while engineering practice tended to be predominantly system oriented.3 While it was invaluable to have a strong fundamental knowledge of the processes, educators realized the need to have courses where this knowledge translated into an ability to design systems; therefore, most universities, starting in the 1970s, mandated that seniors take at least one design/capstone course. However, a third aspect is acquiring increasing importance: the need to analyze, interpret and model data. Such a skill set is proving to be crucial in all scientific activities, none so as much as in engineering and the physical sciences. How can data collected from a piece of equipment be used to assess the claims of the manufacturers? How can performance data either from a natural system or a man-made system be respectively used to maintain it more sustainably or to operate it more efficiently? Such needs are driven by the fact that system performance data is easily available in our present-day digital age where sensor and data acquisition systems have become reliable, cheap and part of the system design itself. This applies both to experimental data (gathered from experiments performed according to some predetermined strategy) and to observational data (where one can neither intrude on system functioning nor have the ability to control the experiment, such as in astronomy). Techniques for data analysis also differ depending on the size of the data; smaller data sets may require the use of “prior” knowledge of how the system is expected to behave or how similar systems have been known to behave in the past. Let us consider a specific instance of observational data: once a system is designed and built, how to evaluate its condition in terms of design intent and, if possible, operate it in an “optimal” manner under variable operating conditions (say, based on cost, or on minimal environmental impact such as carbon footprint, or any appropriate pre-specified objective). Thus, data analysis and data driven modeling methods as applied to this instance can be meant to achieve certain practical ends—for example: (a) Verifying stated claims of manufacturer; (b) Product improvement or product characterization from performance data of prototype; (c) Health monitoring of a system, i.e., how does one use quantitative approaches to reach sound decisions on the state or “health” of the system based on its monitored data? (d) Controlling a system, i.e., how best to operate and control it on a day-to-day basis? (e) identifying measures to improve system performance, and assess impact of these measures; (f) Verification of the performance of implemented measures, i.e., are the remedial measures implemented impacting system performance as intended?

3

Stoecker, W.F., 1989. Design of Thermal Systems, 3rd Edition, McGraw-Hill, New York. ix

x

Intent Data analysis and modeling is not an end in itself; it is a well-proven and often indispensable aid for subsequent decision-making such as allowing realistic assessment and predictions to be made concerning verifying expected behavior, the current operational state of the system and/or the impact of any intended structural or operational changes. It has its roots in statistics, probability, regression, mathematics (linear algebra, differential equations, numerical methods,. . .), modeling and decision making. Engineering and science graduates are somewhat comfortable with mathematics while they do not usually get any exposure to decision analysis at all. Statistics, probability and regression analysis are usually squeezed into a sophomore term resulting in them remaining “a shadowy mathematical nightmare, and . . . a weakness forever”4 even to academically good graduates. Further, many of these concepts, tools and procedures are taught as disparate courses not only in physical sciences and engineering but in life sciences, statistics and econometric departments. This has led to many in the physical sciences and engineering communities having a pervasive “mental block” or apprehensiveness or lack of appreciation of this discipline altogether. Though these analysis skills can be learnt over several years by some (while some never learn it well enough to be comfortable even after several years of practice), what is needed is a textbook which provides: 1. A review of classical statistics and probability concepts, 2. A basic and unified perspective of the various techniques of data based mathematical modeling and analysis, 3. An understanding of the “process” along with the tools, 4. A proper combination of classical methods with the more recent machine learning and automated tools which the wide spread use of computers has spawned, and 5. Well-conceived examples and problems involving real-world data that would illustrate these concepts within the purview of specific areas of application. Such a text is likely to dispel the current sense of unease and provide readers with the necessary measure of practical understanding and confidence in being able to interpret their numbers rather than merely generating them. This would also have the added benefit of advancing the current state of knowledge and practice in that the professional and research community would better appreciate, absorb and even contribute to the numerous research publications in this area.

Approach and Scope Forward models needed for system simulation and design have been addressed in numerous textbooks and have been well-inculcated into the undergraduate engineering and science curriculum for several decades. It is the issue of data-driven methods, which I feel is inadequately reinforced in undergraduate and first-year graduate curricula, and hence the basic rationale for this book. Further, this book is not meant to be a monograph or a compilation of information on papers i.e., not a literature review. It is meant to serve as a textbook for senior undergraduate or first-year graduate students or for continuing education professional courses, as well as a self-study reference book for working professionals with adequate background. Applied statistics and data based analysis methods find applications in various engineering, business, medical, and physical, natural and social sciences. Though the basic concepts are the same, the diversity in these disciplines results in rather different focus and differing emphasis of the analysis methods. This diversity may be in the process itself, in the type and quantity of data, and in the intended purpose of the analysis. For example, many engineering systems have 4

Keller, D.K., 2006. The Tao of Statistics, Saga Publications, London, UK.

Preface (First Edition)

Preface (First Edition)

xi

low “epistemic” uncertainty or uncertainty associated with the process itself, and, also allow easy gathering of adequate performance data. Such models are typically characterized by strong relationships between variables which can be formulated in mechanistic terms and accurate models consequently identified. This is in stark contrast to such fields as economics and social sciences where even qualitative causal behavior is often speculative, and the quantity and uncertainty in data rather poor. In fact, even different types of engineered and natural systems require widely different analysis tools. For example, electrical and specific mechanical engineering disciplines (ex. involving rotary equipment) largely rely on frequency domain analysis methods, while time-domain methods are more suitable for most thermal and environmental systems. This consideration has led me to limit the scope of the analysis techniques described in this book to thermal, energy-related, environmental and industrial systems. There are those students for whom a mathematical treatment and justification helps in better comprehension of the underlying concepts. However, my personal experience has been that the great majority of engineers do not fall in this category, and hence a more pragmatic approach is adopted. I am not particularly concerned with proofs, deductions and statistical rigor which tend to overwhelm the average engineering student. The intent is, rather, to impart a broad conceptual and theoretical understanding as well as a solid working familiarity (by means of case studies) of the various facets of data-driven modeling and analysis as applied to thermal and environmental systems. On the other hand, this is not a cookbook nor meant to be a reference book listing various models of the numerous equipment and systems which comprise thermal systems, but rather stresses underlying scientific, engineering, statistical and analysis concepts. It should not be considered as a substitute for specialized books nor should their importance be trivialized. A good general professional needs to be familiar, if not proficient, with a number of different analysis tools and how they “map” with each other, so that he can select the most appropriate tools for the occasion. Though nothing can replace hands-on experience in design and data analysis, being familiar with the appropriate theoretical concepts would not only shorten modeling and analysis time but also enable better engineering analysis to be performed. Further, those who have gone through this book will gain the required basic understanding to tackle the more advanced topics dealt with in the literature at large, and hence, elevate the profession as a whole. This book has been written with a certain amount of zeal in the hope that this will give this field some impetus and lead to its gradual emergence as an identifiable and important discipline (just as that enjoyed by a course on modeling, simulation and design of systems) and would ultimately be a required senior-level course or first-year graduate course in most engineering and science curricula. This book has been intentionally structured so that the same topics (namely, statistics, parameter estimation and data collection) are treated first from a “basic” level, primarily by reviewing the essentials, and then from an “intermediate” level. This would allow the book to have broader appeal, and allow a gentler absorption of the needed material by certain students and practicing professionals. As pointed out by Asimov,5 the Greeks demonstrated that abstraction (or simplification) in physics allowed a simple and generalized mathematical structure to be formulated which led to greater understanding than would otherwise, along with the ability to subsequently restore some of the real-world complicating factors which were ignored earlier. Most textbooks implicitly follow this premise by presenting “simplistic” illustrative examples and problems. I strongly believe that a book on data analysis should also expose the student to the “messiness” present in real-world data. To that end, examples and problems which deal with case studies involving actual (either raw or marginally cleaned) data have been included. The hope is that this would provide the student with the necessary training and confidence to tackle real-world analysis situations.

5

Asimov, I., 1966. Understanding Physics: Light Magnetism and Electricity, Walker Publications.

xii

Preface (First Edition)

Assumed Background of Reader This is a book written for two sets of audiences: a basic treatment meant for the general engineering and science senior as well as the general practicing engineer on one hand, and the general graduate student and the more advanced professional entering the fields of thermal and environmental sciences. The exponential expansion of scientific and engineering knowledge as well as its cross-fertilization with allied emerging fields such as computer science, nanotechnology and bio-engineering have created the need for a major reevaluation of the thermal science undergraduate and graduate engineering curricula. The relatively few professional and free electives academic slots available to students requires that traditional subject matter be combined into fewer classes whereby the associated loss in depth and rigor is compensated for by a better understanding of the connections among different topics within a given discipline as well as between traditional and newer ones. It is presumed that the reader has the necessary academic background (at the undergraduate level) of traditional topics such as physics, mathematics (linear algebra and calculus), fluids, thermodynamics and heat transfer, as well as some exposure to experimental methods, probability, statistics and regression analysis (taught in lab courses at the freshman or sophomore level). Further, it is assumed that the reader has some basic familiarity with important energy and environmental issues facing society today. However, special effort has been made to provide pertinent review of such material so as to make this into a sufficiently selfcontained book. Most students and professionals are familiar with the uses and capabilities of the ubiquitous spreadsheet program. Though many of the problems can be solved with the existing (or add-ons) capabilities of such spreadsheet programs, it is urged that the instructor or reader select an appropriate statistical program to do the statistical computing work because of the added sophistication which it provides. This book does not delve into how to use these programs, rather, the focus of this book is education-based intended to provide knowledge and skill sets necessary for value, judgment and confidence on how to use them, as against training-based whose focus would be to teach facts and specialized software.

Acknowledgements Numerous talented and dedicated colleagues contributed in various ways over the several years of my professional career; some by direct association, others indirectly through their textbooks and papers-both of which were immensely edifying and stimulating to me personally. The list of acknowledgements of such meritorious individuals would be very long indeed, and so I have limited myself to those who have either provided direct valuable suggestions on the overview and scope of this book, or have generously given their time in reviewing certain chapters of this book. In the former category, I would like to gratefully mention Drs. David Claridge, Jeff Gordon, Gregor Henze John Mitchell and Robert Sonderegger, while in the latter, Drs. James Braun, Patrick Gurian, John House, Ari Rabl and Balaji Rajagopalan. I am also appreciative of interactions with several exceptional graduate students, and would like to especially thank the following whose work has been adopted in case study examples in this book: Klaus Andersen, Song Deng, Jason Fierko, Wei Jiang, Itzhak Maor, Steven Snyder and Jian Sun. Writing a book is a tedious and long process; the encouragement and understanding of my wife, Shobha, and our children, Agaja and Satyajit, were sources of strength and motivation. Tempe, AZ, USA December 2010

T. Agami Reddy

Contents

1

Mathematical Models and Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Forward and Inverse Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Preamble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 The Energy Problem and Importance of Buildings . . . . . . . . . . . 1.1.3 Forward or Simulation Approach . . . . . . . . . . . . . . . . . . . . . . . 1.1.4 Inverse or Data Analysis Approach . . . . . . . . . . . . . . . . . . . . . . 1.1.5 Discussion of Both Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 System Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 What Is a System Model? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Types of Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Types of Uncertainty in Data . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Mathematical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Basic Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Block Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.3 Mathematical Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.5 Steady-State and Dynamic Models . . . . . . . . . . . . . . . . . . . . . . 1.5 Mathematical Modeling Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Broad Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.2 Simulation or Forward Modeling . . . . . . . . . . . . . . . . . . . . . . . . 1.5.3 Inverse Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.4 Calibrated Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Data Analytic Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1 Data Mining or Knowledge Discovery . . . . . . . . . . . . . . . . . . . . 1.6.2 Machine Learning or Algorithmic Models . . . . . . . . . . . . . . . . . 1.6.3 Introduction to Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.2 Basic Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.3 Example of a Data Collection and Analysis System . . . . . . . . . . 1.8 Topics Covered in Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 1 1 2 2 3 3 3 3 4 4 6 7 7 7 9 10 13 14 14 14 17 19 19 20 20 21 22 22 22 23 25 27 30

2

Probability Concepts and Probability Distributions . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Classical Concept of Probability . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Bayesian Viewpoint of Probability . . . . . . . . . . . . . . . . . . . . . 2.1.3 Distinction Between Probability and Statistics . . . . . . . . . . . . .

31 31 31 32 32

. . . . .

xiii

xiv

Contents

2.2

3

Classical Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Basic Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Basic Set Theory Notation and Axioms of Probability . . . . . . . . 2.2.3 Axioms of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Joint, Marginal, and Conditional Probabilities . . . . . . . . . . . . . . 2.2.5 Permutations and Combinations . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Probability Distribution Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Expectations and Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Function of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Chebyshev’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Important Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Distributions for Discrete Variables . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Distributions for Continuous Variables . . . . . . . . . . . . . . . . . . . 2.5 Bayesian Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Application to Discrete Probability Variables . . . . . . . . . . . . . . . 2.5.3 Application to Continuous Probability Variables . . . . . . . . . . . . 2.6 Three Kinds of Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32 32 33 34 35 38 39 39 42 43 44 45 45 45 50 58 58 61 63 64 66 74

Data Collection and Preliminary Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Sensors and Their Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Data Collection Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Generalized Measurement System . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Types and Categories of Measurements . . . . . . . . . . . . . . . . . . . 3.2.3 Data Recording Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Raw Data Validation and Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Limit Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Consistency Checks Involving Conservation Balances . . . . . . . . 3.3.4 Outlier Rejection by Visual Means . . . . . . . . . . . . . . . . . . . . . . 3.3.5 Handling Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Statistical Measures of Sample Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Summary Descriptive Measures . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Covariance and Pearson Correlation Coefficient . . . . . . . . . . . . . 3.5 Exploratory Data Analysis (EDA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 What Is EDA? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Purpose of Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.3 Static Univariate Graphical Plots . . . . . . . . . . . . . . . . . . . . . . . . 3.5.4 Static Bi- and Multivariate Graphical Plots . . . . . . . . . . . . . . . . . 3.5.5 Interactive and Dynamic Graphics . . . . . . . . . . . . . . . . . . . . . . . 3.5.6 Basic Data Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Overall Measurement Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Need for Uncertainty Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Basic Uncertainty Concepts: Random and Bias Errors . . . . . . . . 3.6.3 Random Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.4 Bias Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.5 Overall Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.6 Chauvenet’s Statistical Criterion of Data Rejection . . . . . . . . . . .

75 75 77 77 80 80 81 81 81 82 83 83 87 87 88 89 89 91 92 95 100 101 102 102 102 103 105 105 106

Contents

xv

4

5

3.7

Propagation of Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.1 Taylor Series Method for Cross-Sectional Data . . . . . . . . . . . . . 3.7.2 Monte Carlo Method for Error Propagation Problems . . . . . . . . . 3.8 Planning a Non-Intrusive Field Experiment . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

106 107 111 113 115 121

Making Statistical Inferences from Samples . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Basic Univariate Inferential Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Sampling Distribution and Confidence Interval of the Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Hypothesis Test for Single Sample Mean . . . . . . . . . . . . . . . . . . 4.2.3 Two Independent Sample and Paired Difference Tests on Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Single and Two Sample Tests for Proportions . . . . . . . . . . . . . . 4.2.5 Single and Two Sample Tests of Variance . . . . . . . . . . . . . . . . . 4.2.6 Tests for Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.7 Test on the Pearson Correlation Coefficient . . . . . . . . . . . . . . . . 4.3 ANOVA Test for Multi-Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Single-Factor ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Tukey’s Multiple Comparison Test . . . . . . . . . . . . . . . . . . . . . . 4.4 Tests of Significance of Multivariate Data . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Introduction to Multivariate Methods . . . . . . . . . . . . . . . . . . . . . 4.4.2 Hotteling T2 Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Non-Parametric Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Signed and Rank Tests for Medians . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Kruskal–Wallis Multiple Samples Test for Medians . . . . . . . . . . 4.5.3 Test on Spearman Rank Correlation Coefficient . . . . . . . . . . . . . 4.6 Bayesian Inferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.2 Estimating Population Parameter from a Sample . . . . . . . . . . . . . 4.6.3 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Some Considerations About Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.1 Random and Non-Random Sampling Methods . . . . . . . . . . . . . . 4.7.2 Desirable Properties of Estimators . . . . . . . . . . . . . . . . . . . . . . . 4.7.3 Determining Sample Size During Random Surveys . . . . . . . . . . 4.7.4 Stratified Sampling for Variance Reduction . . . . . . . . . . . . . . . . 4.8 Resampling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.1 Basic Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.2 Application to Probability Problems . . . . . . . . . . . . . . . . . . . . . 4.8.3 Different Methods of Resampling . . . . . . . . . . . . . . . . . . . . . . . 4.8.4 Application of Bootstrap to Statistical Inference Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.5 Closing Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

123 123 124

Linear Regression Analysis Using Least Squares . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Objective of Regression Analysis . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Ordinary Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

124 127 129 133 134 136 137 138 139 141 142 142 143 146 147 149 150 152 152 152 153 154 154 155 156 158 159 159 159 159 160 162 163 167 169 169 170 170 170

xvi

Contents

5.3

Simple OLS Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Estimation of Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Statistical Criteria for Model Evaluation . . . . . . . . . . . . . . . . . . 5.3.3 Inferences on Regression Coefficients and Model Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 Model Prediction Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Multiple OLS Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Higher Order Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Matrix Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Point and Interval Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.4 Beta Coefficients and Elasticity . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.5 Partial Correlation Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.6 Assuring Model Parsimony—Stepwise Regression . . . . . . . . . . . 5.5 Applicability of OLS Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2 Sources of Errors During Regression . . . . . . . . . . . . . . . . . . . . . 5.6 Model Residual Analysis and Regularization . . . . . . . . . . . . . . . . . . . . . . 5.6.1 Detection of Ill-Conditioned Behavior . . . . . . . . . . . . . . . . . . . . 5.6.2 Leverage and Influence Data Points . . . . . . . . . . . . . . . . . . . . . . 5.6.3 Remedies for Nonuniform Residuals . . . . . . . . . . . . . . . . . . . . . 5.6.4 Serially Correlated Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.5 Dealing with Misspecified Models . . . . . . . . . . . . . . . . . . . . . . . 5.7 Other Useful OLS Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.1 Zero-Intercept Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.2 Indicator Variables for Local Piecewise Models— Linear Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.3 Indicator Variables for Categorical Regressor Models . . . . . . . . . 5.8 Resampling Methods Applied to Regression . . . . . . . . . . . . . . . . . . . . . . . 5.8.1 Basic Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8.2 Jackknife and k-Fold Cross-Validation . . . . . . . . . . . . . . . . . . . 5.8.3 Bootstrap Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.9 Case Study Example: Effect of Refrigerant Additive on Chiller Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.10 Parting Comments on Regression Analysis and OLS . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Design of Physical and Simulation Experiments . . . . . . . . . . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Types of Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Purpose of DOE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.3 DOE Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Overview of Different Statistical Methods . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Different Types of ANOVA Tests . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Link Between ANOVA and Regression . . . . . . . . . . . . . . . . . . . 6.2.3 Recap of Basic Model Functional Forms . . . . . . . . . . . . . . . . . . 6.3 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Levels, Discretization, and Experimental Combinations . . . . . . . 6.3.2 Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.3 Unrestricted and Restricted Randomization . . . . . . . . . . . . . . . . 6.4 Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Full Factorial Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 2k Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

171 171 173 175 177 178 178 182 182 185 187 187 189 189 191 191 191 194 197 202 203 205 205 205 207 208 208 208 210 210 213 214 221 223 223 223 224 225 225 225 226 226 227 227 229 229 230 230 234

Contents

xvii

7

6.4.3 Concept of Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.4 Fractional Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Block Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Complete Block Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.2 Latin Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Response Surface Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.3 First- and Second-Order Models . . . . . . . . . . . . . . . . . . . . . . . . 6.6.4 Central Composite Design and the Concept of Rotation . . . . . . . 6.7 Simulation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7.2 Similarities and Differences Between Physical and Simulation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7.3 Monte Carlo and Allied Sampling Methods . . . . . . . . . . . . . . . . 6.7.4 Sensitivity Analysis for Screening . . . . . . . . . . . . . . . . . . . . . . . 6.7.5 Surrogate Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

238 240 241 241 243 245 245 246 246 247 250 250

Optimization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 What Is Optimization? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Terminology and Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Definition of Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Categorization of Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 Types of Objective Functions and Constraints . . . . . . . . . . . . . . 7.2.4 Sensitivity Analysis and Post-Optimality Analysis . . . . . . . . . . . 7.3 Analytical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Unconstrained Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Direct Substitution Method for Equality Constrained Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Lagrange Multiplier Method for Equality Constrained Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.4 Problems with Inequality Constraints . . . . . . . . . . . . . . . . . . . . . 7.3.5 Penalty Function Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Numerical Unconstrained Search Methods . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Univariate Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 Multivariate Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Linear Programming (LP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 Standard Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.2 Example of a LP Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.3 Linear Network Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.4 Example of Maximizing Flow in a Transportation Network . . . . . 7.5.5 Mixed Integer Linear Programing (MILP) . . . . . . . . . . . . . . . . . 7.5.6 Example of Reliability Analysis of a Power Network . . . . . . . . . 7.6 Nonlinear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.1 Standard Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.2 Quadratic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.3 Popular Numerical Multivariate Search Algorithms . . . . . . . . . .

267 267 267 268 269 269 269 271 272 272 272

252 253 254 259 260 260 265

273 274 275 276 277 277 280 282 282 283 284 285 285 286 288 288 289 289

xviii

8

9

Contents

7.7 Illustrative Example: Integrated Energy System (IES) for a Campus . . . . . . 7.8 Examples of Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

290 294 299 306

Analysis of Time Series Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.3 Basic Behavior Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.4 Illustrative Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 General Model Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Smoothing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Arithmetic Moving Average (AMA) . . . . . . . . . . . . . . . . . . . . . 8.3.2 Exponentially Weighted Moving Average (EWA) . . . . . . . . . . . 8.3.3 Determining Structure by Cross-Validation . . . . . . . . . . . . . . . . 8.4 OLS Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 Trend Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.2 Trend and Seasonal Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.3 Forecast Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.4 Fourier Series Models for Periodic Behavior . . . . . . . . . . . . . . . 8.4.5 Interrupted Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Stochastic Time Series Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.2 ACF, PACF, and Data Detrending . . . . . . . . . . . . . . . . . . . . . . . 8.5.3 ARIMA Class of Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.4 Recommendations on Model Identification . . . . . . . . . . . . . . . . . 8.6 ARMAX or Transfer Function Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.1 Conceptual Approach and Benefit . . . . . . . . . . . . . . . . . . . . . . . 8.6.2 Transfer Function Modeling of Linear Dynamic Systems . . . . . . 8.7 Quality Control and Process Monitoring Using Control Chart Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.1 Background and Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.2 Shewart Control Charts for Variables . . . . . . . . . . . . . . . . . . . . . 8.7.3 Shewart Control Charts for Attributes . . . . . . . . . . . . . . . . . . . . 8.7.4 Practical Implementation Issues of Control Charts . . . . . . . . . . . 8.7.5 Time-Weighted Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

309 309 309 311 312 313 314 314 315 316 318 320 320 321 322 324 327 328 328 329 332 337 339 339 339

Parametric and Non-Parametric Regression Methods . . . . . . . . . . . . . . . . . . . 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Important Concepts in Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Structural Identifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.2 Ill-Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.3 Numerical Identifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Dealing with Collinear Regressors: Variable Selection and Shrinkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 Problematic Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.2 Principal Component Analysis and Regression . . . . . . . . . . . . . . 9.3.3 Ridge and Lasso Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.4 Chiller Case Study Involving Collinear Regressors . . . . . . . . . . . 9.3.5 Other Multivariate Methods . . . . . . . . . . . . . . . . . . . . . . . . . . .

355 355 357 357 359 360

341 341 342 344 346 347 349 350 353

361 361 362 367 368 372

Contents

xix

10

9.4

Going Beyond OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.2 Maximum Likelihood Estimation (MLE) . . . . . . . . . . . . . . . . . . 9.4.3 Generalized Linear Models (GLM) . . . . . . . . . . . . . . . . . . . . . . 9.4.4 Box-Cox Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.5 Logistic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.6 Error in Variables (EIV) and Corrected Least Squares . . . . . . . . . 9.5 Non-Linear Parametric Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.1 Detecting Non-Linear Correlation . . . . . . . . . . . . . . . . . . . . . . . 9.5.2 Different Non-Linear Search Methods . . . . . . . . . . . . . . . . . . . . 9.5.3 Overview of Various Parametric Regression Methods . . . . . . . . . 9.6 Non-Parametric Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6.2 Extensions to Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6.3 Basis Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6.4 Polynomial Regression and Smoothing Splines . . . . . . . . . . . . . 9.7 Local Regression: LOWESS Smoothing Method . . . . . . . . . . . . . . . . . . . 9.8 Neural Networks: Multi-Layer Perceptron (MLP) . . . . . . . . . . . . . . . . . . . 9.9 Robust Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

373 373 375 377 378 381 384 386 386 388 390 390 390 391 393 393 394 394 399 401 406

Inverse Methods for Mechanistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 Fundamental Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.1 Applicability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.2 Approaches and Their Characteristics . . . . . . . . . . . . . . . . . . . . 10.1.3 Mechanistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.4 Scope of Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Gray-Box Static Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.1 Basic Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.2 Performance Models for Solar Photovoltaic Systems . . . . . . . . . 10.2.3 Gray-Box and Black-Box Models for Water-Cooled Chillers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.4 Sequential Stagewise Regression and Selection of Data Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.5 Case Study of Non-Intrusive Sequential Parameter Estimation for Building Energy Flows . . . . . . . . . . . . . . . . . . . . 10.2.6 Application to Policy: Dose-Response . . . . . . . . . . . . . . . . . . . . 10.3 Certain Aspects of Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.1 Types of Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.2 Measures of Information Content . . . . . . . . . . . . . . . . . . . . . . . 10.3.3 Functional Testing and Data Fusion . . . . . . . . . . . . . . . . . . . . . . 10.4 Gray-Box Models for Dynamic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.2 Sequential Estimation of Thermal Network Model Parameters from Controlled Tests . . . . . . . . . . . . . . . . . . . . . . . 10.4.3 Non-Intrusive Identification of Thermal Network Models and Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.4 State Space Representation and Compartmental Models . . . . . . . 10.4.5 Example of a Compartmental Model . . . . . . . . . . . . . . . . . . . . . 10.4.6 Practical Issues During Identification . . . . . . . . . . . . . . . . . . . . . 10.5 Bayesian Regression and Parameter Estimation: Case Study . . . . . . . . . . . 10.6 Calibration of Detailed Simulation Programs . . . . . . . . . . . . . . . . . . . . . .

409 410 410 411 413 413 413 413 414 417 419 420 424 426 426 426 431 432 432 433 434 437 437 439 441 446

xx

Contents

10.6.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.2 The Basic Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.3 Detailed Simulation Models for Energy Use in Buildings . . . . . . 10.6.4 Uses of Calibrated Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.5 Causes of Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.6 Definition of Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.7 Raw Input Tuning (RIT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.8 Semi-Analytical Methods (SAM) . . . . . . . . . . . . . . . . . . . . . . . 10.6.9 Physical Parameter Estimation (PPE) . . . . . . . . . . . . . . . . . . . . . 10.6.10 Thoughts on Statistical Criteria for Goodness-of-Fit . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

446 446 447 449 450 451 451 453 456 456 459 464

11

Statistical Learning Through Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Distance as a Similarity Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Unsupervised Learning: Clustering Approaches . . . . . . . . . . . . . . . . . . . . 11.3.1 Types of Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.2 Centroid-Based Partitional Clustering by K-Means . . . . . . . . . . . 11.3.3 Density-Based Partitional Clustering Using DBSCAN . . . . . . . . 11.3.4 Agglomerative Hierarchical Clustering Methods . . . . . . . . . . . . . 11.4 Supervised Learning: Statistical-Based Classification Approaches . . . . . . . 11.4.1 Different Types of Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.2 Distance-Based Classification: k-Nearest Neighbors . . . . . . . . . . 11.4.3 Naive Bayesian Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.4 Classical Regression-Based Classification . . . . . . . . . . . . . . . . . 11.4.5 Discriminant Function Analysis . . . . . . . . . . . . . . . . . . . . . . . . 11.4.6 Neural Networks: Radial Basis Function (RBF) . . . . . . . . . . . . . 11.4.7 Support Vector Machines (SVM) . . . . . . . . . . . . . . . . . . . . . . . 11.5 Decision Tree–Based Classification Methods . . . . . . . . . . . . . . . . . . . . . . 11.5.1 Rule-Based Method and Decision-Tree Representation . . . . . . . . 11.5.2 Criteria for Tree Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.3 Classification and Regression Trees (CART) . . . . . . . . . . . . . . . 11.5.4 Ensemble Method: Random Forest . . . . . . . . . . . . . . . . . . . . . . 11.6 Anomaly Detection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.2 Graphical and Statistical Methods . . . . . . . . . . . . . . . . . . . . . . . 11.6.3 Model-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.4 Data Mining Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.7 Applications to Reducing Energy Use in Buildings . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

467 467 469 470 470 471 475 478 482 482 482 486 486 490 492 493 495 495 496 497 499 504 504 505 505 505 506 510 512

12

Decision-Making, Risk Analysis, and Sustainability Assessments . . . . . . . . . . 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.1 Types of Decision-Making Problems and Applications . . . . . . . . 12.1.2 Purview of Reliability, Risk Analysis, and Decision-Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.3 Example of Discrete Decision-Making . . . . . . . . . . . . . . . . . . . . 12.1.4 Example of Chiller FDD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Single Criterion Decision-Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.1 General Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.2 Representing Problem Structure: Influence Diagrams and Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

515 515 515 518 518 519 521 521 522

Contents

xxi

12.2.3 12.2.4 12.2.5 12.2.6 12.2.7 12.2.8

Single and Multi-Stage Decision Problems . . . . . . . . . . . . . . . . . Value of Perfect Information . . . . . . . . . . . . . . . . . . . . . . . . . . . Different Criteria for Outcome Evaluation . . . . . . . . . . . . . . . . . Discretizing Probability Distributions . . . . . . . . . . . . . . . . . . . . Utility Value Functions for Modeling Risk Attitudes . . . . . . . . . Monte Carlo Simulation for First-Order and Nested Uncertainties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3 Risk Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.1 The Three Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.2 The Empirical Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.3 Context of Environmental Risk to Humans . . . . . . . . . . . . . . . . 12.3.4 Other Areas of Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4 Case Study: Risk Assessment of an Existing Building . . . . . . . . . . . . . . . . 12.5 Multi-Criteria Decision-Making (MCDM) Methods . . . . . . . . . . . . . . . . . 12.5.1 Introduction and Description of Terms . . . . . . . . . . . . . . . . . . . . 12.5.2 Classification of Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.5.3 Basic Mathematical Operations . . . . . . . . . . . . . . . . . . . . . . . . . 12.6 Single Discipline MCDM Methods: Techno-Economic Analysis . . . . . . . . 12.6.1 Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.6.2 Consistent Attribute Scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.6.3 Inconsistent Attribute Scales: Dominance and Pareto Frontier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.6.4 Case Study of Conflicting Criteria: Supervisory Control of an Engineered System . . . . . . . . . . . . . . . . . . . . . . . 12.7 Sustainability Assessments: MCDM with Multi-Discipline Attribute Scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.7.1 Definitions and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.7.2 Indicators and Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.7.3 Sustainability Assessment Frameworks . . . . . . . . . . . . . . . . . . . 12.7.4 Examples of Non, Semi-, and Fully-Aggregated Assessments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.7.5 Two Case Studies: Structure-Based and Performance-Based . . . . 12.7.6 Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

524 526 527 528 529 532 533 533 535 536 538 540 546 546 547 549 549 549 550 551 553 557 557 559 560 562 564 568 569 573

Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607

1

Mathematical Models and Data Analysis

Abstract

This chapter starts with an introduction of forward and inverse models and provides a practical context of their distinctive usefulness and specific capabilities and scopes in terms of a major societal concern, namely the high energy use in buildings. This is followed by a description of the various types of models generally encountered, namely conceptual, physical, and mathematical, with the last type being the sole focus of this book. Next, different types of data collection schemes and the different types of uncertainty encountered are discussed. This is followed by introducing the elements of mathematical models and the different ways to classify them such as linear and nonlinear, lumped and distributed, dynamic and steady-state, etc. Subsequently, how algebraic and first-order differential equations capture different characteristics related to the response of sensors is illustrated. Next, the distinction between simulation or forward (or well-defined or wellspecified) problems, and inverse (or data-driven or ill-defined) problems is highlighted. This chapter introduces analysis approaches relevant to the latter which include calibrated forward models and statistical models identified primarily from data, which can be black-box or gray-box. The latter can again be separated into (i) partial gray-box, i.e., inspired from only partial understanding of system functioning, and (ii) reducedorder mechanistic gray-box models. More recently, a new class of analysis methods has evolved, namely the data analytic approaches, which include data mining or knowledge discovery, machine learning, and big data analysis. These methods, which have been largely driven by increasing computational power and sensing capabilities, are briefly discussed. Next, the various steps involved in a statistical analysis study are discussed followed by an example of data collection and analysis of field-monitored data from an engineering system. Finally, the various topics covered in each chapter of this book are outlined.

1.1

Forward and Inverse Approaches

1.1.1

Preamble

Applied data analysis and modeling of system performance is historically older than simulation modeling. The ancients, starting as far back as 12,000 years ago, observed the movements of the sun, moon, and stars in order to predict their behavior and initiate certain tasks such as planting crops or readying for winter. Theirs was a necessity compelled by survival; surprisingly, still relevant today. The threat of climate change and its dire consequences are being studied by scientists using in essence similar types of analysis tools— tools that involve measured data to refine and calibrate their models, extrapolate, and evaluate the effect of different scenarios and mitigation measures. These tools fall under the general purview of inverse data analysis and modeling methods, and it would be expedient to illustrate their potential and relevance with a case study application that the reader can relate to more practically.

1.1.2

The Energy Problem and Importance of Buildings

One of the current major societal problems facing mankind is the issue of energy, not only due to the gradual depletion of fossil fuels but also due to the adverse climatic and health effects that their burning create. According to the U.S. Department of Energy (USDOE), total worldwide primary energy consumption in 2021 was about 580 Exajoules (= 580 × 1018 J). The average annual growth rate is about 2%, which suggests a doubling time of 35 years. The United States accounts for 17% of the worldwide energy use (with only 5% of the world’s population), while the building sector alone (residential plus commercial buildings) in the United States consumes about 40% of the total primary energy use,

# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 T. A. Reddy, G. P. Henze, Applied Data Analysis and Modeling for Energy Engineers and Scientists, https://doi.org/10.1007/978-3-031-34869-3_1

1

2

1

over 76% of the electricity generated, and is responsible for 40% of the CO2 emitted. Improvement in energy efficiency in all sectors of the economy worldwide has been rightly identified as a major and pressing need, and aggressive programs and measures are being implemented worldwide. By 2030, USDOE estimates that building energy use in the United States could be cut by more than 20% using technologies known to be cost effective today and by more than 35% if research goals are met. Much higher savings are technically possible. Building efficiency must be considered as improving the performance of a complex system designed to provide occupants with a comfortable, safe, and attractive living and work environment. This requires superior architecture and engineering designs, quality construction practices, and intelligent operation and maintenance of the structures. Identifying energy conservation and efficiency opportunities, verifying by monitoring whether anticipated benefits are in fact realized when such measures/systems are implemented, optimal operating of buildings, etc.; all these tasks require skills in data analysis and modeling.

1.1.3

Forward or Simulation Approach

Building energy simulation models (or forward models) are mechanistic (i.e., based on a mathematical formulation of the physical behavior) and deterministic (i.e., there is no randomness in the inputs or outputs).1 They require as inputs the hourly climatic data of the selected location, the layout, orientation and physical description of the building (such as wall material, thickness, glazing type and fraction, type of shading overhangs, etc.), the type of mechanical and electrical systems available inside the building in terms of air distribution secondary systems, performance specifications of primary equipment (chillers, boilers, etc.), and the hourly operating and occupant schedules of the building. The simulation predicts hourly or sub-hourly energy use during the entire year from which monthly total energy use and peak use along with utility rates provide an estimate of the operating costs of the building. The primary strength of such a forward simulation model is that it is based on sound engineering principles usually taught in colleges and universities, and consequently has gained widespread acceptance by the design and professional community. Major public domain simulation codes (e.g., Energy Plus 2009) have been developed with hundreds of man-years invested in their development by very competent professionals. This modeling approach is generally useful for design purposes where different design options are to be evaluated before the actual system is built.

1

These terms will be described more fully in Sect. 1.5.2.

1.1.4

Mathematical Models and Data Analysis

Inverse or Data Analysis Approach

Inverse modeling methods, on the other hand, are used when performance data of the system is available, and one uses this data for certain specific purposes, such as predicting or controlling the behavior of the system under different operating conditions, or for identifying energy conservation opportunities, or for verifying the effect of energy conservation measures and commissioning practices once implemented, or even to verify that the system is performing as intended (called condition monitoring). Consider the case of an existing building whose energy consumption is known (either utility bill data or monitored data). The following are some of the tasks to which knowledge of data analysis methods may be advantageous to a building energy specialist: (a) Commissioning tests: How can one evaluate whether a component or a system is installed and commissioned properly? (b) Comparison with design intent: How does the consumption compare with design predictions? In case of discrepancies, are they due to anomalous weather, to unintended building operation, to improper equipment operation, or to other causes? (c) Demand side management (DSM): How would the energy consumption decrease if certain operational changes are made, such as lowering thermostat settings, ventilation rates or indoor lighting levels? (d) Operation and maintenance (O&M): How much energy could be saved by retrofits to building shell, changes to air handler operation from constant air volume to variable air volume operation, or due to changes in the various control settings, or due to replacing the old chiller with a new and more energy efficient one? (e) Monitoring and verification (M&V): If the retrofits are implemented in the system, can one verify that the savings are due to the retrofit, and not to other confounding causes, e.g., the weather or changes in building occupancy? (f) Automated fault detection, diagnosis, and evaluation (AFDDE): How can one automatically detect faults in heating, ventilating, air-conditioning, and refrigerating (HVAC&R) equipment, which reduce operating life and/or increase energy use? What are the financial implications of this degradation? Should this fault be rectified immediately or at a later time? What specific measures need to be taken? (g) Optimal supervisory operation: How can one characterize HVAC&R equipment (such as chillers, boilers, fans, pumps, etc.) in their installed state and optimize the control and operation of the entire system? (h) Smart-grid interactions: How to best facilitate dynamic energy interactions between building energy systems

1.2 System Models

and the smart-grid with advanced communication and load control capability and high solar/wind energy penetration?

1.1.5

Discussion of Both Approaches

All the above questions are better addressed by data analysis methods. The forward approach could also be used, by, say, (i) going back to the blueprints of the building and of the HVAC system, and repeating the analysis performed at the design stage while using actual building schedules and operating modes, and (ii) performing a calibration or tuning of the simulation model (i.e., varying the inputs in some fashion) since actual performance is unlikely to match observed performance. This process is, however, tedious and much effort has been invested by the building professional community in this regard with only limited success (Reddy 2006). A critical limitation of the calibrated simulation approach is that the data being used to tune the forward simulation model must meet certain criteria, and even then, all the numerous inputs required by the forward simulation model cannot be mathematically identified (this is referred to as an “over-parameterized problem”). Though awkward, labor intensive, and not entirely satisfactory in its current state of development, the calibrated building energy simulation model is still an attractive option and has its place in the toolkit of data analysis methods (discussed at length in Sect. 10.6). The fundamental difficulty is that there is no general and widely used model or software for dealing with data-driven applications as they apply to building energy; only specialized software programs have been developed, which allow certain types of narrow analysis to be performed. In fact, given the wide diversity in applications of data-driven models, it is unlikely that any one methodology or software program will ever suffice. This leads to the basic premise of this book that there exists a crucial need for building energy professionals to be familiar and competent with a wide range of data analysis methods and tools so that they could select the one that best meets their purpose with the end result that buildings will be operated and managed in a much more energy-efficient manner than currently. Building design simulation tools have played a significant role in lowering energy use in buildings. These are necessary tools and their importance should not be understated. Historically, most of the business revenue in architectural engineering and HVAC&R firms was generated from design/build contracts, which required extensive use of simulation and design software programs. Hence, the professional community is fairly knowledgeable in this area, and several universities teach classes geared toward the use of building energy modeling (BEM) simulation programs.

3

The last 40 years or so have seen a dramatic increase in building energy services as evidenced by the number of firms that offer services in this area. The acquisition of the required understanding, skills, and tools relevant to this aspect is different from those required for building design. There are other market forces that are also at play. The recent interest in “green” and “sustainable” has resulted in a plethora of products and practices aggressively marketed by numerous companies. Often, the claims that this product can save much more energy than another, and that that device is more environmentally friendly than others, are, unfortunately, unfounded under closer scrutiny. Such types of unbiased evaluations and independent verification are imperative, otherwise the whole “green” movement may degrade into mere “green-washing” rather than overcoming a dire societal challenge. A sound understanding of applied data analysis is imperative for this purpose and future science and engineering graduates have an important role to play. Thus, the raison d’etre of this book is to provide a general introduction and a broad foundation to the mathematical, statistical, and modeling aspects of data analysis methods.

1.2

System Models

1.2.1

What Is a System Model?

A system is the object under study, which could be as simple or as complex as one may wish to consider. It is any ordered, inter-related set of things, and their attributes. A model is a construct that allows one to represent the real-life system so that it can be used to predict the future behavior of the system under various “what–if” scenarios. The construct could be a scaled down physical version of the actual system (widely adopted historically in engineering) or a mental construct. The development of a model is not the ultimate objective; in other words, it is not an end by itself. It is a means to an end, the end being a credible means to make decisions that could involve system-specific issues (such as gaining insights about influential drivers and system dynamics, or predicting system behavior, or determining optimal control conditions) as well as those involving a broader context (such as operation management, deciding on policy measures and planning, etc.).

1.2.2

Types of Models

One differentiates between different types of models (Fig. 1.1): (a) Abstract models can be: (i) Conceptual (or qualitative or descriptive models), where the system’s behavior is summarized in

4

1

Fig. 1.1 Different types of models

Mathematical Models and Data Analysis

Models

Physical

Abstract

Mathematical

Physics-based

Conceptual

non-analytical ways because only general qualitative trends of the system are known. They are mental/intuitive abstract constructs that capture subjective/qualitative behavior or expectations of how something works based on prior experience. Such models are primarily used as an aid to thought or communication. (ii) Mathematical models, which capture system response using mathematical equations; these are further discussed below. (b) Physical models can be either: (i) Scaled down (or up) physical constructs, whose characteristics resemble those of the physical system being studied. They often supported and guided the work of earlier scientists and engineers and are still extensively used for validating mathematical models (such as architectural daylight experiments in test rooms), or (ii) Analogue models, which are actual physical setups meant to reproduce the physics of the systems and involve secondary measurements of the system to be made (flow, energy, etc.). (c) Mathematical models can be further subdivided into (see Fig. 1.1): (i) Empirical models, which are abstract models based on observed/monitored data with a loose mathematical structure. They capture general qualitative trends of the system based on data describing properties of the system summarized in a graph, a table, or a curve fit to observation points. Such models presume some knowledge of the fundamental quantitative trends but lack accurate understanding. Examples: econometric, medical, sociological, anthropological behavior. (ii) Physics-based mechanistic models (or structural models), which use metric or count data and are based on mathematical relationships derived from physical laws such as Newton’s laws, the laws of thermodynamics and heat transfer, etc. Such

Empirical

Analytical

Numerical Scale

Analogue

models can be used for prediction (during system design) or for proper system operation and control (involving data analysis). This group of models can be classified into: • Analytical, for which closed form mathematical solutions exist for the equation or set of equations. • Numerical, which require numerical procedures to solve the equation(s). Alternatively, mathematical models can be considered to be: • Exact structural models where the equation is thought to apply rigorously, i.e., the relationship between variables and parameters in the model is exact, or as close to exact as current state of scientific understanding permits. • Inexact structural models where the equation applies only approximately, either because the process is not fully known or because one chose to simplify the exact model so as to make it more usable. A typical example is the dose–response model, which characterizes the relation between the amount of toxic agent imbibed by an individual and the incidence of adverse health effect.

1.3

Types of Data

1.3.1

Classification

Data2 can be classified in different ways. One classification scheme is to distinguish between experimental data gathered under controlled test conditions where the observer can perform tests in the manner or sequence he intends, and Several authors make a strict distinction between “data,” which is plural, and “datum,” which is singular and implies a single data point. No such distinction is made throughout this book, and the word “data” is used to designate either.

2

1.3 Types of Data

5

Fig. 1.2 The rolling of a dice is an example of discrete data where the data can only assume whole numbers. Even if the dice is fair, one would not expect, out of 60 throws, the numbers 1 through 6 to appear exactly 10 times but only approximately so

observational data collected while the system is under normal operation or when the system cannot be controlled (as in astronomy). Another classification scheme is based on type of data: (a) Categorical/qualitative data, which involve non-numerical descriptive measures or attributes, such as belonging to one of several categories. One can further distinguish between: (i) Nominal (or unordered), consisting of attribute data with no rank, such as male/female, yes/no, married/ unmarried, eye color, engineering major, etc. (ii) Ordinal data, i.e., data that has some order or rank, such as a building envelope that is leaky, medium, or tight, or a day that is hot, mild, or cold. Such data can be converted into an arbitrary rank order by a min-max scaling (e.g., hot/mild/cold days can be ranked as 1/2/3). Such rank-ordered data can be manipulated arithmetically to some extent. (b) Numerical/quantitative data, i.e., data obtained from measurements of such quantities as time, weight, and height. Further, there are two different kinds: (i) Count or discrete data, which can take on only a finite or countable number of values. An example is data series one would expect by rolling a dice 60 times (Fig. 1.2). (ii) Continuous or metric data involving measurements of time, weight, height, energy, or others. Such data may take on any value in an interval (most metric data is continuous, and hence is not countable); for example the daily average outdoor dry-bulb temperature in Philadelphia, PA, over a year (Fig. 1.3). Further, one can distinguish between: – Data measured on an interval scale, which has an arbitrary zero point (such as the Celsius scale) and so only differences between values are meaningful. – Data measured on a ratio scale, which has a zero point that cannot be arbitrarily changed (such as mass or volume or temperature in Kelvin); both differences and ratios are meaningful.

Fig. 1.3 Continuous data separated into a large number of bins (in this case, 300) resulted in the above histogram of the hourly outdoor dry-bulb temperature (in °F) in Philadelphia, PA, over a year. A smoother distribution would have resulted if a smaller number of bins had been selected

For data analysis purposes, it is often important to view data based on their dimensionality, i.e., the number of axes needed to graphically present the data. A univariate data set consists of observations based on a single variable, bivariate those based on two variables, and multivariate those based on more than two variables. A fourth type of distinction between data types is by the source or origin of the data: (a) Population is the collection or set of all individuals (or items, or characteristics) representing the same quantity with a connotation of completeness, i.e., the entire group of items being studied whether they be the freshmen student body of a university, instrument readings of a test quantity, or points on a curve. (b) Sample is a portion or limited number of items from a population from which information or readings are collected. There are again two types of samples: – Single-sample is a single reading or succession of readings taken at the same time or under different times but under identical conditions. – Multi-sample is a repeated measurement of a fixed quantity using altered test conditions, such as different observers or different instruments or both. Many experiments may appear to be multi-sample data but are actually single-sample data. For example, if the same instrument is used for data collection during different times, the data should be regarded as single-sample not multisample. One can differentiate between different types of multisample data. Consider the case of solar thermal collector testing (as described in Pr. 5.7 of Chap. 5). In essence, the collector is subjected to different inlet fluid temperature levels

6

1

Mathematical Models and Data Analysis

Fig. 1.4 Example of multisample data in the framework of a “round-robin” experiment of testing the same solar thermal collector in six different test facilities (shown by different symbols) following the same testing methodology. The test data is used to determine and plot the collector efficiency versus the reduced temperature along with uncertainty bands (see Pr. 5.7 for nomenclature). (Streed et al. 1979)

under different values of incident solar radiation and ambient air temperatures using an experimental facility with instrumentation of pre-specified accuracy levels. The test results are processed according to certain performance models and the data plotted against collector efficiency versus reduced temperature level. The test protocol would involve performing replicate tests under similar reduced temperature levels, and this is one type of multi-sample data. Another type of multisample data would be the case when the same collector is tested at different test facilities nation-wide. The results of such a “round-robin” test are shown in Fig. 1.4, where one detects variations around the trend line given by the performance model that can be attributed to differences in test facility and instrumentation, and in slight variations in how the test protocols were implemented in different facilities. (c) Two-stage experiments are successive staged experiments where the chance results of the first stage determine the conditions under which the next stage will be carried out. For example, when checking the quality of a lot of mass-produced articles, it is frequently possible to decrease the average sample size by carrying out the inspection in two stages. One may first take a small sample and accept the lot if all articles in the sample are satisfactory; otherwise a large second sample is inspected. Finally, one needs to distinguish between: (i) a duplicate, which is a separate specimen taken from the same source as the first specimen, and tested at the same time and in the same manner, and (ii) replicate, which is the same specimen tested again at a different time. Thus, while duplication allows one to test samples till they are destroyed (such as tensile strength testing of an iron specimen), replicate testing stops short of doing permanent damage to the samples.

1.3.2

Types of Uncertainty in Data

If the same results are obtained when an experiment is repeated under the same conditions, one says that the experiment is deterministic. It is this deterministic nature of science that allows theories or models to be formulated and permits the use of scientific theory for prediction (Hodges and Lehman 1970). However, all observational or experimental data invariably have a certain amount of inherent noise or randomness, which introduces a certain degree of uncertainty in the results or conclusions. Instrument or measurement technique, or improper understanding of all influential factors, or the inability to measure some of the driving parameters, random and/or bias types of errors usually infect the deterministic data. However, there are also experiments whose results vary due to the very nature of the experiment; for example, gambling outcomes (throwing of dice, card games, etc.). These are called random experiments. Without uncertainty or randomness, there would have been little need for statistics. Probability theory and inferential statistics have been largely developed to deal with random experiments and the same approach has also been adapted to deterministic experimental data analysis. Both inferential statistics and stochastic model building have to deal with the random nature of observational or experimental data, and thus require knowledge of probability. There are several types of uncertainty in data, and all of them have to do with the inability to determine the true state of affairs of a system (Haimes 1998). A succinct classification involves the following sources of uncertainties: (a) Purely stochastic variability (or aleatory uncertainty), where the ambiguity in outcome is inherent in the nature of the process, and no amount of additional measurements can reduce the inherent randomness.

1.4 Mathematical Models

Common examples involve coin tossing, or card games. These processes are inherently random (either on a temporal or spatial basis), and whose outcome, while uncertain, can be anticipated on a statistical basis. (b) Epistemic uncertainty or ignorance or lack of complete knowledge of the process, which result in certain influential variables not being considered (and, thus, not measured). (c) Inaccurate measurement of numerical data due to instrument or sampling errors. (d) Cognitive vagueness involving human linguistic description. For example, people use words like tall/ short or very important/not important, which cannot be quantified exactly. This type of uncertainty is generally associated with qualitative and ordinal data where subjective elements come into play. The traditional approach is to use probability theory along with statistical techniques to address (a), (b), and (c) types of uncertainties. The variability due to sources (b) and (c) can be diminished by taking additional measurements, by using more accurate instrumentation, by better experimental design, and by acquiring better insight into specific behavior with which to develop more accurate models. Several authors apply the term “uncertainty” to only these two sources. Finally, source (d) can be modeled using probability approaches though some authors argue that it would be more convenient and appropriate to use fuzzy logic to model such vagueness in human speech.

1.4

Mathematical Models

1.4.1

Basic Terminology

One can envision two different types of systems: open systems in which either energy and/or matter flows into and out of the system, and closed systems in which matter is not exchanged to the environment but energy flows can be present. A system model is a description of the system. Empirical and mechanistic models are made up of three components: (i) Input variables (also referred to as regressor, forcing, exciting, covariates, exogenous or independent variables in the engineering, statistical and econometric literature), which act on the system. Note that there are two types of such variables: controllable by the experimenter, and uncontrollable or extraneous variables, such as climatic variables, for example. (ii) System structure and parameters/properties, which provide the necessary mathematical description of the

7

systems in terms of physical and material constants; for example, thermal mass, overall heat transfer coefficients, mechanical properties of elements. (iii) Output variables (also called response, state, endogenous, or dependent variables), which describe system response to the input variables. A structural model of a system is a mathematical relationship between one or several input variables and parameters and one or several output variables. Its primary purpose is to allow better physical understanding of the phenomenon or process or, alternatively, to allow accurate prediction of system response. This is useful for several purposes; for example, preventing adverse phenomena from occurring, for proper system design (or optimization), or to improve system performance by evaluating other modifications to the system. A satisfactory mathematical model is subject to two contradictory requirements (Edwards and Penney 1996): it must be sufficiently detailed to represent the phenomenon it is attempting to explain or capture, yet it must be sufficiently simple to make the mathematical analysis practical. This requires judgment and experience of the modeler backed by experimentation and validation.3

1.4.2

Block Diagrams

An information flow or block diagram4 is a standard shorthand manner of schematically representing the inputs and output quantities of an element or a system as well as the computational sequence of variables. It is a concept widely used in the context of system modeling and simulation since a block implies that its output can be calculated provided the inputs are known. They are useful for setting up the set of model equations to solve in order to simulate or analyze systems or components. As illustrated in Fig. 1.5, a centrifugal pump could be represented as one of many possible block diagrams (as shown in Fig. 1.6), depending on which parameters are of interest. If the model equation is cast in a form such that the outlet pressure p2 is the response variable and the inlet pressure p1 and the fluid flow volumetric rate v are the forcing variables, then the associated block diagram is that shown in Fig. 1.6a. Another type of block diagram is shown in Fig. 1.6b, where flow rate v is the response variable. The arrows indicate the direction of unilateral information or signal flow, which in term can be viewed as cause–effect 3

Validation is defined as the process of bringing the user’s confidence about the model to an acceptable level either by comparing its performance to other more accepted models or by experimentation. 4 Block diagrams should not be confused with material flow diagrams, which for a given system configuration are unique. On the other hand, there can be numerous ways of assembling block diagrams depending on how the problem is framed.

8

1

relationship; this is why these models are termed causal. Thus, such diagrams depict the manner in which the simulation models of the various components of a system need to be formulated. In general, a system or process is subject to one or more inputs (or stimulus or excitation or forcing functions) to which it responds by producing one or more outputs (or system response). If the observer is unable to act on the system, i.e., change some or any of the inputs, so as to produce a desired output, the system is not amenable to

s

p2

v p1

Fig. 1.5 Schematic of a centrifugal pump rotating at speed s (say, in rpm), which pumps a water flow rate v from lower pressure p1 to higher pressure p2

p1 v p1 p2 p1 p2

s

Pump

p2

(a)

Pump

v

(b)

v

Pump

(c)

+

Fig. 1.6 Different block diagrams for modeling a pump depending on how the problem is formulated

x(t)

Mathematical Models and Data Analysis

control. If, however, the inputs can be varied, then control is feasible. Thus, a control system is defined as an arrangement of physical components connected or related in such a manner as to command, direct, or regulate itself or another system (Stubberud et al. 1994). One needs to distinguish between open and closed loops, and block diagrams provide a convenient way of doing so. (a) An open loop control system is one in which the control action is independent of the output (see Fig. 1.7a). Two important features are: (i) their ability to perform accurately is determined by their calibration, i.e., by how accurately one is able to establish the input–output relationship; and (ii) they are generally not unstable. A practical example is an automatic toaster, which is simply controlled by a timer. If the behavior of an open loop system is not completely understood or if unexpected disturbances act on it, then there may be considerable and unpredictable variations in the output. (b) A closed loop control system, also referred to as a feedback control system, is one in which the control action is somehow dependent on the output (Fig. 1.7b). If the value of the response y(t) is too low or too high, then the control action modifies the manipulated variable (shown as u(t)) appropriately. Such systems are designed to cope with lack of exact knowledge of system behavior, inaccurate component models, and unexpected disturbances. Thus, increased accuracy is achieved by reducing the sensitivity of the ratio of output to input to variations in system characteristics (i.e., increased bandwidth defined as the range of variation in the inputs over which the system will respond satisfactorily) or due to random perturbations of the system by the environment. They have a serious disadvantage though: they can

y (t)

System

(a) Open loop Disturbance

x (t)

+

Control Element

Manipulated variable System u (t)

Input Feedback element (b) Closed loop Fig. 1.7 Open and closed loop systems for a controlled output y(t). (a) Open loop. (b) Closed loop

y(t) Controlled output

1.4 Mathematical Models

9

inadvertently develop unstable oscillations. This issue is an important one by itself and is treated extensively in control textbooks. Using the same example of a centrifugal pump but going one step further would lead us to the control of the pump. For example, if the inlet pressure p1 is specified, and the pump needs to be operated or controlled (i.e., say by varying its rotational speed s) under variable outlet pressure p2 so as to maintain a constant fluid flow rate v, then some sort of control mechanism or feedback is often used (shown in Fig. 1.6c). The small circle at the intersection of the signal s and the feedback represents a summing point that denotes the algebraic operation being carried out. For example, if the feedback signal is summed with the signal s, a “+” sign is placed just outside the summing point. Such graphical representations are called signal flow diagrams and are used in process or system control, which requires inverse modeling and parameter estimation.

1.4.3

Mathematical Representation

Let us start with explaining the difference between parameters and variables in a model. A deterministic model is a mathematical relationship, derived from physical considerations, between variables and parameters. The quantities in a model that can be measured independently during an experiment are the “variables,” which can be either input or output variables (as described earlier). To formulate the relationship among variables, one usually introduces “constants” that denote inherent properties of nature or of the engineering system called parameters. Consider the dynamic model of a component or system represented by the block diagram in Fig. 1.8. For simplicity, let us assume a linear model with no lagged terms in the forcing variables. Then, the model can be represented in matrix form as: Yt = AYt - 1 þ BUt þ CWt

with

Y1 = d

ð1:1Þ

where the output or state variable at time t is Yt. The forcing or input variables are of two types: vector U denoting observable and controllable input variables, and vector W indicating uncontrollable input variables or disturbing inputs that may or may not be observable. The parameter vectors of the model are {A, B, C} while d represents the initial condition vector. Examples of Simple Models (a) Pressure drop Δp of a fluid flowing at velocity v through a pipe of hydraulic diameter Dh and length L: Δp = f

L v2 ρ Dh 2

where f is the friction factor, and ρ is the density of the fluid. For a given system, v can be viewed as the independent or input variable, while the pressure drop is the state variable. The factors f, L, and Dh are the system or model parameters and ρ is a property of the fluid. Note that the friction factor f is itself a function of the velocity, thus making the problem a bit more complex. Sometimes, the distinction between parameters and variables is ambiguous and depends on the context, i.e., the objective of the study and the manner in which the experiment is performed. For example, in Eq. 1.2, pipe length has been taken to be a fixed system parameter since the intention was to study the pressure drop against fluid velocity. However, if the objective is to determine the effect of pipe length on pressure drop for a fixed velocity, the length would then be viewed as the independent variable. (b) Rate of heat transfer from a fluid to a surrounding solid: •

Q = UA T f - T o

ð1:3Þ

where the parameter UA is the overall heat conductance, and Tf and To are the mean fluid and solid temperatures (which are the input variables). (c) Rate of heat added to a flowing fluid: •



Q = mcp ðT out - T in Þ •

Fig. 1.8 Block diagram of a simple component with parameter vectors {A, B, C}. Vectors U and W are the controllable/observable and the uncontrollable/disturbing inputs, respectively, while Y is the state variable or system response

ð1:2Þ

ð1:4Þ

where m is the fluid mass flow rate, cp is its specific heat at constant pressure, and Tout and Tin are the exit and inlet fluid temperatures. It is left to the reader to identify the input variables, state variables, and the model parameters. (d) Lumped model of the water temperature Ts in a storage tank with an immersed heating element and losing heat to the environment is given by the first-order ordinary differential equation (ODE):

10

1

Mcp

dTs = P - UAðT s - T env Þ dt

ð1:5Þ

where Mcp is the thermal heat capacitance of the tank (water plus tank material), Tenv the environment temperature, and P is the auxiliary power (or heat rate) supplied to the tank. It is left to the reader to identify the input variables, state variables, and parameters.

1.4.4

Classification

Predicting the behavior of a system requires a mathematical representation of the system components. The process of deciding on the level of detail appropriate for the problem at hand is called abstraction (Cha et al. 2000). This process has to be undertaken with care; (i) over-simplification may result in loss of important system behavior predictability, while (ii) an overly detailed model may result in undue data collection effort and computational resources as well as time spent in understanding the model assumptions and results generated. There are different ways by which mathematical models can be classified. Some of these are shown in Table 1.1 and described below. (i) Distributed Versus Lumped Parameter In a distributed parameter system, the elements of the system are continuously distributed along the system geometry so that the variables they influence must be treated as differing not only in time but also in space, i.e., from point to point. Partial differential or difference equations are usually needed. Recall that a partial differential equation (PDE) is a differential equation between partial derivatives of an unknown function against at least two independent variables. One distinguishes between two general cases:

Mathematical Models and Data Analysis

• The independent variables are space variables only. • The independent variables are both space and time variables. Though partial derivatives of multivariable functions are ordinary derivatives with respect to one variable (the other being kept constant), the study of PDEs is not an easy extension of the theory for ordinary differential equations (ODEs). The solution of PDEs requires fundamentally different approaches. Recall that ODEs are solved by first finding general solutions and then using subsidiary conditions to determine arbitrary constants. However, such arbitrary constants in general solutions of ODEs are replaced by arbitrary functions in PDE, and determination of these arbitrary functions using subsidiary conditions is usually impossible. In other words, general solutions of ODEs are of limited use in solving PDEs. In general, the solution of the PDEs and subsidiary conditions (called initial or boundary conditions) needs to be determined simultaneously. Hence, it is wise to try to simplify the PDE model as far as possible when dealing with data analysis situations. In a lumped parameter system, the elements are small enough (or the objective of the analysis is such that simplification is warranted) so that each such element can be treated as if it were concentrated (i.e., lumped) at one particular spatial point in the system. The position of the point can change with time but not in space. Such systems usually are adequately modeled by ODE or difference equations. A heated billet as it cools in air could be analyzed as either a distributed system or a lumped parameter system depending on whether the Biot number (Bi) is greater than or less than 0.1 (Fig. 1.9). Recall that the Biot number is proportional to the ratio of the internal to the external heat flow resistances of the sphere. So, a small Biot number would imply that the resistance to heat flow attributed to internal body temperature

Table 1.1 Ways of classifying mathematical models 1 2 3 4 5 6 7 8 9 10

Different classification schemes Distributed vs lumped parameter Dynamic vs static or steady-state Deterministic vs stochastic Continuous vs discrete Linear vs nonlinear in the functional model Linear vs nonlinear in the model parameters Time invariant vs time variant Homogeneous vs non-homogeneous Simulation vs performance models Physics based (white-box) vs data based (black-box) and mix of both (gray-box)

Adapted from Eisen (1988)

Fig. 1.9 Cooling of a solid sphere in air can be modeled as a lumped model provided the Biot number Bi < 0.1. This number is proportional to the ratio of the heat conductive resistance (1/k) inside the sphere to the convective resistance (1/h) from the outer envelope of the sphere to the air

1.4 Mathematical Models

11

Fig. 1.10 Thermal networks to model heat flow through a homogeneous plane wall of surface area A and wall thickness Δx. (a) Schematic of the wall with the indoor and outdoor temperatures and convective heat flow coefficients. (b) Lumped model with two resistances and one capacitance (2R1C model). (c) Higher nth order model with n layers of equal thickness (Δx/n). The numerical discretization assumes all capacitances to be equal, while only the (n - 2) internal resistances (excluding the two end resistances) are taken to be equal. (From Reddy et al. 2016)

gradient is small enough that it can be neglected without biasing the analysis. Thus, a small body with high thermal conductivity and low convection coefficient can be adequately modeled as a lumped system. Another example of lumped model representation is the 1-D heat flow through the wall of a building (Fig. 1.10a) using the analogy between heat flow and electricity flow. The internal and external convective film heat transfer coefficients are represented by hi and ho, respectively, while k, ρ, and cp are the thermal conductivity, density, and specific heat of the wall material, respectively. In the lower limit, the wall can be discretized into one lumped layer of capacitance C with two resistors as shown by the electric network of Fig. 1.10b (referred to as 2R1C network). In the upper limit, the network can be represented by “n” nodes (see Fig. 1.10c). The 2R1C simplification does lead to some errors, which under certain circumstances is outweighed by the convenience it provides while yielding acceptable results. (ii) Dynamic Versus Steady-State Dynamic models are defined as those that allow transient system or equipment behavior to be captured with explicit recognition of the time varying behavior of both output and input variables. The steady-state or static or zero-order model is one that assumes no time variation in its input variables (and hence, no change in the output variable as well). One can also distinguish an intermediate type, referred to as quasistatic models. Cases arise when the input variables (such as

incident solar radiation on a solar hot water panel) are constantly changing at a short time scale (say, at the minute scale) and the thermal output needs to be predicted at hourly intervals only. The dynamic behavior is poorly predicted by the solar collector model at such high-frequency time scales, and so the input variables can be “time-averaged” so as to make them constant during a specific hourly interval. This is akin to introducing a “low pass filter” for the inputs. Thus, the use of quasi-static models allows one to predict the system output(s) in discrete time variant steps or intervals during a given day with the system inputs averaged (or summed) over each of the time intervals fed into the model. These models could be either zero-order or low-order ODE. Dynamic models are usually represented by PDEs, or by ODEs when spatially lumped with respect to time. One could solve them directly, and the simple cases are illustrated in Sect. 1.4.5. Since solving these equations gets harder as the order of the model increases, it is often more convenient to recast the differential equations in a time-series formulation using response functions or transfer functions, which are time-lagged values of the input variable(s) only, or of both the inputs and the response, respectively. This formulation is discussed in Chap. 8. The steadystate or static or zero-order model is one that assumes no time variation in its inputs or outputs. Its time series formulation results in simple algebraic equations with no timelagged values of the input variable(s) appearing in the function.

12

(iii) Deterministic Versus Stochastic A deterministic system is one whose response to specified inputs under specified conditions is completely predictable (to within a certain accuracy of course) from physical laws. Thus, the response is precisely reproducible time and time again. A stochastic system is one where the specific output can be predicted to within an uncertainty range only, which could be due to two reasons: (i) that the inputs themselves are random and vary unpredictably within a specified range of values (such as the electric power output of a wind turbine subject to gusting winds), and/or (ii) because the models are not accurate (e.g., the dose–response of individuals when subject to asbestos inhalation). Concepts from probability theory are required to make predictions about the response. The majority of observed data has some stochasticity in them either due to measurement noise/errors or due to the random nature of the process itself. If the random element is so small that it is negligible as compared to the “noise” in the system, then the process or system can be treated in a purely deterministic framework. The orbits of the planets though well described by Kepler’s laws have small disturbances due to other secondary effects, but Newton was able to treat them as deterministic and verify his law of gravitation. On the other hand, Brownian molecular motion is purely random, and has to be treated by stochastic methods. (iv) Continuous Versus Discrete A continuous system is one in which all the essential variables are continuous in nature and the time that the system operates is some interval (or intervals) of the real numbers. Usually such systems need differential equations to describe them. A discrete system is one in which all essential variables are discrete and the time that the system operates is a finite subset of the real numbers. This system can be described by difference equations. In most applications in engineering, the system or process being studied is fundamentally continuous. However, the continuous output signal from a system is usually converted into a discrete signal by sampling. Alternatively, the continuous system can be replaced by its discrete analog that, of course, has a discrete signal. Hence, analysis of discrete data is usually more widespread in data analysis applications.

1

Mathematical Models and Data Analysis

x1

y1

x2

y2

c1 x1 c2 x2

c1 y1 + c2 y2

Fig. 1.11 Principle of superposition of a linear system

y2(t)] for all pairs of inputs x1(t) and x2(t) and all pairs of real number constants a1 and a2. This concept is illustrated in Fig. 1.11. An equivalent concept is the principle of superposition, which states that the response of a linear system due to several inputs acting simultaneously is equal to the sum of the responses of each input acting alone. This is an extremely important concept since it allows the response of a complex system to be determined more simply by decomposing the input driving function into simpler terms, solving the equation for each term separately, and then summing the individual responses to obtain the desired aggregated response. Such a strategy is common in detailed hour-by-hour building energy simulation programs (Reddy et al. 2016). An important distinction needs to be made between a linear model and a model that is linear in its parameters. For example, • y = ax1 + bx2 is linear in both model and parameters a and b. • y = a sin x1 + bx2 is a nonlinear model but is linear in its parameters. • y = a exp (bx1) is nonlinear in both model and parameters.

(v) Linear Versus Nonlinear

In all fields, linear differential or difference equations are by far more widely used than nonlinear equations. Even if the models are nonlinear, every attempt is made, due to the subsequent convenience it provides, to make them linear either by suitable transformation (such as logarithmic transform) or by piece-wise linearization, i.e., linear approximation over a smaller range of variation. The advantages of linear systems over nonlinear systems are many:

A system is said to be linear if, and only if, it has the following property: if an input x1(t) produces an output y1(t), and if an input x2(t) produces an output y2(t), then an input [c1 x1(t) + c2 x2(t)] produces an output [c1 y1(t) + c2

• Linear systems are simpler to analyze. • General theories are available to analyze them. • They do not have singular solutions (simpler engineering problems rarely have them anyway).

1.4 Mathematical Models

13

• Well-established methods are available, such as the state space approach (see Sect. 10.4.4), for analyzing even relatively complex set of equations. The practical advantage with this type of time domain transformation is that large systems of higher-order ODEs can be transformed into a first-order system of simultaneous equations that, in turn, can be solved rather easily by numerical methods.

AyðnÞ þ Byðn - 1Þ þ . . . þ My00 þ Ny0 þ Oy = 0

ð1:7Þ

A system is time invariant or stationary if neither the form of the equations characterizing the system, nor the model parameters vary with time under either constant or varying inputs; otherwise the system is time-variant or non-stationary. In some cases, when the model structure is poor and/or when the data are very noisy, time variant models are used requiring either on-line or off-line updating depending on the frequency of the input forcing functions and how quickly the system responds. Examples of such instances abound in electrical engineering applications. Usually, one tends to encounter time invariant models in less complex thermal and environmental engineering applications.

yields the free response of the system. The homogeneous solution is a general solution whose arbitrary constants are then evaluated using the initial (or boundary) conditions, thus making it unique to the situation. (b) The non-homogeneous form where P(x) ≠ 0 and Eq. 1.6 applies. The forced response of the system is associated with the case when all the initial conditions are identically zero, i.e., y(0), y′(0), . . . y(n - 1) are all zero. Thus, the implication is that the forced response is only dependent on the external forcing function P(x). The total response of the linear time-invariant ODE is the sum of the free response and the forced response (thanks to the superposition principle). When system control is being studied, slightly different terms are often used to specify total dynamic system response: (a) the steady-state response is that part of the total response that does not approach zero as time approaches infinity, and (b) the transient response is that part of the total response that approaches zero as time approaches infinity.

(vii) Homogeneous Versus Non-Homogeneous

(viii) Simulation Versus Performance-Based Models

If there are no external inputs and the system behavior is determined entirely by its initial conditions, then the system is called homogeneous or unforced or autonomous; otherwise it is called non-homogeneous or forced. Consider the general form of a nth order time-invariant or stationary linear ODE:

The distinguishing trait between simulation and performance models is the basis on which the model structure is framed (this categorization is quite important). Simulation models are used to predict system performance during the design phase when no actual system exists, and design alternatives are being evaluated. A performancebased model (also referred to as “stochastic data model”) relies on measured performance data of the actual system to provide insights into model structure and to estimate its parameters or to simply predict future performance (such models are referred to as “algorithmic models”). Both these model approaches are discussed in Sects. 1.5 and 1.6.

(vi) Time Invariant Versus Time Variant

AyðnÞ þ Byðn - 1Þ þ . . . þ My00 þ Ny0 þ Oy = PðxÞ

ð1:6Þ

where y′, y″, and y(n) are the first, second, and nth derivatives of y with respect to x, and A, B, . . . M, N, and O are constants. The function P(x) frequently corresponds to some external influence on the system and is a function of the independent variable. Often, the independent variable is the time variable t. This is intentional since time comes into play when the dynamic behavior of most physical systems is modeled. However, the variable x can be assigned any other physical quantity as appropriate. To completely specify the problem, i.e., to obtain a unique solution y(x), one needs to specify two additional factors: (i) the interval of x over which a solution is desired, and (ii) a set of n initial conditions. If these conditions are such that y(x) and its (n - 1) derivatives are specified for x = 0, then the problem is called an initial value problem. Thus, one distinguishes between: (a) The homogeneous form where P(x) = 0, i.e., there is no external driving force. The solution of the differential equation:

1.4.5

Steady-State and Dynamic Models

Let us illustrate steady-state and dynamic system responses using the example of measurement sensors. Steady-state models (also called zero-order models) apply when input variables (and hence, the output variables) are maintained constant. A zero-order model for the dynamic performance of measuring systems is used (i) when the variation in the quantity to be measured is very slow as compared to how quickly the instrument responds, or (ii) as a standard of comparison for other more sophisticated models. For a zero-order instrument, the output is directly proportional to the input (Doebelin 1995):

14

1

a0 qo = b0 qi

ð1:8aÞ

25

Or

Steady-state value

ð1:8bÞ

where a0 and b0 are the system parameters, assumed time invariant, qo and qi are the output and the input quantities, respectively, and K = b0/a0 is called the static sensitivity of the instrument. Hence, only K is required to completely specify the response of the instrument. Thus, the zero-order instrument is an ideal instrument; no matter how rapidly the measured variable changes, the output signal faithfully and instantaneously reproduces the input. The next step in complexity is the first-order model: a1

dq0 þ a0 q o = b 0 q i dt

ð1:9aÞ

Or τ

dq0 þ qo = Kqi dt

ð1:9bÞ

where τ is the time constant of the instrument (τ = a1/a0), and K is the static sensitivity of the instrument. Thus, two numerical parameters are used to completely specify a first-order instrument. The solution to Eq. 1.9b for a step change in input is: qo ðt Þ = Kqis 1 - e - t=τ

ð1:10Þ

where qis is the value of the input quantity after the step change. After a step change in the input, the steady-state value of the output will be K times the input qis (just as in the zeroorder instrument). This is shown as a dotted horizontal line in Fig. 1.12 with a numerical value of 20. The time constant characterizes the speed of response; the smaller its value the faster its response, and vice versa, to any kind of input. Figure 1.12 illustrates the dynamic response and the associated time constants for two instruments when subject to a step change in the input. Numerically, the time constant represents the time taken for the response to reach 63.2% of its final change, or to reach a value within 36.8% of the final value. This follows from Eq. 1.10 by setting t = τ, o ðt Þ i.e., qKq = ð1 - e - 1 Þ = 0:632. Another useful measure of is

response speed for any instrument is the 5% settling time, i.e., the time for the output signal to get to within 5% of the final value. For any first-order instrument, it is equal to about three times the time constant.

Instrument reading

20

qo = Kqi

Mathematical Models and Data Analysis

15 63.2% of change 10

Two different instruments

5 Small time constant

Large time constant

0 0

5

10

15

20

25

30

Time from step change in input (seconds)

Fig. 1.12 Step-responses of two first-order instruments with different response times on a plot with instrument reading (y-axis) versus time (xaxis). The response is characterized by the time constant, which is the time for the instrument reading to reach 63.2% of the steady-state value

1.5

Mathematical Modeling Approaches

1.5.1

Broad Categorization

As shown in Fig. 1.13, one can differentiate between two broad types of mathematical approaches meant for: (a) Simulation or forward (or well-defined or wellspecified) problems. (b) Inverse problems, which include calibrated forward models, statistical models identified primarily from data that can be black-box or gray-box. The latter can again be designated as partial gray-box, i.e., inspired from partial understanding, or mechanistic gray-box or physics-based reduced-order structural models.

1.5.2

Simulation or Forward Modeling

Simulation or forward or white-box or detailed mechanistic models are based on the laws of physics and permit accurate and microscopic modeling of the various fluid flow, heat and mass transfer phenomenon, etc. that occur within engineered systems. Students are generally familiar with the algebraic and differential equations to represent the temporal (or transient or dynamic) and spatial performances of numerous equipment, subsystems, and systems in terms of algebraic, ordinary, and partial differential equations (ODE and PDE, respectively), and how to solve them analytically or numerically (sequentially or iteratively) over different time periods and duration (from minutes, to hours, days, and years). A high level of physical understanding is necessary to develop these models, complemented with some

1.5 Mathematical Modeling Approaches

15

Fig. 1.13 Overview of various traditional mathematical modeling approaches with applications

Mathematical Models

Inverse models (based on performance data)

Forward models (simulation-based)

Calibrated forward models

White-box models

White-box models

System design applications

Data-driven or Statistical models

Black-box models (curve fitting)

Partial grey-box models (mechanistic)

Performance prediction of existing systems

Physics-based reduced order models

Grey-box models (mechanistic)

Understanding, prediction, control

expertise in numerical analysis. Consequently, these have found their niche in design studies prior to building a system where the effect of different system configurations needs to be evaluated under different operating conditions and scenarios. Adopting the model specified by Eq. 1.1 and Fig. 1.8, such problems are framed as: Given fU, Wg and fB, Cg determine Y

ð1:11Þ

The objective is to predict the response or state variables of a specified model with known structure and known parameters when subject to specified input or forcing variables. This is also referred to as the “well-defined problem” since it has a unique solution if formulated properly (in other words, the degree of freedom is zero). Such models are implicitly studied in classical mathematics and also in system simulation design courses. For example, consider a simple steady-state problem wherein the operating point of a pump and piping network are represented by black-box models of the pressure drop (Δp) and volumetric flow rate (V ), such as shown in Fig. 1.14: Δp = a1 þ b1 V þ c1 V 2

for the pump

Δp = a2 þ b2 V þ c2 V

for the pipe network

2

ð1:12Þ

Solving the two equations simultaneously yields the performance condition of the operating point, i.e., pressure drop and flow rate (Δp0,V0). Note that the numerical values

Fig. 1.14 Example of a forward problem where solving two simultaneous equations, one representing the pump curve and the other the system curve, yields the operating point

of the model parameters {ai, bi, ci} are known, and that (Δp) and V are the two variables, while the two equations provide the two constraints. This simple example has obvious extensions to the solution of differential equations where the combined spatial and temporal response is sought. In order to ensure accuracy of prediction, the models have tended to become increasingly complex especially with the advent of powerful and inexpensive computing power. The divide and conquer mind-set is prevalent in this approach, often with detailed mathematical equations based on scientific laws used to model micro-elements of the complete

16

1

Fig. 1.15 Schematic of the cooling plant for Example 1.5.1

tb

Mathematical Models and Data Analysis

qc

tc

P

Condenser Expansion Valve

Compressor

Evaporator

te qe

system. This approach presumes detailed knowledge of not only the various natural phenomena affecting system behavior but also of the magnitude of various interactions (e.g., heat and mass transfer coefficients, friction coefficients, etc.). The main advantage of this approach is that the system need not be physically built in order to predict its behavior. Thus, this approach is ideal in the preliminary design and analysis stage and is most often employed as such. Note that incorporating superfluous variables and needless modeling details does increase computing time and complexity in the numerical resolution. However, if done correctly, it does not usually compromise the accuracy of the solution obtained. Example 1.5.1 Simulation of a chiller. Consider an example of simulating a chilled water cooling plant consisting of the condenser, compressor, and evaporator, as shown in Fig. 1.15.5 Simple black-box models are used for easier comprehension. The steadystate cooling capacity qe (in kWt6) and the compressor electric power draw P (in kWe) are function of the refrigerant evaporator temperature te and the refrigerant condenser temperature tc in °C, and are supplied by the equipment manufacturer: qe = 239:5 þ 10:073t e - 0:109t 2e - 3:41t c - 0:00250t 2c - 0:2030t e t c þ 0:00820t 2e t c þ0:0013t e t 2c - 0:000080005t 2e t 2c

ð1:13Þ

ta

and P = - 2:634 - 0:3081t e - 0:00301t 2e þ 1:066t c - 0:00528t 2c - 0:0011t e t c - 0:000306t 2e t c þ0:000567t e t 2c þ 0:0000031t 2e t 2c

ð1:14Þ

Another equation needs to be introduced for the heat rejected at the condenser qc (in kWt). This is simply given by a heat balance of the system (i.e., from the first law of thermodynamics) as: qc = qe þ P

ð1:15Þ

The forward problem would entail determining the unknown values of Y = {te, tc, qe, P, qc}. Since there are five unknowns, five equations are needed. In addition to the three equations above, two additional ones are required. They are the heat transfer equations at the evaporator and the condenser between the refrigerant (assuming to be changing phase and so at a constant temperature) and the circulating water: qe = me cp ðt a - t e Þ 1 - exp -

UAe me cp

ð1:16Þ

qc = mc cp ðt c - t b Þ 1 - exp -

UAc mc cp

ð1:17Þ

and

where cp is the specific heat of water = 4.186 kJ/kg K. Further, values of parameters are specified: 5

From Stoecker (1989), by permission of McGraw-Hill. kWt denotes that the units correspond to thermal energy while kWe to energy in electric units.

6

• Water flow rate through the evaporator, me = 6.8 kg/s, and through the condenser, mc = 7.6 kg/s

1.5 Mathematical Modeling Approaches

• Thermal conductance of the evaporator, UAe = 30.6 kW/ K, and that of the condenser, UAc = 26.5 kW/K • Inlet water temperature to the evaporator, ta = 10 °C, and that to the condenser, tb = 25 °C Solving Eqs. 1.13–1.17 results in: t e = 2:84 ° C, t c = 34:05 ° C, qe = 134:39 kW and P = 28:34 kW To summarize, the performance of the various equipment and their interaction have been represented by mathematical equations, which allow a single solution set to be determined. This is the case of the well-defined forward problem adopted in system simulation and design studies. There are instances when the same system could be subject to the inverse model approach. Consider the case when a cooling plant similar to that assumed above exists, and the facility manager wishes to instrument the various components in order to: (i) verify that the system is performing adequately, and (ii) vary some of the operating variables so that the power consumed by the compressor is reduced. In such a case, the numerical model coefficients given in Eqs. 1.13 and 1.14 will be unavailable, and so will be the UA values, since either he is unable to find the manufacturer-provided models in his documents or the equipment has degraded somewhat such that the original models are no longer accurate. The model calibration will involve determining these values from experiment data gathered by appropriately sub-metering the evaporator, condenser, and compressor on both the refrigerant and the water coolant side. How best to make these measurements, how accurate should the instrumentation be, what should be the sampling frequency, for how long should one monitor, etc. are all issues that fall within the purview of design of field monitoring. Uncertainty in the measurements as well as the fact that the assumed models are approximations of reality will introduce model prediction errors and so the verification of the actual system against measured performance will have to consider such aspects realistically.

17

1.5.3

Inverse Modeling

It is rather difficult to succinctly define inverse problems since they apply to different classes of problems with applications in diverse areas, each with their own terminology and viewpoints (it is no wonder that it suffers from the “blind men and the elephant” syndrome). Generally speaking, inverse problems are those that involve identification of model structure (system identification) and/or estimates of model parameters where the system under study already exists, and one uses measured or observed system behavior to aid in the model building and/or refinement. Different model forms may capture the data trend; this is why some argue that inverse problems can be referred to as “ill-defined” or “ill-posed.” Following Eq. 1.1 and Fig. 1.8, inverse problems can be conceptually framed as either: – Parameter estimation problems: givenfY, U, W, dg, determinefA, B, Cg

ð1:18Þ

– Control models: givenfY00 g and fA, B, Cg, determinefU, W, dg ð1:19Þ where Y″ is meant to denote that only limited measurements may be available for the state variable. Typically, one (i) takes measurements of the various parameters (or regressor variables) affecting the output (or response variables) of a device or a phenomenon, (ii) identifies a quantitative correlation between them by regression, and (iii) uses it to make predictions about system behavior under future operating conditions. As shown in Fig. 1.13, one can distinguish between black-box and gray-box approaches in terms of how the quantitative correlation is identified. Table 1.2 summarizes different characteristics of both these approaches and those of the simulation modeling approach. (i) Black-box approach identifies a simple mathematical function between response and regressor variables assuming that either (i) nothing is known about the

Table 1.2 Characteristics of different types of modeling approaches Approach Simulation

Time variation of system inputs/outputs Dynamic Quasi-static

Model type White-box Detailed mechanistic

Physical understanding High

Mechanistic -inverse

Dynamic Quasi-static Steady-state Static or steady-state

Gray-box Semi-physical Reduced-order Black-box Curve-fit

Medium

Empirical-inverse

ODE ordinary differential equation, PDE partial differential equation

Low

Types of equations PDE ODE Algebraic ODE Algebraic Algebraic

18

innards or the inside workings of the system, or (ii) there is very little/partial understanding of system behavior. The system functioning is opaque to the user. (ii) Gray-box approach uses the physics-based understanding of system structure and functioning to identify a mathematical model and deduce physically relevant system parameters from measured response and input variables. The resulting models are usually reducedorder, i.e., lumped models based on first-order ODE or algebraic equations. Such system parameters (e.g., overall heat loss coefficient, time constant, etc.) can serve to improve our mechanistic understanding of the phenomenon or system behavior. The mechanistic model structure can make more accurate predictions and provide better control capability. The identification of these models that combine phenomenological plausibility with mathematical simplicity generally requires both good understanding of the physical phenomenon or of the systems/equipment being modeled, and a competence in statistical methods. These analysis methods were initially proposed several hundred years back and have seen major improvements over the years (especially during the last hundred years) resulting in a rich literature in this area with great diversity of techniques and level of sophistication. Traditionally, the distinction was made between parametric and non-parametric methods, and these are discussed in Chaps. 5 and 9. With the advent of computing power, resampling methods have been gaining increasing importance/popularity because they provide (i) additional flexibility in model building and in predictive accuracy, (ii) more robustness in estimating the errors in both model coefficients and in predictive ability, and (iii) are able to handle much larger sets of regressor variables than traditional parametric modeling. These are discussed in Chap. 5. The gray-box approach requires context-specific approximate numerical or analytical solutions for linear and nonlinear problems and often involves model selection and parameter estimation as well. The ill-conditioning, i.e., the solution is extremely sensitive to the data (see Sect. 9.2), is often due to the repetitive nature of the data collected while the system is under normal operation. There is a rich and diverse body of knowledge on inverse methods applied to physical systems, and numerous textbooks, monographs, and research papers are available on this subject. Chapter 10 addresses these problems at more length. In summary, different models and parameter estimation techniques need to be adopted depending on whether: (i) the intent is to subsequently predict system behavior within the temporal and/or spatial range of input variables—in such cases, simple and well-known methods such as curve fitting

1

Mathematical Models and Data Analysis

Fig. 1.16 Example of a parameter estimation problem where the model parameters of a presumed function of pressure drop versus volume flow rate are identified from discrete experimental data points

may suffice (Fig. 1.16); (ii) the intent is to subsequently understand/predict/control system behavior outside the temporal and/or spatial range of input variables—in such cases, physically based models are generally more appropriate. Example 1.5.2 Dose–response models. An example of how model-driven methods differ from a straightforward curve fit is given below (the same example is treated in more depth in Sect. 10.2.6). Consider the case of models of risk to humans when exposed to toxins (or biological poisons), which are extremely deadly even in small doses. Dose is the total mass of toxin that the human body ingests. Response is the measurable physiological change in the body produced by the toxin, which can have many manifestations; here, the focus is on human cells becoming cancerous. There are several aspects to this problem relevant to inverse modeling: (i) Can the observed data of dose versus response provide some insights into the process that induces cancer in biological cells? (ii) How valid are these results extrapolated down to low doses? (iii) Since laboratory tests are performed on animal subjects, how valid are these results when extrapolated to humans? The manner one chooses to extrapolate the dose–response curve downward is dependent on either the assumption one makes regarding the basic process itself or how one chooses to err (which has policy-making implications). For example, erring too conservatively in terms of risk would overstate the risk and prompt implementation of more precautionary measures, which some critics would fault as unjustified and improper use of limited resources. There are no simple answers to these queries (until the basic process itself is completely understood). There is yet another major issue. Since different humans (and test animals) react differently to the same dose, the response is often interpreted as a probability of cancer being induced, which can be framed as a risk. Probability is

1.6 Data Analytic Approaches

19

numerous model parameters so that model predictions match observed system behavior as closely as possible. Often, only a subset or limited number of measurements of system states and forcing function values are available, resulting in a highly over-parameterized problem with more than one possible solution. Following Eq. 1.1, such inverse problems can be framed as: given

Fig. 1.17 Three different inverse models depending on toxin type for extrapolating dose–response observations at high doses to the response at low doses. (From Heinsohn and Cimbala 2003, by permission of CRC Press)

bound to play an important role to the nature of the process, and hence the adoption of various agencies (such as the U.S. Environmental Protection Agency) of probabilistic methods toward risk assessment and modeling. Figure 1.17 illustrates three methods of extrapolating dose–response curves down to low doses (Heinsohn and Cimbala 2003). The dots represent observed laboratory tests performed at high doses. Three types of models are fit to the data. They all agree at high doses; however, they deviate substantially at low doses because the models are functionally different. While model I is a nonlinear model applicable to highly toxic agents, curve II is generally taken to apply to contaminants that are quite harmless at low doses (i.e., the body is able to metabolize the toxin at low doses). Curve III is an intermediate one between the other two curves. The above models are somewhat empirical (or black-box) and provide little understanding of the basic process itself. Models based on simplified but phenomenological considerations of how biological cells become cancerous have also been developed and these are described in Sect. 10.2.6.

1.5.4

Calibrated Simulation

The calibrated simulation approach can be viewed as a hybrid of the forward and inverse methods (refer to Fig. 1.13). Here one uses a mechanistic model originally developed for the purpose of system simulation, and modifies or “tunes” the

fY00 , U00 , W00 , d00 g, determine fA00 , B00 , C00 g ð1:20Þ

where the ″ notation is used to represent limited measurements or reduced parameter set. Example 1.5.1 is a simple simulation or forward model with explicit algebraic equations for each component with no feedback loops. Detailed simulation programs are much more complex (with hundreds of variables, complex interactions and boundary conditions, etc.) involving ODEs or PDEs; one example is computational fluid dynamic (CFD) models for indoor air quality studies. Calibrating such models is extremely difficult given the lack of proper instrumentation to measure detailed spatial and temporal fields, and the inability to conveniently compartmentalize the problem so that inputs and outputs of sub-blocks could be framed and calibrated individually as done in the cooling plant example above. Thus, in view of such limitations in the data, developing a simpler system model consistent with the data available while retaining the underlying mechanistic considerations as far as possible is a more appealing approach, albeit a challenging one. Such an approach is called the “gray-box” approach, involving inverse models (see Fig. 1.13). An in-depth discussion of calibrated simulation approaches is provided in Sect. 10.6.

1.6

Data Analytic Approaches

Several authors, for example (Sprent 1998), use terms such as (i) data-driven models to imply those that are suggested by the data at hand and commensurate with knowledge about system behavior—this is somewhat akin to our definition of black-box models discussed above, and (ii) model-driven approaches as those that assume a pre-specified model and the data is used to determine the model parameters; this is synonymous with gray-box models inverse methods. Data-driven or statistical methods have been traditionally separated into black-box and gray-box approaches. An alternate view (Fig. 1.18), reflective of current thinking, is to distinguish between traditional stochastic methods and data analytic methods (Breiman 2001 goes so far as to refer to these as different cultures). The traditional methods consisting of parametric, non-parametric, and resampling methods (which have been introduced earlier) fall under

20

1

Fig. 1.18 An overview of different approaches under the broad classification of “traditional stochastic” methods and “data analytics” methods

Statistical Analysis Traditional Stochastic Methods Resampling

Classical Parametric Non-Parametric

“stochastic data modeling” and were meant primarily to understand and predict system behavior. In the last few decades there has been an explosion of other analysis methods, loosely called data analytic methods, which are less based on statistics and probability and more on data exploration and computer-based learning algorithms. The superiority of such algorithmic7 methods is manifest under situations when patterns in the data are too complex for humans to grasp or learn. The algorithmic approach is better suited for learning or knowledge discovery or gaining insight into important relationships/correlations (i.e., finding patterns in data) while providing superior predictive capability and control of nonlinear systems under complex situations. Several statisticians (e.g., Breiman 2001) argue the need to emphasize this approach given its diversity in application and its ability to make better use of large data sets in the real world. This book deals largely with the traditional stochastic modeling methods, with only Chap. 11 devoted to data analytic methods.

1.6.1

Data Mining or Knowledge Discovery

Data mining (DM) is defined as the science of extracting useful information from large/enormous data sets; that is why it is also referred to as knowledge discovery. The associated suite of approaches were developed in fields outside statistics. Though DM is based on a range of techniques, from the very simple to the sophisticated (involving such methods as clustering classification, anomaly detection, etc.), it has the distinguishing feature that it is concerned with shifting through large/enormous amounts of data with no clear aim in mind except to discern hidden information;

7

Mathematical Models and Data Analysis

The terminology follows Breiman (2001) who argues that algorithmic modeling rather than traditional stochastic approaches is much better suited to tackling modern-day problems, which involve complex systems and decision-making behavior based on numerous factors, variables, and large sets of informational data.

Data Analytics Methods Data Mining

Machine Learning Big Data

discover patterns, associations, and trends; or summarize data behavior (Dunham 2003). Thus, not only does its distinctiveness lie in the data management problems associated with storing and retrieving large amounts of data from perhaps multiple data sets, but also in it being much more exploratory and less formalized in nature than is statistics and model building where one analyzes a relatively small data set with some specific objective in mind. Data mining has borrowed concepts from several fields such as multivariate statistics and Bayesian theory, as well as less formalized ones such as machine learning, artificial intelligence, pattern recognition, and data management so as to bound its own area of study and define the specific elements and tools involved. It is the result of the digital age where enormous digital databases abound from the mundane (supermarket transactions, credit card records, telephone calls, Internet postings, etc.) to the very scientific (astronomical data, medical images, etc.). Thus, the purview of data mining is to explore such databases in order to find patterns or characteristics (called data discovery) or even in response to some very general research question not provided by any previous mechanistic understanding of the social or engineering system, so that some action can be taken resulting in a benefit or value to the owner. Data mining techniques are briefly discussed in Chap. 11.

1.6.2

Machine Learning or Algorithmic Models

Machine learning (ML) or predictive algorithmic modeling is the field of study that develops algorithms that computers follow in order to identify and extract patterns from data with the primary purpose of developing prediction models with accuracy being the primary aim and understanding or explaining the data trends of secondary concern (Kelleher and Tierney 2018). It has also been defined as a field of study that gives computers the ability to learn without being explicitly programmed. Thus, ML models and algorithms learn to map input to output in an iterative manner with the internal

1.6 Data Analytic Approaches

21

Fig. 1.19 Examples of some infrastructure-related technological applications of big data from the perspective of different stakeholders (SMI smart metering infrastructure)

Examples of Technological Applications of Big Data

Building Operations Learning energy consumption in residences (DM using ANN ensembles and adaptive)

Electric Utilities Datamine SMI (cluster customers for rate plan modification)

structure changing continuously. It is especially useful when the patterns in the data are too complex for traditional statistical analysis methods to handle. The learning of ML algorithms such as neural network models improves with additional data and is adaptive to changes in environmental conditions. ML is one of the important fields in computer science. Chapter 11 presents some of the important ML algorithms such as neural networks.

1.6.3

Introduction to Big Data

One of the major impacts to modern day society is the discipline called big data or data science. It involves harnessing information in novel ways and extracting a certain level of quantification in terms of trends and behavior from large data sets. While the knowledge may not enhance fundamental understanding or insight or wisdom,8 it produces useful actionable insights on goods and services of significant value to businesses and society. “It is meant to inform rather than explain” (Mayer-Schonberger and Cukier 2013, p. 4); and in that sense DM and ML algorithms are at the core of its suite of analysis tools. The basic trait that distinguishes it from statistical learning methods is that it is based on processing huge amounts of heterogeneous multi-source data (sensors, videos, Internet searches, social media, survey, government, etc.), which is characterized by variety, large noise, huge volume, velocity (real-time streaming), data fusion from multiple disparate sources, and sensor fusion from multiple sensors. The thinking is that the size of the data set can compensate for the use of simple models and noisier non-curated data. However, there are some major concerns:

“Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information?” T.S. Eliot (1934).

8

Smart Grid Operations

City Level Routine City Operations

Extreme Events Mgmt

Development Planning for Aspirational Goals (carbon neutrality, sustainable cities)

• Can obscure long-term view/behavior of phenomena • Increases danger of false learning/beliefs and overconfidence • Major issue with privacy and ethics A whole new set of tools, procedures, and software has been developed for datafication, which is the process of transforming raw data such as Internet searches, Internet purchasing, texts, visual, social media, info from phones, cars, etc. into a quantifiable format so that it can be tabulated, stored, and analyzed. The value of the data shifts from its primary use to its potential future use. The new profession of the data scientist combines numerous traditional skills: the statistician (data mining), computer scientist (machine learning), software programmer, machine learning expert, informatics, planners, etc. Numerous textbooks have appeared recently on data science, for example, Kelleher and Tierney (2018). There is a debate on whether big data processing capabilities and the trends and behavior therefrom extracted diminish the value of the domain expert in applications that fell traditionally under the purview of statistical and engineering analysis. Nevertheless, big data offers great promise to reshape entire business sectors to address/satisfy several societal and global problems. Some examples from a technological perspective as applied to infrastructure applications are listed below and shown in Fig. 1.19: – Improving current sensing and data analysis capabilities in individual buildings. – Be able to analyze the behavior of large number of buildings/cities in terms of energy efficiency, and integrate higher renewable energy penetration with the smart-grid. – Be able to analyze the benefits and limitations of engineered infrastructures and routine operational practices on society and different entities.

22

1

– Be able to identify gaps in current societal needs, use this knowledge to develop strategies to enhance operational efficiency and decarbonization, track the implementation of these strategies, and assess the benefit once implemented. – Be able to provide and enhance routine services as well as emergency response measures.

1.7

Data Analysis

1.7.1

Introduction

Data analysis is not performed just for its own sake; its usefulness lies in the support it provides to such objectives as gaining insight about system behavior, characterizing current system performance against a baseline, deciding whether retrofits and suggested operational changes to the system are warranted or not, quantifying the uncertainty in predicting future behavior of the present system, suggesting robust/costeffective/risk averse ways to operate an existing system, avoiding catastrophic system failure, etc. Analysis is often a precursor to decision-making in the real world. There is another discipline with overlapping/ complementary aims to that of data analysis and modeling. Risk analysis and decision-making provide both an overall paradigm and a set of tools with which decision makers can construct and analyze a model of a decision situation (Clemen and Reilly 2001). Even though it does not give specific answers to problems faced by a person, decision analysis provides structure, guidance, and analytical tools on how to logically and systematically tackle a problem, model uncertainty in different ways, and hopefully arrive at rational decisions in tune with the personal preferences of the individual who has to live with the choice(s) made. While it is applicable to problems without uncertainty but with multiple outcomes, its strength lies in being able to analyze complex multiple outcome problems that are inherently uncertain or stochastic compounded with the utility functions or risk preferences of the decision maker. There are different sources of uncertainty in a decisionmaking process but the one pertinent to data modeling and analysis in the context of this book is that associated with fairly well-behaved and well-understood engineering systems with relatively low uncertainty in their performance data. This is the reason why, historically, engineering students were not subjected to a class in decision analysis. However, many engineering systems are operated wherein the attitudes and behavior of people operating these systems assume importance; in such cases, there is a need to adapt

Mathematical Models and Data Analysis

many of the decision analysis tools and concepts with traditional data analysis and modeling techniques. This issue is addressed in Chap. 12.

1.7.2

Basic Stages

In view of the diversity of fields to which data analysis is applied, an all-encompassing definition would have to be general. One good definition is: “an evaluation of collected observations so as to extract information useful for a specific purpose.” The evaluation relies on different mathematical and statistical tools depending on the intent of the investigation. In the area of science, the systematic organization of observational data, such as the orbital movement of the planets, provided a means for Newton to develop his laws of motion. Observational data from deep space allow scientists to develop/refine/verify theories and hypotheses about the structure, relationships, origins, and presence of certain phenomena (such as black holes) in the cosmos. At the other end of the spectrum, data analysis can also be viewed as simply: “the process of systematically applying statistical and logical techniques to describe, summarize, and compare data.” From the perspective of an engineer/scientist, data analysis is a process that when applied to system performance data, collected either intrusively or non-intrusively, allows certain conclusions about the state of the system to be drawn, and, thereby, to initiate follow-up actions. Studying a problem through the use of statistical data analysis usually involves four basic steps (Arsham 2008): (a) Defining the Problem The context of the problem and the exact definition of the problem being studied need to be framed. This allows one to design both the data collection system and the subsequent analysis procedures to be followed. (b) Collecting the Data In the past (say, 70 years back), collecting the data was the most difficult part, and was often the bottleneck of data analysis. Nowadays, one is overwhelmed by the large amounts of data resulting from the great strides in sensor and data collection technology; and data cleaning, handling, and summarizing have become major issues. Paradoxically, the design of data collection systems has been marginalized by an apparent belief that extensive computation can make up for any deficiencies in the design of data collection. Gathering data without a clear definition of the problem often results in failure or limited success. Data can be

1.7 Data Analysis

collected from existing sources or obtained through observation and experimental studies designed to obtain new data. In an experimental study, the variable of interest is identified. Then, one or more factors in the study are controlled so that data can be obtained about how the factors influence the variables. In observational studies, no attempt is made to control or influence the variables of interest either intentionally or due to the inability to do so (two examples are surveys and astronomical data). (c) Analyzing the Data There are various statistical and analysis approaches and tools that one can bring to bear depending on the type and complexity of the problem and the type, quality, and completeness of the data available. Several categories of problems encountered in data analysis are shown in Figs. 1.13 and 1.18. Probability is an important aspect of data analysis since it provides a mechanism for measuring, expressing, and analyzing the uncertainties associated with collected data and mathematical models used. This, in turn, impacts the confidence in our analysis results: uncertainty in future system performance predictions, confidence level in our confirmatory conclusions, uncertainty in the validity of the action proposed, etc. The majority of the topics addressed in this book pertain to this category. (d) Reporting the Results The final step in any data analysis effort involves preparing a report. This is the written document that logically describes all the pertinent stages of the work, presents the data collected, discusses the analysis results, states the conclusions reached, and recommends further action specific to the issues of the problem identified at the onset. The final report and any technical papers resulting from it are the only documents that survive over time and are invaluable to other professionals. Unfortunately, the task of reporting is often cursory and not given its due importance. The term “intelligent” data analysis has been used, which has a different connotation from traditional ones (Berthold and Hand 2003). This term is used not in the sense that it involves increasing knowledge/intelligence of the user or analyst in applying traditional tools, but that the statistical tools themselves have some measure of intelligence built into them. A simple example is when a regression model has to be identified from data. Software packages and programmable platforms are available, which facilitate hundreds of built-in functions to be evaluated, and a prioritized list of models identified. The recent evolution of computer-intensive methods (such as bootstrapping and Monte Carlo methods) along with soft computing algorithms (such as artificial neural networks, genetic algorithms, etc.) enhances the

23

computational power and capability of traditional statistics, model estimation, and data analysis methods. Such added capabilities of modern-day computers and the sophisticated manner in which the software programs are written allow “intelligent” data analysis to be performed.

1.7.3

Example of a Data Collection and Analysis System

Data can be separated into experimental or observational depending on whether the system operation can be modified by the observer or not. Consider a system where the initial phase of designing and installing the monitoring system is complete. Figure 1.20 is a flowchart depicting various stages in the collection, analysis, and interpretation of data collected from an engineering thermal9 system while in operation. The various elements involved are: (a) A measurement system consisting of various sensors of pre-specified types and accuracy. The proper location, commissioning, and maintenance of these sensors are important aspects of this element. (b) The data sampling element whereby the output of the various sensors is read at a pre-determined frequency. The low cost of automated data collection has led to increasingly higher sampling rates. Typical frequencies for thermal systems are in the range of 1 s–1 min. (c) The cleaning of raw data for spikes, gross errors, mis-recordings, and missing or dead channels; average (or sum) the data samples and, if necessary, store them in a dynamic fashion (i.e., online) in a central electronic database with an electronic time stamp. (d) The averaging of raw data and storing in a database; typical periods are in the range of 1–30 min. One can also include some finer checks for data quality by flagging data when they exceed physically stipulated ranges. This process need not be done online but could be initiated automatically and periodically, say, every day. It is this data set that is queried as necessary for subsequent analysis. (e) The above steps in the data collection process are performed on a routine basis. This data can be used to advantage provided one can frame the issues relevant to the client and determine which of these can be satisfied. Examples of such routine uses are assessing overall time-averaged system efficiencies and preparing weekly performance reports, as well as for subtler action such as supervisory control and automated fault detection. 9

Electrical systems have different considerations since they mostly use very high-frequency sampling rates.

24

1

Mathematical Models and Data Analysis

Measurement System Design

System Monitoring

Data Sampling (1 sec–1 min)

Clean (and Store Raw Data)

-Initial cleaning and flagging (missing misrecorded, dead channels) -Gross error detection -Removal of spikes

Average and Store Data (1–30 min)

Define issue to be Analyzed

-Formulate intention of client as engineering problem -Determine analysis approach -Determine data needed

Extract Data Sub-set for Intended Analyses

-Data transformation -Data filtering -Outlier detection -Data validation

Perform Engineering Analysis

-Statistical inference -Identify patterns in data -Regression analysis -Parameter estimation -System identification

Perform Decision Analysis

-Data adequate for sound decision? -Is prior presumption correct? -How to improve operation and/or effy? -Which risk-averse strategy to select? -How to react to catastrophic risk?

Perform additional analyses

Redesign and take additional measurements

Present Decision to Client

End

Fig. 1.20 Flowchart depicting various stages in data analysis and decision-making as applied to continuous monitoring of thermal systems

1.8 Topics Covered in Book

25

Table 1.3 Analysis methods covered in this book Chapter 1 2 3 4 5 6 7 8 9 10 11 12

Topic Introduction to mathematical models and description of different classes of analysis approaches Probability and statistics, important probability distributions, Bayesian statistics Data collection, exploratory data analysis, measurement uncertainty, and propagation of errors Inferential statistics, non-parametric tests, Bayesian, sampling and resampling methods Linear ordinary least squares (OLS) regression, residual analysis, point and interval estimation, resampling methods Design of experiments (factorial and block, response surface, context of computer simulations) Traditional optimization methods; linear, nonlinear, and dynamic programming Time series analysis, trend and seasonal models, stochastic methods (ARIMA), quality control Advanced regression (parametric, non-parametric, collinearity, nonlinear, and neural networks) Parameter estimation of static and dynamic gray-box models, calibration of simulation models Data analytics, unsupervised learning (clustering), supervised (classification) Decision-making, risk analysis, and sustainability assessment methods

(f) Occasionally, the owner would like to evaluate major changes such as equipment change out or addition of new equipment, or would like to improve overall system performance or reliability not knowing exactly how to achieve this. Alternatively, one may wish to evaluate system performance under an exceptionally hot spell of several days. This is when specialized consultants are brought in to make recommendations to the owner. Historically, such analysis was done based on the professional expertise of the consultant with minimal or no measurements of the actual system. However, both financial institutions who would lend the money for implementing these changes and the upper management of the company owning the system are insisting on a more transparent engineering analysis based on actual data. Hence, the preliminary steps involving relevant data extraction and a more careful data proofing and validation are essential. (g) Extracted data are then subject to certain engineering analyses that can be collectively referred to as applied data modeling and analysis. These involve statistical inference, identifying patterns in the data, regression analysis, parameter estimation, performance extrapolation, classification or clustering, deterministic modeling, etc. (h) Performing a decision analyses, in our context, involves using the results of the engineering analyses and adding an additional layer of analyses that includes modeling uncertainties (involving among other issues a sensitivity analysis), modeling stakeholder preferences, and structuring decisions. Several iterations may be necessary between this element and the ones involving engineering analysis and data extraction. (i) The presentation of the various choices suggested by the decision analysis to the owner or decision maker so that a final course of action may be determined. Sometimes,

it may be necessary to perform additional analyses or even modify or enhance the capabilities of the measurement system in order to satisfy client needs.

1.8

Topics Covered in Book

The overall structure of the book is depicted in Table 1.3 along with a simple suggestion as to how this book could be used for two courses if necessary. This chapter has provided a general introduction of mathematical models and discussed the different types of problems and analysis tools available for data-driven modeling and analysis. Chapter 2 reviews basic probability concepts (both classical and Bayesian), and covers various important probability distributions with emphasis on their practical usefulness. Chapter 3 covers data collection and proofing, along with exploratory data analysis and descriptive statistics. The latter entails performing “numerical detective work” on the data and developing methods for screening, organizing, summarizing, and detecting basic trends in the data (such as graphs, and tables), which would help in information gathering and knowledge generation. Historically, formal statisticians have shied away from exploratory data analysis considering it to be either too simple to warrant serious discussion or too ad hoc in nature to be able to expound logical steps (McNeil 1977). This area had to await the pioneering work by John Tukey and others to obtain a formal structure. A brief overview is provided in this book, and the interested reader can refer to Hoagin et al. (1983) or Tukey (1988) for an excellent perspective. The concepts of measurement uncertainty and propagation of errors are also addressed, and relevant equations provided. Chapter 4 covers statistical inference involving hypotheses testing of single-sample and multi-sample

26

parametric tests (such as analysis of variance [ANOVA]) involving univariate and multivariate samples. Inferential problems are those that involve making uncertainty inferences or calculating confidence intervals of population estimates from selected samples. These methods are the backbone of classical statistics from which other approaches evolved (covered in Chaps. 5, 6, and 9). Non-parametric tests and Bayesian inference methods have also proven to be useful in certain cases, and these approaches are also covered. The advent of computers has led to the very popular sampling and resampling methods that reuse the available sample multiple times resulting in more intuitive, versatile, and robust point and interval estimation. Chapter 5 deals with inferential statistics applicable to linear regression situations, an application that is perhaps the most prevalent. In essence, the regression problem involves (i) taking measurements of the various parameters (or regressor variables) and of the output (or response variables) of a device/system or a phenomenon, (ii) identifying a causal quantitative correlation between them by regression, (iii) estimating the model coefficients/ parameters, and (iv) using it to make predictions about system behavior under future operating conditions. When a regression model is identified from data, the data cannot be considered to include the entire “population” data, i.e., all the observations one could possibly conceive. Hence, model parameters and model predictions suffer from uncertainty, which falls under the purview of inferential statistics. There is a rich literature in this area called “model building” with great diversity of techniques and level of sophistication. Traditional regression methods using ordinary least squares (OLS) for univariate and multivariate linear problems along with advanced parametric and non-parametric methods are covered. Residual analysis, detection of leverage, and influential points are also discussed along with simple remedial measures one could take if the residual behavior is improper. Resampling methods applied in a regression context are also presented. Chapter 6 covers experimental design methods and discusses factorial and response surface methods that allow extending hypothesis testing to multiple variables as well as identifying sound performance models. “Design of experiments” (DOE) is the process of prescribing the exact manner in which samples for testing need to be selected, and the conditions and sequence under which the testing needs to be performed such that the relationship or model between a response variable and a set of regressor variables can be identified in a robust and accurate manner. The extension of traditional DOE approaches to computer simulation-based design of energy efficient buildings involving numerous design variables is also discussed. The material from all these five chapters (Chaps. 2, 3, 4, 5 and 6) is generally covered in undergraduate statistics

1

Mathematical Models and Data Analysis

and probability course, and can be used for that purpose. It can also be used as review or refresher material (especially useful to the general practitioner) for a second course meant to cover more advanced concepts and statistical techniques and to better prepare graduate students and energy researchers. Chapter 7 reviews various traditional optimization methods separated into analytical (such as the Lagrange multiplier method) and univariate and multivariate numerical methods. Linear programming problems (network models and mixed integer problems) as well as nonlinear programming problems are covered. Optimization is at the heart of numerous data analysis situations (including regression model building) and is essential to current societal problems such as the design and future planning of energy-efficient and resilient infrastructure systems. Time series analysis methods are treated in Chap. 8, which are a set of tools that include traditional model building techniques as well as those that capture the sequential behavior of the data and its noise. They involve the analysis, interpretation, and manipulation of time series signals in either time domain or frequency domain. Several methods to smooth time series data in the time domain are presented. Forecasting models based on OLS modeling and the more sophisticated class of stochastic models (such as autoregressive and moving average methods) suitable for linear dynamic models are discussed. An overview is also provided of control chart techniques extensively used for process control and condition monitoring. Chapter 9 deals with subtler and more advanced topics related to parametric and non-parametric regression analysis. The dangers of collinearity among regressors during multivariate regression is addressed, and ways to minimize such effects (such as principal component analysis, ridge regression, and shrinkage methods) are discussed. An overarching class of models, namely Generalized Linear Models (GLM), are introduced that combine in a unified framework both the strictly linear models and nonlinear models that can be transformed into linear ones. Next, parameter estimation of intrinsically nonlinear models as well as non-parametric estimation methods involving smoothing and regression splines and the use of kernel functions are covered. Finally, the multi-layer perceptron (MLP) neural network model is discussed. Inverse modeling is an approach to data analysis method, which combines the basic physics of the process with statistical methods so as to achieve a better understanding of the system dynamics, and thereby use it to predict system performance. Chapter 10 presents an overview of the types of problems that fall under inverse estimation methods applied to structural models: (a) static gray-box models involving algebraic equations, (b) dynamic gray-box models involving differential equations, and (c) the calibration of white-box

Problems

models involving detailed simulation programs. The concept of information content of collected data and associated quantitative metrics are also introduced. Chapter 11 deals with data analytic methods, which include data mining and machine learning. They are directly concerned with practical applications (discern hidden information; discover patterns, associations, and trends; or summarize data behavior) through data exploration and computer-based learning algorithms. The problems can be broadly divided into two categories: unsupervised learning approaches (such as classification methods) and supervised learning approaches (such as clustering methods). Classification problems are those where one would like to develop a model to statistically distinguish or “discriminate” differences between two or more groups when one knows beforehand that such groupings exist in the data set provided, and to subsequently assign, allocate, or classify a future unclassified observation into a specific group with the smallest probability of error. During clustering, the number of clusters or groups is not known beforehand (thus, a more difficult problem), and the intent is to allocate a set of observation sets into groups that are similar or “close” to one another. Several important subtypes of both these categories are presented and discussed. Decision theory is the study of methods for arriving at “rational” decisions under uncertainty. Chapter 12 covers this issue (including risk analysis), and further provides an introduction to sustainability assessment methods that have assumed central importance in recent years. An overview of quantitative decision-making methods is followed by the suggestion that this area be divided into single and multicriteria methods and further separated into single discipline and multi-discipline applications that usually involve non-consistent attribute scales. How the decision-making process is actually applied to the selection of the most appropriate course of action for engineered systems is discussed and various relevant concepts introduced. A general introduction of sustainability and the importance of assessments in this area are discussed. The two primary sustainability assessment frameworks, namely the structure-based and the performance-based, are described along with their various subcategories and analysis procedures; illustrative case study examples are also provided.

Problems Pr. 1.1 Identify which of the following equations are linear functional models, which are linear in their parameters (a, b, c), and which are both: (a) y = a + bx + cx2 (b) y = a þ bx þ xc2

27

(c) (d) (e) (f) (g) (h) (i)

y = a + b(x - 1) + c(x - 1)2 y = a0 þ b0 x1 þ c0 x21 þ a1 þ b1 x1 þ c1 x21 x2 y = a + b. sin (c + x) y = a + b sin (cx) y = a + bxc y = a + bx1.5 y = a + b ex

Pr. 1.2 Consider the equation for pressure drop in a pipe given by Eq. 1.2 (a) Recast the equation such that it expresses the fluid volume flow rate (rather than velocity) in terms of pressure drop and other quantities. (b) Draw a block diagram to represent the case when a feedback control is used to control the flow rate from measured pressure drop. Pr. 1.3 Consider Eq. 1.5, which is a lumped model of a fully mixed hot water storage tank. Assume initial temperature is Ts, initial = 60 °C while the ambient temperature is constant at 20 °C. (i) Deduce the expression for the time constant of the tank in terms of model parameters. (ii) Compute its numerical value when Mcp = 9.0 MJ/°C and UA = 0.833 kW/°C. (iii) What will be the storage tank temperature after 6 h under cool-down (with P = 0)? (iv) How long will the tank temperature take to drop to 40 ° C under cool-down? (v) Derive the solution for the transient response of the storage tank under electric power input P. (vi) If P = 50 kW, calculate and plot the response when the tank is initially at 30 °C (akin to Fig. 1.12). Pr. 1.4 Consider Fig. 1.9 where a heated sphere is being cooled. The analysis simplifies considerably if the sphere can be modeled as a lumped one. This can be done if the Biot number Bi  hLk e < 0:1. Assume that the external heat transfer coefficient is 10 W/m2 °C and that the radius of the sphere is 15 cm. The equivalent length of the sphere is Volume Le = Surface area. Determine whether the lumped model assumption is appropriate for spheres made of the following materials: (a) Steel with thermal conductivity k = 34 W/m °C (b) Copper with thermal conductivity k = 340 W/m °C (c) Wood with thermal conductivity k = 0.15 W/m °C

28

1

Mathematical Models and Data Analysis

Fig. 1.21 Steady-state heat flow through a composite wall of surface area A made up of three layers in series. (a) Sketch. (b) Electrical resistance analog (Pr. 1.6). (From Reddy et al. 2016)

Pr. 1.5 The thermal network representation of a homogeneous plane is illustrated in Fig. 1.10. Draw the 3R2C network representation and derive expressions for the three resistors and the two capacitors in terms of the two air film coefficients and the wall properties (Hint: follow the approach illustrated in Fig. 1.10 for the 2R1C network). Pr. 1.6 Consider the composite wall of surface area A as shown in Fig. 1.21 with four interfaces designed by (1, 2, 3, 4). It consists of three different materials (A, B, C) with thermal conductivity k and thickness Δx. The surface temperatures at points A and C are T1 and T4, respectively, while the steadystate heat flow rate is Q. (a) Write the equation for steady-state heat flow through this wall. This is the forward application. When would one use this equation? (b) This equation can also be used for inverse modeling applications. Give two practical instances when this applies. Clearly state what the intent is, how the equation has to be restructured, what parameters are to be determined, and what measurements are needed to do so. Pr. 1.7 A building with a floor area of A = 1000 m2 and inside wall height of h = 3 m has an air infiltration (i.e., leakage) rate of 0.4 air change per hour (Note: one air change is equal to the volume of the room). The outdoor temperature is To = 2 °C and the indoor temperature Ti = 22 °C: (a) Write the equation for the rate of heat input Q that must be provided by the building heating system to warm the cold outside air assuming the location to be at sea level.

(b) This equation can also be used for inverse modeling applications. Give a practical instance. Clearly state what the intent is, how the equation has to be restructured, what parameters are to be determined, and what measurements are needed to do so. Pr. 1.8 Consider a situation where a water pump is to deliver a quantity of water F = 500 L/s from a well of depth d = 60 m to the top of a building of height h = 30 m. The friction pressure drop through the pipe is Δp = 30 kPa. The pump-motor efficiency ηp = 60%. (a) Write the equation for electric power consumption in terms of the various quantities specified. This is the forward situation. Under what situations would this equation be used? (b) This equation can also be used for inverse modeling applications. Give a practical instance. Clearly state what the intent is, how the equation has to be restructured, what parameters are to be determined, and what measurements are needed to do so. Pr. 1.9 Two pumps in parallel problem viewed from the forward and the inverse perspectives Consider Fig. 1.22, which will be analyzed in both the forward and data-driven approaches. (a) Forward problem10: Two pumps with parallel networks deliver F = 0.01 m3/s of water from a reservoir to the destination. The pressure drops in Pascals (Pa) of each network are given by: Δp1 = ð2:1Þ  1010  F 21 and Δp2 = ð3:6Þ  1010  F 22 where F1 and F2 are the flow rates through each branch in m3/s. Assume that

10

From Stoecker (1989), by permission of McGraw-Hill.

Problems

29

ΔP1 = (2.1 x 1010) (F1)2 ΔP2 = (3.6 x 1010) (F2)2

F=0.01 m3/s

F1

F2

Fig. 1.22 Pumping system with two pumps in parallel (Pr. 1.9) Fig. 1.23 Perspective of the forward problem for the lake contamination situation (Pr. 1.10)

Contaminated outfall Incoming stream

Qs =5.0 m3/s Cs =10.0 mg/L

pumps and their motor assemblies have the same efficiency. Let P1 and P2 be the electric power in Watts (W) consumed by the two pump-motor assemblies. (i) Sketch the block diagram for this system with total electric power as the output variable. (ii) Frame the total power P as the objective function that needs to be minimized against total delivered water F. (iii) Solve the problem for F1 and P. (b) Inverse problem: Now consider the same system in the inverse framework where one would instrument the existing system such that operational measurements of P for different F1 and F2 are available. (i) Frame the function appropriately using insights into the functional form provided by the forward model. (ii) The simplifying assumption of constant efficiency of the pumps is unrealistic. How would the above function need to be reformulated if efficiency can be taken to be a quadratic polynomial (or black-box model) of flow rate as shown below for the first piping branch (with a similar expression applying for the second branch): η1 = a1 þ b1  F 1 þ c1  F 21 .

Q w =0.5 m3/s Cw=100.0 mg/L

V=10.0 x 106 m3 k=0.20/day C=?

Outgoing stream Qm= ? m3/s Cm= ? mg/L

Pr. 1.10 Lake contamination problem viewed from the forward and the inverse perspectives A lake of volume V is fed by an incoming stream with volumetric flow rate Qs contaminated with concentration Cs11 (Fig. 1.23). The outfall of another source (say, the sewage from a factory) also discharges a flow Qw of the same pollutant with concentration Cw. The wastes in the stream and sewage have a decay coefficient k. (a) Let us consider the forward model approach. In order to simplify the problem, the lake will be considered to be a fully mixed compartment and evaporation and seepage losses to the lake bottom will be neglected. In such a case, the concentration of the outflow is equal to that in the lake, i.e., Cm = C. Then, the steady-state concentration in the lake can be determined quite simply: Input rate = Output rate + decay rate. where Input rate = QsCs + QwCw, Output rate = QmCm = (Qs + Qw)Cm, and decay rate = kCV.12 s þQw C w This results in: C = QQssCþQ . w þkV

11

From Masters and Ela (2008), by permission of Pearson Education. This term is the first-order Taylor series approximation of exp(kCVt) where t = time.

12

30

Verify the above-derived expression, and also check that C = 3.5 mg/L when the numerical values for the various quantities given in Fig. 1.23 are used. (b) Now consider the inverse control problem when an actual situation can be generally represented by the model treated above. One can envision several scenarios; let us consider a simple one. Flora and fauna downstream of the lake have been found to be adversely affected, and an environmental agency would like to investigate this situation by installing appropriate instrumentation. The agency believes that the factory is polluting the lake, which the factory owner, on the other hand, disputes. Since it is rather difficult to get a good reading of spatial averaged concentrations in the lake, the experimental procedure involves measuring the cross-sectionally averaged concentrations and volumetric flow rates of the incoming, outgoing, and outfall streams. (i) Using the above model, describe the agency’s thought process whereby they would conclude that indeed the factory is the major cause of the pollution. (ii) Identify arguments that the factory owner can raise to rebut the agency’s findings.

References Arsham, http://home.ubalt.edu/ntsbarsh/stat-data/Topics.htm, downloaded August 2008 Berthold, M. and D.J. Hand (eds.) 2003. Intelligent Data Analysis, 2nd Edition, Springer, Berlin. Breiman, L. 2001. Statistical modeling: The two cultures, Statistical Science, vol. 16, no.3, pp. 199–231 Cha, P.D., J.J. Rosenberg and C.L. Dym, 2000. Fundamentals of Modeling and Analyzing Engineering Systems, 2nd Ed., Cambridge University Press, Cambridge, UK. Clemen, R.T. and T. Reilly, 2001. Making Hard Decisions with Decision Tools, Brooks Cole, Duxbury, Pacific Grove, CA Edwards, C.H. and D.E. Penney, 1996. Differential Equations and Boundary Value Problems, Prentice Hall, Englewood Cliffs, NJ

1

Mathematical Models and Data Analysis

Eisen, M., 1988. Mathematical Methods and Models in the Biological Sciences, Prentice Hall, Englewood Cliffs, NJ. Energy Plus, 2009. Energy Plus Building Energy Simulation software, developed by the National Renewable Energy Laboratory (NREL) for the U.S. Department of Energy, under the Building Technologies program, Washington DC, USA. http://www.nrel.gov/buildings/ energy_analysis.html#energyplus. Doebelin, E.O., 1995. Engineering Experimentation: Planning, Execution and Reporting, McGraw-Hill, New York Dunham, M., 2003. Data Mining: Introductory and Advanced Topics, Pearson Education Inc. Haimes, Y.Y., 1998. Risk Modeling, Assessment and Management, John Wiley and Sons, New York. Heinsohn, R.J. and J.M. Cimbala, 2003. Indoor Air Quality Engineering, Marcel Dekker, New York, NY Hoagin, D.C., F. Moesteller and J.W. Tukey, 1983. Understanding Robust and Exploratory Analysis, John Wiley and Sons, New York. Hodges, J.L. and E.L. Lehman, 1970. Basic Concepts of Probability and Statistics, 2nd Edition Holden Day Kelleher J.D. and B. Tierney, 2018. Data Science, MIT Press, Cambridge, MA Masters, G.M. and W.P. Ela, 2008. Introduction to Environmental Engineering and Science, 3rd Ed. Prentice Hall, Englewood Cliffs, NJ Mayer-Schonberger V. and K. Cukier, 2013. Big Data, John Murray, London, UK McNeil, D.R. 1977. Interactive Data Analysis, John Wiley and Sons, New York. Reddy, T.A., 2006. Literature review on calibration of building energy simulation programs: Uses, problems, procedures, uncertainty and tools, ASHRAE Transactions, 112(1), January Reddy, T.A., J.F. Kreider, P. Curtiss, A. Rabl, 2016. Heating and Cooling of Buildings- Principles and Practice of Energy Efficient Design, 3rd Edition, CRC Press, Boca Raton, FL. Sprent, P., 1998. Data Driven Statistical Methods, Chapman and Hall, London. Stoecker, W.F., 1989. Design of Thermal Systems, 3rd Edition, McGraw-Hill, New York. Streed, E.R., J.E. Hill, W.C. Thomas, A.G. Dawson and B.D. Wood, 1979. Results and Analysis of a Round Robin Test Program for Liquid-Heating Flat-Plate Solar Collectors, Solar Energy, 22, p.235. Stubberud, A., I. Williams, and J. DiStefano, 1994. Outline of Feedback and Control Systems, Schaum Series, McGraw-Hill. Tukey, J.W., 1988. The Collected Works of John W. Tukey, W. Cleveland (Editor), Wadsworth and Brookes/Cole Advanced Books and Software, Pacific Grove, CA

2

Probability Concepts and Probability Distributions

Abstract

This chapter reviews basic notions of probability (or stochastic variability), which is the formal study of the laws of chance, i.e., where the ambiguity in outcome is inherent in the nature of the process itself. Probability theory allows idealized behavior of different types of systems to be modeled and provides the mathematical underpinning of statistical inference. In that respect, it can be viewed as pertaining to the forward modeling domain. Both the primary views of probability, namely the frequentist (or classical) and the Bayesian, are covered, and a discussion provided of the difference between probability and statistics. The basic laws of probability are presented followed by an introductory treatment of set theory nomenclature and algebra. Relevant concepts of random variables, namely density functions, moment generation attributes, and transformation of variables, are also covered. Next, some of the important discrete and continuous probability distributions are presented along with a discussion of their genealogy, their mathematical form, and their application areas. Subsequently, the Bayes’ theorem is derived and how it provides a framework to include prior knowledge in multistage tests is illustrated using examples involving forward and reverse tree diagrams. Finally, the three different kinds of empirical probabilities, such as absolute, relative, and subjective, are described with illustrative examples.

2.1

Introduction

2.1.1

Classical Concept of Probability

Random data by its very nature is indeterminate. So how can a scientific theory attempt to deal with indeterminacy? Probability theory does just that and is based on the fact that

though the result of any particular trial or experiment or event cannot be predicted, a long sequence of performances taken together reveals a stability that can serve as the basis for fairly precise predictions. Consider the case when an experiment was carried out several times and the anticipated event E occurred in some of them. Relative frequency is the ratio denoting the fraction of event E occurring. It is usually estimated empirically after the event as: pðE Þ =

number of times E occured ð2:1Þ number of times the experiment was carried out

For certain simpler events, one can determine this proportion without actually carrying out the experiment; this is referred to as wise before the event. For example, the relative frequency of getting heads (selected as a “success” event) when tossing a fair coin is 0.5. In any case, this a-priori proportion is interpreted as the long-run relative frequency and is referred to as probability of event E occurring. This is the classical or frequentist or traditionalist definition of probability on which probability theory is founded. This interpretation arises from the strong law of large numbers (a wellknown result in probability theory), which states that the average of a sequence of independent random variables having the same distribution will converge to the mean of that distribution. If a six-faced dice is rolled, the probability of getting a pre-selected number between 1 and 6 (say, 4) will vary from event to event, but the long-run average will tend toward 1/6. The classical probability concepts are often described or explained in terms of dice-tossing or coinflipping or card-playing outcomes since they are intuitive and simple to comprehend, but their applicability is much wider and extends to all sorts of problems as will become evident in this chapter.

# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 T. A. Reddy, G. P. Henze, Applied Data Analysis and Modeling for Energy Engineers and Scientists, https://doi.org/10.1007/978-3-031-34869-3_2

31

32

2.1.2

2 Probability Concepts and Probability Distributions

Bayesian Viewpoint of Probability

The classical or traditional or objective probability concepts are associated with the frequentist view of probability, i.e., interpreting probability as the long-run frequency. This has a nice intuitive interpretation, hence its appeal. However, people have argued that many processes are unique events and do not occur repeatedly, thereby questioning the validity of the frequentist or objective probability viewpoint. Further, even when one may have some basic preliminary idea of the probability associated with a certain event, the classical view excludes such subjective insights in the determination of probability. The Bayesian approach, however, recognizes such issues by allowing one to update assessments of probability that integrate prior knowledge with observed events, thereby allowing better conclusions to be reached. It can thus be viewed as an approach combining a-priori probability (i.e., estimated ahead of the experiment) with post-priori knowledge gained after the experiment is over. Both the classical and the Bayesian approaches converge to the same results as increasingly more data (or information) is gathered. It is when the datasets are small that the additional insight offered by the Bayesian approach becomes advantageous. Thus, the Bayesian view is not an approach that is at odds with the frequentist approach, but rather adds (or allows the addition of) refinement to it. This can be a great benefit in many types of analysis, and therein lies its appeal. The Bayes’ theorem and its application to discrete and continuous probability variables are discussed in Sect. 2.5, while Sect. 4.6 presents its application to estimation and hypothesis testing problems.

2.1.3

(i) To try to understand the overall nature of the system from its measured performance, i.e., to explain what caused the system to behave in the manner it did. (ii) To try to make inferences about the general behavior of the system from a limited amount of data. Consequently, some authors have suggested that the probability approach be viewed as a “deductive” science where the conclusion is drawn without any uncertainty, while statistics is an “inductive” science where only an imperfect conclusion can be reached, with the added problem that this conclusion hinges on the types of assumptions one makes about the random nature of the process and its forcing functions! Here is a simple example to illustrate the difference. Consider the flipping of a coin supposed to be fair. The probability of getting “heads” is ½. If, however, “heads” come up eight times out of the last ten trials, what is the probability the coin is not fair? Statistics allows an answer to this type of “inverse” enquiry, while probability is the approach for the “forward” type of questioning.

Distinction Between Probability and Statistics

The distinction between probability and statistics is often not clear cut, and sometimes the terminology adds to the confusion.1 In its simplest sense, probability theory generally allows one to predict the behavior of the system “before” the event under the stipulated assumptions, while statistics refers to a body of post-priori knowledge whose application allows one to make sense out of the data collected. Thus, probability concepts provide the theoretical underpinnings of those aspects of statistical analysis that involve random behavior or noise in the actual data being analyzed. Recall that in Sect. 1.3.2, a distinction had been made between four types of uncertainty or unexpected variability in the data. The first was due to the stochastic or inherently random nature of the process itself, which no amount of experiment, even if For example, “statistical mechanics” in physics has nothing to do with statistics at all but is a type of problem studied under probability.

1

done perfectly, can overcome. The study of probability theory is mainly mathematical, and applies to this type, i.e., to situations/processes/systems whose random nature is known to be of a certain type or can be modeled such that its behavior (i.e., certain events being produced by the system) can be predicted in the form of probability distributions. Thus, probability deals with the idealized behavior of a system under a known type of randomness. Unfortunately, most natural or engineered systems do not fit neatly into any one of these groups, and so when performance data is available of a system, the objective may be:

2.2

Classical Probability

2.2.1

Basic Terminology

A random (or stochastic) variable is one whose numerical value depends on the outcome of a random phenomenon or experiment or trial, i.e., one whose value depends on chance and thus not entirely predictable. For example, rolling a dice is a random experiment. There are two types of random variables: • Discrete random variables—those that can take on only a finite or countable number of values. • Continuous random variables—those that may take on any value in an interval. The following basic notions relevant to the study of probability apply primarily to discrete random variables:

2.2 Classical Probability

(i) Simple event or trial of a random experiment is one that has a single outcome. It cannot be decomposed into anything simpler. For example, getting a {6} when a dice is rolled. (ii) Sample space (some refer to it as “universe”) is the set of all possible outcomes of a single trial. For the rolling of a six-faced dice, the sample space is S = {1, 2, 3, 4, 5, 6}. (iii) Compound or composite event is one involving grouping of several simple events. For example, getting a pre-selected number, say, A={6}, from all the possible outcomes of rolling two six-faced dices together would constitute a composite event. (iv) Complement of an event is the set of outcomes in the sample not contained in A above. For example, A = f2, 3, 4, 5, 7, 8, 9, 10, 11, 12g is the complement of the event A.

2.2.2

33

S

A (a) S

intersection A ∩ B

(b) S

B

A

Basic Set Theory Notation and Axioms of Probability

A∩B=Ø

(c) A “set” in mathematics is a collection of well-defined distinct objects that can be considered as an object in its own right. Set theory has its own definitions, axioms, notations, and algebra, which is closely related to the properties of random variables. Familiarity with algebraic manipulation of sets can enhance understanding and manipulation of probability concepts. The outcomes of simple or compound events can also be considered to be elements of a set. For example, the sample space S of outcomes from rolling a dice is a collection of items or a set, and is usually indicated by S = {1, 2, 3, 4, 5, 6}. If set B were to denote B{1,2,3}, it would be a subset of S or “belong to S,” and is expressed as B 2 S. The symbol 2 = represents “not belonging to.” Another mathematical representation that B is a subset of S is by B ⊂ S or S ⊃ B. Generally, if E is a set of numbers between 3 and 6 (inclusive), it can be stated as E={x |3≤ x ≤ 6}. Finally, a compound or joint event is one that arises from operations involving two or more events occurring at the same time. The concepts of complementary, union, intersection, etc. discussed below in the context of probability also apply to set manipulation. The Venn diagram is a pictorial representation wherein elements are shown as points in a plane and sets as closed regions within an enclosing rectangle denoting the universal set or sample space. It offers a convenient manner of illustrating set and subset interaction and allows intuitive understanding of compound events and the properties of their combined probabilities. Figure 2.1 illustrates the following concepts:

B

A

S

A B (d) Fig. 2.1 Venn diagrams for a few simple cases. (a) Event A is denoted as a region in space S. Probability of event A is represented by the area inside the circle to that inside the rectangle. (b) The intersection of events A and B is the common overlapping area (shown hatched). (c) Events A and B are mutually exclusive or are disjoint events. (d) Event B is a subset of event A

• The universe of outcomes or sample space S is a set denoted by the area enclosed within a rectangle, while the probability of a particular event (say, event A) is denoted by a region within (Fig. 2.1a). • Union of two events A and B (Fig. 2.1b) is represented by the set of outcomes in either A or B or both, and is denoted by A [ B (where the symbol [ is conveniently remembered as “u” of “union”). This is akin to an addition and the composite event is denoted mathematically as C = A [ B = B [ A. An example is the number of cards in a pack of 52 cards which are either hearts or spades (= 52*(1/4+1/4) = 26). • Intersection of two events A and B is represented by the set of outcomes that are in both A and B simultaneously. This is akin to a multiplication, and denoted by D = A \ B =

34

2 Probability Concepts and Probability Distributions

B \ A. It is represented by the hatched area in Fig. 2.1b. An example is drawing a card from a deck and finding it to be a red jack (probability = (1/2) × (1/13) = 1/26). The figure also shows the areas denoted by the intersection of A and B with their complements Ā and B. • Mutually exclusive events or disjoint events are those that have no outcomes in common (Fig. 2.1c). In other words, the two events cannot occur together during the same trial. If events A and B are disjoint, this can be expressed as A \ B = Ø, where Ø denote a null or empty set. An example is drawing a red spade (nil). • Event B is inclusive in event A when all outcomes of B are contained in those of A, i.e., B is a subset of A (Fig. 2.1d). This is expressed as B ⊂ A or A ⊃ B or B 2 A. An example is the number of cards less than six (event B), which are red cards (event A). The figure also shows the area (A – B) representing the difference between events A and B.

2.2.3

Axioms of Probability

Let us now apply the above concepts to random variables and denote the sample space S as consisting of two events A and B with probabilities p(A) and p(B), respectively. Then: (i) Probability of any event, say A, cannot be negative. This is expressed as: pðAÞ ≥ 0

ð2:2Þ

(ii) Probabilities of all events must be unity (i.e., normalized): pð SÞ  pð A Þ þ pð B Þ = 1

ð2:3Þ

(iii) Probabilities of mutually exclusive events A and B add up: pðA [ BÞ = pðAÞ þ pðBÞ

ð2:4Þ

If a dice is rolled twice, the outcomes can be assumed to be mutually exclusive. If event A is the occurrence of 2 and event B that of 3, then p(A or B) = 1/6 + 1/6 = 1/3, i.e., the additive rule (Eq. 2.4) applies. The extension to more than two mutually exclusive events is straightforward. Some other inferred relations are: (iv) Probability of the complement of event A: p A = 1 - pð A Þ which leads to pðAÞ [ p A = S

ð2:5Þ

(v) Probability for either A or B to occur (when they are not mutually exclusive) is: pðA [ BÞ = pðAÞ þ pðBÞ - pðA \ BÞ

ð2:6Þ

This is intuitively obvious from the Venn diagram (see Fig. 2.1b) since the hatched area (representing p(A \ B)) gets counted twice in the sum, and so needs to be deducted once. This equation can also be deduced from the axioms of probability. Note that if events A and B are mutually exclusive, then Eq. 2.6 reduces to Eq. 2.4. Example 2.2.1 (a) Set theory approach Consider two sets A and B defined by integers from 1 to 10 by: A = {1,4,5,7,8,9} and B = {2,3,5,6,8,10}. Then A [ B = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} = sample space S. Also A \ B = {5, 8} and A – B = {1,4,7,9} and B – A = {2,3,6,10}. The reader is urged to draw the corresponding Venn diagram for better conceptual understanding. (b) Probability approach Let the two sets be redefined based on the number of integers in each set. Then p(A) = 6/10 = 0.6 and p(B) = 0.6. Clearly the two sets are overlapping since they sum to greater than one. Union p(A [ B) = {1} = entire space S Rearranging Eq. 2.6 results in intersection p(A \ B) = p(A) + p(B) - 1 = 0.2, which is consistent with (a) where the intersection consisted of two elements out of 10 in the set. Example 2.2.2 For three non-mutually exclusive events, Eq. 2.6 can be extended to pðA [ B [ CÞ = pðAÞ þ pðBÞ þ pðCÞ - pðA \ BÞ - pðA \ CÞ - pðB \ CÞ þ 3pðA \ B \ CÞ ð2:7Þ This is clear from the corresponding Venn diagram shown in Fig. 2.2. The last term (A \ B \ C) denotes the intersection area of all three events occurring simultaneously and is deducted three times as part of (A \ B), (A \ C), and (B \ C), and so needs to be added thrice.

2.2 Classical Probability

35

A

B

A∩B A∩B∩C A∩C

S

B∩C

C

Fig. 2.3 Two components connected in series and in parallel (Example 2.2.3)

Fig. 2.2 Venn diagram for the union of three non-mutually exclusive events (Example 2.2.2)

2.2.4

Joint, Marginal, and Conditional Probabilities

RðA and BÞ = RðAÞ:RðBÞ = ð0:9 × 0:75Þ = 0:675:

(a) Joint probability of two independent events represents the case when both events occur together at the same point in time. Such events and more complex probability problems are not appropriate for Venn diagram representation. Then, following the multiplication law: pðA and BÞ = pðAÞ pðBÞ if A and B are independent

(a) Series connection: For the system to function, both components should be functioning. Then the joint probability of system functioning (or the reliability):

ð2:8Þ

These are called product models. The notations p(A \ B) and p(A and B) can be used interchangeably. Consider a six-faced dice-tossing experiment. If outcome of event A is the occurrence of an even number, then p (A) = 1/2. If outcome of event B is that the number is less than or equal to 4, then p(B) = 2/3. The probability that both outcomes occur when a dice is rolled is p(A and B) = 1/2 × 2/3 = 1/3. This is consistent with our intuition since outcomes {2,4} would satisfy both the events. Example 2.2.3 Probability concepts directly apply to reliability problems associated with engineered systems. Consider two electronic components A and B but with different rates of failure expressed as probabilities, say p(A) = 0.1 and p(B) = 0.25. What is the failure probability of a system made up of the two components if connected (a) in series and (b) in parallel. Assume independence, i.e., the failure of one component is independent of the other. The two cases are shown in Fig. 2.3 with the two components A and B connected in series and in parallel. Reliability is the probability of functioning properly and is the complement of probability of failure, i.e., Reliability R(A) = 1 – p(A) = 1 – 0.1 = 0.9, and R(B) = 1 – p(B) = 1 – 0.25 = 0.75.

The failure probability of the system is p(system failure) = 1 – R(A and B) = 1 – 0.675 = 0.325. For the special case when both components have the same probability of failure p, it is left to the reader to verify that system reliability or probability of system functioning = (1 – p)2. (b) Parallel connection: For the system to fail, both components should fail. In this case, it is better to work with failure probabilities. Then, the joint probability of system failing: p(A). p(B) = 0.1 × 0.25 = 0.025, which is much lower than 0.325 found for the components in series scenario. This result is consistent with the intuitive fact that components in parallel increase the reliability of the system. The corresponding probability of functioning is R(A and B) = 1 - 0.025 = 0.975. For the special case when both components have the same probability of failure p, it is left to the reader to verify that system reliability or probability of system functioning = (1 – p2). (c) Marginal probability of an event A compared to another event B refers to its probability of occurrence irrespective of B. It is sometimes referred to as “unconditional probability” of A on B. Let the space contain only events A and B, i.e., events A and B are known to have occurred. Since S can be taken to be the sum of event space B and its complement B, the probability of A can be expressed in terms of the sum of the disjoint parts of B: pðAÞ = pðA \ BÞ þ p A \ B = p ðA and BÞ þ p A and B

ð2:9Þ

This expression (known as Bayes’ Rule) can be extended to the case of more than two joint events. This equation will be made use of in Sect. 2.5.

36

2 Probability Concepts and Probability Distributions

Example 2.2.4 The percentage data of annual income versus age has been gathered from a large population living in a certain region— see Table 2.1. Let X be the income and Y the age. The marginal probability of X for each class is simply the sum of the probabilities under each column and that of Y the sum of those for each row. Thus, p(X ≥ 40, 000) = 0.15 + 0.10 + 0.08 = 0.33, and so on. Also, verify that the sum of the marginal probabilities of X and Y sum to 1.00 (to satisfy the normalization condition). ■ (c) Conditional probability: There are several situations involving compound outcomes that are sequential or successive in nature. The chance result of the first stage determines the conditions under which the next stage occurs. Such events, called two-stage (or multistage) events, involve step-by-step outcomes that can be represented as a probability tree. This allows better visualization of how the probabilities progress from one stage to the next. If A and B are events, then the probability that event B occurs given that A has already occurred is given by the conditional probability of B given A (known as Bayes’ Rule): pðB=AÞ =

pðA \ BÞ pðAÞ

ð2:10Þ

An example of a conditional probability event is the drawing of a spade from a pack of cards from which a first card was already drawn. If it is known that the first card was not a spade, then the probability of drawing a spade the second time is 12/51 = 4/17. On the other hand, if the first card drawn was a spade, then the probability of getting a spade on the second draw is 11/51. A special but important case is when p(B/A) = p(B). In this case, B is said to be independent of A because the fact that event A has occurred does not affect the probability of B occurring. In this case, one gets back Eq. 2.8.

Mutually exclusive events and independent events are not to be confused. While the former is a property of the events themselves that occur simultaneously, the latter is a property that arises from the event probabilities that are sequential or staged over time. The distinction is clearer if one keeps in mind that: – If A and B events are mutually exclusive, p(A \ B)=p(AB) = 0. – If A and B events are independent, then p(AB) = p(A). p(B). – If A and B events are mutually exclusive and independent, then p(AB) = 0 = p(A).p(B), and so one of the events cannot or should not occur. Example 2.2.5 A single fair dice is rolled. Let event A = {even outcome} and event B = {outcome is divisible by 3}. (a) List the various outcomes in the sample space: {1 2 3 4 5 6} (b) List the outcomes in A and find p(A): {2 4 6}, p(A) = 1/2 (c) List the outcomes of B and find p(B): {3 6}, p(B) = 1/3 (d) List the outcomes in A \ B and find p(A \ B): {6}, p(A \ B) = 1/6 (e) Are the events A and B independent? Yes, since Eq. 2.8 holds ■ Example 2.2.6 Two defective bulbs have been mixed with ten good ones. Let event A = {first bulb is good}, and event B = {second bulb is good}. The two events are independent. (a) If two bulbs are chosen at random with replacement, what is the probability that both are good? Given p(A) = 8/10 and p(B) = 8/10. Then from Eq. 2.8:

Table 2.1 Computing marginal probabilities from a probability table (Example 2.2.4) Income (X) Age (Y ) Under 25 Between 25 and 40 Above 40 Marginal probability of X

≤$40,000 0.15 0.10

40,000–90,000 0.09 0.16

≥90,000 0.05 0.12.

0.08 0.33

0.20 0.45

0.05 0.22

Marginal probability of Y 0.29 0.38 0.33 Should sum to 1.00 both ways

pðA \ BÞ =

8 8 64 : = = 0:64 10 10 100

(b) What is the probability that two bulbs drawn in sequence (i.e., not replaced) are good where the status of the bulb after the first draw is known to be good? From Eq. 2.8, p(both bulbs drawn are good): pðA \ BÞ = pðAÞ  pðB=AÞ =

8 7 28  = = 0:622 10 9 45 ■

2.2 Classical Probability

37

Example 2.2.7 Two events A and B have the following probabilities: p(A) = 0.3, p(B) = 0.4, and p A \ B = 0:28. (a) Determine whether the events A and B are independent or not? From Eq. 2.5, P A = 1 - pðAÞ = 0:7. Next, one will verify whether Eq. 2.8 holds or not. In this case, one needs to verify whether: p A \ B = p A  pðBÞ or whether 0.28 is equal to (0.7 × 0.4). Since this is correct, one can state that events A and B are independent. (b) Find p(A [ B) From Eqs. 2.6 and 2.8: pðA [ BÞ = pðAÞ þ pðBÞ - pðA \ BÞ = pðAÞ þ pðBÞ - pðAÞ  pðBÞ = 0:3 þ 0:4 - ð0:3Þð0:4Þ = 0:58

Example 2.2.8 Generating a probability tree for a residential air-conditioning (AC) system. Assume that the AC is slightly undersized for the house it serves. There are two possible outcomes (S, satisfactory, and NS, not satisfactory) depending on whether the AC is able to maintain the desired indoor temperature. The outcomes depend on the outdoor temperature, and, for simplicity, its annual variability is grouped into three categories: very hot (VH), hot (H ), and not hot (NH). The probabilities for outcomes S and NS to occur in each of the three day-type categories are shown in the conditional probability tree diagram (Fig. 2.4) while the joint probabilities computed following Eq. 2.8 are assembled in Table 2.2. Note that the relative probabilities of the three branches in both the first stage and in each of the two branches of each outcome add to unity (e.g., in the Very Hot, the S and NS outcomes add to 1.0, and so on). Further, note that the joint probabilities shown in the table also must sum to unity (it is advisable to perform such verification checks). The probability of the indoor conditions being satisfactory is determined as: p(S)= 0.02 + 0.27 + 0.6 = 0.89 while p (NS) = 0.08 + 0.03 + 0 = 0.11. It is wise to verify that p (S) + p(NS) = 1.0. ■

Fig. 2.4 The conditional probability tree for the residential air-conditioner when two outcomes are possible (S, satisfactory; and NS, not satisfactory) for each of three day-types (VH, very hot; H, hot; and NH, not hot). (Example 2.2.8) Table 2.2 Joint probabilities of various outcomes (Example 2.2.8) p(VH \ S) = 0.1 × 0.2 = 0.02 p(VH \ NS) = 0.1 × 0.8 = 0.08 p(H \ S) = 0.3 × 0.9 = 0.27 p(H \ NS) = 0.3 × 0.1 = 0.03 p(NH \ S) = 0.6 × 1.0 = 0.6 p(NH \ NS) = 0.6 × 0 = 0

Table 2.3 Probabilities of various outcomes (Example 2.2.9) pðA \ RÞ =

pðB \ RÞ =

pðA \

pðB \

pðA \

1 1 1 2 × 2 = 4 1 1 W Þ = 2 × 2 = 14 GÞ = 12 × 0 = 0

pðB \

1 3 3 2 × 4 = 8 1 WÞ = 2 × 0 = 0 GÞ = 12 × 14 = 18

Example 2.2.9 Two-stage experiment Consider a problem where there are two boxes with marbles as specified: Box A : ð1 red and 1 whiteÞ; and Box B : ð3 red and 1 greenÞ: A box is chosen at random and a marble drawn from it. What is the probability of getting a red marble? One is tempted to say that since there are 4 red marbles in total out of 6 marbles, the probability is 2/3. However, this is incorrect, and the proper analysis approach requires that one frames this problem as a two-stage experiment. The first stage is the selection of the box, and the second the drawing of the marble. Let event A (or event B) denote choosing Box A (or Box B). Let R, W, and G represent red, white, and green marbles. The resulting probabilities are shown in Table 2.3.

38

2 Probability Concepts and Probability Distributions

Fig. 2.5 The first stage of the forward probability tree diagram involves selecting a box (either A or B) while the second stage involves drawing a marble that can be red (R), white (W ), or green (G) in color. The total probability of drawing a red marble is 5/8. (Example 2.2.9)

Thus, the probability of getting a red marble = 1/4 + 3/ 8 = 5/8. The above example is depicted in Fig. 2.5, where the reader can visually note how the probabilities propagate through the probability tree. This is called the “forward tree” to differentiate it from the “reverse tree” discussed in Sect. 2.5. The above example illustrates how a two-stage experiment must be approached. First, one selects a box that by itself does not tell us whether the marble is red (since one has yet to pick a marble). Only after a box is selected, can one use the prior probabilities regarding the color of the marbles inside the box in question to determine the probability of picking a red marble. These prior probabilities can be viewed as conditional probabilities; i.e., for example, p(A \ R) = p (R/A)  p(A) ■

2.2.5

Permutations and Combinations

The study of probability requires a sound knowledge of combinatorial mathematics, which is concerned with developing rules for situations involving permutations and combinations. (a) Permutation P(n, k) is the number of ways in which k objects can be selected from n objects with the order being important. It is given by: Pðn, kÞ =

n! ðn - kÞ!

ð2:11Þ

A special case is the number of permutations of n objects taken n at a time: Pðn, nÞ = n! = nðn - 1Þðn - 2Þ . . . ð2Þð1Þ

ð2:12Þ

(b) Combination C(n,k) is the number of ways in which k objects can be selected from n objects with the order not being important. It is given by: Cðn, kÞ =

n!  ðn - k Þ!k!

n k

ð2:13Þ

Note that the same equation also defines the binomial coefficients since the expansion of (a+b)n according to the Binomial theorem is: n

ð a þ bÞ n = k=0

n k

an - k b k

ð2:14Þ

Example 2.2.10 (a) Calculate the number of ways in which three people from a group of seven people can be seated in a row. This is a case of permutation since the order is important. The number of possible ways is given by Eq. 2.11: Pð7, 3Þ =

ð7Þ  ð6Þ  ð5Þ 7! = = 2110 1 ð7 - 3Þ!

(b) Calculate the number of combinations in which three people can be selected from a group of seven. Here the order is not important, and the combination formula can be used (Eq. 2.13). Thus: C ð7, 3Þ =

ð7Þ  ð6Þ  ð5Þ 7! = = 35 ð7 - 3Þ!3! ð 3Þ  ð 2Þ ■

2.3 Probability Distribution Functions

39

Table 2.4 Number of combinations for equipment scheduling in a large physical plant of a campus

One of each Two of each: assumed identical Two of each: non-identical except for boilers

Status (0, off; 1, on) Prime-movers Boilers 0–1 0–1 0–0, 0–1, 0–0, 0–1, 1–1 1–1 0–0, 0–1, 0–0, 0–1, 1–0, 1–1 1–1

Chillers-Vapor compression 0–1 0–0, 0–1, 1–1

Chillers-Absorption 0–1 0–0, 0–1, 1–1

Number of combinations 24 = 16 34 = 81

0–0, 0–1, 1–1

0–0, 0–1, 1–0, 1–1

43 × 31 = 192

Another type of combinatorial problem is the factorial problem to be discussed in Chap. 6 while dealing with design of experiments. Consider a specific example involving equipment scheduling at a physical plant of a large campus that includes prime-movers (diesel engines or turbines that produce electricity), boilers, and chillers (vapor compression and absorption machines). Such equipment need a certain amount of time to come online and so operators typically keep some of them “idling” so that they can start supplying electricity/heating/cooling at a moment’s notice. Their operating states can be designated by a binary variable; say “1” for on-status and “0” for off-status. Extensions of this concept include cases where, instead of two states, one could have m states. An example of three states is when say two identical boilers are to be scheduled. One could have three states altogether: (i) when both are off (0–0), (ii) when both are on (1–1), and (iii) when only one is on (1–0). Since the boilers are identical, state (iii) is identical to 0–1. In case the two boilers are of different size, there would be four possible states. The number of combinations possible for “n” such equipment where each one can assume “m” states is given by mn. Some simple cases for scheduling four different types of energy equipment in a physical plant are shown in Table 2.4.

2.3

Probability Distribution Functions

2.3.1

Density Functions

The notions of discrete and continuous random variables were introduced in Sect. 2.2.1. The concepts relevant to discrete outcomes of events or discrete random variables were addressed in Sect. 2.2. and these will now be extended to continuous random variables. The distribution of a random variable represents the probability of it taking its various possible values. For example, if the y-axis in Fig. 1.2 of the six-faced dice rolling experiment were to be changed into a relative frequency (= 1/6), the resulting histogram would graphically represent the corresponding probability density function (PDF) (Fig. 2.6a). Thus, the probability of getting a 2 in the rolling of a dice is 1/6th. Since, this is a discrete random variable, the function takes on specific values at discrete points of the x-axis (which represents the outcomes).

The same type of y-axis normalization done to the data shown in Fig. 1.3 would result in the PDF for the case of continuous random data. This is shown in Fig. 2.7a for the random variable taken to be the hourly outdoor dry-bulb temperature over the year at Philadelphia, PA. Notice that this is the envelope of the histogram of Fig. 1.3. Since the variable is continuous, it is implausible to try to determine the probability of, say temperature outcome of 57.5 °F. One would be interested in the probability of outcomes within a range, say 55–60 °F. The probability can then be determined as the area under the PDF as shown in Fig. 2.7b. It is for such continuous random variables that the cumulative distribution function (CDF) is useful. It is simply the cumulative area under the curve starting from the lowest value of the random variable X to the current value (Fig. 2.8). The vertical scale directly gives the probability (or, in this case, the fractional time) that X is less than or equal to a certain value. Thus, the probability (X ≤ 60) is about 0.58. The concept of CDF also applies to discrete variables but as a discontinuous stepped curve as illustrated in Fig. 2.6b for the dice rolling example. To restate, depending on whether the random variable is discrete or continuous, one gets discrete or continuous PDFs. Though most experimentally gathered data is discrete, the underlying probability theory is based on the data being continuous. Replacing the integration sign by the summation sign in the equations that follow allows extending the definitions to discrete distributions. Let f(x) be the PDF associated with a random variable X. This is a function that provides the probability that a discrete random variable X takes on a particular value x among its various possible values. The axioms of probability (Eqs. 2.2 and 2.3) for the discrete case are expressed for the case of continuous random variables as: • PDF cannot be negative: f ðxÞ ≥ 0

-1 10. In problems where the normal distribution is used, it is more convenient to standardize the random variable into a

2.4 Important Probability Distributions

51 0.4

0.4

Z=-0.8

Z=-1.2 0.3

0.3

0.2

0.2

p(-0.8) =0.2119

p(-1.2) =0.1151 0.1

0.1

0

Z=0.8

-3

-2

-1

0

1

2

z

a

0

3

b

-3

-2

-1

0

1

2

3

z

Fig. 2.16 Figures meant to illustrate the fact that the shaded figure areas are the physical representations of the tabulated standardized probability values in Table A3 (corresponds to Example 2.4.9ii). (a) Lower limit. (b) Upper limit

new random variable z  x -σ μ with mean zero and variance of unity. This results in the standard normal curve or z-curve: z2 1 N ðz; 0, 1Þ = p exp 2 2π

ð2:42bÞ

In actual problems, the standard normal distribution is used to determine the probability of the random variable having a value within a certain interval, say z between z1 and z2. Then Eq. 2.42b can be modified into: N ðz 1 ≤ z ≤ z 2 Þ =

z2 z1

z2 1 p exp dz 2 2π

ð2:42cÞ

The shaded area in Table A3 permits evaluating the above integral, i.e., determining the associated probability assuming z1 = - 1. Note that for z = 0, the probability given by the shaded area is equal to 0.5. Since not all texts adopt the same format in which to present these tables, the user is urged to use caution in interpreting the values shown in such tables. Example 2.4.9 Graphical interpretation of probability using the standard normal table Resistors made by a certain manufacturer have a nominal value of 100 ohms, but their actual values are normally distributed with a mean of μ = 100.6 ohms and standard deviation σ = 3 ohms. Find the percentage of resistors that will have values: (i) Higher than the nominal rating. The standard normal variable z(X = 100) = (100 – 100.6)/3 = - 0.2. From

Table A3, this corresponds to a probability of (1 – 0.4207) = 0.5793 or 57.93%. (ii) Within 3 ohms of the nominal rating (i.e., between 97 and 103 ohms). The lower limit z1 = (97 - 100.6)/ 3 = - 1.2, and the tabulated probability from Table A3 is p(z = –1.2) = 0.1151 (as illustrated in Fig. 2.16a). The upper limit is: z2 = (103 - 100.6)/3 = 0.8. However, care should be taken in properly reading the corresponding value from Table A3, which only gives probability values of z < 0. One first determines the probability about the negative value symmetric about 0, i.e., p(z = –0.8) = 0.2119 (shown in Fig. 2.16b). Since the total area under the curve is 1.0, p (z = 0.8) = 1.0 – 0.2119 = 0.7881. Finally, the required probability p(–1.2 < z < 0.8) = (0.7881 – 0.1151) = 0.6730 or 67.3%. ■ Inspection of Table A3 allows the following statements to be made, which are often adopted during statistical inferencing: • The interval μ ± σ contains approximately [1 – 2 (0.1587)] = 0.683 or 68.3% of the observations. • The interval μ ± 2σ contains approximately 95.4% of the observations. • The interval μ ± 3σ contains approximately 99.7% of the observations. Another manner of using the standard normal table is for the “backward” problem. In such cases, instead of being specified the z value and having to deduce the probability, now the probability is specified, and the z value is to be deduced.

52

2 Probability Concepts and Probability Distributions

If the mean and standard deviation of this distribution are μ and σ, and the civil engineer wishes to determine the “statistical minimum strength” x, specified as the strength below which only say 5% of the cubes are expected to fail, one searches Table A3 and determines the value of z for which the probability is 0.05, i.e., p(z = - 1.645) = 0.05. This would correspond to x = μ - 1.645σ.

0.4 Normal N(0,1)

d.f=10 0.3

PDF

Example 2.4.10 Reinforced and pre-stressed concrete structures are designed so that the compressive stresses are carried mostly by the concrete itself. For this and other reasons, the main criterion by which the quality of concrete is assessed is its compressive strength. Specifications for concrete used in civil engineering jobs may require specimens of specified size and shape (usually cubes) to be cast and tested on site. One can assume the normal distribution to apply.

0.2 d.f=5 0.1 0

-3

-2

-1

0

1

2

3

x Fig. 2.17 Comparison of the normal (or Gaussian) z curve to two Student’s t-curves with different degrees of freedom (d.f.). As the d.f. decreases, the PDF for the Student’s t-distribution flattens out and deviates increasingly from the normal distribution

(b) Student’s t-Distribution One important application of the normal distribution is that it allows making statistical inferences about population means from random samples (see Sect. 4.2). In case the random samples are small (say, n < 30), then the Student’s t-distribution, rather than the normal distribution, should be used. If one assumes that the sampled population is approximately - μÞ normally distributed, then the random variable t = ðx p has s



the Student’s t-distribution t(μ, s, ν) where μ is the sample mean, s is the sample standard deviation, and ν is the degrees of freedom = (n - 1). Thus, the number of degrees of freedom (d.f.) equals the number of data points minus the number of constraints or restrictions placed on the data. Table A4 (which is set up differently from the standard normal table) provides numerical values of the t-distribution for different degrees of freedom at different confidence levels. How to use these tables will be discussed in Sect. 4.2. Unlike the z curve, one has a family of t-distributions for different values of ν. Qualitatively, the t-distributions are similar to the standard normal distribution in that they are symmetric about a zero mean, while they are slightly wider than the corresponding normal distribution as indicated in Fig. 2.15. However, in terms of probability values represented by areas under the curves as in Example 2.4.11, the differences between the normal and the Student’s t-distributions are large enough to warrant retaining this distinction (Fig. 2.17).

Example 2.4.11 Differences between normal and Student’s t-inferences Consider Example 2.4.10 where the distribution of the strength of concrete samples tested followed a normal distribution with a mean and standard deviation of μ and σ. The probability that the “minimum strength” x specified as the strength below which only say 5% of the cubes are expected to fail was determined from Table A3 to be p(z = - 1.645) = 0.05. It was then inferred that that the statistical strength would correspond to x = μ - 1.645σ. The Student’s t-distribution allows one to investigate how this interval changes when different numbers of samples are taken. The mean and standard deviation now correspond to those of the sample. The critical values are the multipliers of the standard deviation and the results are assembled in the table below for different number of samples tested (found from Table A4 for a single-tailed distribution of 95%). For an infinite number of samples, we get back the critical value found for the normal distribution. Number of samples 5 10 30 1

Degrees of freedom 4 9 29 1

Critical values 2.132 1.833 1.699 1.645

2.4 Important Probability Distributions

53

1

PDF

0.8 L(1,1)

0.6

L(2,2) 0.4 L(3,3) 0.2 0

0

2

4

6

8

10

x Fig. 2.18 Lognormal distributions for different mean and standard deviation values

Example 2.4.12 Using lognormal distributions for pollutant concentrations Concentration of pollutants produced by chemical plants is often modeled by lognormal distributions and is used to evaluate compliance with government regulations. The concentration of a certain pollutant, in parts per million (ppm), is assumed lognormal with parameters μ = 4.6 and σ = 1.5. What is the probability that the concentration exceeds 10 ppm? One can use Eq. 2.43, or, simpler still, use the z tables (Table A3) by suitable transformations of the random variable. lnð10Þ - 4:6 1:5 = 1 - N ð - 1:531Þ = 1 - 0:0630 = 0:937

LðX > 10Þ = 1 - N ½lnð10Þ, 4:6, 1:5 = 1 - N



(c) Lognormal Distribution This distribution is appropriate for non-negative skewed data for which the symmetrical normal distribution is no longer appropriate. If a variate X is such that ln(X) is normally distributed, then the distribution of X is said to be lognormal. With X ranging from - 1 to + 1, ln(X) would range from 0 to + 1 . It is characterized by two parameters, the mean and standard deviation (μ, σ), as follows: Lðx; μ, σ Þ ¼

σ:x ¼0

1 p



exp -

ðln x - μÞ2 2σ 2

(d) Gamma Distribution The gamma distribution (also called the Erlang distribution) is a good candidate for modeling random phenomena that can only be positive and are unimodal (akin to the lognormal distribution). The gamma distribution is derived from the gamma function for positive values of α, which, one may recall from mathematics, is defined by the integral:

when x ≥ 0 elsewhere ð2:43Þ

Note that the two parameters pertain to the logarithmic mean and standard deviation, i.e., to ln(X) and not to the random variable X itself. The lognormal curves are a family of skewed curves as illustrated in Fig. 2.18. Lognormal failure laws apply when the degradation in lifetime of a system is proportional to the current state of degradation. Typical applications in civil engineering involve flood frequency, in mechanical engineering with crack growth and mechanical wear, in electrical engineering to failure of electrical transformers or faults in electrical cables, and in environmental engineering with pollutants produced by chemical plants and threshold values for drug dosage. Lognormal distributions are also often used to characterize “fragility curves”, which represent the probability of damage due to extreme natural events (hurricanes, earthquakes, etc.), of the built environment such as buildings and other infrastructures. For example, Winkler et al. (2010) adopted topological and terrain-specific (μ, σ) parameters to model failure of electric power transmission lines and sub-stations and supporting towers/poles during hurricanes.

1

xα - 1 e - x dx

Γx ðαÞ =

α>0

ð2:44aÞ

0

Integration results in the following expression for non-negative integers k: Γðk þ 1Þ = k!

ð2:44bÞ

The continuous random variable X has a gamma distribution with positive parameters α and λ if its density function is given by: Gðx; α, λÞ = λα e - λx

xα - 1 ðα - 1Þ!

=0

x>0

ð2:44cÞ

elsewhere

The mean and variance of the gamma distribution are: μ=

α and λ

σ2 =

α λ2

ð2:44dÞ

Variation of the parameter α (called the shape factor) and λ (called the scale parameter) allows a wide variety of shapes to be generated (see Fig. 2.19). From Fig. 2.11, one notes that

54

2 Probability Concepts and Probability Distributions 0.3

2.4

0.25

2 G(3,1)

0.15

G(3,0.33)

0.1

1.2 G(1,1)

0.8

G(3,0.2)

0.05 0

G(0.5,1)

1.6

PDF

PDF

0.2

G(3,1)

0.4 0

5

10

15

20

25

30

35

x

a

0

40

0

2

4

6

8

10

12

x

b

Fig. 2.19 Gamma distributions for different combinations of the shape parameter α and the scale parameter β = 1/λ

if x ≥ 0 otherwise

ð2:45aÞ

where λ is the mean value per unit time or distance. The mean and variance of the exponential distribution are: μ=

E(0.5)

1.2 0.8

E(1)

0.4

A special case of the gamma distribution for α = 1 is the exponential distribution. It is the continuous distribution analogue to the geometric distribution which is applicable to discrete random variables. Its PDF is given by

= 0

1.6

E(2)

(e) Exponential Distribution

Eðx; λÞ = λe - λx

2

PDF

the Gamma distribution is the parent distribution for many other distributions discussed. If α→1 and λ = 1, the gamma distribution approaches the normal (see Fig. 2.11). When α = 1, one gets the exponential distribution. When α = 2υ and λ = 12 , one gets the chi-square distribution (discussed below).

1 1 and σ 2 = 2 λ λ

0

0

1

2

3

4

5

x Fig. 2.20 Exponential distributions for three different values of the parameter λ

of faults in a long cable, if the number of faults per unit length is Poisson distributed, then the cable length between faults is exponentially distributed. Its CDF is given by: a

ð2:45bÞ

λ:e - λx dx = 1 - e - λa

CDF ½Eða, λÞ =

ð2:45cÞ

0

The distribution is represented by a family of curves for different values of λ (see Fig. 2.20). Exponential failure laws apply to products whose current age does not have much effect on their remaining lifetimes. Hence, this distribution is said to be “memoryless.” It is used to model such processes as the interval between two occurrences, e.g., the distance between consecutive faults in a cable, or the time between chance failures of a component (such as a fuse) or a system, or the time between consecutive emissions of α-particles, or the time between successive arrivals at a service facility. The exponential and the Poisson distributions are closely related. While the latter represents the number of failures per unit time, the exponential represents the time between successive failures. In the context

Example 2.4.13 Temporary disruptions to the power grid can occur due to random events such as lightning, transformer failures, forest fires, etc. The exponential distribution has been known to be a good function to model such failures. If these occur, on average, say, once every 2.5 years, then λ = 1/2.5 = 0.40 per year. (a) What is the probability that there will be no more than one disruption next year? From Eq. 2.45c CDF ½E ðX ≤ 1; λÞ = 1 - e - 0:4ð1Þ = 1 0:6703 = 0:3297

2.4 Important Probability Distributions

55

(b) What is the probability that there will be more than two disruptions next year? This is the complement of at least two disruptions.

β > 1, the curves become close to bell-shaped and somewhat resemble the normal distribution. The expression for the CDF is given by: CDF ½W ðx; α, βÞ ¼ 1 - exp½ - ðx=βÞα  for x ≥ 0

Probability = 1 - CDF ½E ðX ≤ 2; λÞ

¼0

= 1 - 1 - e - 0:4ð2Þ = 0:4493 ■ (f) Weibull Distribution Another widely used distribution is the Weibull distribution, which has been found to be applicable to datasets from a wide variety of systems and natural phenomena. It has been used to model the time of failure or life of a component as well as engine emissions of various pollutants. Moreover, the Weibull distribution has been found to be very appropriate to model reliability of a system, i.e., the failure time of the weakest component of a system (bearing, pipe joint failure, etc.). The continuous random variable X has a Weibull distribution with parameters α and β (shape and scale factors, respectively) if its density function is given by: α α-1 x exp½ - ðx=βÞα  for x ≥ 0 βα =0 elsewhere ð2:46aÞ

0.12 0.1

PDF

0.08 0.06 0.04

with mean

0.02

1 μ = βΓ 1 þ α

ð2:46bÞ

Figure 2.21 shows the versatility of this distribution for different sets of α and β values. Also shown is the special case of W(1,1) which is the exponential distribution. For

0

0

5

10

15

20

25

30

x Fig. 2.22 PDF of the Weibull distribution W(2, 7.9) (Example 2.4.14)

1

8

0.8

W(10,0.5)

6

W(1,1)

0.6

PDF

PDF

elsewhere

Example 2.4.14 Modeling wind distributions using the Weibull distribution The Weibull distribution is widely used to model the hourly variability of wind velocity. The mean wind speed and its distribution on an annual basis, which are affected by local climate conditions, terrain, and height of the tower, are important in order to determine annual power output from a wind turbine of a certain design whose efficiency changes with wind speed. It has been found that the shape factor α varies between 1 and 3 (when α = 2, the distribution is called the Rayleigh distribution). The probability distribution shown in Fig. 2.22 has a mean wind speed of 7 m/s. In this case:

W ðx; α, βÞ =

W(2,1) 0.4

W(10,1)

4

W(10,2)

W(2,5) 2

0.2 0

0 0

a

ð2:46cÞ

2

4

6

x

8

10

b

0

1

2

x

Fig. 2.21 Weibull distributions for different values of the two parameters α and β (the shape and scale factors, respectively)

3

4

56

2 Probability Concepts and Probability Distributions

(a) The numerical value of the parameter β assuming the shape factor α = 2 can be calculated from the gamma μ = 7:9 function Γ 1 þ 12 = 0:8862, from which β = 0:8862 (b) Using the PDF given by Eq. 2.46a, it is left to the reader to compute the probability of the wind speed being equal to 10 m/s (and verify the solution against Fig. 2.22, which indicates a value of 0.064). ■

(g) Chi-square Distribution A third special case of the gamma distribution is when α = 2v and λ = 12 where v is a positive integer and is called the degrees of freedom. This distribution, called the chi-square (χ 2) distribution, plays an important role in inferential statistics where it is used as a test of significance for hypothesis testing and analysis of variance type of problems. It is based on the standard normal distribution with mean of 0 and standard deviation of 1. Just like the t-statistic, there is a family of distributions for different values of v (Fig. 2.23). The somewhat complicated PDF is given by Eq. 2.47a but its usefulness lies in the determination of a range of values from this distribution; more specifically, it provides the probability of observing a value of χ 2 from 0 to a specified value (Wolberg, 2006). Note that the distribution cannot assume negative values, and that it is positively skewed. Table A5 assembles critical values of the Chi-square distribution for different values of the degrees of freedom parameter v and for different significance levels. The usefulness of these tables will be discussed in Sect. 4.2. The PDF of the chi-square distribution is: χ2 ðx; νÞ =

1 ν

22 Γ

=0

υ x 2

ν 2-1

while the mean and variance values are :

While the t-distribution allows comparison between two sample means, the F-distribution allows comparison between two or more sample variances. It is defined as the ratio of two independent chi-square random variables, each divided by its degrees of freedom. The F-distribution is also represented by a family of plots (Fig. 2.24) where each plot is specific to a set of numbers representing the degrees of freedom of the two random variables (υ1, υ2). Table A6 assembles critical values of the F-distributions for different combinations of these two parameters, and its use will be discussed in Sect. 4.2. (i) Uniform Distribution The uniform probability distribution is the simplest of all PDFs and applies to both continuous and discrete data whose outcomes are all equally likely, i.e., have equal probabilities. Flipping a coin for heads/tails or rolling a six-sided dice for getting numbers between 1 and 6 are examples that come readily to mind. The probability density function for the discrete case where X can assume values x1, x2,. . .xk is given by: U ðx; kÞ =

1 k

e

x>0

ð2:48aÞ

k

with mean μ =

i=1

variance σ 2 =

ð2:47aÞ

xi and

k k

- 2x

ð2:47bÞ

(h) F-Distribution

i=1

ðxi - μÞ

ð2:48bÞ

2

k

elsewhere 0.8

1.2 1

0.6

0.8

F(6,24)

c (1) 2

PDF

PDF

σ2 = 2v

μ = v and

0.6 c (4)

0.4

2

0.4

c 2(6)

0.2 0

0

2

4

6

8

10

F(6,5)

0.2

12

x Fig. 2.23 Chi-square distributions for different values of the variable υ denoting the degrees of freedom

0

0

1

2

3

4

5

x Fig. 2.24 Typical F-distributions for two different combinations of the random variables (υ1 and υ2)

2.4 Important Probability Distributions

57

Fig. 2.25 The uniform distribution assumed continuous over the interval [c, d]

For random variables that are continuous over an interval (c,d) as shown in Fig. 2.25, the PDF is given by: 1 d-c =0

U ð xÞ =

when c < x < d

ð2:48cÞ

otherwise

The mean and variance of the uniform distribution (using notation shown in Fig. 2.25) are given by: μ=

cþd 2

and

σ2 =

ð d - cÞ 2 12

ð2:48dÞ

The probability of random variable X being between say x1 and x2 is: U ðx1 ≤ X ≤ x2 Þ =

x2 - x1 d-c

ð2:48eÞ

Example 2.4.15 A random variable X has a uniform distribution with c = -5 and d = 10 (Fig. 2.25). (a) On an average, what proportion will have a negative value? (Answer: 1/3) (b) On an average, what proportion will fall between -2 and 2? (Answer: 4/15) ■ (j) Beta Distribution

Fig. 2.26 Various shapes assumed by the Beta distribution depending on the values of the two model parameters

Another versatile distribution is the Beta distribution which is appropriate for discrete random variables between 0 and 1 (such as representing proportions). It is a two-parameter model that is given by: Betaðx; p, qÞ =

ðp þ q þ 1Þ! p - 1 x ð 1 - xÞ q - 1 ðp - 1Þ!ðq - 1Þ!

ð2:49aÞ

Depending on the values of p and q, one can model a wide variety of curves from u-shaped ones to skewed distributions (Fig. 2.26). The distributions are symmetrical

when p and q are equal, with the curves becoming peakier as the numerical values of the two parameters increase. Skewed distributions are obtained when the parameters are unequal. The mean of the Beta distribution μ = variance σ 2 =

pq ð p þ qÞ ð p þ q þ 1 Þ 2

p and pþq

ð2:49bÞ

58

2 Probability Concepts and Probability Distributions

This distribution originates from the Binomial distribution, and one can detect the obvious similarity of a two-outcome affair with specified probabilities. The usefulness of this distribution will become apparent in Sect. 2.5.3, dealing with the Bayesian approach to problems involving continuous probability distributions.

2.5

Bayesian Probability

2.5.1

Bayes’ Theorem

It was stated in Sect. 2.1.2 that the Bayesian viewpoint can enhance the usefulness of the classical frequentist notion of probability.3 Its strength lies in the fact that it provides a framework to include prior information in a two-stage (or multistage) experiment. If one substitutes the term p(A) in Eq. 2.10 by that given by Eq. 2.9 (known as Bayes’ Rule), one gets : pðB=AÞ =

pðA \ BÞ pðA \ BÞ þ p A \ B

pðA=BÞpðBÞ pðA=BÞpðBÞ þ p A=B p B

n

p A \ Bj =

ð2:53Þ

p A=Bj p Bj

j=1

j=1

Then

( / )=

ð2:51Þ

Bayes’ theorem, superficially, appears to be simply a restatement of the conditional probability equation given by Eq. 2.10. The question is why is this reformulation so insightful or advantageous? First, the probability is now re-expressed in terms of its disjoint parts B, B , and second the probabilities have been “flipped,” i.e., p(B/A) is now expressed in terms of p(A/B). Consider the two events A and B. If event A is observed while event B is not, this expression allows one to infer the “flip” probability, i.e., probability of occurrence of B from that of the observed event A. In Bayesian terminology, Eq. 2.51 can be written as: Posterior probability of event B given event A ðLikelihood of A given BÞðPrior probability of BÞ = Prior probability of A ð2:52Þ

3

n

pðAÞ =

ð2:50Þ

Also, one can re-arrange Eq. 2.10 into: p(A \ B) = p(A)  p(B/A) =p(B)  p(A/B). This allows expressing Eq. 2.50 into the following expression referred to as the law of total probability or Bayes’ theorem: pðB=AÞ =

Thus, the probability p(B) is called the prior probability (or unconditional probability) since it represents the opinion before any data was collected, while p(B/A) is said to be the posterior probability, which is reflective of the opinion revised in light of new data. The term “likelihood” is synonymous to “the conditional probability” of A given B, i.e., p(A/B) Equation 2.51 applies to the case when only one of two events is possible. It can be extended to the case of more than two events that partition the space S. Consider the case where one has n events, B1. . .Bn, which are disjoint and make up the entire sample space. Figure 2.27 shows a sample space of four events. Then, the law of total probability states that the probability of an event A is the sum of its disjoint parts:

There are several texts that deal only with Bayesian statistics; for example, Bolstad (2004).

( ∩

)

( )

=∑

( / ( /

) (

)

)

ð2:54Þ

likelihood prior

Posterior probability

This expression is known as Bayes’ theorem for multiple events. To restate, the marginal or prior probabilities p(Bi) for i = 1, . . ., n are assumed to be known in advance, and the intention is to update or revise our “belief” on the basis of the observed evidence of event A having occurred. This is captured by the probability p(Bi/A) for i = 1, . . ., n called the posterior probability. This is the weight one can attach to each event Bi after event A is known to have occurred.

S B1

B2

B3 ∩ A

A

B3

B4 Fig. 2.27 Bayes’ theorem for multiple events depicted on a Venn diagram. In this case, the sample space is assumed to be partitioned into four discrete events B1. . .B4. If an observable event A shown by the circle has already occurred, the conditional probability of Þ B3 is pðB3 =AÞ = pðpBð3A\A Þ . This is the ratio of the hatched area to the total area inside the ellipse

2.5 Bayesian Probability

59

Example 2.5.1 Consider the two-stage experiment of Example 2.2.9 with six marbles of three colors in two boxes. Assume that the experiment has been performed and that a red marble has been obtained. One can use the information known beforehand, i.e., the prior probabilities R, W, and G to determine from which box the marble came from. Note that the probability of the red marble having come from box A represented by p(A/R) is now the conditional probability of the “flip” problem. This is called the posterior probabilities of event A with event R having occurred. Thus, from the law of total probability, the posterior conditional probabilities (Eq. 2.51): – For the red marble to be from Box B: pðB=RÞ =

pðR=BÞpðBÞ = pðR=BÞpðBÞ þ p R=B p B

1 3 2 4 1 3 2 4

þ

1 1 2 2

=

3 5

– For the red marble to be from Box A:

p

A = R

1 1 2 2 1 1 2 2

þ

1 3 2 4

=

2 5

The reverse probability tree for this experiment is shown in Fig. 2.28. The reader is urged to compare this with the forward tree diagram of Example 2.2.9. The probabilities of 1.0 for both W and G outcomes imply that there is no uncertainty at all in predicting where the marble came from. This is obvious since only Box A contains W, and only Box B contains G. However, for the red marble, one cannot be sure of its origin, and this is where a probability measure must be determined. ■ Example 2.5.2 Forward and reverse probability trees for fault detection of equipment A large piece of equipment is being continuously monitored by an add-on fault detection system developed by another vendor in order to detect faulty operation. The vendor of the fault detection system states that their product correctly identifies faulty operation when indeed it is faulty (this is referred to as sensitivity) 90% of the time. This implies that there is a probability p = 0.10 of a “false negative” occurring

Fig. 2.28 The probabilities of the reverse tree diagram at each stage are indicated. If a red marble (R) is picked, the probabilities that it came from either Box A or Box B are 2/5 and 3/5, respectively (Example 2.5.1)

(i.e., a missed opportunity of signaling a fault). Also, the vendor quoted that the correct status prediction rate or specificity of the detection system (i.e., system identified as healthy when indeed it is so) is 0.95, implying that the “false positive” or false alarm rate is 0.05. Finally, historic data seem to indicate that the large piece of equipment tends to develop faults only 1% of the time. Figure 2.29 shows how this problem can be systematically represented by a forward tree diagram. State A is the faultfree state and state B is represented by the faulty state. Further, each of these states can have two outcomes as shown. While outcomes A1 and B1 represent, respectively, correctly identified fault-free and faulty operations, the other two outcomes are errors arising from an imperfect fault detection system. Outcome A2 is the false positive event (or false alarm or error type II, which will be discussed at length in Sect. 4.2), while outcome B2 is the false negative event (or missed opportunity or error type I). The figure clearly illustrates that the probabilities of A and B occurring along with the conditional probabilities p(A1/A) = 0.95 and p(B1/B) = 0.90, result in the probabilities of each of the four states as shown in the figure. The reverse tree situation, shown in Fig. 2.30, corresponds to the following situation. A fault has been signaled. What is the probability that this is a false alarm?

60

2 Probability Concepts and Probability Distributions

Fig. 2.29 The forward tree diagram showing the four events that may result when monitoring the performance of a piece of equipment (Example 2.5.2)

pðB=B2Þ=

Fig. 2.30 Reverse tree diagram depicting two possibilities. If an alarm sounds, it could be either an erroneous one (outcome A from A2) or a valid one (B from B1). Further, if no alarm sounds, there is still the possibility of missed opportunity (outcome B from B2). The probability that it is a false alarm is 0.846, which is too high to be acceptable in practice. How to decrease this is discussed in the text

Using Eq. 2.51: pðA2=AÞpðAÞ pðA2=AÞpðAÞ þ pðB1=BÞpðBÞ ð0:05Þð0:99Þ ¼ ð0:05Þð0:99Þ þ ð0:90Þð0:01Þ 0:0495 ¼ ¼ 0:846 0:0495 þ 0:009

pðA=A2Þ ¼

Working backward using the forward tree diagram (Fig. 2.29) allows one to visually understand the basis of the quantities appearing in the expression above. The value of 0.846 for the false alarm probability is very high for practical situations and could well result in the operator disabling the fault detection system altogether. One way of reducing this false alarm rate, and thereby enhance robustness, is to increase the sensitivity of the detection device from its current 90% to something higher by altering the detection threshold. This would result in a higher missed opportunity rate, which one must accept for the price of reduced false alarms. For example, the current missed opportunity rate is:

pðB2=BÞpðBÞ pðB2=BÞpðBÞ þ pðA1=AÞpðAÞ

=

ð0:10Þð0:01Þ ð0:10Þð0:01Þ þ ð0:95Þð0:99Þ

=

0:001 = 0:001 0:001 þ 0:9405

This is probably lower than what is needed, and so the above suggested remedy is one that can be considered. A practical way to reduce the false alarm rate is not to take action when a single alarm is sounded but to do so when several faults are flagged. Such procedures are adopted by industrial process engineers using control chart techniques (see Sect. 8.7). Note that as the piece of machinery degrades, the percent of time when faults are likely to develop will increase from the current 1% to something higher. This will have the effect of lowering the false alarm rate (left to the reader to convince himself why). ■ Bayesian statistics provide the formal manner by which prior opinion expressed as probabilities can be revised in the light of new information (from additional data collected) to yield posterior probabilities. When combined with the relative consequences or costs of being right or wrong, it allows one to address decision-making problems as pointed out in the example above. It has had some success in engineering (as well as in social sciences) where subjective judgment, often referred to as intuition or experience gained in the field, is relied upon heavily. The Bayes’ theorem is a consequence of the probability laws and is accepted by all statisticians. It is the interpretation of probability, which is controversial. Both approaches differ in how probability is defined: • Classical viewpoint: long-run relative frequency of an event. • Bayesian viewpoint: degree of belief held by a person about some hypothesis, event, or uncertain quantity (Phillips 1973).

2.5 Bayesian Probability

61

Advocates of the classical approach argue that human judgment is fallible while dealing with complex situations, and this was the reason why formal statistical procedures were developed in the first place. Introducing the vagueness of human judgment as done in Bayesian statistics would dilute the “purity” of the entire mathematical approach. Advocates of the Bayesian approach, on the other hand, argue that the “personalist” definition of probability should not be interpreted as the “subjective” view. Granted that the prior probability varies from one individual to the other based on their own experience, but with additional data collection all these views get progressively closer. Thus, with enough data, the initial divergent opinions would become indistinguishable. Hence, they argue, the Bayesian method brings consistency to informal thinking when complemented with collected data, and should, thus, be viewed as a mathematically valid approach.

2.5.2

The following examples illustrate how the Bayesian approach can be applied to discrete data. Example 2.5.34 Consider a machine whose prior PDF of the proportion X of defectives is given by Table 2.7. If a random sample of size 2 is selected, and one defective is found, the Bayes’ estimate of the proportion of defectives produced by the machine is determined as follows. Let y be the number of defectives in the sample. The probability that the random sample of size 2 yields one defective is given by the Binomial distribution since this is a two-outcome situation: 2 y

xy ð1 - xÞ2 - y ; y = 0, 1, 2

If x = 0.1, then f ð1=0:1Þ = Bð1; 2, 0:1Þ =

2 1

ð0:1Þ1 ð0:9Þ2 - 1

= 0:18 Similarly, for x = 0:2, f

1 0:2

= 0:32.

Table 2.7 Prior PDF of proportion of defectives (x) (Example 2.5.3) X f(x)

4

0.1 0.6

f ðy = 1Þ = ð0:18Þð0:6Þ þ ð0:32Þð0:40Þ = ð0:108Þ þ ð0:128Þ = 0:236 The posterior probability f(x/y = 1) is then given: • for x = 0.1: 0.108/0.236 = 0.458 • for x = 0.2: 0.128/0.236 = 0.542 Finally, the Bayes’ estimate of the proportion of defectives x is: x = ð0:1Þð0:458Þ þ ð0:2Þð0:542Þ = 0:1542 which is quite different from the value of 0.5 given by the classical method. ■

Application to Discrete Probability Variables

f ðy=xÞ = Bðy; n, xÞ =

Thus, the total probability of finding one defective in a sample size of 2 is:

0.2 0.4

From Walpole et al. (2007), by permission of Pearson Education.

Example 2.5.45 Using the Bayesian approach to enhance value of concrete piles testing Concrete piles driven in the ground are used to provide bearing strength to the foundation of a structure (building, bridge, etc.). Hundreds of such piles are used in large construction projects. These piles should not develop defects such as cracks or voids in the concrete, which would lower compressive strength. Tests are performed by engineers on piles selected at random during the concrete pour process in order to assess overall foundation strength. Let the random discrete variable be the proportion of defective piles out of the entire lot, which is taken to assume five discrete values as shown in the first column of Table 2.8. Consider the case where the prior experience of an engineer as to the proportion of defective piles from similar sites is given in the second column of the table below. Before any testing is done, the expected value of the probability of finding one pile to be defective is: p = (0.20) (0.30) + (0.4)(0.40) + (0.6)(0.15) + (0.8)(0.10) + (1.0)(0.05) = 0.44 (as shown in the last row under the second column). This is the prior probability. Had he drawn a conclusion on just a single test that turns out to be defective, without using his prior judgment, he would have concluded that all the piles were defective; clearly, an over-statement. Suppose the first pile tested is found to be defective. How should the engineer revise his prior probability of the proportion of piles likely to be defective? Bayes’ theorem (Eq. 2.51) can be used. For proportion x = 0.2, the posterior conditional probability is: 5

From Ang and Tang (2007), by permission of John Wiley and Sons.

62

2 Probability Concepts and Probability Distributions

pðx ¼ 0:2Þ

pðx ¼ 0:2Þ

ð0:2Þð0:3Þ ¼ ð0:2Þð0:3Þ þ ð0:4Þð0:4Þ þ ð0:6Þð0:15Þ þ ð0:8Þð0:10Þ þ ð1:0Þð0:05Þ 0:06 ¼ ¼ 0:136 0:44

ð0:2Þð0:136Þ ð0:2Þð0:136Þ þ ð0:4Þð0:364Þ þ ð0:6Þð0:204Þ þ ð0:8Þð0:182Þ þ ð1:0Þð0:114Þ ¼ 0:0272=0:55 ¼ 0:049

This is the value that appears in the first row under the third column. Similarly, the posterior probabilities for different values of x can be determined, which add up to the expected value E (x = 1) = 0.55. Hence, a single inspection has led to the engineer revising his prior opinion upward from 0.44 to 0.55. The engineer would probably get a second pile tested, and if it also turns out to be defective, the associated probabilities are shown in the fourth column of Table 2.8. For example, for x = 0.2:

¼

The expected value in case the two piles tested turn out to be defective increases to 0.66. In the limit, if each successive pile tested turns out to be defective, one gets back the classical distribution, listed in the last column of the table. The progression of the PDF from the prior to the infinite case is illustrated in Fig. 2.31. Note that as more piles tested turn out to be defective, the evidence from the data gradually overwhelms the prior judgment of the engineer. However, it is only when collecting data is so expensive or time

Table 2.8 Illustration of how a prior PDF is revised with new data (Example 2.5.4)

Proportion of defectives (x) 0.2 0.4 0.6 0.8 1.0 Expected probability of defective piles

Probability of being defective Prior PDF After one pile tested of defectives is found defective 0.30 0.136 0.40 0.364 0.15 0.204 0.10 0.182 0.05 0.114 0.44 0.55

After two piles tested are found defective 0.049 0.262 0.221 0.262 0.205 0.66

...

...

Limiting case of infinite defectives 0.0 0.0 0.0 0.0 1.0 1.0

Fig. 2.31 Illustration of how the prior discrete PDF is affected by data collection following Bayes’ theorem (Example 2.5.4)

2.5 Bayesian Probability

63

consuming that decisions must be made from limited data that the power of the Bayesian approach becomes evident. Of course, if one engineer’s prior judgment is worse than that of another engineer, then his conclusion from the same data would be poorer than the other engineer. It is this type of subjective disparity that antagonists of the Bayesian approach are uncomfortable with. On the other hand, proponents of the Bayesian approach would argue that experience (even if intangible) gained in the field is a critical asset in engineering applications and that discarding this type of heuristic knowledge entirely is naïve, and short-sighted. ■ There are instances when no previous knowledge or information is available about the behavior of the random variable; this is sometimes referred to as prior of pure ignorance. It can be shown that this assumption of the prior leads to results identical to those of the traditional probability approach (see Example 2.5.5).

2.5.3

Application to Continuous Probability Variables

The Bayes’ theorem can also be extended to the case of continuous random variables (Ang and Tang 2007). Let X be the random variable with a prior PDF denoted by p(x). Though any appropriate distribution can be chosen, the Beta distribution is particularly convenient,6 and is widely used to characterize prior PDF. Another commonly used prior is the uniform distribution called a diffuse prior. For consistency with convention, a slightly different nomenclature than that of Eq. 2.51 is adopted. Assume that the Beta distribution (Eq. 2.49a) can be rewritten to yield the prior: pðxÞ / xa ð1 - xÞb

ð2:55Þ

Recall that higher the values of the exponents a and b, the peakier the distribution indicative of the prior distribution being relatively well defined. Let L(x) represent the conditional probability or likelihood function of observing y “successes” out of n observations. Then, the posterior probability is given by: f ðx=yÞ / LðxÞ  pðxÞ

ð2:56Þ

In the context of Fig. 2.27, the likelihood of the unobservable events B1. . .Bn is the conditional probability that A has occurred given Bi for i = 1, . . ., n, or by p(A/Bi). The likelihood function can be gleaned from probability considerations 6

Because of the corresponding mathematical simplicity that it provides as well as the ability to capture a wide variety of PDF shapes.

in many cases. Consider Example 2.5.4 involving testing the foundation piles of buildings. The Binomial distribution gives the probability of x failures in n independent Bernoulli trials, provided the trials are independent and the probability of failure in any one trial is p. This applies to the case when one holds p constant and studies the behavior of the PDF of defectives x. If instead, one holds x constant and lets p(x) vary over its possible values, one gets the likelihood function. Suppose n piles are tested and y piles are found to be defective or sub-par. In this case, the likelihood function is written as follows for the Binomial PDF: LðxÞ =

n y

xy ð 1 - xÞ n - y

0≤x≤1

ð2:57Þ

Notice that the Beta distribution is the same form as the likelihood function. Consequently, the posterior distribution given by Eq. 2.57 assumes the form: f ðx=yÞ = k  xaþy ð1 - xÞbþn - y

ð2:58Þ

where k is independent of x and is a normalization constant introduced to satisfy the probability law that the area under the PDF is unity. What is interesting is that the information contained in the prior has the net result of “artificially” augmenting the number of observations taken. While the classical approach would use the likelihood function with exponents y and (n - y) (see Eq. 2.57), these are inflated to (a + y) and (b + n - y) in Eq. 2.58 for the posterior distribution. This is akin to having taken more observations and supports the previous statement that the Bayesian approach is particularly advantageous when the number of observations is low. The examples below illustrate the use of Eq. 2.58. Example 2.5.5 Let us consider the same situation as that treated in Example 2.5.4 for the concrete pile testing situation. However, the proportion of defectives X is now a continuous random variable for which no prior distribution can be assigned. This implies that the engineer has no prior information, and, in such cases, a uniform distribution (or a diffuse prior) is assumed: pðxÞ = 1:0

for

0≤x≤1

The likelihood function for the case of the single tested pile turning out to be defective is x, i.e., L(x) = x. From Eq. 2.58, the posterior distribution is then: f ðx=yÞ = k  xð1:0Þ The normalizing constant

64

2 Probability Concepts and Probability Distributions -1

1

k=

=2

xdx 0

Hence, the posterior probability distribution is: f ðx=yÞ = 2x

0≤x≤1

for

The Bayesian estimate of the proportion of defectives, when one pile is tested and it turns out to be defective, is then: 1

p = E ðx=yÞ =

Fig. 2.32 Probability distributions of the prior, likelihood function, and the posterior (Example 2.5.5). (From Ang and Tang 2007, by permission of John Wiley and Sons)

x  2xdx = 0:667 0

■ Example 2.5.67 Enhancing historical records of wind velocity using the Bayesian approach Buildings are designed to withstand a maximum wind speed, which depends on the location. The probability x that the wind speed will not exceed 120 km/h more than once in 5 years is to be determined. Past records of wind speeds of a nearby location indicated that the following beta distribution would be an acceptable prior for the probability distribution (Eq. 2.49a): pðxÞ = 20x3 ð1 - xÞ for

0≤x≤1

Further, the likelihood that the annual maximum wind speed will exceed 120 km/h in 1 out of 5 years is given by the Binomial distribution as: 5

LðxÞ =

4

x4 ð1 - xÞ = 5x4 ð1 - xÞ

Hence, the posterior probability is deduced following Eq. 2.58: f ðx=yÞ = k  5x4 ð1 - xÞ  20x3 ð1 - xÞ = 100k  x7  ð1 - xÞ2 where the constant k can be found from the normalization criterion: -1

1

k=

2

100x ð1 - xÞ dx 7

= 3:6

0

7

From Ang and Tang (2007), by permission of John Wiley and Sons.

Finally, the posterior PDF is given by f ðx=yÞ = 360x7 ð1 - xÞ2 for 0 ≤ x ≤ 1 Plots of the prior, likelihood, and the posterior functions are shown in Fig. 2.32. Notice how the posterior distribution has become more peaked reflective of the fact that the single test data has provided the analyst with more information than that contained in either the prior or the likelihood function. ■

2.6

Three Kinds of Probabilities

The previous sections in this chapter presented basic notions of classical probability and how the Bayesian viewpoint is appropriate for certain types of problems. Both these viewpoints are still associated with the concept of probability as the relative frequency of an occurrence. At a broader context, one should distinguish between three kinds of probabilities: (i) Objective or absolute probability, which is the classical measure interpreted as the “long-run frequency” of the outcome of an event. It is an informed estimate of an event that in its simplest form is a constant; for example, historical records yield the probability of flood occurring this year or of the infant mortality rate in the United States. It would be unchanged for all individuals since it is empirical having been deduced from historic records. Table 2.9 assembles probability estimates for the occurrence of natural disasters with 10 and 1000 fatalities per event (indicative of the severity level) during different time spans (1, 10, and 20 years). Note that floods and tornados have relatively small return times for small events while

2.6 Three Kinds of Probabilities

65

Table 2.9 Estimates of absolute probabilities for different natural disasters in the United States. (Adapted from Barton and Nishenko 2008) Exposure times Disaster Earthquakes Hurricanes Floods Tornadoes

10 fatalities per event 1 year 10 years 0.11 0.67 0.39 0.99 0.86 >0.99 0.96 >0.99

20 years 0.89 >0.99 >0.99 >0.99

Return time (years) 9 2 0.5 0.3

1000 fatalities per event 1 year 10 years 0.01 0.14 0.06 0.46 0.004 0.04 0.006 0.06

20 years 0.26 0.71 0.08 0.11

Return time (years) 67 16 250 167

Table 2.10 Leading causes of death in the United States, 1992. (Adapted from Kolluru et al. 1996) Cause Cardiovascular or heart disease Cancer (malignant neoplasms) Cerebrovascular diseases (strokes) Pulmonary disease (bronchitis, asthma, etc.) Pneumonia and influenza Diabetes mellitus Non-motor vehicle accidents Motor vehicle accidents HIV/AIDS Suicides Homicides All other causes Total annual deaths (rounded)

earthquakes and hurricanes have relatively short times for large events. Such probability considerations can be determined at a finer geographical scale, and these play a key role in the development of codes and standards for designing large infrastructures (such as dams) as well as small systems (such as residential buildings). Note that the probabilities do not add up to one since one cannot define the entire population of possible events. (ii) Relative probability, where the chance of occurrence of one event is stated in terms of another. This is a way of comparing the effect or outcomes of different types of adverse events happening on a system or on a population when the absolute probabilities are difficult to quantify. For example, the relative risk for lung cancer is (approximately) 10 if a person has smoked before, compared to a non-smoker. This means that he is 10 times more likely to get lung cancer than a non-smoker. Table 2.10 shows leading causes of death in the United States in the year 1992. Here the observed values of the individual number of deaths due to various causes are used to determine a relative risk expressed as percent (%) in the last column. Thus, heart disease, which accounts for 33% of the total deaths, is 16 times riskier than motor vehicle deaths. However, as a note of caution, these are values aggregated across the whole population and during a specific time interval and need to be interpreted accordingly. State and

Annual deaths (× 1000) 720 521 144 91 76 50 48 42 34 30 27 394 2177

Percent (%) 33 24 7 4 3 2 2 2 1.6 1.4 1.2 18 100

government analysts separate such relative risks by location, age groups, gender, and race for public policy-making purposes. (iii) Subjective probability, which differs from one person to another, is an informed or best guess about an event that can change as our knowledge of the event increases. Subjective probabilities are those where the objective view of probability has been modified to treat two types of events: (i) when the occurrence is unique and is unlikely to repeat itself, or (ii) when an event has occurred but one is unsure of the final outcome. In such cases, one has still to assign some measure of likelihood of the event occurring and use this in their analysis. Thus, a subjective interpretation is adopted with the probability representing a degree of belief of the outcome selected as having actually occurred (which could be based on a scientific analysis subject to different assumptions or even by “gut-feeling”). There are no “correct answers,” simply a measure reflective of one’s subjective judgment. A good example of such subjective probability is one involving forecasting the probability of whether the impacts on gross world product of a 3°C global climate change by 2090 would be large or not. A survey was conducted involving 20 leading researchers working on global warming issues but with different technical backgrounds, such as scientists,

66

2 Probability Concepts and Probability Distributions

Fig. 2.33 Example illustrating large differences in subjective probability. A group of prominent economists, ecologists, and natural scientists were polled so as to get their estimates of the loss of gross world product due to doubling of atmospheric carbon dioxide (which is likely to occur by the end of the twenty-first century when mean global temperatures increase by 3°C). The two ecologists predicted the highest adverse impact while the lowest four individuals were economists. (From Nordhaus 1994)

engineers, economists, ecologists, and politicians, who were asked to assign a probability estimate (along with 10% and 90% confidence intervals). Though this was not a scientific study as such since the whole area of expert opinion elicitation is still not fully mature, there was nevertheless a protocol in how the questioning was performed, which led to the results shown in Fig. 2.33. The median, and 10% and 90% confidence intervals predicted by different respondents show great scatter, with the ecologists estimating impacts to be 20–30 times higher (the two right-most bars in the figure), while the economists on average predicted the chance of large consequences to have only a 0.4% loss in gross world product. An engineer or a scientist may be uncomfortable with such subjective probabilities, but there are certain types of problems where this is the best one can do with current knowledge. Thus, formal analysis methods must accommodate such information, and it is here that Bayesian techniques can play a key role.

Problems Pr. 2.1 Three sets are defined as integers from 1 to 12: A = {1,3,5,6,8,10}, B = {4,5,7,8,11} and C = {2,9,12}. (a) (b) (c) (d)

Represent these sets in a Venn diagram. What are A [ B, A \ B, A [ C, A \ C ? What are A [ B, A [ B, A [ C, A [ C ? What are AB, A – B, A + B?

Pr. 2.2 A county office determined that of the 1000 homes in their area, 400 were older than 20 years (event A), that 500 were constructed of wood (event B), and that 400 had central air-conditioning (AC) (event C). Further, it is found that events A and B occur in 300 homes, that events A or C occur in 625 homes, that all three events occur in 150 homes, and that no event occurs in 225 homes. If a single house is picked, determine the following probabilities (also draw the Venn diagrams): (a) That it is older than 20 years and has central AC. (b) That it is older than 20 years and does not have central AC. (c) That it is older than 20 years and is not made of wood. (d) That it has central AC and is made of wood. Pr. 2.3 A university researcher has submitted three research proposals to three different agencies. Let E1, E2, and E3 be the outcomes that the first, second, and third bids are successful with probabilities: p(E1) = 0.15, p(E2) = 0.20, p (E3) = 0.10. Assuming independence, find the following probabilities using group theory: (a) That all three bids are successful. (b) That at least two bids are successful. (c) That at least one bid is successful. Verify the above results using the probability tree approach.

Problems

67

Fig. 2.34 Components in parallel and in series (Problem 2.4)

Pr. 2.4 Example 2.2.3 illustrated how to compute the reliability of a system made up of two components A and B. As an extension, it would be insightful to determine whether reliability of the system will be better enhanced by (i) duplicating the whole system in parallel, or (ii) by duplicating the individual components in parallel. Consider a system made up of two components A and B. Figure 2.34 (a) represents case (i) while Figure 2.34(b) represents case (ii). (a) If p(A) = 0.1 and p(B) = 0.8 are the failure probabilities of the two components, what are the probabilities of the system functioning properly for both configurations. (b) Prove that functional probability of system (i) is greater than that of system (ii) without assuming any specific numerical values. Derive the algebraic expressions for proper system functioning for both configurations from which the above proof can be deduced. Pr. 2.5 Consider the two system schematics shown in Fig. 2.35. At least one pump must operate when one chiller is operational, and both pumps must operate when both chillers are on. Assume that both chillers have identical reliabilities of 0.90 and that both pumps have identical reliabilities of 0.95. (a) Without any computation, make an educated guess as to which system would be more reliable overall when (i) one chiller operates, and (ii) when both chillers operate. (b) Compute the overall system reliability for each of the configurations separately under cases (i) and (ii) defined above. Pr. 2.68 An automatic sprinkler system for a high-rise apartment has two different types of activation devices for each sprinkler

8

From McClave and Benson (1988) with permission of Pearson Education.

Fig. 2.35 Two possible system configurations (for Pr. 2.5)

head. Reliability of such devices is a measure of the probability of success, i.e., that the device will activate when called upon to do so. Type A and Type B devices have reliability values of 0.90 and 0.85, respectively. In case a fire does start, calculate: (a) The probability that the sprinkler head will be activated (i.e., at least one of the devices works). (b) The probability that the sprinkler will not be activated at all. (c) The probability that both activation devices will work properly. (d) Verify the above results using the probability tree approach. Pr. 2.7 Consider the following probability distribution of a random variable X: f ðxÞ = ð1 þ bÞx:b =0

for 0 ≤ x ≤ 1 elsewhere

Use the method of moments: (a) To find the estimate of the parameter b (b) To find the expected value of X

68

2 Probability Concepts and Probability Distributions

Pr. 2.8 Consider the following cumulative distribution function (CDF): F ðxÞ = 1 - expð - 2xÞ for x > 0 =0 x≤0 (a) Construct and plot the cumulative distribution function. (b) What is the probability of X < 2? (c) What is the probability of 3 < X < 5? Pr. 2.9 The joint density for the random variables (X,Y) is given by: f ðx, yÞ = 10xy2 =0

0 absðr Þ → weak

2

A more statically sound procedure is described in Sect. 4.2.7, which allows one to ascertain whether observed correlation coefficients are significant or not depending on the number of data points.

It is very important to note that inferring non-association of two variables x and y from their correlation coefficient is misleading since it only indicates linear relationship. Hence, a poor correlation does not mean that no relationship exists between them (e.g., a second-order relation may exist between x and y; see Fig. 3.13f). Detection of non-linear correlations is addressed in Sect. 9.5.1. Note also that correlation analysis does not indicate whether the relationship is causal, i.e., whether the variation in the y-variable is directly caused by that in the x-variable. Finally, keep in mind that the correlation analysis does not provide an equation for predicting the value of a variable—this is done under model building (see Chaps. 5 and 9). Example 3.4.2 The following observations are taken of the extension of a spring under different loads (Table 3.4). Using Eq. 3.8, the standard deviations of load and extension are 3.7417 and 18.2978, respectively, while the correlation coefficient r = 0.9979 (following Eq. 3.12). This indicates a very strong linear positive correlation between the two variables as one would expect. ■

3.5

Exploratory Data Analysis (EDA)

3.5.1

What Is EDA?

EDA is an analysis process (some use the phrase “attitude toward data analysis”) that has been developed and championed by John Tukey (1970) and considerably expanded and popularized by subsequent statisticians (e.g., Hoaglin et al. 1983). Rather than directly proceeding to perform the then-traditional confirmatory analysis (such as hypothesis testing) or stochastic model building (such as regression), Tukey suggested that the analyst should start by “looking at the data and see what it seems to say.” Such

90

3

Data Collection and Preliminary Analysis

Fig. 3.13 Illustration of various plots with different correlation strengths. (a) Moderate linear positive correlation (b) Perfect linear positive correlation (c) Moderate linear negative correlation (d) Perfect linear negative correlation (e) No correlation at all (f) Correlation exists but it is not linear. (From Wonnacutt and Wonnacutt 1985, by permission of John Wiley and Sons)

Table 3.4 Extension of a spring with applied load (Example 3.4.2)

Load (Newtons) Extension (mm)

2 10.4

an investigation of data visualization and exploration, it was argued, will be more effective since it would indicate otherwise hidden/unexpected behavior in the dataset, allow evaluating the assumptions in data behavior presumed by traditional statistical analyses, and provide better guidance into the selection and use of new/appropriate statistical techniques. The timely advances in computer technology and software sophistication allowed new graphical techniques to be developed along with ways to transform/ normalize variables to correct for unwanted shape/spread of their distributions. In short, EDA can be summarized as a process of data guiding the analysis! At first glance, EDA techniques may seem to be ad hoc and not follow any unifying structure; however, the interested reader can refer

4 19.6

6 29.9

8 42.2

10 49.2

12 58.5

to Hoaglin et al. (1983) for a rationale for developing the EDA techniques and for an explanation and illustration of the connections between EDA and classical statistical theory. Heiberger and Holland (2015) provide an in-depth coverage and illustrate how different types of graphical data displays can be used to aid in exploratory data analysis and enhance various types of statistical computation and analysis, such as for inference, hypothesis testing, regression, timeseries, and experimental design (topics that are also covered in different chapters in this book). The book also provides software code in the open-source R statistical environment for generating such visual displays (https://github.com/ henze-research-group/adam/).

3.5 Exploratory Data Analysis (EDA)

3.5.2

Purpose of Data Visualization

Data visualization is done by graphs and serves two purposes. During exploration of the data, it provides a better means of assimilating broad qualitative trend behavior of the data than can be provided by tabular data. Second, it provides an excellent manner of communicating to the reader what the author wishes to state or illustrate (recall the adage “a picture is worth a thousand words”). Hence, data visualization can serve as a medium to communicate information, not just to explore data trends (an excellent reference is Tufte 2001). However, it is important to be clear as to the intended message or purpose of the graph, and also tailor it as to be suitable for the intended audience’s background and understanding. A pretty graph may be visually appealing but may obfuscate rather than clarify or highlight the necessary aspects being communicated. For example, unless one is experienced, it is difficult to read numerical values off of 3-D graphs. Thus, graphs should present data clearly and accurately without hiding or distorting the underlying intent. Table 3.5 provides a succinct summary of graph formats appropriate for different applications. Graphical methods are usually more insightful than numerical screening in identifying data errors in smaller datasets with few variables. For large datasets with several variables, they become onerous if basic software, such as the ubiquitous spreadsheet, is used. Additionally, the strength of a graphical analysis during EDA is to visually point out to the analyst relationships (linear or non-linear) between two or more variables in instances when a sound physical understanding is lacking, thereby aiding in the selection of the appropriate regression model. Present-day graphical visualization tools allow much more than this simple objective,

91

some of which will become apparent below. There is a very large number of graphical ways of presenting data, and it is impossible to cover them all. Only a small set of representative and commonly used plots will be discussed below, while other types of plots will be presented in later chapters as relevant. Nowadays, several high-end graphical software programs allow complex, and sometimes esoteric, plots to be generated. Graphical representations of data are the backbone of exploratory data analysis. They are usually limited to one-, two-, and three-dimensional (1-D, 2-D, and 3-D) data. In the last few decades, there has been a dramatic increase in the types of graphical displays largely due to the seminal contributions of Tukey (1970), Cleveland (1985), and Tufte (1990, 2001). A particular graph is selected based on its ability to emphasize certain characteristics or behavior of one-dimensional data, or to indicate relations between twoand three-dimensional data. A simple manner of separating these characteristics is to view them as being: (i) Cross-sectional (i.e., data collected at one point in time or when time is not a factor) (ii) Time-series data (iii) Hybrid or combined (iv) Relational (i.e., emphasizing the joint variation of two or more variables) An emphasis on visualizing data to be analyzed has resulted in statistical software programs becoming increasingly convenient to use and powerful. Any data analysis effort involving univariate and bivariate data should start by looking at basic plots (higher-dimension data require more elaborate plots discussed later).

Table 3.5 Type and function of appropriate graph formats Type of message Component

Function Shows relative size of various parts of a whole

Relative amounts

Ranks items according to size, impact, degree, etc.

Time series

Shows variation over time

Frequency

Shows frequency of distribution among certain intervals

Correlation

Shows how changes in one set of data is related to another set of data

Typical format Pie chart (for one or two important components) Bar chart Dot chart Line chart Bar chart Line chart Dot chart Bar chart (for few intervals) Line chart Histogram Line chart Box-and-Whisker Paired bar Line chart Scatter diagram

Downloaded from the Energy Information Agency (EIA) website in 2009, which was since removed (http://www.eia.doe.gov/neic/graphs/introduc. htm)

92

3.5.3

3

Static Univariate Graphical Plots

Commonly used graphics for cross-sectional representation are mean and standard deviation plots, steam-and-leaf diagrams, dot plots, histograms, box-whisker-mean plots, distribution plots, bar charts, pie charts, area charts, and quantile plots. Mean and standard deviation plots summarize the data distribution using the two most basic measures; however, this manner is of limited use (and even misleading) when the distribution is skewed. (a) Histograms For univariate discrete or continuous data, plotting of histograms is very straightforward while providing a compact visual representation of the spread and shape (such as unimodal or bimodal) of the relative frequency distribution. There are no hard and fast rules regarding how to select the number of bins (Nbins) or classes in case of continuous data, probably because there is no proper theoretical basis. The shape of the underlying distribution is better captured with larger number of bins, but then each bin will contain fewer observations, and may exhibit jagged behavior (see Fig. 1.3 for the variation of outdoor air temperature in Philadelphia). Generally, the larger the number of observations n, the more Fig. 3.14 Box and whisker plot and its association with a standard normal distribution. The box represents the 50th percentile range while the whiskers extend 1.5 times the inter-quartile range (IQR) on either side. (From Wikipedia website)

Data Collection and Preliminary Analysis

classes can be used, though as a guide it should be between 5 and 20. Devore and Farnum (2005) suggest: Number of bins or classes = N bins = ðnÞ1=2

ð3:14aÞ

which would suggest that if n = 100, Nbins = 10. Doebelin (1995) proposes another equation: N bins = 1:87:ðn - 1Þ0:4

ð3:14bÞ

which would suggest that if n = 100, Nbins = 12. (b) Box-and-whisker plots The box-and-whisker plots better summarize the distribution, but this is done using percentiles. Figure 3.14 depicts its shape for univariate continuous data, identifies various important quantities, and illustrates how these can be associated with the Gaussian or standard normal distribution. The lower and upper box values Q1 and Q3 (called hinges) correspond to the 25th and 75th percentiles (recall the interquartile range [IQR] defined in Sect. 3.4.1) while the whiskers commonly extend to 1.5 times the IQR on either side. From Fig. 3.14, it is obvious that the IQR would contain 50% of the data observations while the range bounded by the

3.5 Exploratory Data Analysis (EDA) Table 3.6 Values of time taken (in minutes) for 20 students to complete an exam

37.0 42.0 47.0

93 37.5 43.1 62.0

38.1 43.9 64.3

40.0 44.1 68.8

40.2 44.6 70.1

40.8 45.0 74.5

41.0 46.1

See Example 3.5.1

low and high whiskers for a normally distributed dataset would contain 99.3% of the total data (close to the 99.7% value corresponding to ± (3 × standard deviation)). Such a representation reveals the skewness in the data, and also indicates outlier points. Any observation between (1.5 × IQR) and (3.0 × IQR) from the closest quartile is considered to be a mild outlier, and shown as a closed or filled point, while one falling outside (3.0 × IQR) from the closest quartile is taken to be an extreme outlier and shown as an open circle point. However, if the outlier falls within the (1.5 × IQR) spread, then the whisker line should be terminated at this outlier point (Tukey 1970). (c) Q-Q plots Though plotting a box-and-whisker plot or a plot of the distribution itself can suggest the shape of the underlying distribution, a better visual manner of ascertaining whether a presumed or specified theoretical probability distribution is consistent with the dataset being analyzed is by means of quantile plots. Recall that quantiles are points dividing the range of a probability distribution (or any sample univariate data) into continuous intervals with equal probabilities; hence quartiles are a good way of segmenting a distribution. Further, percentiles are a subset of quantiles that divide the data into 100 equally sized groups. In essence, a quantile plot is one where the quantiles of the dataset are plotted as a cumulative distribution. A quantile–quantile or Q-Q plot is one where the quantiles of the data are plotted against the quantiles of a specified standardized theoretical distribution.3 How they align or deviate from the 45° reference line allows one to visually determine whether the two distributions are in agreement, and, if not, may indicate likely causes for this difference. The Q-Q plot can also be generated with quantiles from one sample data against those of another. A special case of the more general Q-Q plot is the normal probability plot where the comparison is done against the normal distribution plotted on the y-axis. There is some variation between different statistical software in the terminology adopted and how these plots are displayed (an issue to be kept in mind). Q-Q plots are better interpreted with larger datasets, but the following example is meant for illustrative purposes.

3

Another similar type of plot for comparing two distributions is the P-P plot, which is based on the cumulative probability distributions. However, the Q-Q plot based on the cumulative quantile distribution is more widely used.

Fig. 3.15 Q-Q plot of the data in Table 3.6 against a standardized Gaussian distribution

Example 3.5.1 An instructor wishes to ascertain whether the time taken by her students to complete the final exam follows a normal or Gaussian distribution. The values in minutes shown in Table 3.6 have been recorded. The normal quantile plot for this dataset against a Gaussian (or standard normal) is shown in Fig. 3.15. The pattern is obviously non-linear, and so a Gaussian distribution is improper for this data. The apparent break appearing in the data on the right side of the graph indicates the presence of outliers (in this case, caused by five students taking much longer to complete the exam). ■ Example 3.5.2 Consider the same dataset as for Example 3.4.1. The following plots have been generated (shown in Fig. 3.16): (a) Box-and-whisker plot (note that the two whiskers are not equal indicating a slight skew). (b) Histogram of data (assuming 9 bins) shown as relative frequency; the sum of the y-axis of the individual bars should add to 100. (c) Normal probability plot allows evaluating whether the distribution is close to a normal distribution. The tails deviate from the straight line indicating departure from a normal distribution. Note that the normal quantile is now plotted on the y-axis and rescaled compared to the Q-Q plot of Fig. 3.15. Different software programs generate such plots differently. One also comes across Q-Q plots where both axes represent quantiles, one from the normal distribution and one from the sample distribution.

94

3

Data Collection and Preliminary Analysis

Fig. 3.16 Various exploratory plots for the dataset in Table 3.2

Fig. 3.17 Common components of the box plot and the violin plot for an arbitrary continuous variable (shown on the y-axis)

120

Outside Points

95

Upper Adjacent Value 70 Third Quartile Median 45

First Quartile

Lower Adjacent Value 20 Box Plot

(d) Run chart (or a time-series plot) meant to retain the timeseries nature of the data while the other graphics do not. The manner in which the run chart has been generated is meaningless since the data has been entered into the spreadsheet in the wrong sequence, with data entered column-wise instead of row-wise. The run chart, had the data been entered correctly, would have resulted in a monotonically increasing curve and would have been more meaningful. ■

Violin Plot

(d) Violin plots Another visually appealing plot that conveys essentially the same information as the box plot (but more completely) is the violin plot (Fig. 3.17). It reveals the probability density of the data at different values (each half of the violin shows the same distribution; this convention has been adopted for symmetry and visual aesthetics). The plot is usually smoothed by a kernel density estimator. The dot in the middle of the box inside the violin is the median, and the first and third quartiles are drawn (the box represents inter-quartile range).

3.5 Exploratory Data Analysis (EDA)

3.5.4

Static Bi- and Multivariate Graphical Plots

There are numerous graphical representations that fall in this category and only an overview of the more common plots will be provided here. The box plot representation, discussed earlier, also allows a convenient visual comparison of the similarities and differences between the spread and shape of two or more datasets when multiple box plots are plotted side by side. (a) Pie Charts Multivariate stationary data of worldwide percentages of total primary energy sources can be represented by Fig. 3.18 Two different ways of plotting stationary data. Data corresponds to worldwide percentages of total primary energy supply in 2003. (a) Pie chart. (b) Bar chart. (From IEA, World Energy Outlook, IEA, Paris, France, 2004)

95

the widely used pie chart (Fig. 3.18a), which allows the relative aggregate amounts of the variables to be clearly visualized. The same information can also be plotted as a bar chart (Fig. 3.18b), which is not quite as revealing. (b) Elaborate Bar Charts More elaborate bar charts (such as those shown in Fig. 3.19) allow numerical values of more than one variable to be plotted such that their absolute and relative amounts are clearly highlighted. The plots depict differences between the electricity sales during each of the four different quarters of the year over 6 years. Such plots can be drawn as compounded plots to allow better visual inter-comparisons (Fig. 3.19a). Column charts or stacked charts (Fig. 3.19b, c) show the same information

96

3

Data Collection and Preliminary Analysis

Fig. 3.19 Different types of bar and area plots to illustrate year-by-year variation (over 6 years) in quarterly electricity sales (in GigaWatt-hours) for a certain city

as that in Fig. 3.19a but are stacked one above another instead of showing the numerical values side by side. One plot shows the stacked values normalized such that the sum adds to 100%, while another stacks them so as to retain their numerical values. Finally, the same information can be plotted as an area chart (Fig. 3.19d) wherein both the time-series trend and the relative magnitudes are clearly highlighted. (c) Scatter and Time-Series Plots Time-series plots or relational plots or scatter plots (such as x–y plots) between two variables are the most widely used types of graphical displays. Scatter plots allow visual determination of the trend line between two variables and the extent to which the data scatter around the trend line (Fig. 3.20). An important issue is that the manner of selecting the range of the variables can be misleading to the eye. The same data is plotted in Fig. 3.21 on two different scales, but one would erroneously conclude that there is more data scatter around the trend line for (b) than for (a). This is

referred to as the lie factor defined as the ratio of the apparent size of effect in the graph and the actual size of effect in the data (Tufte 2001). The data at hand and the intent of the analysis should dictate the scale of the two axes, but it is difficult in practice to determine this heuristically.4 It is in such instances that statistical measures can be used to provide an indication of the magnitude of the graphical scales. (d) Bubble Plots Bubble plots allow observations with three attributes to be plotted. The 2-D version of such plots is the wellknown x–y scatter plot. An additional variable is represented by enlarging the dot, i.e., the bubble size. Figure 3.22 is illustrative of such a representation for the commute patterns in major US cities in 2008.

4

Generally, it is wise, at least at the onset, to adopt scales starting from zero, view the resulting graphs and then make adjustments to the scales as appropriate.

3.5 Exploratory Data Analysis (EDA)

97

Fig. 3.20 Scatter plot (or x–y plot) of worldwide population growth over time showing past values and projected values with 2010 as the current year. In this case, a second-order quadratic regression model has been selected to plot the trend line. The actual population in 2021 was 7.84 billion (quite close to the projected value)

Fig. 3.21 Figure to illustrate how the effect of resolution can mislead visually. The same data is plotted in the two plots, but one would erroneously conclude that there is more data scatter around the trend line for (b) than for (a)

Fig. 3.22 Bubble plot showing the commute patterns in major US cities in 2008. The size of the bubble represents the number of commuters. (From Wikipedia website, downloaded 2010)

98

3

Data Collection and Preliminary Analysis

Fig. 3.24 Scatter plot combined with box-whisker-mean (BWM) plot of the same data as shown in Fig. 3.10. (From Haberl and Abbas 1998, by permission of Haberl)

Fig. 3.23 Several types of combination charts are possible. The plots shown allow visual comparison of the standardized (subtracted by the mean and divided by the standard deviation) daily whole-house electricity use in 29 similar residences against the overlaid standard normal distribution. (From Reddy 1990)

(e) Combination Charts Combination charts can take numerous forms, but, in essence, are those where two basic but different ways of representing data are combined together in one graph. One example is Fig. 3.23 where the histogram depicts actual data spread, the distribution of which can be visually evaluated against the standard normal line overlaid on the histogram. The dataset corresponds to the daily energy use of 29 residences with similar diurnal energy use (classified as Stratum 5) during the summer and winter days. Occupant vagaries can be likened to randomness/noise in the electricity use data. Possible causes for the closeness or deviation from normality for each season can provide physical insights to the analyst and also allow different electricity curtailment measures to be modeled in a probabilistic framework. For purposes of data checking, x–y plots are perhaps most appropriate as discussed in Sect. 3.3.4. The x–y scatter plot

(Fig. 3.10) of hourly cooling energy use of a large institutional building versus outdoor temperature allowed outliers to be detected. The same data could be summarized by combined box-and-whisker plots (first suggested by Tukey 1988) as shown in Fig. 3.24. Here the x-axis range is subdivided into discrete bins (in this case, 5 °F bins), showing the median values (joined by a continuous line) along with the 25th percentiles on either side of the mean (shown boxed), the 10th and 90th percentiles indicated by the vertical whiskers from the box, and the values less than the 10th percentile and those greater than the 90th percentile are shown as individual points.5 Such a representation is clearly a useful tool for data quality checking, for detecting underlying patterns in data at different sub-ranges of the independent variable, and also for ascertaining the shape of the data spread around this pattern. Note that the cooling energy use line seems to plateau at outside temperature values above 80 °F. What does this indicate about the installed capacity of the chiller? Probably that the chiller was undersized at the onset or has degraded over time or that the building loads have increased over time. (f) Component-Effect Plots In case the functional relationship between the independent and dependent variables changes due to known causes, it is advisable to plot these in different frames. For example, hourly energy use in a commercial building is known to change with time of day, and, moreover, the functional relationship is quite different dependent on the season (time of year). Component-effect plots are multiple plots between the variables for cold, mild, and hot 5

Note that the whisker end points are different from those described earlier in Sect. 3.5.3. Different textbooks and studies adopt slightly different selection criteria.

3.5 Exploratory Data Analysis (EDA)

Fig. 3.25 Example of a combined box-whisker-component plot depicting how hourly energy use varies with hour of day during a year for different outdoor temperature bins for a large commercial building.

Fig. 3.26 Contour plot characterizing the sensitivity of total power consumption (condenser water pump power plus tower fan power) to condenser water-loop controls for a single chiller load, ambient wet-bulb temperature, and chilled water supply temperature. (From Braun et al. 1989, # American Society of Heating, Refrigerating and Air-Conditioning Engineers, Inc., www.ashrae.org)

periods of the year combined with different box-andwhisker plots for different hours of the day. They provide more clarity in underlying trends and scatter as illustrated in Fig. 3.25, where the time of year is broken up into three temperature bins. (g) Contour Plots A contour plot depicts the relationship between system response (or dependent variable) and two independent variables plotted on the two axes. This relationship is

99

(From ASHRAE 2002, # American Society of Heating, Refrigerating and Air-Conditioning Engineers, Inc., www.ashrae.org)

captured by a series of lines drawn for different pre-selected values of the response variable. This is illustrated by Fig. 3.26 where the total power of a condenser loop of a cooling system is the sum of the pump power and the cooling tower fan, both of which are function of their operating speed (since they can be modulated). The two axes of the figure are the normalized fan and pump speeds relative to their rated values. The minimum total power is shown by a cross at the center of the innermost contour circle, while one notes that this minimum is quite broad and different combinations of the two control variables are admissible. Such plots clearly indicate the degree of latitude allowable in the operating control settings of the pump and the fan and reveal the non-linear sensitivity of these control points stray from the optimal setting. Insights into how different combinations of the two independent variables impact total system power are useful to system operators. (h) Scatter Plot Matrix Figure 3.27, called scatter plot matrix, is another useful representation of visualizing multivariate data. Here the various permutations of the variables are shown as individual scatter plots. The idea, though not novel, has merit because of the way the graphs are organized and presented. The graphs are arranged in rows and columns such that each row or column has all the graphs relating a certain variable to all the others; thus, the variables have shared axes. Though there are twice as many graphs as needed minimally (since each graph has another one with the axis interchanged), the redundancy is sometimes useful to the analyst in better detecting underlying trends.

100 Fig. 3.27 Scatter plot matrix or carpet plots for multivariable graphical data visualization. The data corresponds to hourly climatic data for Phoenix, AZ, for January 1990. The bottom lefthand corner frame indicates how solar radiation in Btu/hr-ft2 (x-axis) varies with dry-bulb temperature (in °F) and is a flipped and rotated image of that at the top right-hand corner. The HR variable represents humidity ratio (in lbm/lba). Points that fall distinctively outside the general scatter can be flagged as outliers

3

Data Collection and Preliminary Analysis

TDB

HR

Solar

improperly time-stamped, such as overlooking daylight savings shift or misalignment of 24-h holiday profiles (Fig. 3.29). One negative drawback associated with these graphs is the difficulty in viewing exact details such as the specific hour or specific day on which a misalignment occurs. Some analysts complain that 3-D surface plots obscure data that is behind “hills” or in “valleys.” Clever use of color or dotted lines has been suggested to make it easier to interpret such graphs.

Fig. 3.28 Three-dimensional (3-D) surface chart of mean hourly whole-house electricity during different hourly segments of the day across several residences. (From Reddy 1990)

(i) 3-D Plots Three-dimensional (or 3-D) plots are being increasingly used from the past several decades. They allow plotting variation of a variable when it is influenced by two independent factors. They also allow trends to be gauged and are visually appealing, but the numerical values of the variables are difficult to read. The data plotted in Fig. 3.28 corresponds to the mean hourly energy use of 29 residences with similar diurnal energy use (classified as Stratum 5). The day has been broken up into eight segments of 3 h each to reduce clutter in the graph. Such a graph can reveal probabilistic trends in how occupants consume electricity so that different demand curtailment measures can be evaluated by electric utilities in a probabilistic modeling framework. Another benefit of such 3-D plots is their ability to aid in the identification of oversights. For example, energy use data collected from a large commercial building could be

(j) Domain-Specific Charts Different disciplines have developed different types of plots and graphical representations. Tornado diagrams are commonly used to illustrate variable sensitivity during risk analysis, and spider plots are common in multicriteria decision-making studies (Chap. 12). One example in HVAC studies is the well-known psychrometric chart (Reddy et al. 2016), which allows one to determine (for a given location characterized by its elevation above sea level) the various properties of air–water mixtures such as dry-bulb temperature, absolute humidity, relative humidity, specific volume, enthalpy, wet-bulb temperature. Solar scientists and architects have developed the sunpath diagram, which allows one to determine the position of the sun in the sky (defined by the solar altitude and the solar azimuth angles) at different times of the day and the year for a location of latitude 40° N (Fig. 3.30). Such a representation has also been used to determine periods of the year when shading occurs from neighboring obstructions. Such considerations are important while siting solar systems or designing buildings.

3.5.5

Interactive and Dynamic Graphics

The above types of plots can be generated by most of the present-day data analysis software programs. More

3.5 Exploratory Data Analysis (EDA)

101

Fig. 3.29 Example of a three-dimensional plot of measured hourly electricity use in a commercial building over 9 months. (From ASHRAE 2002, # American Society of Heating, Refrigerating and Air-Conditioning Engineers, Inc., www.ashrae.org) Fig. 3.30 Figure illustrating an overlay plot for shading calculations. The sun-path diagram is generated by computing the solar altitude and azimuth angles for a given latitude (for 40° N in this case) during different times of the day and times of the year. Trees and other objects can obstruct the observer. Such periods are conveniently determined by drawing the contours of these objects characterized by the angles φp and βp computed from basic geometry and overlaid on the sun-path diagram. (From Reddy et al. 2016, by permission of CRC Press)

specialized software programs allow interactive data visualization, which provide much greater insights and intuitive understanding into data trends, correlations, outliers, and local behavior, especially when large amounts of data are being analyzed. Animation has also been used to advantage in understanding time-series system behavior from monitored data since effects such as diurnal and seasonal differences in building energy use can be conveniently investigated. Animated scatter plots of the x and y variables, use of filtering, zooming, brushing, use of distortion lenses, etc. in conjunction with judicious use of color can provide even better visual insights to the professional and enhance classroom learning as well. The interested reader can refer to Keim and Ward (2003) for a more in-depth classification and treatment of advanced data visualization techniques most appropriate for large datasets. Different domains have developed specialized visualization tools. Glaser and Ubbelohde (2001) describe novel highperformance visualization techniques for viewing timedependent data common to building energy simulation program output. Some of these techniques include: (i) brushing

and linking where the user can investigate the behavior during a few days of the year, (ii) tessellating a 2-D chart into multiple smaller 2-D charts giving a four-dimensional (4-D) view of the data such that a single value of a representative sensor can be evenly divided into smaller spatial plots arranged by time of day, (iii) magic lenses that can zoom into a certain portion of the room, and (iv) magic brushes. These techniques enable rapid inspection of trends and singularities that cannot be gleaned from conventional viewing methods.

3.5.6

Basic Data Transformations

Another important aspect of EDA is data transformation. Such transformations are meant to simplify the analysis by removing effects such as strong asymmetry, many outliers in one tail, batches of data with different spreads, promoting linear relationships between regressor and response variables—in short to yield more effective insights of the dataset being analyzed (Hoaglin et al. 1983). Typical examples include converting into appropriate units, taking

102

3

ratios, rescaling, and applying mathematical corrections to the data such as taking logarithms or taking square roots. Such transformations are discussed in various chapters of this book (e.g., Chap. 5 in the context of regression model building, Chap. 11 for data mining, and Chap. 12 for sustainability assessments). Only the very basic rescaling or normalization methods are described below: (a) Decimal scaling moves the decimal point but still preserves most of the original data. The specific observations of a given variable may be divided by 10x where x is the minimum value so that all the observations are scaled between -1 and 1. For example, say the largest value is 289 and the smallest value is -150. Then since x = 3, all observations are divided by 1000 so as to lie between [–0.150 and 0.289]. (b) Min-max scaling allows for better distribution of observations over the range of variation than does decimal scaling. It does this by redistributing the values to lie between [0 and 1]. Hence, each observation is normalized as: zi =

xi - xmin xmax - xmin

ð3:15Þ

where xmax and xmin are the maximum and minimum numerical values, respectively, of the x variable. Sometimes, xmin may be close or equal to zero, and then Eq. 3.15 simplifies down to: zi =

xi xmax

xi - x sx

3.6

Overall Measurement Uncertainty

3.6.1

Need for Uncertainty Analysis

Any measurement exhibits some difference between the measured value and the true value and, therefore, has an associated uncertainty. A statement of measured value without an accompanying uncertainty statement has limited meaning. Uncertainty is the interval around the measured value within which the true value is expected to fall at some stated confidence level (CL). “Good data” is not characterized by point values only; the data should be within an acceptable uncertainty interval or, in other words, should provide the acceptable degree of confidence in the result. Measurements made in the field are especially subject to errors. In contrast to measurements taken under the controlled conditions of a laboratory setting, field measurements are typically made under less predictable circumstances and with less accurate and less expensive instrumentation. Furthermore, field measurements are vulnerable to errors arising from: (a) Variable measurement conditions so that the method employed may not be the best choice for all operating conditions. (b) Limited instrument field calibration, because it is typically more complex and expensive than laboratory calibration. (c) Limitations in the ability to adjust instruments in the field.

ð3:16Þ

Note that though this transformation may look very appealing, the scaling relies largely on the minimum and maximum values, which are generally not very robust and may be error prone. The min-max scaling is a linear transformation. One could also scale some of the variables using non-linear functions such as power or logarithmic functions (as discussed in Sect. 12.5.3). (c) Standard deviation scaling is widely used for distance measures (such as in multivariate statistical analysis) but transforms data into a form unrecognizable from the original data. Here, each observation is transformed as follows: zi =

Data Collection and Preliminary Analysis

ð3:17Þ

where x and sx are the mean and standard deviation, respectively, of the x variable. The dataset is said to be converted into one with zero mean and unit standard deviation (a transformation adopted in several statistical tests and during multivariate regression).

With appropriate care, many of these sources of error can be addressed: (i) through the optimization of the measurement system to provide maximum benefit for the chosen budget, and (ii) through the systematic development of a procedure by which an uncertainty statement can be ascribed to the result. The results of a practitioner who does not consider sources of error are likely to be questioned by others, especially since the engineering community is increasingly becoming sophisticated and mature about the proper reporting of measured data and associated uncertainties.

3.6.2

Basic Uncertainty Concepts: Random and Bias Errors

There are several standard documents for evaluating and reporting uncertainties in measurements, parameters, and methods and for propagation of those uncertainties to the test result. The International Organization of Standardization (ISO) standards are the basis from which professional organizations have developed standards specific to their

3.6 Overall Measurement Uncertainty

Random errors are differences from one observation to the next due to both sensor noise and extraneous conditions affecting the sensor. The random error changes from one observation to the next, but its mean (average value) over a very large number of observations is taken to approach zero. Random error generally has a well-defined probability distribution that can be used to bound its variability in statistical terms as described in the next two sub-sections when a finite number of observations is made of the same variable.

Population average

Frequency

True value Frequency

Parameter Measurement (a) Unbiased and precise

Parameter Measurement (b) Biased and precise

True value and Population average

Population average True value

Frequency

(a) Bias or systematic error (or precision or fixed error) is analogous to sensor precision (see Sect. 3.1). In certain cases, the fixed offset (bias) errors can be determined. For example, a bias is present if a temperature sensor always reads 1 °C higher than the true value from a certified calibration procedure, and this miscalibration error can be corrected. However, there are other causes such as improper placement of the sensor, degradation, or particular measurement technique that cause perturbations on the sensor reading (akin to Fig. 3.1b). These perturbations are also treated as a random variable but characterized by a fixed and unchanging uncertainty value which does not reduce even with multi-sampling (this aspect is elaborated below in Sect. 3.6.4). Thus, for the specific situation, a simple bias correction to the measurements cannot be applied. (b) Random error (or inaccuracy error) is an error due to the unpredictable and unknown variations in the experiment that causes readings to take random values on either side of some mean value. Measurements may be accurate or non-accurate, depending on how well an instrument can reproduce the subsequent readings of an unchanged input (Fig. 3.31). Only random errors can be treated by statistical methods. There are two types of random errors: (i) additive errors that are independent of the magnitude of the observations, and (ii) multiplicative errors that are dependent on the magnitude of the observations (Fig. 3.32). Usually instrument accuracy is stated in terms of percent of full scale, and in such cases uncertainty of a reading is taken to be additive, i.e., irrespective of the magnitude of the reading.

True value and Population average

Frequency

purpose (e.g., ASME PTC19.1-2018). The following material is largely drawn from Guideline 2 (ASHRAE G-2 2005), which deals with engineering analysis of experimental data. Uncertainty sources may be classified as either systematic/ bias or random and these are treated as random variables, with, however, different multipliers applied to them. The end result of an uncertainty analysis is a numerical estimate of the test uncertainty with an appropriate CL.

103

Parameter Measurement (c) Unbiased and imprecise

Parameter Measurement (d) Biased and imprecise

Fig. 3.31 The four general manifestations of measurement bias and precision errors of population average estimated from sample measurements

Fig. 3.32 Conceptual figures illustrating how additive and multiplicative errors affect the uncertainty bands around the trend line

3.6.3

Random Uncertainty

Based on measurements of a random variable X, the true value of X can be specified to lie in the interval (Xbest ± Ux) where Xbest is usually the mean value of the measurements taken and Ux is the uncertainty in X that corresponds to the estimate of the effects of combining fixed and random errors.

104

The uncertainty being reported is specific to a confidence level (CL),6 which can be directly interpreted as a probability. The confidence interval (CI) defines the range of values or the bounds/limits that can be expected to include the true value with a stated probability. For example, a statement that the CI at the 95% CL is 5.1–8.2 implies that the true value will be contained between the interval bounded by {5.1, 8.2} in 19 out of 20 predictions (95% probability), or that one is 95% confident that the true value lies between 5.1 and 8.2. This is a loose interpretation (but easier to understand by practitioners); the more accurate one is that the CI applies to the difference between the sample and the population means and not to the population mean itself. An uncertainty statement with a low CL is usually of little use. For example, in the previous example, if a CL of 40% is used instead of 95%, the interval becomes a tight {7.6, 7.7}. However, only 8 out of 20 predictions will likely lie between 7.6 and 7.7. Conversely, it is useless to seek a 100% CL since then the true value of some quantity would lie between plus and minus infinity. Multi-sample data (repeated measurements of a fixed quantity using altered test conditions, such as different observers or different instrumentation or both) provides greater reliability and accuracy than single sample data (measurements by one person using a single instrument). For the majority of engineering cases, it is impractical and too costly to perform a true multi-sample experiment. Strictly speaking, merely taking repeated readings with the same procedure and equipment does not provide multisample results; however, such a procedure is often accepted by the engineering community as a fair approximation of a multi-sample experiment. Depending upon the sample size of the data (greater or less than about 30 samples), different statistical considerations and equations apply. The issue of estimating CI is further discussed in Chap. 4, while operational equations are presented below. These levels or limits are directly based on the Gaussian and the Student-t distributions presented in Sect. 2.4.3. (a) Random uncertainty in large samples (n > about 30): The best estimate of a variable x is usually its sample mean value given by x: The limits of the CI are determined from the sample standard deviation sx. The typical procedure is then to assume that the individual data values are scattered about the mean following a certain probability distribution function, within (±z. sx) of the mean where z is a multiplier described below. Usually a normal probability curve (Gaussian distribution) is assumed to represent the dispersion in experimental 6

Several publications cite uncertainty intervals without specifying a corresponding CL; such practice should be avoided.

3

Data Collection and Preliminary Analysis

data, unless the process is known to follow one of the other standard distributions (discussed in Sect. 2.4). For a normal distribution, the standard deviation indicates the following degrees of dispersion of the values about the mean. From Table A3, for z = 1.96,7 the area shown shaded is 0.025, which translates to 0.05 for a two-tailed distribution, implying that 95% of the data will be within (±1.96sx) of the mean. Thus, the z-multiplier has a direct relationship with the CL selected (assuming a known probability distribution). The CL for the mean of n number of multi-sample random data, with no fixed error, is: z:s z:s xmin = x - p x and xmax = x þ p x n n

ð3:18aÞ

(b) Random uncertainty in small samples (n < about 30). In many circumstances, the analyst will not be able to collect a large number of data points and may be limited to a dataset of less than 30 values (n < 30). Under such conditions, the mean value and the standard deviation are computed as before. The z-value applicable for the normal distribution cannot be used for small samples. The new values, called t-values, are tabulated for different degrees of freedom d.f. (ν = n - 1) and for the acceptable degree of confidence (see Table A48). The CI for the mean value of x, when no fixed (bias) errors are present in the measurements, is given by: t:s xmin = x - p x n

t:s and xmax = x þ p x n

ð3:18bÞ

For example, consider the case of d.f. = 10 and two-tailed 95% CL. One finds from Table A4 that t = 2.228 for 95% CL. Note that this reduces to t = 2.086 for d.f. = 20 and reaches the z-value of 1.96 for d.f. = 1. Example 3.6.1 Estimating confidence intervals (CI) (a) The length of a field is measured 50 times. The mean is 30 with a standard deviation of 3. Determine the 95% CI assuming no fixed error. 7 Note that the value of 1.96 corresponds to very large samples (>120 or so). For a sample size of 30, the z-value can be read off of Table A4, which shows a value of 2.045 for degrees of freedom = 30 – 1 = 29 for a two-tailed 95% CL. However, it is common practice to simply assume the z-values for samples greater than 30 even though this does introduce some error. 8 Table A4 assembles critical values for both the one-tailed and two-tailed distributions, while most of the discussion here applies to the latter. See Sect. 4.2.1 for the distinction between both.

3.6 Overall Measurement Uncertainty

105

This is a large sample case, for which the z-multiplier is 1.96. Hence, from Eq. 3.17, the 95% CI = 30 ±

ð1:96Þð3Þ ð50Þ1=2

= 30 ± 0:83 = f29:17, 30:83g.

(b) Only 21 measurements are taken and the same mean and standard deviation as in (a) are found. Determine the 95% CI assuming no fixed error. This is a small sample case for which the t-value = 2.086 for d.f. = 20. Then, from Eq. 3.18, the 95% CI will turn out to be wider: 30 ± 1:37 = f28:63, 31:37g

3.6.4

ð2:086Þð3Þ ð21Þ1=2

= 30 ± ■

Bias Uncertainty

Estimating the bias or fixed error of a random variable at a specified confidence level (commonly, 95% CL) is described below. The fixed error BX for a given value x is assumed to be a single value drawn from some larger distribution of possible fixed errors. The treatment is similar to that of random errors with the major difference that only one value is considered even though several observations may be taken. When further knowledge is lacking, a normal distribution is usually assumed. Hence, if a manufacturer specifies that the fixed uncertainty BX = ±1.0 °C with 95% CL (compared to some standard reference device), then one assumes that the fixed error belongs to a larger distribution (taken to be Gaussian) with a standard deviation SB = 0.5 °C (since the corresponding z-value ≃2.0).

3.6.5

Overall Uncertainty

The overall uncertainty of a measured variable x combines the random and bias uncertainty estimates. Though several forms of this expression appear in different texts, a convenient working formulation is as follows:

Ux =

s Bx 2 þ t px n

2

ð3:19Þ

where: Ux = overall uncertainty in the value x at a specified CL Bx = uncertainty in the bias or fixed component at the specified CL sx = standard deviation estimates for the random component n = sample size t = t-value at the specified CL for the appropriate degrees of freedom

Example 3.6.2 For a single measurement, the statistical concept of standard deviation does not apply. Nonetheless, one could estimate it from manufacturer’s specifications if available. One wishes to estimate the overall uncertainty at 95% CL in an individual measurement of water flow rate in a pipe under the following conditions: (a) Full-scale meter reading 150 L/s (b) Actual flow reading 125 L/s (c) Random error of instrument is ±6% of full-scale reading at 95% CL (d) Fixed (bias) error of instrument is ±4% of full-scale reading at 95% CL The solution is rather simple since all stated uncertainties are at 95% CL. It is implicitly assumed that the normal distribution applies. The random error = 150 × 0.06 = ±9 L/s. The fixed error = 150 × 0.04 = ±6 L/s. The overall uncertainty can be estimated from Eq. 3.19 with n = 1: U x = 62 þ 92

1=2

= ± 10:82 L=s

The fractional overall uncertainty at 95% CL = = 0:087 = 8:7%

10:82 125

Ux x

= ■

Example 3.6.3 Consider Example 3.6.2. In an effort to reduce the overall uncertainty, 25 readings of the flow are taken instead of only one reading. The resulting uncertainty in this case is determined as follows: • The bias error remains unchanged at ±6 L/s p • The random error decreases by a factor of n to 9=ð25Þ1=2 = ± 1:8 L=s • Then from Eq. 3.19, the overall uncertainty is: Ux = (62 + 1.82)1/2 = ±6.26 L/s • The fractional overall uncertainty at 95% CL = Uxx = 6:26 = 0:05 = 5:0% 125 Increasing the number of readings from 1 to 25 reduces the absolute uncertainty in the flow measurement from ±10.82 L/s to ±6.26 L/s and the relative uncertainty from ±8.7% to ±5.0%. Because of the large, fixed error, further increase in the number of readings would result in only a small reduction in the overall uncertainty. ■ Example 3.6.4 A flow meter manufacturer stipulates a random error of 5% for his meter at 95.5% CL (i.e., at z = 2). Once installed, the engineer estimates that the bias error due to the placement of

106

3

Table 3.7 Table for Chauvenet’s criterion of rejecting outliers

Number of readings n 5 6 7 10 15 20 25 30 50 100 300 500 1000

Deviation ratio dmax/sx 1.65 1.73 1.80 1.96 2.13 2.24 2.33 2.51 2.57 2.81 3.14 3.29 3.48

the meter in the flow circuit is 2% at 95.5% CL. The flow meter takes a reading every minute, but only the mean value of 15 such measurements is recorded once every 15 min. Estimate the overall uncertainty of the mean of the recorded values at 99% CL. The bias uncertainty can be associated with the normal tables. From Table A3, z = 2.575 has an associated probability of 0.01, which corresponds to the 99% CL. Given that the bias error at 95.5% CL (z = 2) is 2%, the bias uncertainty at 99% CL (z = 2.575) would be 2.575%. Next, the random error at z = 1 is half of that at z = 2, i.e., 2.5%. However, the number of observations is less than 30, and so the student-t distribution has to be used for the random uncertainty component. From Table A4, the critical tvalue = 2.977 for d.f. = 15 – 1 = 14 and two-tailed CL = 95%. Hence, from Eq. 3.19, the overall uncertainty of the recorded values at 99% CL

Ux =

2

½2:575 þ

ð2:977Þ:ð2:5Þ ð15Þ1=2

2

1=2

= 0:0322 = 3:22% ■

3.6.6

Data Collection and Preliminary Analysis

Chauvenet’s Statistical Criterion of Data Rejection

The statistical considerations described above can lead to analytical screening methods that can point out data errors not flagged by graphical methods alone. Though several types of rejection criteria have been proposed, perhaps the best known is the Chauvenet’s criterion, which is said to provide an objective and quantitative method for data rejection. This criterion, which presumes that the errors are normally distributed and have constant variance, specifies that any reading out of a

series of n readings shall be rejected if the magnitude of its deviation dmax from the mean value of the series (=abs (xi - x)) is such that the two-sided probability of occurrence of such a deviation exceeds (1/2n). This criterion should not be applied to small datasets since the Gaussian distribution does not apply in such cases. The Chauvenet criteria is approximately given by the following regression model: d max = 0:8478 þ 0:5375 lnðnÞ - 0:02309 ln n2 sx

ð3:20Þ

where sx is the standard deviation of the series and n is the number of data points.9 The deviation ratio for different number of readings is more accurately given in Table 3.7. For example, if one takes 15 observations, an observation shall be discarded if its deviation from the mean exceeds a value dmax = (2.13)sx. This data rejection should be done only once and more than one round of elimination using the Chauvenet’s criterion is not advised. Note that the Chauvenet’s criterion has inherent assumptions that may not be justified. For example, the underlying distribution may not be normal, but could have a longer tail. In such cases, one may be throwing out good data. A more scientific manner of dealing with outliers is not to reject data points but to use either weighted regression or robust regression, where observations farther away from the mean are given less weight than those from the center (see Sects. 5.6 and 9.9).

3.7

Propagation of Errors

In many cases, the variable used for data analysis is not directly measured, but values of several associated variables are measured, which are then combined using a functional 9 The regression fit has an R-square of 99.6% and root mean square error (RMSE) = 0.0358; these goodness-of-fit statistical criteria are explained in Sect. 5.3.2.

3.7 Propagation of Errors

107

relationship. The objective of this section is to present the methodology to estimate overall error/uncertainty10 of a functional value y knowing the uncertainties in the individual input variables xi. The random and fixed components, which together constitute the overall error, must be estimated separately. The treatment that follows, though limited to random errors, could also apply to bias errors. It is recommended that the Taylor series method be applied when the errors are relatively small compared to the measurement values (say, 5% or so). When the errors are large (say, over 15%), the Monte Carlo (MC) method (Sect. 3.7.2) is preferable.

3.7.1

Taylor Series Method for Cross-Sectional Data

The general approach to estimate the error of a function y = y(x1, x2, . . ., xn), whose independently measured variables are all specified at the same confidence level, is to use the first-order Taylor series expansion (often referred to as the Kline-McClintock propagation of errors equation): n

εy = i=1

∂y εx,i ∂xi

2

ð3:21Þ

where: εy = error in function value εx,i = error in measured quantity xi (equivalent to Ux in the previous section) Neglecting terms higher than the first order (implied by a first-order Taylor Series expansion), the propagation of error expressions for the basic arithmetic operations are given below. Let x1 and x2 have errors ε1 and ε2. Then, for the basic arithmetic operations, Eq. 3.21 simplifies to: Addition or subtraction: For y = x1 ± x2 εy = ε2x1 þ ε2x2

ð3:22Þ

1=2

10 This chapter uses the terms “uncertainty” and “error” interchangeably. There is, however, a distinction. The word “error” is usually used in the context of sensors and derived measurements when the bias and random influences are small compared to the magnitude of the observation (say, 5% or less). When these are large, then the term “uncertainty” is more appropriate. For example, when population statistics are derived from sample data, or when the error propagation analysis involves very large uncertainties of the individual variables warranting a stochastic approach.

Multiplication : For y = x1 =x2 εy = ð x1 x2 Þ

εx1 x1

2

ε þ x2 x2

2 1=2

ð3:23Þ

2 1=2

ð3:24Þ

Division : For y = x1 =x2 x εy = 1 x2

εx1 x1

2

ε þ x2 x2

For functions involving multiplication and division only, it is much simpler to use the following expression than it is to use the more general Eq. 3.21. If y = xx1 x3 2 , then the fractional standard deviation is given by: εy ε 2 ε 2 ε 2 = x12 þ x22 þ x33 y x1 x2 x3

1=2

ð3:25Þ

The error or uncertainty in the result depends on the squares of the uncertainties in the independent variables. This means that if the uncertainty in one variable is larger than the uncertainties in the other variables, then it is the largest uncertainty that dominates. To illustrate, suppose there are three variables with an uncertainty of magnitude 1 and one variable with an uncertainty of magnitude 5. The uncertainty in the result would be (52 + 12 + 12 + 12)0.5 = (28)0.5 = 5.29. Clearly, the effect of the uncertainty in the single largest variable dominates the others. An analysis involving relative magnitude of uncertainties plays an important role during the design of an experiment and the procurement of instrumentation. Very little is gained by trying to reduce the “small” uncertainties since it is the “large” ones that dominate. Any improvement in the overall experimental result must be achieved by improving the instrumentation or experimental technique resulting in these relatively large uncertainties. This concept is illustrated below. Example 3.7.111 Relative error in Reynolds number for flow in a pipe Water is flowing in a pipe at a certain measured rate. The temperature of the water is measured, and the viscosity and density are then found from tables of water properties. Determine the probable errors of the Reynolds numbers (Re) at the low- and high-flow conditions given the information assembled in Table 3.8: Recall that Re = ρVd μ . Since the function involves multiplication and division only, it is easier to work with Eq. 3.25.

11

Adapted from Schenck (1969), by permission of McGraw-Hill.

108

3

Data Collection and Preliminary Analysis

Table 3.8 Error table of the four quantities that define the Reynolds number (Example 3.7.1) Minimum flow 1 0.2 1000 1.12 × 10-3

Quantity Velocity, m/s (V ) Pipe diameter, m (d ) Density, kg/m3 (ρ) Viscosity, kg/m-s (μ) a

Maximum flow 20 0.2 1000 1.12 × 10-3

Random error at full flow (95% CL) 0.1 0 1 0.45 × 10-5

% Errora Minimum 10 0 0.1 0.4

Maximum 0.5 0 0.1 0.4

Note that the last two columns under “% Error” are computed from the previous three columns of data

o

Relative error in Re

o o ooo o o oo o o oo o oo o o o o oo o o o o o o o

o o o

o

Reynolds number (Re)

Fig. 3.33 Expected variation in experimental relative error (at 95% CL) with magnitude of Reynolds number (Example 3.7.1)

At minimum flow condition, the relative error in Re (assuming no error in pipe diameter value) is: ε ðRe Þ = Re

0:1 1

2

þ

1 1000

2

þ

= 0:12 þ 0:0012 þ 0:0042

0:45 112 1=2

2 1=2

= 0:100 or 10%

On the other hand, at maximum flow condition, the percentage error is: ε ðRe Þ = 0:0052 þ 0:0012 þ 0:0042 Re

1=2

= 0:0065 or 0:65%

The above example reveals that (i) at low-flow conditions the error is 10%, which reduces to 0.65% at high-flow conditions, and (ii) at low-flow conditions the other sources of error are absolutely dwarfed by the 10% error due to flow measurement uncertainty. Thus, the only way to improve the experiment is to improve flow measurement accuracy. If the experiment is run without changes, one can confidently expect the data at the low-flow end to show a broad scatter becoming smaller as the velocity is increased. This phenomenon is captured by the 95% CI shown as relative errors in Fig. 3.33. ■

Equation 3.21 applies when the measured variables are uncorrelated. If they are correlated, their interdependence can be quantified by the covariance (defined by Eq. 3.11). If two variables x1 and x2 are correlated, then the error of their sum is given by: εy = εx1 2 þ εx2 2 þ 2x1 x2 covðx1 , x2 Þ

ð3:26Þ

Note that the covariance term can assume positive or negative values, and so the combined error can be higher or lower than that for uncorrelated independent variables. Example 3.7.2 Uncertainty in overall heat transfer coefficient The equation of the overall heat transfer coefficient U of a heat exchanger consisting of a fluid flowing inside and another fluid flowing outside a steel pipe of negligible thermal resistance is: U = ð1=h1 þ 1=h2 Þ - 1 = ½h1 h2 =ðh1 þ h2 Þ

ð3:27Þ

where h1 and h2 are the individual coefficients of the two fluids on either side of the pipe. If h1 = 15 W/m2 °C with a fractional error of 5% at 95% CL and h2 = 20 W/m2 °C with a fractional error of 3%, also at 95% CL, what will be the fractional error in random uncertainty of the U coefficient at 95% CL assuming bias error to be zero? In this case, because of the addition term, one has to use the fundamental equation given by Eq. 3.21. In order to use the propagation of error equation, the partial derivatives need to be computed. One could proceed to do so analytically using basic calculus. Then: ∂U δh1

=

h22 h2 ð h 1 þ h 2 Þ - h1 h 2 = ð h1 þ h2 Þ 2 ðh1 þ h2 Þ2

ð3:28Þ

=

h21 h1 ð h 1 þ h 2 Þ - h1 h 2 = ð h1 þ h2 Þ 2 ðh1 þ h2 Þ2

ð3:29Þ

h2

and ∂U δh2

h1

The absolute uncertainty εU in the overall heat transfer coefficient U is given by Eq. 3.21:

3.7 Propagation of Errors

εU =

∂U εh1 ∂h1

109 2

þ

∂U εh2 ∂h2

2

ð3:30Þ

where εh1 and εh2 are the errors of the coefficients h1 and h2, respectively, and are determined as:

Example 3.7.3 Uncertainty in exponential growth models Exponential growth models are used to describe several commonly encountered phenomena from population growth to consumption of resources. The amount of resource consumed over time Q(t) can be modeled as:

εh1 = 0:05 × 15 = 0:75 and εh2 = 0:03 × 20 = 0:80:

t

Plugging numerical values in the expression for U given by Eq. 3.27, one gets U = 8.571, while the partial derivatives given by Eqs. 3.28 and 3.29 are computed as: ∂U ∂U = 0:3265 and = 0:1837 ∂h2 ∂h1 Finally, from Eq. 3.30, the absolute error in the overall heat transfer coefficient U: εU = ð0:3265 × 0:75Þ2 þ ð0:1837 × 0:80Þ2

1=2

= 0:288 at 95%CL The fractional error in U = (0.288/8.571) = 3.3%.

Q ðt Þ =

yðx1 þ Δx1 , x2 , . . .Þ - yðx1 - Δx1 , x2 , . . .Þ ∂y = 2:Δx1 ∂x1 yðx1 , x2 þ Δx2 , . . .Þ - yðx1 , x2 - Δx2 . . .Þ ∂y = etc . . . 2:Δx2 ∂x2 ð3:31Þ No strict rules for the size of the perturbation or step size Δx can be framed since they would depend on the underlying shape of the function. Perturbations in the range of 1–4% of the value are reasonable choices, and one should evaluate the stability of the partial derivative computed numerically by repeating the calculations for a few different step sizes. In cases involving complex experiments with extended debugging phases, one should update the uncertainty analysis whenever a change is made in the data reduction program. Commercial software programs are also available with in-built uncertainty propagation formulae. This procedure is illustrated in Example 3.7.3.

P0 rt ð e - 1Þ r

ð3:32Þ

0

where P0 = initial consumption rate, and r = exponential rate of growth. The world coal consumption in 1986 was equal to 5.0 billion (short) tons and the estimated recoverable reserves of coal were estimated at 1000 billion tons. (a) If the growth rate is assumed to be 2.7% per year, how many years will it take for the total coal reserves to be depleted? Rearranging Eq. 3.32 results in t=



Another method of determining partial derivatives is to adopt a perturbation approach, which allows the local sensitivity or slope of the function to be evaluated numerically. A computer routine can be written to perform this task. One method is based on approximating partial derivatives by a central finite-difference approach. If y = y(x1, x2, . . . xn), then:

P0 ert dt =

Or t =

1 0:027 : ln

1 Q:r ln 1 þ r P0

ð3:33Þ

1 þ ð1000Þ5ð0:027Þ = 68:75 years

(b) Assume that the growth rate r and the recoverable reserves are subject to random uncertainty. If the uncertainties of both quantities are taken to be normal with standard deviation values of 0.2% (absolute) and 10% (relative), respectively, determine the lower and upper estimates of the years to depletion at the 95% CL. While the partial derivatives can be derived analytically in this case, the use of Eq. 3.21 following a numerical approach is adopted for illustration. The pertinent results using Eq. 3.31 with a perturbation multiplier of 1% to both the base values of r (= 0.027) and of Q (=1000) are assembled in Table 3.9. From here: ∂t = ∂r ∂t = ∂Q

ð68:37917 - 69:12924Þ = - 1389, and ð0:02727 - 0:02673Þ ð69:06297 - 68:43795Þ = 0:03125 ð1010 - 990Þ

110

3

Table 3.9 Numerical computation of the partial derivatives of t with Q and r (Example 3.7.3) Multiplier 0.99 1.00 1.01

Assuming Q = 1000 r t (from Eq. 3.32) 0.02673 69.12924 0.027 68.75178 0.02727 68.37917

Assuming r = 0.027 Q t (from Eq. 3.32) 990 68.43795 1000 68.75178 1010 69.06297

While power E can be measured directly, the amount of cooling Qch has to be determined by individual measurements of the chilled water volumetric flow rate and the difference between the supply and return chilled water temperatures along with water properties. Qch = ρVcΔT

ð3:35Þ

where:

Then:

εt =

Data Collection and Preliminary Analysis

∂t εr ∂r

2

þ

∂t εQ ∂Q

2 1=2

= ½ð- 1389Þð0:002Þ2 þ ð0:03125Þð0:1Þð1000Þ2 = 2:7782 þ 3:1252

1=2

1=2

= 4:181

Thus, the lower and upper limits at the 95% CL (with the z = 1.96) is = 68:75 ± ð1:96Þ4:181 = f60:55, 76:94g years The analyst should repeat the above procedure with, say, a perturbation multiplier of 2% in order to evaluate the stability of the numerically derived partial derivatives. If these differ substantially, it is urged that the function be plotted and scrutinized for irregular behavior around the point of interest. ■ Example 3.7.4 Selecting instrumentation during the experimental design phase A general uncertainty analysis is recommended at the planning phase for the purposes of proper instrument selection. This analysis should intentionally be kept simple. It is meant to identify the primary sources of uncertainties, to evaluate the relative weights of different source, and to perform a sensitivity analysis. An experimental program is being considered involving continuous monitoring of a large chiller under field conditions. The objective of the monitoring is to determine the chiller coefficient-of-performance (COP) on an hourly basis. The fractional uncertainty in the COP should not be greater than 5% at 95% CL. The rated full load is 450 tons of cooling (1 ton = 12,000 Btu/h). The chiller is operated under constant chilled water and condenser water flow rates. Only random errors are to be considered. The COP of a chiller is defined as the ratio of the amount of cooling at the evaporator (Qch) to the electric power (E) consumed: COP =

Qch E

ð3:34Þ

ρ = density of water V = chilled water volumetric flow rate, assumed constant during operation (rated flow rate = 1080 gpm) c = specific heat of water ΔT = temperature difference between the entering and leaving chilled water at the evaporator (a quantity that changes during operation) From Eq. 3.25, the fractional uncertainty in COP (neglecting the small effect of uncertainties in the density and specific heat terms) is: UC O P = COP

UV V

2

þ

U ΔT ΔT

2

þ

UE E

2

ð3:36Þ

Note that since this is a preliminary uncertainty analysis, only random errors are considered. The maximum flow reading of the selected meter is 1500 gpm with 4% uncertainty at 95% CL. This leads to an absolute uncertainty of (1500 × 0.04) = 60 gpm. The first term UVV is a constant and does not depend on the chiller load since the flow through the evaporator is maintained constant at 1080 gpm, Thus, UV V

2

=

2

60 1080

= 0:0031 and

UV = ± 0:056: V

The random error at 95% CL for the type of commercial grade sensor to be used for temperature measurement is 0.2°F. Consequently, the error in the measurement of temperature difference ΔT = (0.22 + 0.22)1/2 = 0.28 °F. From manufacturer catalogs, the temperature difference between supply and return chilled water temperatures at full load can be assumed to be 10 °F. The fractional uncertainty at full load is then U ΔT ΔT

2

=

0:28 10

2

= 0:00078 and

U ΔT = ± 0:028: ΔT

The power instrument has a full-scale value of 400 kW with an error of 1% at 95% CL, i.e., an error of 4.0 kW. The chiller rated capacity is 450 tons of cooling, with an assumed realistic lower bound of 0.8 kW per ton of cooling. The

3.7 Propagation of Errors

111

anticipated electric draw at full load of the chiller = 0.8 × 450 = 360 kW. The fractional uncertainty at full load is then: UE E

2

=

4:0 360

2

= 0:00012 and

UE = ± 0:011 E

Thus, the fractional uncertainty in the power is about one fifth of the flow rate. Propagation of the above errors yields the fractional uncertainty at 95% CL at full chiller load of the measured COP (using Eq. 3.36): U COP = ð0:0031 þ 0:00078 þ 0:00012Þ1=2 = 0:063 = 6:3% COP It is clear that the fractional uncertainty of the proposed instrumentation is not satisfactory for the intended purpose. Looking at the fractional uncertainties of the individual contributions, the logical remedy is to select a more accurate flow meter or one with a lower maximum flow reading. ■

3.7.2

Monte Carlo Method for Error Propagation Problems

The previous method of ascertaining errors/uncertainty based on the first-order Taylor series expansion is widely used; but it has limitations. If relative uncertainty is large, this method may be inaccurate for non-linear functions since it assumes first-order derivatives based on local functional behavior. Further, for the CI of the functional variable to have a statistical interpretation, the errors have to be normally distributed. Finally, deriving partial derivatives of complex and interrelinked analytical functions (as is common for models involving system simulation of various individual components) is a tedious and error-prone affair. A more general manner of dealing with uncertainty propagation is to use Monte Carlo (MC) methods, which are widely used for a number of complex applications (and treated at more length in Sects. 6.7.3, 10.6.7, and 12.2.8). These methods are numerical methods for solving problems involving random numbers and require considerations of probability. MC, in essence, is a numerical process of repeatedly calculating a mathematical function in which the input variables and/or the function parameters are random or contain uncertainty with prescribed probability distributions. Specifically, the individual inputs and/or parameters are sampled randomly from their prescribed probability distributions to form one repetition (or run or trial). The corresponding numerical solution is one possible outcome of the function. This process of generating runs is repeated a large number of times, resulting in a distribution of the functional values that

can then be represented as probability distributions, or as histograms, or by summary statistics, or by CI for any percentile threshold chosen. Such insights cannot be gained from using the traditional Taylor series error propagation approach. The MC process is computer intensive and requires thousands of runs to be performed. However, the entire process is simple and easily implemented even on spreadsheet programs (which have in-built functions for generating pseudo-random numbers of selected distributions). Specialized engineering software programs are also available. There is a certain amount of arbitrariness associated with the process because MC simulation is a numerical method. Several authors propose approximate formulae for determining the number of trials, but a simple method is as follows. Start with a large number of trials (say, 1000), and generate pseudo-random numbers with the assumed probability distribution. Since they are pseudo-random, the mean and the distribution (say, the standard deviation) may deviate somewhat from the desired ones (which depend on the accuracy of the algorithm used). Generate a few such sets and pick one whose mean and standard deviation values are closest to the desired quantities. Use this set to simulate the corresponding values of the function. This can be repeated a few times until the mean and standard deviations stabilize around some average values, which can be taken to be the answer. It is also urged that the analyst evaluate the effect of the results with different number of trials, say using 3000 trials, and ascertaining that the results of both the 1000 trial and 3000 trials are similar. If they are not, sets with increasingly large number of trials should be used till the results converge. The approach is best understood by means of a simple example. Example 3.7.5 Using Monte Carlo (MC) method to determine uncertainty in exponential growth models Let us solve the problem given in Example 3.7.3 by the MC method. The approach involves setting up a spreadsheet table as shown in Table 3.10. Since only two variables (namely Q and r) have uncertainty, one only needs to assign two columns to these and a third column to the desired quantity, i.e., time t over which the total coal reserves will be depleted. The first row shows the calculation using the mean values and one sees that the value of t = 68.75 as found in part (a) of Example 3.7.3 is obtained (this is done for verifying the spreadsheet cell formula). The analyst then generates random numbers of Q and r with the corresponding mean and standard deviations as specified and shown in the first row of the table. MC methods, being numerical methods, require that a large sample be generated in order to obtain reliable results.

112 Table 3.10 The first few and last few calculations used to determine uncertainty in variable t using the Monte Carlo (MC) method

3 Run # 1 2 3 4 5 6 7 8 9 10 ... ... 990 991 992 993 994 995 996 997 998 999 1000 Mean SD

Data Collection and Preliminary Analysis

Q (1000, 100) 1000.0000 1050.8152 1171.6544 1098.2454 1047.5003 1058.0283 946.8644 1075.5269 967.9137 1194.7164

r (0.027, 0.002) 0.0270 0.0287 0.0269 0.0284 0.0261 0.0247 0.0283 0.0277 0.0278 0.0262

t (years) 68.7518 72.2582 73.6445 73.2772 69.0848 67.7451 68.5256 71.8072 68.6323 73.3758

1133.6639 997.0123 896.6957 1056.2361 1033.8229 1078.6051 1137.8546 950.8749 1023.7800 950.2093 849.0252 1005.0 101.82

0.0278 0.0252 0.0257 0.0283 0.0298 0.0295 0.0276 0.0263 0.0264 0.0248 0.0247 0.0272 0.00199

73.6712 66.5173 63.8175 71.9108 72.8905 73.9569 73.4855 66.3670 68.7452 64.5692 61.0231 68.91 3.919

The last two rows indicate the mean and standard deviation (SD) of the individual columns (Example 3.7.5)

Fig. 3.34 Pertinent plots of the Monte Carlo (MC) analysis for the time variable t (years) (a) Histogram. (b) Normal probability plot with 95% limits

Often, 1000 normal distribution samples are selected but it is advisable to repeat the analysis a few times for more robust results. For example, instead of having (1000, 100) for the mean and standard deviation of Q, the 1000 samples have (1005.0, 101.82). The corresponding mean and standard deviation of t are found to be (68.91, 3.919) compared to the previously estimated values of (68.75, 4.181). Further, even though normal distributions were assumed for variables Q and r, the functional form for time t results in a non-normal

distribution as can be seen by the histogram and the normal probability plots of Fig. 3.34. Hence it is more meaningful to look at the percentiles rather than simply the mean and standard deviation (shown in Table 3.10). These are easily deduced from the MC runs and shown in Table 3.11. Such additional insights into the distribution of the variable t provided by the MC-generated data are certainly advantageous. ■

3.8 Planning a Non-Intrusive Field Experiment

113

Table 3.11 Percentiles for the time variable t (years) determined from the 1000 MC runs of Table 3.10 % 1.0 5.0 10.0 25.0 50.0 75.0 90.0 95.0 99.0

3.8

Percentiles 58.509 62.331 63.8257 66.3155 68.968 71.6905 73.818 74.9445 77.051

Planning a Non-Intrusive Field Experiment

One needs to differentiate between two conditions under which data can be collected. On the one hand, one can have a controlled setting where the various variables of interest can be altered by the experimenter. In such a case, referred to as intrusive testing, one can plan an “optimal” experiment where one can adjust the inputs and boundary or initial conditions as well as choose the number and location of the sensors so as to minimize the effect of errors on estimated values of the parameters (treated in Chap. 6). On the other hand, one may be in a situation where one is a mere “spectator,” i.e., the system or phenomenon cannot be controlled, and the data is collected under non-experimental conditions (as is the case of astronomical observations). Such an experimental protocol, known as non-intrusive identification, is usually not the best approach. In certain cases, the driving forces may be so weak or repetitive that even when a “long” dataset is used for identification, a strong enough or varied output signal cannot be elicited for proper statistical treatment (see Sect. 10.3). An intrusive or controlled experimental protocol, wherein the system is artificially stressed to elicit a strong response, is more likely to yield robust and accurate models and their parameter estimates. However, in some cases, the type and operation of the system may not allow such intrusive experiments to be performed. Further, one should appreciate differences between measurements made in a laboratory setting and in the field. The potential for errors, both bias and random, is usually much greater in the latter. Not only can measurements made on a piece of laboratory equipment be better designed and closely controlled, but they will be more accurate as well because more expensive sensors can be selected and placed correctly in the system. For example, proper flow measurement requires that the flow meter be placed 30 pipe diameters after a bend to ensure the flow profile is well established. A laboratory setup can be designed accordingly, while field

conditions may not allow such conditions to be met satisfactorily. Further, systems being operated in the field may not allow controlled tests to be performed, and one has to develop a model or make decisions based on what one can observe. Any experiment should be well-planned involving several rational steps (e.g., ascertaining that the right sensors and equipment are chosen, that the right data collection protocol and scheme are followed, and that the appropriate data analysis procedures are selected). It is advisable to explicitly adhere to the following steps (ASHRAE 2005): (i) Identify experimental goals and acceptable accuracy that can be achieved within the time and budget available for the experiment. (ii) Identify entire list of measurable variables and relationships. If some are difficult to measure, find alternative variables. (iii) Establish measured variables and limits (theoretical limits and expected bounds) to match the selected instrument limits. Also, determine instrument limits— all sensor and measurement instruments have physical limits that restrict their ability to accurately measure quantities of interest. (iv) Preliminary instrumentation selection should be based on accuracy, repeatability, and features of the instrument increase, as well as cost. Regardless of the instrument chosen, it should have been calibrated within the last 12 months or within an interval required by the manufacturer, whichever is less. The required accuracy of the instrument will depend upon the acceptable level of uncertainty for the experiment. (v) Document uncertainty of each measured variable using information gathered from manufacturers or past experience with specific instrumentation. Document the uncertainty for each measured variable. This information will then be used in estimating the overall uncertainty of results using propagation of error methods. (vi) Perform preliminary uncertainty analysis of proposed measurement procedures and experimental methodology. This should be completed before the procedures and methodology are finalized in order to estimate the uncertainty in the final results. The higher the accuracy required of measurements, the higher the accuracy of sensors needed to obtain the raw data. The uncertainty analysis is the basis for selection of a measurement system that provides acceptable uncertainty at least cost. How to perform such a preliminary uncertainty analysis was illustrated by Example 3.7.4. (vii) Final instrument selection and methods should be based on the results of the preliminary uncertainty

114

analysis and selection of instrumentation. Revise selection, if necessary, to achieve the acceptable uncertainty in the experiment results. (viii) Install instrumentation in accordance with manufacturer’s recommendations. Any deviation in the installation from the manufacturer’s recommendations should be documented and the effects of the deviation on instrument performance evaluated. A change in instrumentation or location may be required if in situ uncertainty exceeds acceptable limits determined by the preliminary uncertainty analysis. (ix) Perform initial data quality verification to ensure that the measurements taken are not too uncertain and represent reality. Instrument calibration and independent checks of the data are recommended. Independent checks can include sensor validation, energy balances, and material balances (see Sect. 3.3). (x) Collect data. The challenge for data acquisition in any experiment is to collect the required amount of information while avoiding collection of superfluous information. Superfluous information can overwhelm simple measures taken to follow the progress of an experiment and can complicate data analysis and report generation. The relationship between the desired result—either static, periodic stationary, or transient— and time is the determining factor for how much information is required. A static, non-changing result requires only the steady-state result and proof that all transients have died out. A periodic stationary result, the simplest dynamic result, requires information for one period and proof that the one selected is one of three consecutive periods with identical results within acceptable uncertainty. Transient or non-repetitive results—whether a single pulse or a continuing, random result—require the most information. Regardless of the result, the dynamic characteristics of the measuring system and the full transient nature of the result must be documented for some relatively short interval of time. Identifying good models requires a certain amount of diversity in the data, i.e., should cover the spatial domain of variation of the independent variables. Some basic suggestions pertinent to controlled experiments are summarized below, which are also pertinent for non-intrusive data collection. (a) Range of variability: The most obvious way in which an experimental plan can be made compact and efficient is to space the variables in a predetermined manner. If a functional relationship between an independent variable X and a dependent variable Y is sought, the most obvious way is to select end points or limits of the test,

3

Data Collection and Preliminary Analysis

Fig. 3.35 A possible XYZ envelope with X and Z as the independent variables. The dashed lines enclose the total family of points over the feasible domain space

thus covering the test envelope or domain that encloses the complete family of data. For a model of the type Z = f(X,Y ), a plane area or map is formed (see Fig. 3.35). Functions involving more variables are usually broken down to a series of maps. The above discussion relates to controllable regressor variables. Extraneous variables, by their very nature, cannot be varied at will. An example are phenomena driven by climatic variables. The energy use of a building is affected, among others, by outdoor dry-bulb temperature, humidity, and solar radiation. Since these cannot be varied at will, a proper experimental data collection plan would entail collecting data during different seasons of the year. (b) Grid spacing considerations: Once the domains or ranges of variation of the variables are defined, the next step is to select the grid spacing. Being able to anticipate the system behavior from theory or from prior publications would lead to a better experimental design. For a relationship between X and Y, which is known to be non-linear, the optimal grid is to space the points at the two extremities. However, if a linear relationship between X and Y is sought for a phenomenon that can be approximated as linear, then it would be best to space the x points evenly. For non-linear or polynomial functions, an equally spaced test sequence in X is clearly not optimal. Consider the pressure drop through a new fitting as a function of flow. It is known that the relationship is quadratic. Choosing an experiment with equally spaced X values would result in a plot such as that shown in Fig. 3.36a. One would have more observations in the

Problems

115

Fig. 3.36 Two different experimental designs for proper identification of the parameter (k) appearing in the model for pressure drop versus velocity of a fluid flowing through a pipe assuming ΔP = kV2. The grid spacing shown in (a) is the more common one based on equal

increments in the regressor variable, while that in (b) is likely to yield more robust estimation but would require guess-estimating the range of variation for the pressure drop

low-pressure drop region and less in the higher range. One may argue that an optimal spacing would be to select the velocity values such that the pressure drop readings are more or less evenly spaced (see Fig. 3.36b). Which one of the two is better depends on the instrument precision. If the pressure drop instrument has constant uncertainty over the entire range of variation of the experiment, then test spacing as shown in Fig. 3.36b is better because it is uniform. But if the fractional uncertainty of the instrument decreases with increasing pressure drop values, then the points at the lower end have higher uncertainty. In that case it is better to take more readings at the lower end as for the spacing sequence shown in Fig. 3.36a. (xi) Accomplish data reduction and analysis, which involves the distillation of raw data into a form that is usable for further analysis. It may involve averaging multiple measurements, quantifying necessary conditions (e.g., steady-state), comparing with physical limits or expected ranges, and rejecting outlying measurements. (xii) Perform final uncertainty analysis, which is done after the entire experiment has been completed and when the results of the experiments are to be documented or reported. This will take into account unknown field effects and variances in instrument accuracy during the experiment. A final uncertainty analysis involves the following steps: (i) Estimate fixed (bias) error based upon instrumentation calibration results, and (ii) document the random error due to the instrumentation based upon instrumentation calibration results. The fixed errors needed for the detailed uncertainty analysis are usually more difficult to estimate with a high degree of certainty. Minimizing fixed errors can

be accomplished by careful calibration with referenced standards. (xiii) Reporting results is the primary means of communication. Different audiences require different reports with various levels of detail and background information. The report should be structured to clearly explain the goals of the experiment and the evidence gathered to achieve the goals. It should describe the data reduction, data analysis, and uncertainty analysis performed. Graphical and mathematical representations are often used. On graphs, error bars placed vertically and horizontally on representative points are a very clear way to present expected uncertainty. A data analysis section and a conclusion are critical sections and should be prepared with great care while being succinct and clear.

Problems Pr. 3.1 Consider the data given in Table 3.2. Determine: (a) The 10% trimmed mean value. (b) Which observations can be considered to be “mild” outliers (>1.5 × IQR)? (c) Which observations can be considered to be “extreme” outliers (>3.0 × IQR)? (d) Identify outliers using Chauvenet’s criterion given by Eq. 3.20. Compare them with those obtained by using Table 3.7. (e) Compare the results from (b), (c), and (d). (f) Generate a few plots (such as Fig. 3.16) using your statistical software. Other types of plots can be generated but they should be relevant.

116

3

Pr. 3.2 Consider the data given in Table 3.6. Perform an exploratory data analysis involving pertinent statistical summary measures and generate at least three pertinent graphical plots along with a discussion of findings. Pr. 3.312 A nuclear power facility produces a vast amount of heat that is usually discharged into the aquatic system. This heat raises the temperature of the aquatic system resulting in a greater concentration of chlorophyll that in turn extends the growing season. To study this effect, water samples were collected monthly at three stations for one year. Station A is located closest to the hot water discharge, and Station C the farthest (Table 3.12). You are asked to perform the following tasks with the time-series data and annotate with pertinent comments: (a) Flag any outlier points: (i) visually, (ii) using box-andwhisker plots, and (iii) following the Chauvenet’s criterion. (b) Compute pertinent statistical descriptive measures (after removing outliers).

Table 3.12 Data table for Problem 3.3

Month January February March April May June July August September October November December

Table 3.13 Parameters and uncertainties to be assumed (Pr. 3.4)

Parameter cpc mc Tc,i Tc,o cph mh Th,i Th,o

(c) Generate at least two pertinent graphical plots and discuss relevance of these plots in terms of insights they provide. (d) Compute the covariance and correlation coefficients between the three stations and draw relevant conclusions (do this both without and with outlier rejection). Pr. 3.413 Consider a basic indirect heat exchanger where heat exchange rates associated with the cold and hot fluid flow sides are given by: Qactual = mc cpc ðT c,o - T c,i Þ ðcoldsideheatingÞ

Data available electronically on book website.

ð3:37Þ

Qactual = mh cph ðT h,i - T h,o Þ ðhot sidecoolingÞ where m, T, and cp are the mass flow rate, temperature, and specific heat, respectively. The subscripts o and i stand for outlet and inlet, and c and h denote cold and hot streams, respectively. Assume the values and uncertainties of various parameters shown in Table 3.13:

Station A 9.867 14.035 10.700 13.853 7.067 11.670 7.357 3.358 4.210 3.630 2.953 2.640

Nominal value 1 Btu/lb°F 475,800 lb/h 34 °F 46 °F 0.9 Btu/h°F 450,000 lb/h 55 °F 40 °F

Station B 3.723 8.416 12.723 9.168 4.778 9.145 8.463 4.086 4.233 2.320 3.843 3.610

Station C 4.410 11.100 4.470 8.010 14.080 8.990 3.350 4.500 6.830 5.800 3.480 3.020

95% Uncertainty ±5% ±10% ±1 °F ±1 °F ±5% ±10% ±1 °F ±1 °F

From ASHRAE-G2 (2005) # American Society of Heating, Refrigerating and Air-Conditioning Engineers, Inc., www.ashrae.org. 13

12

Data Collection and Preliminary Analysis

Problems

117

Table 3.14 Data table for Problem 3.6 Entering air enthalpy (hai) Leaving air enthalpy (hao) Entering water enthalpy (hci)

Units Btu/lb Btu/h Btu/h

When installed 38.7 27.2 23.2

(a) Compute the heat exchanger loads and the uncertainty ranges for the hot and cold sides assuming all variables to be uncorrelated. (b) What would you conclude regarding the uncertainty around the heat balance checks? (c) It has been found that the inlet hot fluid temperature is correlated with the hot fluid mass flow rate with a correlation coefficient of –0.6 (i.e., when the hot fluid flow rate increases, the corresponding inlet temperature decreases). How would your results for (a) and (b) change? Provide a short discussion.

Current 36.8 28.2 21.5

95% Uncertainty 5% 5% 2.5%

Pr. 3.6 Determining cooling coil degradation based on effectiveness The thermal performance of a cooling coil can also be characterized by the concept of effectiveness widely used for thermal modeling of traditional heat exchangers. In such coils, a stream of humid air flows across a coil supplied by chilled water and is cooled and dehumidified as a result. In this case, the effectiveness can be determined as: ε=

ðh - hao Þ actual heat transfer rate = ai maximum possible heat transfer rate ðhai - hci Þ ð3:38Þ

Hint: The standard deviations can be taken to be half of the 95% uncertainty values listed in Table 3.13. Pr. 3.5 Consider Example 3.7.4 where the uncertainty analysis on chiller COP was done at full-load conditions. What about part-load conditions, especially since there is no collected data? One could use data from chiller manufacturer catalogs for a similar type of chiller, or one could assume that part-load operation will affect the inlet minus the outlet chilled water temperatures (ΔT) in a proportional manner, as stated below. (a) Compute the 95% CL uncertainty in the COP at 70% and 40% full load assuming the evaporator water flow rate to be constant. At part load, the evaporator temperature difference is reduced proportionately to the chiller load, while the electric power drawn is assumed to increase from a full-load value of 0.8 kW/t to 1.0 kW/t at 70% full load and to 1.2 kW/t at 40% full load. (b) Would the instrumentation be adequate or would it be prudent to consider better instrumentation if the fractional COP uncertainty at 95% CL is to be less than 10%? (c) Note that fixed (bias) errors have been omitted from the analysis, and some of the assumptions in predicting part-load chiller performance can be questioned. A similar exercise with slight variations in some of the assumptions, called a sensitivity study, would be prudent at this stage. How would you conduct such an investigation?

where hai and hao are the enthalpies of the air stream at the inlet and outlet, respectively, and hci is the enthalpy of entering chilled water. The effectiveness is independent of the operating conditions provided the mass flow rates of air and chilled water remain constant. An HVAC engineer would like to determine whether the coil has degraded after it has been in service for a few years. For this purpose, she measures the current coil performance at air and water flow rates identical to those when originally installed as shown in Table 3.14. Note that the uncertainty in determining the air enthalpies are relatively large due to the uncertainty associated with measuring bulk air stream temperatures and humidities. However, the uncertainty in the enthalpy of the chilled water is only half of that of air. (a) Assess, at 95% CL, whether the cooling coil has degraded or not. Clearly state any assumptions you make during the evaluation. (b) What are the relative contributions of the uncertainties in the three enthalpy quantities to the uncertainty in the effectiveness value? Do these differ from the installed period to the time when current tests were performed? Pr. 3.7 Table 3.15 assembles values of the total electricity generated by five different types of primary energy sources and their associated total emissions (EIA 1999). Clearly, coal and oil generate a lot of emissions of pollutants, which are harmful not only to the environment but also to public health.

118

3

Data Collection and Preliminary Analysis

Table 3.15 Data table for Problem 3.7 US power generation mix and associated pollutant emissions Electricity Fuel kWh (1999) % Total Coal 1.77E + 12 55.7 Oil 8.69E + 10 2.7 Natural gas 2.96E + 11 9.3 Nuclear 7.25E + 11 22.8 Hydro/Wind 3.00E + 11 9.4 Totals 3.18E + 12 100.0

Short tons (=2000 lb/t) SO2 1.13E + 07 6.70E + 05 2.00E + 03 0.00E + 00 0.00E + 00 1.20E + 07

NOx 6.55E + 06 1.23E + 05 3.76E + 05 0.00E + 00 0.00E + 00 7.05E + 06

CO2 1.90E + 09 9.18E + 07 1.99E + 08 0.00E + 00 0.00E + 00 2.19E + 09

Data available electronically on book website Table 3.16 Data table for Problem 3.8 Symbol HP Hours ηold ηnew

Description Horse power of the end-use device Number of operating hours in the year Efficiency of the old motor Efficiency of the new motor

France, on the other hand, has a mix of 21% coal and 79% nuclear. (a) Calculate the total and percentage reductions in the three pollutants should the United States change its power generation mix to mimic that of France (Hint: First normalize the emissions per kWh for all three pollutants). (b) The relative uncertainties of the three pollutants SO2, NOx, and CO2 are 10%, 12%, and 5%, respectively. Assuming log-normal distributions for all quantities, compute the uncertainty in the total reductions of the three pollutants estimated in (a) above. Pr. 3.8 Uncertainty in savings from energy conservation retrofits There is great interest in implementing retrofit measures meant to conserve energy in individual devices as well as in buildings. These measures must be justified economically, and including uncertainty in the estimated energy savings is an important element of the analysis. Consider the rather simple problem involving replacing an existing electric motor with a more energy-efficient one. The annual energy savings Esave in kWh/year are given by: E save = ð0:746ÞðHPÞðHoursÞ

1 1 ηold ηnew

ð3:39Þ

with the symbols described in Table 3.16 along with their numerical values. (a) Determine the absolute and relative uncertainties in Esave under these conditions. (b) If this uncertainty had to be reduced, which variable will you target for further refinement?

Value 40 6500 0.85 0.92

Outdoor air (OA)

95% Uncertainty 5% 10% 4% 2%

Mixed air (MA)

To building zones Air-handler unit

Return air (RA)

Fig. 3.37 Sketch of an all-air HVAC system supplying conditioned air to indoor zones/rooms of a building (Problem 3.9)

(c) What is the minimum value of ηnew at which the lower bound of the 95% CL interval of Esave is greater than zero? Pr. 3.9 Uncertainty in estimating outdoor air fraction in HVAC systems Ducts in heating, ventilating, and air-conditioning (HVAC) systems supply conditioned air (SA) to the various spaces in a building, and also exhaust the air from these spaces, called return air (RA). A sketch of an all-air HVAC system is shown in Fig. 3.37. Occupant health requires that a certain amount of outdoor air (OA) be brought into the HVAC systems while an equal amount of return air be exhausted to the outdoors. The OA and the RA mix at a point just before the air-handler unit. Outdoor air ducts have dampers installed to control the OA since excess OA leads to unnecessary energy wastage. One of the causes for recent complaints from occupants has been identified as inadequate OA, and sensors installed inside the ducts could modulate the dampers accordingly. Flow measurement is always problematic on a continuous basis. Hence, OA flow is inferred from measurements of the air temperature TR inside the RA stream, of TO inside the OA stream, and TM inside the mixed air (MA) stream. The supply air is deduced by measuring the fan speed with a tachometer,

Problems

119

using a differential pressure gauge to measure static pressure rise, and using manufacturer equation for the fan curve. The random error of the sensors is 0.2 °F at 95% CL with negligible bias error. (a) From a sensible heat balance (with changes in specific heat with temperature neglected), derive the following expression for the outdoor air fraction (ratio of outdoor air and mixed air) OAf = ðT R - T M Þ=ðT R - T O Þ. (b) Derive the expression for the uncertainty in OAf and calculate the 95% CL in the OAf if TR = 70 °F, TO = 90 °F, and TM = 75 °F. Pr. 3.10 Sensor placement in HVAC ducts with consideration of flow non-uniformity Consider the same situation as in Pr. 3.9. Usually, the air ducts have large cross-sections. The problem with inferring outdoor air flow using temperature measurements is the large thermal non-uniformity usually present in these ducts due to both stream separation and turbulence effects. Moreover, temperature (and, hence density) differences between the OA and MA streams result in poor mixing. The following table gives the results of a traverse in the mixed air duct with nine measurements (using an equally spaced grid of 3 × 3 designated by numbers in bold in Table 3.17). The measurements were replicated four times under the same outdoor conditions. The random error of the sensors is 0.2 °F at 95% CL with negligible bias error. Determine: (a) The worst and best grid locations for placing a single sensor (to be determined based on analyzing the recordings at each of the nine grid locations and for all four time periods). (b) The maximum and minimum errors at 95% CL one could expect in the average temperature across the duct cross-section, if the best grid location for the single sensor was adopted. Pr. 3.11 Consider the uncertainty in the heat transfer coefficient illustrated in Example 3.7.2. The example was solved

Table 3.17 Table showing the temperature readings (in °F) at the nine different sections (S#1–S#9) of the mixed air (MA) duct (Pr. 3.10) 55.6, 54.6, 55.8, 54.2 S#1 66.4, 67.8, 68.7, 67.6 S#4 63.5, 65.0, 63.6, 64.8 S#7

56.3, 58.5, 57.6, 63.8 S#2 58.0, 62.4, 62.3, 65.8 S#5 67.4, 67.4, 66.8, 65.7 S#8

Data available electronically on book website

53.7, 50.2, 59.0, 49.4 S#3 61.2, 56.3, 64.7, 58.8 S#6 63.9, 61.4, 62.4, 60.6 S#9

analytically using the Taylor’s series approach. You are asked to solve the same example using the Monte Carlo (MC) method: (a) Using 100 data points (b) Using 1000 data points Compare the results from this approach with those in the solved example. Also determine the 2.5th and the 97.5th percentile bounds and compare results. Pr. 3.12 You will repeat Example 3.7.3 involving uncertainty in exponential growth using the Monte Carlo (MC) method. (a) Instead of computing the standard deviation, plot the distribution of the time variable t to evaluate its shape for 100 trials. Generate probability plots (such as q-q plots) against a couple of the promising distributions. (b) Determine the mean, median, 25th percentile, 75th percentile, and the 2.5% and 97.5% percentiles. (c) Compare the 2.5% and 97.5% values to the corresponding values in Example 3.7.3. (d) Generate the box-and-whisker plot and compare them with the results of (b). Pr. 3.13 In 2015, the United States had about 1000 GW of installed electricity generation capacity. (a) Assuming an exponential electric growth rate of 1% per year, what would be the needed electricity generation capacity in 2050? (b) If the penetration target set for renewables and energy conservation is 75% of the electricity capacity in 2050, what should be their needed annual growth rate (taken to be exponential)? Assume an initial value of 14% renewable capacity in 2015 (which includes hydropower). (c) If the electric growth rate assumed in (a) has an uncertainty of 15% (taken to be a normally distributed), calculate the uncertainty associated with the growth rate computed in (b). Pr. 3.14 Uncertainty in the estimation of biological dose over time for an individual Consider an occupant inside a building in which a biological agent has been accidentally released. The dose (D) is the cumulative amount of the agent to which the human body is subjected, while the response is the measurable physiological change produced by the agent. The widely accepted approach for quantifying dose is to assume functional forms based on first-order kinetics. For biological and radiological agents

120

3

Data Collection and Preliminary Analysis

Solve this problem following both the numerical partial derivative method and the Monte Carlo (MC) method and compare results.

where the process of harm being done is cumulative, one can use Haber’s law (Heinsohn and Cimbala 2003): t2

D ðt Þ = k

C ðt Þdt

ð3:40Þ

Pr. 3.15 Propagation of optical and tracking errors in solar concentrators Solar concentrators are optical devices meant to increase the incident solar radiation flux density (power per unit area) on a receiver. Separating the solar collection component (viz., the reflector) and the receiver can allow heat losses per collection area to be reduced. This would result in higher fluid operating temperatures at the receiver. However, there are several sources of errors that lead to optical losses:

t1

where C(t) is the indoor concentration at a given time t, k is a constant that includes effects such as the occupant breathing rate, the absorption efficiency of the agent or species, etc., and t1 and t2 are the start and end times. This relationship is often used to determine health-related exposure guidelines for toxic substances. For a simple one-zone building, the free response, i.e., the temporal decay given in terms of the initial concentration C(t1), is determined by: C ðt Þ = Cðt 1 Þ exp½ - aðt - t 1 Þ

(i) Due to non-specular or diffuse reflection from the reflector, which could be due to improper curvature of the reflector surface during manufacture (shown in Fig. 3.38), or to progressive dust accumulation over the surface over time as the system operates in the field. (ii) Due to tracking errors arising from improper tracking mechanisms as a result of improper alignment sensors or non-uniformity in drive mechanisms (usually, the tracking is not continuous; a sensor activates a motor every few minutes, which re-aligns the reflector to the solar disk as it moves in the sky). The result is a spread in the reflected radiation as illustrated in Fig. 3.38b. (iii) Improper reflector and receiver alignment during the initial mounting of the structure or due to small ground/pedestal settling over time.

ð3:41Þ

where the model parameter a is a function of the volume of the space and the outdoor and supply air flow rates. The above equation is easy to integrate during any time period from t1 to t2, thus providing a convenient means of computing total occupant inhaled dose when occupants enter or leave the contaminated zones at arbitrary times. Let a = 0.017186 with 11.7% uncertainty while C(t1) = 7000 cfu/m3 (cfu: colony forming units) with 15% uncertainty. Assume k = 1. (a) Determine the total dose to which the individual is exposed to at the end of 15 min. (b) Compute the uncertainty of the corresponding total dose in terms of absolute and relative terms.

The above errors are characterized by root mean square random errors (or RMSE) and their combined effect can be determined statistically following the basic propagation of errors formula. Bias errors such as that arising from structural

Incoming ray

Incident ray

Reflected rays

a Fig. 3.38 Different types of optical and tracking errors (Problem 3.15). (a) Micro-roughness in solar concentrator surface leads to a spread in the reflected radiation. The roughness is illustrated as a dotted line for the ideal reflector surface and as a solid line for the actual surface. (b) Tracking errors lead to a spread in incoming solar radiation shown as

b

Tracker reflector a normal distribution. Note that a tracker error of σ track results in a reflection error σ reflec = 2. σ track from Snell’s law. Factor of 2 also applies to other sources based on the error occurring as light both enters and leaves the optical device (see Eq. 3.42)

References

121

Table 3.18 Data table for Problem 3.15 Component

Source of error

Solar disk Reflector

Finite angular size Curvature manufacture Dust buildup Sensor misalignment Drive non-uniformity Misalignment

Tracker Receiver

RMSE error Fixed value 9.6 mrad 1.0 mrad – 2.0 mrad – 2.0 mrad

mismatch can be partially corrected by one-time or regular corrections and are not considered. Note that these errors need not be normally distributed, but such an assumption is often made in practice. Thus, RMSE values representing the standard deviations of these errors are often used for such types of analysis. (a) You will analyze the absolute and relative effects of this source of radiation spread at the receiver considering various other optical errors described above, using the numerical values shown in Table 3.18. σ totalpread = ðσ solardisk Þ2 þ ð2σ manuf Þ2 þ ð2σ dustbuild Þ2 þ ð2σ sensor Þ2 þ ð2σ drive Þ2 þ σ rec - mesalign

2 1=2

ð3:42Þ (b) Plot the variation of the total error as a function of the tracker drive non-uniformity error for three discrete values of dust building up (0, 1, and 2 mrad). Note that the finite angular size of the solar disk results in incident solar rays that are not parallel but subtend an angle of about 33 min or 9.6 mrad.

References Abbas, M., and J.S. Haberl, 1994. Development of indices for browsing large building energy databases, Proc. Ninth Symp. Improving Building Systems in Hot and Humid Climates, pp. 166–181, Dallas, TX, May. ASHRAE G-14, 2002. Guideline14–2002: Measurement of Energy and Demand Savings, American Society of Heating, Refrigerating and Air-Conditioning Engineers, Atlanta. ASHRAE G-2, 2005. Guideline 2–2005: Engineering Analysis of Experimental Data, American Society of Heating, Refrigerating and Air-Conditioning Engineers, Atlanta, GA. ASME PTC 19.1, 2018. Test Uncertainty, American Society of Mechanical Engineers, New York, NY. Ayyub, B.M. and R.H. McCuen, 1996. Numerical Methods for Engineers, Prentice-Hall, Upper Saddle River, NJ Baltazar, J.C., D.E. Claridge, J. Ji, H. Masuda and S. Deng, 2012. Use of First Law Energy Balance as a Screening Tool for Building Energy Data, Part 2: Experiences on its Implementation As a Data Quality Control Tool, ASHRAE Trans., Vol. 118, Pt. 1, Conf. Paper CH-12C021, pp. 167-174, January.

Variation over time – – 0–2 mrad – 0–10 mrad –

Braun, J.E., S.A. Klein, J.W. Mitchell and W.A. Beckman, 1989. Methodologies for optimal control of chilled water systems without storage, ASHRAE Trans., 95(1), American Society of Heating, Refrigerating and Air-Conditioning Engineers, Atlanta, GA. Cleveland, W.S., 1985. The Elements of Graphing Data, Wadsworth and Brooks/Cole, Pacific Grove, California. Devore J., and N. Farnum, 2005. Applied Statistics for Engineers and Scientists, 2nd Ed., Thomson Brooks/Cole, Australia. Doebelin, E.O., 1995. Measurement Systems: Application and Design, 4th Edition, McGraw-Hill, New York EIA, 1999. Electric Power Annual 1999, Vol.II, October 2000, DOE/ EIA-0348(99)/2, Energy Information Administration, US DOE, Washington, D.C. 20585–065 http://www.eia.doe.gov/eneaf/electric ity/epav2/epav2.pdf. Glaser, D. and S. Ubbelohde, 2001. Visualization for time dependent building simulation, 7th IBPSA Conference, pp. 423–429, Rio de Janeiro, Brazil, Aug. 13–15. Haberl, J.S. and M. Abbas, 1998. Development of graphical indices for viewing building energy data: Part I and Part II, ASME J. Solar Energy Engg., vol. 120, pp. 156–167 Hawkins, D. 1980. Identification of Outliers, Chapman and Hall, Kluwer Academic Publishers, Boston/Dordrecht/London Heiberger, R.M. and B. Holland, 2015. Statistical Analysis and Data Display, 2nd Ed., Springer, New York. Heinsohn, R.J. and J.M. Cimbala, 2003. Indoor Air Quality Engineering, Marcel Dekker, New York, NY Hoaglin, D.C., F. Mosteller and J.W. Tukey (Eds.), 1983. Understanding Robust and Exploratory Data Analysis, John Wiley and Sons, New York. Holman, J.P. and W.J. Gajda, 1984. Experimental Methods for Engineers, 5th Ed., McGraw-Hill, New York Keim, D. and M. Ward, 2003. Chap.11 Visualization, in Intelligent Data Analysis, M. Berthold and D.J. Hand (Editors), 2nd Ed., SpringerVerlag, Berlin, Germany. Reddy, T.A., J.K. Kreider, P.S. Curtiss and A. Rabl, 2016. Heating and Cooling of Buildings, 3rd Ed., CRC Press, Boca Raton, FL. Reddy, T.A., 1990. Statistical analyses of electricity use during the hottest and coolest days of summer for groups of residences with and without air-conditioning. Energy, vol. 15(1): pp. 45–61. Schenck, H., 1969. Theories of Engineering Experimentation, 2nd Edition, McGraw-Hill, New York. Tufte, E.R., 1990. Envisioning Information, Graphic Press, Cheshire, CT. Tufte, E.R., 2001. The Visual Display of Quantitative Information, 2nd Edition, Graphic Press, Cheshire, CT Tukey, J.W., 1970. Exploratory Data Analysis, , Vol.1, Reading MA, Addison-Wesley Tukey, J.W., 1988. The Collected Works of John W. Tukey,W. Cleveland (Editor), Wadsworth and Brookes/Cole Advanced Books and Software, Pacific Grove, CA Wang, Z., T. Parkinson, P. Li, B. Lin and T. Hong, 2019. The squeaky wheel: Machine learning for anomaly detection in subjective thermal comfort votes, Building and Environment, 151 (2019), pp. 219-227, Elsevier. Wonnacutt, R.J. and T.H. Wonnacutt, 1985. Introductory Statistics, 4th Ed., John Wiley & Sons, New York.

4

Making Statistical Inferences from Samples

Abstract

This chapter covers various concepts and statistical methods on inferring reliable parameter estimates about a population from sample data using knowledge of probability and probability distributions. More specifically, such statistical inferences involve point estimation, confidence interval estimation, hypothesis testing of means and variances from two or more samples, analysis of variance methods, goodness of fit tests, and correlation analysis. The basic principle of inferential statistics is that a random sample (or subset) drawn from a population tends to exhibit the same properties as those of the entire population. Traditional single and multiple parameter estimation techniques for sample means, variances, correlation coefficients, and empirical distributions are presented. This chapter covers the popular single-factor ANOVA technique which allows one to test whether the mean values of data taken from several different groups are essentially equal or not, i.e., whether the samples emanate from different populations or whether they are essentially from the same population. Also treated are non-parametric statistical procedures, best suited for ordinal data or for noisy data, that do not assume a specific probability distribution from which the sample(s) is taken. They are based on relatively simple heuristic ideas and are generally more robust; but, on the other hand, are generally less powerful and efficient. How prior information on the population (i.e., the Bayesian approach) can be used to make sounder statistical inferences from samples and for hypothesis testing problems is also discussed. Further, various types of sampling methods are described which is followed by a discussion on estimators and their desirable properties. Finally, resampling methods, which reuse an available sample multiple times that was drawn from a population to make statistical inferences of parameter estimates, are treated which, though computationally expensive, are more intuitive, conceptually simple, versatile, and allow robust point and interval estimation. Not surprisingly, they

have become indispensable techniques in modern day statistical analyses.

4.1

Introduction

The primary reason for resorting to sampling as against measuring the whole population is to reduce expense, or to make quick decisions (say, in case of a real-time production process), or because often, it is simply impossible to do otherwise. Random sampling, the most common form of sampling, involves selecting samples (or subsets) from the population in a random and independent manner. If done correctly, it reduces or eliminates bias while enabling inferences to be made about the population from the sample. Such inferences are usually made on distributional characteristics or parameters such as the mean value or the standard deviation. Estimators are mathematical formulae or expressions applied to sample data to deduce the estimate of the true parameter (i.e., its numerical value). For example, Eqs. (3.4a) and (3.8) in Chap. 3 are the estimators for deducing the mean and standard deviation of a data set. Given the uncertainty involved, the point estimates must be specified along with statistical confidence intervals (CI) albeit under the presumption that the samples are random. Methods of doing so are treated in Sect. 4.2 and 4.3 for single and multiple samples involving single parameters and in Sect. 4.4 involving multiple parameters. Unfortunately, certain unavoidable, or even undetected, biases may creep into the supposedly random sample, and this could lead to improper or biased inferences. This issue, as well as a more complete discussion of sampling and sampling design, is covered in Sect. 4.7. Resampling methods, which are techniques that involve reanalyzing an already drawn sample, are discussed in Sect. 4.8. Parameter tests on population estimates assume that the sample data are random and independently drawn, and thus, treated as random variables. The sampling fraction, in the case of finite populations, is very much smaller than the

# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 T. A. Reddy, G. P. Henze, Applied Data Analysis and Modeling for Energy Engineers and Scientists, https://doi.org/10.1007/978-3-031-34869-3_4

123

124

4

population size (often about one to three orders of magnitude). Further, the data of the random variable is assumed to be close to being normally distributed. There is an entire field of inferential statistics based on nonparametric or distributionfree tests which can be applied to population data with unknown probability distributions. Though nonparametric tests are encumbered by fewer restrictive assumptions, are easier to apply and understand, they are less efficient than parametric tests (in that their uncertainty intervals are wider). These are briefly discussed in Sect. 4.5, while Bayesian inferencing, whereby one uses prior information to enhance the inference-making process, is addressed in Sect. 4.6.

4.2

Basic Univariate Inferential Statistics

4.2.1

Sampling Distribution and Confidence Interval of the Mean

(a) Sampling distribution of the mean Consider a huge box filled with N round balls of unknown but similar diameter (the population). Let μ be the population mean and σ the population standard deviation of the ball diameter. If a sample of n balls is drawn and recognizing that the mean diameter X of the sample will vary from sample to sample, what can one say about the distribution of X ? It can be shown that the sample mean X would behave like a normally distributed random variable such that its expected value is: E X =μ

ð4:1Þ

and the standard error (SE) of X 1 is: SE X =

σ ðnÞ1=2

sx N -n 1=2 N - 1 ð nÞ

1=2

ð4:3Þ

where N is the population size, and n the sample size. Note that if N ≫ n, one effectively gets back Eq. (4.2).

1

The parameters (say, e.g., the mean) of a single sample are unlikely to be equal to those of the population. Since the sample mean values vary from sample to sample, this pattern of variation is referred to as the sampling distribution of the mean. Can the information contained in sampling distribution be reliably extended to provide estimates of the population mean? The answer is yes. However, since randomness is involved from one sample to another, the answer can only be reframed in terms of probabilities or confidence levels. In summary, the confidence interval (CI) of the population parameter estimates is based on sample size; the larger the better. They also depend on the variance in outcomes of different samples or trials. The larger the variance from sample to sample, the larger ought to be the sample size for the same degree of confidence. The distribution of the sample data is presumed to be Gaussian or normal. Such an assumption is based on the Central Limit Theorem (one of the most important theorems in statistical theory), which states that if independent random samples of n observations are selected with replacement from a population with any arbitrary distribution, then the distribution of the sample means X will approximate a Gaussian distribution provided n is sufficiently large (n > 30). The larger the sample n, the closer does the sampling distribution approximate the Gaussian (Fig. 4.1).2 A consequence of the theorem is that it leads to a simple method of computing approximate probabilities of sums of independent random variables. It explains the remarkable fact that the empirical frequencies of so many natural “populations” exhibit bell-shaped (i.e., normal or Gaussian) curves. Let X1, X2,. . .,Xn be a sequence of independent identically distributed random variables with population mean μ and variance σ 2. Then the distribution of the random variable Z (Sect. 2.4.3):

ð4:2Þ

Since commonly the population standard deviation σ is not known, the sample standard deviation sx (given by Eq. 3.8) can be used instead. In case the population sample is small, and sampling is done without replacement, then Eq. (4.2) is modified to: SE X =

Making Statistical Inferences from Samples

The standard deviation is a measure of the variability within a sample while the SE is a measure of the variability of the mean between samples.

Z=

X -μ p σ= n

ð4:4Þ

approaches the standard normal as n tends toward infinity. Note that this theorem is valid for any distribution of X; therein lies its power. Probabilities for random quantities can be found by determining areas under the standard normal curve as described in Sect. 2.4.3. Suppose one takes a random sample of size n from a population of mean μ and standard deviation σ. Then the random variable Z has (i) approximately the standard normal distribution if n > 30 regardless of the 2 That the sum of two Gaussian distributions from a population would be another Gaussian variable (a property called invariant under addition) is somewhat intuitive. Why the sum of two non-Gaussian distributions should gradually converge to a Gaussian is less so, and hence the importance of this theorem.

4.2 Basic Univariate Inferential Statistics

125

Fig. 4.1 Illustration of the central limit theorem. The sampling distribution of X contrasted with the parent population distribution for three cases. The first case (left column of figures) shows sampling from a normal population. As sample size n increases, the standard error of X

decreases. The next two cases show that even though the populations are not normal, the sampling distribution still becomes approximately normal as n increases. (From Wonnacutt and Wonnacutt 1985 by permission of John Wiley and Sons)

distribution of the population, and (ii) exactly the standard normal distribution if the population itself is normally distributed regardless of the sample size (Fig. 4.1).

Note that when sample sizes are small (n < 30) and the underlying distribution is unknown, the Student t-distribution which has wider uncertainty bands (Sect. 2.4.3) should be

126

4

Making Statistical Inferences from Samples

Fig. 4.2 Illustration of critical cutoff values between one-tailed and two-tailed tests for the standard normal distribution. The shaded areas representing the probability values ( p) corresponding to 95% CL or

significance level α = 0.05 illustrate the difference in how these two types of tests differ in terms of the critical values (which are determined from Table A.3)

used with (n - 1) degrees of freedom instead of the Gaussian (Fig. 2.15 and Table A.4). Unlike the z-curve, there are several t-curves depending on the degrees of freedom (d.f.). At the limit of infinite d.f., the t-curve collapses into the zcurve.

select only one sample of the population comes at a price: one must necessarily accept some uncertainty in our estimates. Based on a sample taken from a population:

(b) One-tailed and two-tailed tests An important concept needs to be clarified, namely “when does one use one-tailed as against two-tailed tests?” In the two-tailed test, one is testing whether the sample parameter is different (i.e., smaller or larger) than that of the stipulated population. In cases where one wishes to test whether the sample parameter is specifically larger (or specifically smaller) than that of the stipulated population, then the one-tailed test is used. The tests are set up and addressed in like manner, the difference being in how the significancelevel expressed as a probability ( p) value is finally determined. The smaller the p-value, the stronger the statistical evidence that the observed data did not happen by chance. The shaded areas of the normal distributions shown in Fig. 4.2 illustrate the difference in how the one-tailed and two-tailed types of tests have to be performed. For a significance level α of 0.05 or probability p = 0.05 that the observed value is lower than the mean value, the cut-off value is 1.645 for the one-tailed test and is 1.96 for the two-tailed test, as indicated in Fig. 4.2. The nomenclature adopted to denote the critical cutoff values for two-tailed and one-tailed are zα/2 = 1.96 and zα = - 1.645 respectively for 95% CL (which can be determined from Table A.3). (c) Confidence interval for the mean In the sub-section above, the behavior of many samples, all taken from one population, was considered. Here, only one large random sample from a population is selected and analyzed to make an educated guess on parameter estimates of the population such as its mean and standard deviation. This process is called inductive reasoning or arguing backwards from a set of observations to reach a reasonable hypothesis. However, the benefit provided by having to

(a) One can test whether the sample mean differs from a known population mean (this is covered in this sub-section). (b) One can deduce interval bounds of the population mean at a specified probability or confidence level which is expressed as a confidence interval CI (covered in Sect. 4.2.2). The concept of CI was introduced in Sect. 3.6.3 in reference to instrument errors. This concept pertinent to random variables in general is equally applicable to sampling. A 95% CI is traditionally interpreted as implying that there is a 95% chance that the difference between the sample and population mean values is contained within this interval (and that this conclusion may be erroneous in 5% of the time).3 The range is obtained from the z-curve by finding the value at which the area under the curve (i.e., the probability) is equal to 0.95 (Fig. 4.2). From Table A.3, the corresponding two-tailed cutoff value zα/2 is 1.96 (corresponding to a probability value of [(1 - 0.95)/2] = 0.025). This implies that the probability is: X -μ p < 1:96 ≈ 0:95 sx = n s s or X - 1:96 px < μ < X þ 1:96 px n n p - 1:96
30). The half-width of the 95% CI about the mean value is (1:96 psxn ) and is called the bound of the error of estimation. For small samples, instead of random variable z, one uses the Student-t variable. Note that Eq. (4.5a) refers to the long-run bounds, i.e., in the long run, roughly 95% of the intervals will contain μ. If one is interested in predicting a single X value that has yet to be observed, one uses the following equation and the term prediction interval (PI) is used (Devore and Farnum 2005): PI ðX Þ = X ± t α=2 sx 1 þ

1 n

1=2

ð4:6Þ

where tα/2 is the two-tailed cutoff value determined from the Student t-distribution at d.f. = (n - 1) at the desired confidence level, and sx is the sample standard deviation. It is apparent that the prediction intervals are much wider than the CI because the quantity “1” within the brackets of Eq. (4.6) will generally dominate (1/n). This means that there is a lot more uncertainty in predicting the value of a single observation X than there is in estimating the mean value μ. Example 4.2.1 Evaluating manufacturer-quoted lifetime of light bulbs from sample data A manufacturer of xenon light bulbs for street lighting claims that the distribution of the lifetimes of his best model has a mean μ = 16 years and a standard deviation σ = 2.0 years when the bulbs are lit for 12 h every day. Suppose that a city official wants to check the claim by purchasing a sample of 36 of these bulbs and subjecting them to tests that determine their lifetimes. (i) Assuming the manufacturer’s claim to be true, describe the sampling distribution of the mean lifetime of a sample of 36 bulbs. Even though the shape of the distribution is unknown, the Central Limit Theorem suggests that the normal distribution can be used. Thus μ = x = 16 and SEðxÞ = p2:0 = 0:33 years.4 36 (ii) What is the probability that the sample purchased by the city official has a mean-lifetime of 15 years or less? The normal distribution N(16, 0.33) is drawn and the darker shaded area to the left of x = 15 as shown in Fig. 4.3 provides the probability of the city official 4

The nomenclature of using X or x is somewhat confusing. The symbol X is usually used to denote the random variable while x is used to denote one of its values.

density

1 0.8 0.6 0.4 0.2 0 14

15

16

17

18

x Fig. 4.3 Sampling distribution of X for a normal distribution N(16, 0.33). Shaded area represents the probability of the mean life of the sample of bulbs being 1200 h, but the convention is to frame the alternative hypothesis as “different from” the null hypothesis.

4

Making Statistical Inferences from Samples

Assume a sample size of n = 100 of bulbs manufactured by the new process and set the significance or error level of the test to be α = 0.05. This is clearly a one-tailed test since the new bulb manufacturing process should have a longer life, not just different from that of the traditional process, and the critical value must be selected accordingly. The mean life x of the sample of 100 bulbs is assumed to be normally distributed with mean value of 1200 and standard error p p σ= n = 300= 100 = 30. From the standard normal table (Table A.3), the one-tailed critical z-value is: zα = 1.64. cpμ0 Recalling that the critical value is defined as: zα = xσ= n

leads to xc = 1200 + 1.64 × 300/(100)1/2 = 1249 or about 1250. Suppose testing of the 100 tubes yields a value of x = 1260. As x > xc , one would reject the null hypothesis at the 0.05 significance (or error) level. This is akin to jury trials where the null hypothesis is taken to be that the accused is innocent, and the burden of proof during hypothesis testing is on the alternate hypothesis, i.e., on the prosecutor to show overwhelming evidence of the culpability of the accused. If such overwhelming evidence is absent, the null hypothesis is preferentially favored. ■ There is another way of looking at this testing procedure (Devore and Farnum 2005):

(a) Null hypothesis H0 is true, but one has been exceedingly unlucky and got a very improbable sample with mean x. In other words, the observed difference turned out to be significant when, in fact, there is no real difference. Thus, the null hypothesis has been rejected erroneously. The innocent man has been falsely convicted; (b) H0 is not true after all. Thus, it is no surprise that the observed x value was so high, or that the accused is indeed culpable. The second explanation is likely to be more plausible, but there is always some doubt because statistical decisions inherently contain probabilistic elements. In other words, statistical tests of hypothesis do not always yield conclusions with absolute certainty: they have in-built margins of error just like jury trials are known to hand down wrong verdicts. Specifically, two types of errors can be distinguished: (i) Concluding that the null hypothesis is false, when in fact it is true, is called a Type I error, and represents the probability α (i.e., the pre-selected significance level) of erroneously rejecting the null hypothesis. This is also called the false negative or “false alarm” rate. The upper normal distribution shown in Fig. 4.4 has a mean value of 1200 (equal to the population or claimed mean value) with a standard error of 30. The area to the right of

4.2 Basic Univariate Inferential Statistics

Reject Ho

Accept Ho

(X 0.001) 15

N(1200,30) 12

density

Fig. 4.4 The two kinds of error that occur in a classical test. (a) False negative: If H0 is true, then significance level α = probability of erring (rejecting the true hypothesis H0). (b) False positive: If Ha is true, then β = probability of erring (judging that the false hypothesis H0 is acceptable). The numerical values correspond to data from Example 4.2.2

129

9

From population

6 Area represents probability of falsely rejecting null hypothesis (Type I error)

3 0 1100

1150

1200

1250

1300

(X 0.001) 15

N(1260,30)

density

12

Area represents probability of falsely accepting the alternative hypothesis (Type II error)

From sample

9 6 3 0 1200

1250 Critical value

the critical value of 1250 represents the probability of Type I error occurring. (ii) The flip side, i.e., concluding that the null hypothesis is true, when in fact it is false, is called a Type II error and represents the probability β of erroneously rejecting the alternate hypothesis, also called the false positive rate. The lower plot of the normal distribution shown in Fig. 4.4 now has a mean of 1260 (the mean value of the sample) with a standard error of 30, while the area to the left of the critical value xc indicates the probability β of being in error of Type II. The two types of error are inversely related as is clear from the vertical line in Fig. 4.4 drawn through both plots. A decrease in probability of one type of error is likely to result in an increase in the probability of the other. Unfortunately, one cannot simultaneously reduce both by selecting a smaller value of α. The analyst would select the significance level depending on the tolerance, or seriousness of the consequences of either type of error specific to the circumstance. Recall that the probability of making a Type I error is called the significance level of the test. This probability of correctly rejecting the null hypothesis is also referred to as the statistical power. The only way of reducing both types of

1300

1350

1400

x

errors is to increase the sample size with the expectation that the standard error would decrease, and the sample mean would get closer to the population mean. One final issue relates to the selection of the test statistic. One needs to distinguish between the following two instances: (i) if the population variance σ is known and for sample sizes n > 30, then the z-statistic is selected for performing the test along with the standard normal tables (as done for Example 4.2.2 above); (ii) if the population variance is unknown or if the sample size n < 30, then the t-statistic is selected (using the sample standard deviation sx instead of σ) for performing the test using Student-t tables with the appropriate degree of freedom.

4.2.3

Two Independent Sample and Paired Difference Tests on Means

As opposed to hypothesis tests for a single population mean, there are hypothesis tests that allow one to compare two population mean values from samples taken from each

130

4

population. Two basic presumptions for the tests (described below) to be valid are that the standard deviations of the populations are reasonably close, and that the populations are approximately normally distributed. (a) Two independent samples test The test is based on information (namely, the mean and the standard deviation) obtained from taking two independent random samples from the two populations under consideration whose variances are unknown and unequal. Using the same notation as before for population and sample and using subscripts 1 and 2 to denote the two samples, the random variable z=

ð x 1 - x2 Þ - ð μ 1 - μ 2 Þ s21 n1

ð4:7Þ

s2 1=2

þ n22

is said to approximate the standard normal distribution for large samples (n1 > 30 and n2 > 30) where s1 and s2 are the standard deviations of the two samples. The denominator is called the standard error (SE) and is a measure of the total variability of both samples combined. Notice the similarity between the expression for SE and that for the combined error of two independent measurements given by the quadrature sum of their squared values (Eq. 3.22). The confidence intervals CI of the difference in the population means can be determined as: μ1 - μ2 = ðx1 - x2 Þ ± zα  SEðx1 , x2 Þ where SE ðx1 , x2 Þ =

s21 s22 þ n1 n2

1=2

ð4:8Þ

where zα is the critical value at the selected significance level. Thus, the testing of the two samples involves a single random variable combining the properties of both. For smaller sample sizes, Eq. (4.8) still applies, but the z-standardized variable is replaced with the Student-t variable. The critical values are found from the Student t-tables with degrees of freedom d.f. = n1 + n2 - 2. If the variances of the population are known, then these should be used instead of the sample variances. When the samples are small and only when the variances of both populations are close, some textbooks suggest that the two samples be combined, and the parameter be treated as a single random variable. Here, instead of using individual standard deviation values s1 and s2, a combined quantity called the pooled variance sp2 is used: 5

Such energy conservation programs result in a revenue loss in electricity sales but often this is more cost effective to utilities in terms of

s2p =

Making Statistical Inferences from Samples

ðn1 - 1Þs21 þ ðn2 - 1Þs22 with d:f : = n1 þ n2 - 2 ð4:9Þ n1 þ n2 - 2

Note that the pooled variance is simply the weighted average of the two sample variances. The use of the pooled variance approach is said to result in tighter confidence intervals, and hence its appeal. The random variable approximates the t-distribution, and the confidence interval, CI of the difference in the population means is: μ1 - μ2 = ðx1 - x2 Þ ± t α  SE ðx1 , x2 Þ where SEðx1 , x2 Þ = s2p

1 1 þ n1 n2

1=2

ð4:10Þ

Note that the above equation is said to apply when the variances of both samples are close. However, Devore and Farnum (2005) strongly discourage the use of the pooled variance approach as a general rule, and so the better approach, when in doubt, is to use Eq. (4.8) so as to be conservative. Manly (2005), on the other hand, states that the independent random sample test is fairly robust to the assumptions of normality and equal population variance especially when the sample size exceeds 20 or so. The assumption of equal population variances is said not to be an issue if the ratio of the two variances is within 0.4–2.5. Example 4.2.3 Verifying savings from energy conservation measures in homes Certain electric utilities with limited generation capacities fund contractors to weather strip residences in an effort to reduce infiltration losses thereby reducing building loads, which in turn lower electricity needs.5 Suppose an electric utility wishes to determine the cost-effectiveness of their weather-stripping program by comparing the annual electric energy use of 200 similar residences in a given community, half of which were weather-stripped, and the other half were not. Samples collected from both types of residences yield: Control sample: Weather-stripped sample:

x1 = 18,750; s1 = 3200 and n1 = 100. x2 = 15,150; s2 = 2700 and n2 = 100.

The mean difference ðx1 - x2 Þ = 18,750 - 15,150 = 3600, i.e., the mean savings fraction or percentage in each weather-stripped residence is 19.2% (=3600/18,750) of the mean baseline or control home. However, there is an uncertainty associated with this mean value since only a sample has been analyzed. This uncertainty is characterized as a deferred generation capacity expansion costs. Another reason for implementing conservation programs is the mandatory requirement set by Public Utility Commissions (PUC) which regulates electric utilities in the United States.

4.2 Basic Univariate Inferential Statistics

131

bounded range for the mean difference. The one-tailed critical value at the 95% CL corresponding to a significance level α = 0.05 is zα = 1.645 from Table A.3. Then from Eq. (4.8): s2 s2 μ1 - μ2 = ðx1 - x2 Þ ± 1:645 1 þ 2 100 100

p t = d=SE where SE = sd = n

ð4:11aÞ

1=2

and the CI around d is:

To complete the calculation of the confidence interval (CI), it is assumed, given that the sample sizes are large, that the sample variances are reasonably close to the population variances. Thus, the CI is approximately: ð18,750 - 15,150Þ ± 1:645

of two small paired samples (n < 30), and d their mean value. Then, the t-statistic is taken to be:

32002 27002 þ 100 100

1=2

= 3600 ± 689 = ð2911 and 4289Þ: These intervals represent the lower and upper values of saved energy at the 95% CL. To conclude, one can state that the savings are positive, i.e., one can be 95% confident that there is an energy benefit in weather-striping the homes. More specifically, the mean saving percentage is 19.2% = [(18,750–15,150)/18,750] of the baseline value with an uncertainty of 19.1% (= 689/3600) in the savings at the 95% CL. Thus, the uncertainty in the savings estimate is quite large. In practice, the energy savings fraction is usually (much) smaller than what was assumed above, and this could result in the uncertainty in the savings to be as large as the savings amount itself. Such a concern reflects realistic situations where the efficacy of energy conservation programs in homes is often difficult to verify accurately. One could try increasing the sample size or resorting to stratified sampling (see Sect. 4.7.4) but they may not necessarily be more conclusive. Another option is to adopt a less stringent confidence level such as the 90% CL. ■ (b) Paired difference test The previous section dealt with independent samples from two populations with close to normal probability distributions. There are instances when the samples are somewhat correlated, and such interdependent samples are called paired samples. This interdependence can also arise when the samples are taken at the same time and are affected by a timevarying variable which is not explicitly considered in the analysis. Rather than the individual values, the difference is taken as the only random sample since it is likely to exhibit much less variability than those of the two samples. Thus, the confidence intervals calculated from paired data will be narrower than those calculated from two independent samples. Let di be the difference between individual readings

p μd = d ± t α sd = n

ð4:11bÞ

Hypothesis testing of means for paired samples is done the same way as that for a single independent mean and is usually (but not always) superior to an independent sample test. Paired difference tests are used for comparing “before and after” or “with and without” type of experiments done on the same group in turn, say, to assess effect of an action performed. For example, the effect of an additive in gasoline meant to improve gas mileage can be evaluated statistically by considering a set of data representing the difference in the gas mileage of n cars which have each been subjected to tests involving “no additive” and “with additive.” Its usefulness is illustrated by the following example which is a more direct application of paired difference tests. Example 4.2.4 Comparing energy use of two similar buildings based on utility bills—the wrong way Buildings which are designed according to certain performance standards are eligible for recognition as energyefficient buildings by federal and certification agencies. A recently completed building (B2) was awarded such an honor. The federal inspector, however, denied the request of another owner of an identical building (B1) located close by who claimed that the differences in energy use between both buildings were within statistical error. An energy consultant was hired by the owner to prove that B1 is as energy efficient as B2. He chose to compare the monthly mean utility bills over a year between the two commercial buildings using data recorded over the same 12 months and listed in Table 4.1 (the analysis would be more conclusive if bill data over several years were used, but such data may be hard to come by). This problem can be addressed using the two-sample test method. The null hypothesis is that the mean monthly utility charges μ1 and μ2 for the two buildings are equal against the alternative hypothesis that Building B2 is more energy efficient that B1 (thus, a one-tailed test is appropriate). Since the sample sizes are less than 30, the t-statistic has to be used instead of the standard normal z-statistic. The pooled variance approach given by Eq. (4.9) is appropriate in this instance. It is computed as:

132

4

Making Statistical Inferences from Samples

Table 4.1 Monthly utility bills and the corresponding outdoor temperature for the two buildings being compared (Example 4.2.4) Month 1 2 3 4 5 6 7 8 9 10 11 12 Mean Std. Deviation

Building B1 utility cost ($) 693 759 1005 1074 1449 1932 2106 2073 1905 1338 981 873 1,349 530.07

Building B2 utility cost ($) 639 678 918 999 1302 1827 2049 1971 1782 1281 933 825 1,267 516.03

Fig. 4.5 Month-by-month variation of the utility bills for the two buildings B1 and B2 (Example 4.2.5)

Difference in costs (B1-B2) 54 81 87 75 147 105 57 102 123 57 48 48 82 32.00

Outdoor temperature (°C) 3.5 4.7 9.2 10.4 17.3 26 29.2 28.6 25.5 15.2 8.7 6.8

2500 B1 B2

Utility Bills ($/month)

2000

Difference 1500 1000 500 0 1

2

3

4

5

6

7

8

9

10

11

12

Month of Year

s2p =

ð12 - 1Þ  ð530:07Þ2 þ ð12 - 1Þ  ð516:03Þ2 12 þ 12 - 2

= 273,630:6 while the t-statistic can be deduced by rearranging Eq. (4.10): t=

ð1349 - 1267Þ - 0 ð273, 630:6Þ

1 12

þ

1 12

1=2

=

82 = 0:38 213:54

The t-value is very small and will not lead to the rejection of the null hypothesis even at significance level α = 0.02 (from Table A.4, the one-tailed critical value is 1.321 for CL = 90% and d.f. = 12 + 12 – 2 = 22). Thus, the consultant would report that insufficient statistical evidence exists to state that the two buildings are different in their energy consumption. As explained next, this example demonstrates a faulty analysis.

Example 4.2.5 Comparing energy use of two similar buildings based on utility bills—the right way There is, however, a problem with the way the energy consultant performed the test of the previous example. Close observation of the data as plotted in Fig. 4.5 would lead one not only to suspect that this conclusion is erroneous, but also to observe that the utility bills of the two buildings tend to rise and fall together because of seasonal variations in the outdoor temperature. Hence the condition that the two samples are independent is violated. It is in such circumstances that a paired test is relevant. Here, the test is meant to determine whether the monthly mean of the differences in utility bills between buildings B1 and B2 ðxD Þ is zero (null hypothesis) or is positive. In this case: 82 - 0 xD - 0 p = 8:88 p = sD = nD 32= 12 with d:f: = 12 - 1 = 11

t - statistic =

4.2 Basic Univariate Inferential Statistics

133

Fig. 4.6 Conceptual illustration of three characteristic cases that may arise during two-sample testing of medians. The box and whisker plots provide some indication as to the variability in the results of the tests. Case (a) very clearly indicates that the samples are very much different,

case (b) also suggests the same but with a little less certitude. Finally, it is more difficult to draw conclusions from case (c), and it is in such cases that statistical tests are useful

where the values of 82 and 32 are found from Table 4.1. For a significance level of α = 0.05 and d.f. = 11, the one-tailed test (Table A.4) suggests a critical value t0.05 = 1.796. Because 8.88 is much higher than this critical value, one can safely reject the null hypothesis. In fact, Bldg. 1 is less energy efficient than Bldg. 2 even at a significance level of 0.0005 (or CL = 99.95%), and the owner of B1 does not have a valid case at all! This illustrates how misleading results can be obtained if inferential tests are misused, or if the analyst ignores the underlying assumptions behind a particular test. It is important to keep in mind the premise that the random variables should follow a normal distribution, and so some preliminary exploratory data analysis using multiple years of utility bills is advisable (Sect. 3.5). Figure 4.6 illustrates, in a simple conceptual manner, the three characteristic cases which can arise when comparing the means of two populations based on sampled data. Recall that the box and whisker plot is a type of graphical display of the shape of the distribution where the solid line denotes the median, the upper and lower hinges of the box indicate the interquartile range values (25th and 75th percentiles) with the whiskers extending to 1.5 times this range. Case (a) corresponds to the case where the whisker of one box plot extends to the lower quartile of the second boxplot. One does not need to perform a statistical test to conclude that the two-population means are different. Case (b) also suggests difference between population means but with a little less certitude. Case (c) illustrates the case when the two whisker bands are quite close, and the value of statistical tests becomes apparent. As a rough rule of thumb, if the 25th percentile for one sample exceeds the median line of the other sample, one could conclude that the mean are likely to be different (Walpole et al. 2007).

4.2.4

Single and Two Sample Tests for Proportions

There are several cases where surveys are performed to determine fractions or proportions of populations who either have preferences of some sort or have purchased a certain type of equipment. For example, the gas company may wish to determine what fraction of their customer base has natural gas heating as against another source (e.g., electric heat pumps). The company performs a survey on a random sample from which it would like to extrapolate and ascertain confidence limits on this fraction. It is in a case such as this which can be interpreted as either a “success” (the customer has gas heat) or a “failure”—in short, a binomial experiment (see Sect. 2.4.2b)—that the following test is useful. (a) Single sample test Let p be the population proportion one wishes to estimate from the sample proportion p which can be determined as: of successes in sample p = number = nx. Then, provided the sample is total number of trials large (n ≥ 30), proportion p is an unbiased estimator of p with approximately normal distribution. Dividing the expression for standard deviation of the Bernoulli trials (Eq. 2.37b) by n2, yields the standard error of the sampling distribution of p : SE ðpÞ = ½pð1 - pÞ=n1=2

ð4:12Þ

Thus, the large sample CI for p for the two tailed case at a significance level z is given by: CI = p ± zα=2 ½pð1 - pÞ=n1=2

ð4:13Þ

134

4

Example 4.2.6 In a random sample of n = 100 new residences in Scottsdale, AZ, it was found that 63 had swimming pools. Find the 95% CI for the fraction of buildings which have pools. 63 In this case, n = 100, while p = 100 = 0:63: From Table A.3, the two-tailed critical value z0.05/2 = 1.96, and hence from Eq. (4.13), the two tailed 95% CI for p is: 0:63 - 1:96 þ1:96

0:63ð1 - 0:63Þ 100

0:63ð1 - 0:63Þ 100

1=2

< p < 0:63

1=2

or 0:5354 < p < 0:7246: ■

Example 4.2.7 The same equations can also be used to determine a sample size in order for p not to exceed a certain range or error e. For instance, one would like to determine from Example 4.2.6 data, the sample size which will yield an estimate of p within 0.02 or less at 95% CL. Then, recasting Eq. (4.13) results in a sample size: n=

Example 4.2.8 Hypothesis testing of increased incidence of lung ailments due to radon in homes The Environmental Protection Agency (EPA) would like to determine whether the fraction of residents with health problems living in an area where the subsoil is known to have elevated radon concentrations is statistically higher than the fraction of homes where sub-soil radon concentrations is negligible. Specifically, the agency wishes to test the hypothesis at the 95% CL that the fraction of residents p1 with lung ailments in radon-prone areas is higher than the fraction p2 corresponding to low radon level locations. The following data are collected: High radon level area: Low radon area:

z=

= ■

(b) Two sample tests The intent here is to estimate whether statistically significant differences exist between proportions of two populations based on one sample drawn from each population. Assume that the two samples are large and independent. Let p1 and p2 be the sampling proportions. Then, the sampling distribution of ðp1 - p2 Þ is approximately normal with ðp1 - p2 Þ being an unbiased estimator of ( p1 - p2) and the standard error given by: SE ðp1 - p2 Þ =

p 1 ð 1 - p 1 Þ p2 ð 1 - p2 Þ þ n1 n2

1=2

The following example illustrates the procedure.

ð4:14Þ

n1 = 100, p1 = 0:38 n2 = 225, p2 = 0:22 null hypothesis H 0 : ðp1 - p2 Þ = 0 alternative hypothesis H 1 : ðp1 - p2 Þ ≠ 0

One calculates the random variable using Eq. (4.14) to compute the SE:

z2 α=2 pð1 - pÞ 1:962 ð0:63Þð1 - 0:63Þ = = 2239 2 e ð0:02Þ2

It must be pointed out that the above example is somewhat misleading since one does not know the value of p beforehand. One may have a preliminary idea, in which case, the sample size n would be an approximate estimate, and this may have to be revised once some data is collected.

Making Statistical Inferences from Samples

ð p 1 - p2 Þ p1 ð1 - p1 Þ n1

þ

p2 ð1 - p2 Þ n2

1=2

ð0:38 - 0:22Þ ð0:38Þð0:62Þ 100

Þð0:78Þ þ ð0:22225

1=2

= 2:865

A one-tailed test is appropriate, and from Table A.3 the critical value of z0.05 = 1.65 for the 95% CL. Since the calculated z value > zα, this would suggest that the null hypothesis can be rejected. Thus, one would conclude that those living in areas of high radon levels have statistically higher lung ailments than those who do not. Further inspection of Table A.3 reveals that zα = 2.865 corresponds to a probability value of 0.021 or close to 98% CL. Should the EPA require mandatory testing of all homes at some expense to all homeowners or should some other policy measure be adopted? These types of considerations fall under the purview of decision-making discussed in Chap. 12. ■

4.2.5

Single and Two Sample Tests of Variance

Recall that when a sample mean is used to provide an estimate of the population mean μ, it is more informative to give a confidence interval CI for μ instead of simply stating the value x. A similar approach can be adopted for estimating the population variance from that of a sample.

4.2 Basic Univariate Inferential Statistics

135

(a) Single sample test

F=

The CI for a population variance σ based on sample variance s2 are to be determined. To construct such a CI, one will use the fact that if a random sample of size n is taken from a population that is normally distributed with variance σ 2, then the random variable 2

χ2 =

n-1 2 s σ2

ð4:15Þ

has the Pearson chi-square distribution with ν = (n - 1) degrees of freedom (described in Sect. 2.4.3). The advantage of using χ 2 instead of s2 is akin to standardizing a variable to a normal random variable. Such a transformation allows standard tables (such as Table A.5) to be used for determining probabilities irrespective of the magnitude of s2. The basis of these probability tables is again akin to finding the areas under the chi-square curves. Example 4.2.9 A company which makes boxes wishes to determine whether their automated production line requires major servicing or not. Their decision will be based on whether the weight from one box to another is significantly different from a maximum permissible population variance value of σ 2 = 0.12 kg2. A sample of 10 boxes is selected, and their variance is found to be s2 = 0.24 kg2. Is this difference significant at the 95% CL? From Eq. (4.15), the observed chi-square value is -1 ð0:24Þ = 18. Inspection of Table A.5 for ν = 9 χ 2 = 100:12 degrees of freedom reveals that for a significance level α = 0.05, the critical chi-square value χ 2α = 16.92 and, for α = 0.025, χ 2α = 19.02. Thus, the result is significant at α = 0.05 or 95% CL but not at the 97.5% CL. Whether to service the automated production line based on these statistical tests would involve weighing the cost of service with associated benefits in product quality. ■ (b) Two sample tests This instance applies to the case when two independent random samples are taken from two populations that are normally distributed, and one needs to determine whether the variances of the two populations are different or not. Such tests find application prior to conducting t-tests on two means which presumes equal variances. Let σ 1 and σ 2 be the standard deviations of both the populations, and s1 and s2 be the sample standard deviations. If σ 1 = σ 2, then the random variable

s21 s22

ð4:16Þ

has the F-distribution (described in Sect. 2.4.3) with degrees of freedom (d.f.) = (ν1, ν2) where ν1 = (n1 - 1) and ν2 = (n2 - 1). Note that the distributions are different for different combinations of ν1 and ν2. The probabilities for F can be determined using areas under the F curves or from tabulated values as in Table A.6. Note that the F-test applies to independent samples, and, unfortunately, is known to be rather sensitive to the assumption of normality. Hence, some argue against its use altogether for two sample testing (for example, Manly 2005). Example 4.2.10 Comparing variability in daily productivity of two workers It is generally acknowledged that worker productivity increases if the environment is properly conditioned to meet the stipulated human comfort environment. One is interested in comparing the mean productivity of two office workers under the same conditions. However, before undertaking that evaluation, one is unsure about the assumption of equal variances in productivity of the workers (i.e., how consistent are the workers from one day to another). This test can be used to check the validity of this assumption. Suppose the following data have been collected for two workers under the same environment and performing similar tasks. An initial analysis of the data suggests that the normality condition is met for both workers: Worker A: n1 = 13 days, mean x1 = 26.3 production units, standard deviation s1 = 8.2 production units. Worker B: n2 = 18 days, mean x2 = 19.7 production units, standard deviation s2 = 6.0 production units. The intent here is to compare not the means but the standard deviations. The F-statistic is determined by always choosing the larger variance as the numerator. Then F = (8.2/6.0)2 = 1.87. From Table A.6, the critical F value is Fα = 2.38 for (13 - 1) = 12 and (18 - 1) = 17 degrees of freedom at a significance level α = 0.05. Thus, as illustrated in Fig. 4.7, one is forced to accept the null hypothesis since the calculated F-value < Fα, and conclude that the data provide not enough evidence to indicate that the population variances of the two workers are statistically different at α = 0.05. Hence, one can now proceed to use the two-sample t-test with some confidence to determine whether the difference in the means between both workers is statistically significant or not.

136

4

Table 4.2 Expected number of homes for different number of non-code compliance values if the process is assumed to be a Poisson distribution with sample mean of 0.5 (Example 4.2.11)

F distribution with d.f. (17,12) 1 Critical value =2.38 for a =0.05

density

0.8

X = number of non-code compliance values 0 1 2 3 4 5 or more Total

Rejection region

0.6 0.4

Calculated F-value=1.87

0.2 0 0

1

2

3

4

Fig. 4.7 Since the calculated F value is lower than the critical value, one is forced to accept the null hypothesis (Example 4.2.10)

Tests for Distributions

Recall from Sect. 2.4.3 that the chi-square (χ 2) statistic applies to discrete data. It is used to statistically test the hypothesis that a set of empirical or sample data does not differ significantly from that expected from a specified theoretical distribution. In other words, it is a goodness-of-fit test to ascertain whether the distribution of proportions of one group differs from another or not. The chi-square statistic is computed as: χ2 = k

f obs - f exp f exp

P(x)  n (0.6065)  380 (0.3033)  380 (0.0758)  380 (0.0126)  380 (0.0016)  380 (0.0002)  380 (1.000)  380

Expected number 230.470 115.254 28.804 4.788 0.608 0.076 380

5

x

4.2.6

Making Statistical Inferences from Samples

2

ð4:17Þ

where fobs is the observed frequency of each class or interval, fexp is the expected frequency for each class predicted by the theoretical distribution, and k is the number of classes or intervals. If χ 2 = 0, then the observed and theoretical frequencies agree exactly. If not, the larger the value of χ 2, the greater the discrepancy. Tabulated values of χ 2 are used to determine significance for different values of degrees of freedom υ = k - 1 (Table A.5). Certain restrictions apply for proper use of this test. The sample size should be greater than 30, and none of the expected frequencies should be less than 5 (Walpole et al. 2007). In other words, a long tail of the probability curve at the lower end is not appropriate, and in that sense, some power is lost when the test is adopted. The following two examples serve to illustrate the process of applying the chi-square test. Example 4.2.11 Ascertaining whether non-code compliance infringements in residences is random or not

A county official was asked to analyze the frequency of cases when home inspectors found new homes built by one specific builder to be non-code compliant and determine whether the violations were random or not. The following data for 380 homes were collected: No. of code infringements Number of homes

0 242

1 94

2 38

3 4

4 2

The underlying random process can be characterized by the Poisson distribution (see Sect. 2.4.2): PðxÞ =

λx expð- λÞ : x!

The null hypothesis, namely that the sample is drawn from a population that is Poisson distributed, is to be tested at the 0.05 significance level. ð38Þþ3ð4Þþ4ð2Þ = 0:5 The sample mean λ = 0ð242Þþ1ð94Þþ2 380 infringements per home. For a Poisson distribution with λ = 0.5, the underlying or expected values are found for different values of x as shown in Table 4.2. The last two categories have expected frequencies that are less than 5, which do not meet one of the requirements for using the test (as stated above). Hence, these will be combined into a new category called “3 or more cases” which will have an expected frequency of (4.7888 + 0.608 + 0.076) = 5.472. The following statistic is calculated first:

χ2 =

ð242 - 230:470Þ2 ð94 - 115:254Þ2 ð38 - 28:804Þ2 þ þ 230:470 115:254 28:804 2 ð6 - 5:472Þ þ = 7:483 5:472

Since there are only 4 groups, the degrees of freedom υ = 4 – 1 = 3, and from Table A.5, the two-tailed critical value at 0.05 significance level is χ 2α = 7.815. This would suggest that the null hypothesis cannot be rejected at the 0.05 significance level; however, the two frequencies are very close, and some further analysis may be warranted. ■

4.2 Basic Univariate Inferential Statistics

137

Table 4.3 Observed and computed (assuming gender independence) number of accidents in different circumstances (Example 4.2.12) Male Observed 40 49 18 107

Circumstance At work At home Other Total

Female Observed 5 58 13 76

Expected 26.3 62.6 18.1

Example 4.2.126 Evaluating whether injuries in males and females are independent of circumstance Chi-square tests are also widely used as tests of independence using contingency tables. In 1975, more than 59 million Americans suffered injuries. More males (=33.6 million) were injured than females (=25.6 million). These statistics do not distinguish whether males and females tend to be injured in similar circumstances. A safety survey of n = 183 accident reports was selected at random to study this issue in a large city, and the results are summarized in Table 4.3. The null hypothesis is that the circumstance of an accident (whether at work or at home) is independent of the gender of the victim. This hypothesis is to be verified at a significance level of α = 0.01. The degrees of freedom d.f. = (r - 1)  (c - 1) where r is the number of rows and c the number of categories. Hence, d.f. = (3 - 1)  (2 - 1) = 2. From Table A.5, the two-tailed critical value is χ 2α = 9.21 at α = 0.01 for d.f. = 2. The expected values for different joint occurrences (male/ work, male/home, male/other, female/work, female/home, female/other) are shown in italics in Table 4.3 and correspond to the case when the occurrences are independent. Recall from basic probability (Eq. 2.10) that if events A and B are independent, then p(A \ B) = p(A). p(B) where p indicates the probability. In our case, if being “male” and “being involved in an accident at work” were truly independent, then p(work \ male) = p(work). p(male). Consider the cell corresponding to male/at work. Its expected value = n  pðwork \ maleÞ = n  pðworkÞ  pðmaleÞ = 183 ð45Þð107Þ 183

45 183

 107 183 =

= 26:3 (as shown Table 4.3). Expected values for other joint occurrences shown in the table have been computed in like manner. Thus, the chi-square statistics is χ 2 = ð5 - 18:7Þ2 18:7

ð40 - 26:3Þ2 26:3

ð13 - 12:9Þ2 12:9

þ

þ ... þ = 24.3. Since, χ 2α = 9.21 which is much lower than 24.3, the null hypothesis can be safely rejected at a significance level of 0.01. Hence, one would conclude that gender has a bearing on the circumstance in which the accidents occur. ■

6

From Weiss (1987) by permission of Pearson Education.

Expected 18.7 44.4 12.9

Total Observed 45 107 31 183 = n

Another widely used test to determine whether the distribution of a set of empirical or sample data comes from a specified theoretical distribution is the Kolmogorov-Smirnov (KS) test (Shannon, 1975). It is analogous to the chi-square test and is especially useful for small sample sizes since it does not require that adjacent categories with expected frequencies less than 5 be combined. The chi-square test is very powerful for large samples (greater than sample sizes of about n > 100, with some authors even suggesting n > 30). The KS test is based on binning the cumulative probability distribution of the empirical data into a specific number of classes or intervals and is best used for 10 < n < 100. In general, the greater the number of intervals the more discriminating the test. While the chi-square test requires a minimum of 5 data points in each class, the KS test can assume even a single observation in a class.

4.2.7

Test on the Pearson Correlation Coefficient

Recall that the Pearson correlation coefficient was presented in Sect. 3.4.2 as a means of quantifying the linear relationship between samples of two variables. One can also define a population correlation coefficient ρ for two variables. Section 4.2.1 presented methods by which the uncertainty around the population mean could be ascertained from the sample mean by determining confidence limits. Similarly, one can make inferences about the population correlation coefficient ρ from knowledge of the sample correlation coefficient r. Provided both the variables are normally distributed (called a bivariate normal population), then Fig. 4.8 provides a convenient way of ascertaining the 95% CL of the population correlation coefficient for different sample sizes. Say, r = 0.6 for a sample n = 10 pairs of observations, then the 95% CL interval limits for the population correlation coefficient are (-0.05 < ρ < 0.87), which are very wide. Notice how increasing the sample size shrinks these bounds. For example, when n = 100, the interval limits are (0.47 < ρ < 0.71). The accurate manner of determining whether the sample correlation coefficient r found from analyzing a data set with two variables is significant or not is to conduct a standard hypothesis test involving null and alternative hypothesis.

138

4

Making Statistical Inferences from Samples

Fig. 4.8 Plot depicting 95% CI for population correlation in a bivariate normal population for various sample sizes n. The bold vertical line defines the lower and upper limits of ρ when r = 0.6 from a data set of 10 pairs of observations. (From Wonnacutt and Wonnacutt 1985 by permission of John Wiley and Sons)

Table A.7 lists the critical values of the sample correlation coefficient r for testing the null hypothesis that the population correlation coefficient is statistically significant (i.e., ρ ≠ 0) at the 0.05 and 0.01 significance levels for one and two tailed tests. The interpretation of these values is of some importance in many cases, especially when dealing with small data sets. Say, analysis of the 12 monthly bills of a residence revealed a linear correlation of r = 0.6 with degree-days at the location (see Pr. 2.28 for a physical explanation of the degree-day concept). Assume that a one-tailed test applies. The sample correlation suggests the presence of a correlation at a significance level α = 0.05 (the critical value from Table A.7 is ρc = 0.497) while there is none at α = 0.01 (for which ρc = 0.658). Note that certain simplified suggestions on interpreting values of r in terms of whether they are strong, moderate, or weak for general engineering analysis were given by Eq. (3.13); these are to be used with caution and were meant as rules of thumb only. Instances do arise when the correlation coefficients determined on the same two variables, but from two different samples, are to be statistically compared. Fisher’s r to z transformation method can be adopted whereby the sampling distribution of the Pearson correlation coefficient r is converted into a normally distributed random variable z. Convert the two data sets (x1, y1) and (x2, y2) into two sets of normalized (zx1, zy1) and (zx2, zy2) scores and calculate

the correlation coefficients zr1 and zr2 of both data sets separately following Eq. (3.12). Then calculate the random variable: zobs = ðzx1 –zx2 Þ

1 1 þ n1 - 3 n 2 - 3

1=2

ð4:18Þ

Standard statistical significance tests can then be performed on the random variable zobs. For example, say the observed value was zobs = 2.15. If the level of significance is set at 0.05, the critical value for a two-tailed test would be zcritical = 1.96. Since zobs > zcritical, the null hypothesis that the two correlations are not significantly different can be rejected.

4.3

ANOVA Test for Multi-Samples

The statistical methods known as ANOVA (analysis of variance) are a broad set of widely used and powerful techniques meant to identify and measure sources of variation within a data set. This is done by partitioning the total variation in the data into its component parts. Specifically, ANOVA uses variance information from several samples to make inferences about the means of the populations from which these samples were drawn (and, hence, the appellation).

4.3 ANOVA Test for Multi-Samples Fig. 4.9 Conceptual explanation of the basis of a single-factor ANOVA test. (From Devore and Farnum 2005 with permission from Thomson Brooks/Cole)

139

Variation within samples Variation between sample means

.. .. .

... ..

. .. ..

... . .

SMALL

.. .. .

4.3.1

Single-Factor ANOVA

The ANOVA procedure uses just one test for comparing k sample means, just like that followed by the two-sample test. The following example allows a conceptual understanding of the approach. Say, four random samples of the same physical quantity have been selected, each from a different source. Whether the sample means differ enough to suggest different parent populations for the sources can be ascertained from the within-sample variation to the variation between the four samples. The more the sample means differ, the larger will be the between-samples variation, as shown in Fig. 4.9b, and the less likely is the probability that the samples arise from the same population. The reverse is true if the ratio of between-samples variation to that of the withinsamples is small (Fig. 4.9a). Single-factor ANOVA methods test the null hypothesis of the form:

LARGE

Variation between samples Variation within samples

When H0 is true

Recall that z-tests and t-tests described previously are used to test for differences in one parameter treated as a random variable (such as mean values) between two independent groups depending on whether the sample sizes are greater than or less than about 30 respectively. The two groups differ in some respect; they could be, say, two samples of 10 marbles each selected from a production line during different “time periods.” The “time period” is the experimental variable which is referred to as a factor in designed experiments and hypothesis testing. It is obvious that the cases treated in Sect. 4.2 are single-factor hypothesis tests (on mean and variance) involving single and two groups or samples; the extension to multiple groups is the single-factor (or one-way) ANOVA method. This section deals with single-factor ANOVA method which is a logical lead-in to multivariate techniques (discussed in Sect. 4.4) and to experimental design methods involving multiple factors or experimental variables (discussed in Chap. 6).

.. . ..

.. . ..

. .. ..

When H0 is false

H 0 : μ1 = μ2 = . . . = μk H a : at least two of the μi 0 s are different

ð4:19Þ

Adopting the following notation: Sample sizes: n1 , n2 . . . , nk Sample means: x1 , x2 . . . xk Sample standard deviations: s1 , s2 . . . sk Total sample size: n = n1 þ n2 . . . þ nk Grand average: hxi = weighted average of all n responses Then, one defines between-sample variation called “treatment sum of squares7” (SSTr) as: k

ni ðxi - hxiÞ2

SSTr =

with

d:f : = k - 1

ð4:20Þ

i=1

and within-samples variation or “error sum of squares” (SSE) as: k

SSE = i=1

ðni - 1Þs2i with d:f : = n - k

ð4:21Þ

Together these two sources of variation comprise the “total sum of squares” (SST): k

n

SST = SSTr þ SSE =

2

xij - hxi with d:f : = n - 1 i=1 j=1

ð4:22Þ

The term “treatment” was originally coined for historic reasons where one was interested in evaluating the effect of treatments or specific changes in material mix and processing of a product during its development.

7

140

4

Table 4.4 Amount of vibration (values in microns) for five brands of bearings tested on six motor samples (Example 4.3.1)a

Sample 1 2 3 4 5 6 Mean Std. dev. a

Brand 1 13.1 15.0 14.0 14.4 14.0 11.6 13.68 1.194

Brand 2 16.3 15.7 17.2 14.9 14.4 17.2 15.95 1.167

Making Statistical Inferences from Samples

Brand 3 13.7 13.9 12.4 13.8 14.9 13.3 13.67 0.816

Brand 4 15.7 13.7 14.4 16.0 13.9 14.7 14.73 0.940

Brand 5 13.5 13.4 13.2 12.7 13.4 12.3 13.08 0.479

Data available electronically on book website

Table 4.5 ANOVA table for Example 4.3.1 Source Factor Error Total

d.f. 5-1=4 30 - 5 = 25 30 - 1 = 29

Sum of squares SSTr = 30.855 SSE = 22.838 SST = 53.694

SST is simply the sample variance of the combined set of n data points = (ni - 1)s2 where s is the standard deviation of all the n data points. The statistic defined below as the ratio of two variances follows the F-distribution: F=

MSTr MSE

ð4:23Þ

ð4:24Þ

and the mean within-sample square error is MSE = SSE=ðn - k Þ

ð4:25Þ

Recall that the p-value is the area of the F curve for (k - 1, n - k) degrees of freedom to the right of F-value. If p-value ≤α (the selected significance level), then the null hypothesis can be rejected. Note that the F-test is meant to be used for normal populations and equal population variances. Example 4.3.18 Comparing mean life of five motor bearings A motor manufacturer wishes to evaluate five different motor bearings for motor vibration (which adversely results in reduced life). Each type of bearing is installed on different random samples of six motors. The amount of vibration (in microns) is recorded when each of the 30 motors are running. The data obtained is assembled in Table 4.4.

From Devore and Farnum (2005) by # permission of Cengage Learning.

8

F-value 8.44

Determine from a F-test whether the bearing brands influence motor vibration at the α = 0.05 significance level. In this example, Grand average : hxi = 14:22, k = 5, and n = 30. The one-way ANOVA table is first generated as shown in Table 4.5 using Eqs. (4.20)–(4.25). For example, SSTr = 6× ð13:68–14:22Þ2 þ ð15:95–14:22Þ2 þ ð13:67–14:22Þ2

þð14:73–14:22Þ2 þ ð13:08–4:22Þ2 = 30:855

where the mean between-sample variation is MSTr = SSTr=ðk - 1Þ

Mean square MSTr = 7.714 MSE = 0.9135

SSE = 5 × 1:1942 þ 1:1672 þ 0:8162 þ 0:9402 þ 0:4792 = 22:838: From the F-tables (Table A.6) and for α = 0.05, the critical F-value for d.f. = (4, 25) is Fc = 2.76, which is less than F = 8.44 computed from the data. Hence, one is compelled to reject the null hypothesis that all five means are equal, and conclude that type of bearing motor does have a significant effect on motor vibration. In fact, this conclusion can be reached even at the more stringent significance level of α = 0.001. The results of the ANOVA analysis can be conveniently illustrated by generating an effects plot, as shown in Fig. 4.10a. This clearly illustrates the relationship between the mean values of the response variable, i.e., vibration level for the five different motor bearing brands. Brand 5 gives the lowest average vibration, while Brand 2 has the highest. Note that such plots, though providing useful insights, are not generally a substitute for an ANOVA analysis. Another way of plotting the data is a means plot (Fig. 4.10b) which includes 95% CL intervals as well as the information provided in Fig. 4.10a. Thus, a sense of the variation within samples can be gleaned. ■

4.3 ANOVA Test for Multi-Samples

141

Fig. 4.10 (a) Effect plot. (b) Means plot showing the 95% CL intervals around the mean values of the 5 brands (Example 4.3.1)

Table 4.6 Pairwise analysis of the five samples following Tukey’s HSD procedure (Example 4.3.2)

Samples 1,2 1,3 1,4 1,5 2,3 2,4 2,5 3,4 3,5 4,5 a

4.3.2

Distance j13.68 - 15.95 j j13.68 - 13.67 j j13.68 - 14.73 j j13.68 - 13.08 j j15.95 - 13.67 j j15.95 - 14.73 j j15.95 - 13.08 j j13.67 - 14.73 j j13.67 - 13.08 j j14.73 - 13.08 j

= 2.27 = 0.01 = 1.05 = 0.60 = 2.28 = 1.22 = 2.87 = 1.06 = 0.59 = 1.65

μi ≠ μj μi ≠ μj

μi ≠ μj

Indicated only if distance > critical value of 1.62

Tukey’s Multiple Comparison Test

A limitation with the ANOVA test is that, in case the null hypothesis is rejected, one is unable to determine the exact cause. For example, one poor motor bearing brand could have been the cause of this rejection in the example above even though there is no significant difference between the other four other brands. Thus, one needs to be able to pinpoint the culprit. One could, of course, perform paired comparisons of two brands one at a time. In the case of 5 sets, one would then make 10 such tests. Apart from the tediousness of such a procedure, making independent paired comparisons leads to a decrease in sensitivity, i.e., type I errors are magnified.9 Rigorous classical procedures that allow multiple comparisons to be made simultaneously have been proposed for this purpose (see Manly 2005). One such method is discussed in Sect. 4.4.2. In this section, the Tukey’s HSD (honestly significant difference) procedure is described which is used to test the differences among multiple sample means by pairwise testing. It is limited to cases of equal sample sizes. This procedure allows the simultaneous formation of prespecified confidence intervals for all paired comparisons using the Student t-distribution. Separate tests are conducted to determine whether μi = μj for each pair (i,j) of means in an ANOVA study of k population means. Tukey’s procedure is based on comparing the distance (or absolute value)

between any two sample means j xi - xj j to a threshold value T that depends on significance level α as well as on the mean square error (MSE) from the ANOVA test. The T-value is calculated as: T = qα

This can be shown mathematically but is beyond the scope of this text.

MSE ni

1=2

ð4:26aÞ

where ni is the size of the sample drawn from each brand population, qα values are called the Student-t range distribution values and are given in Table A.8 for α = 0.05 for d. f. = (k, n - k). If j xi - xj j > T, then one concludes that μi ≠ μj at the corresponding significance level. Otherwise, one concludes that there is no difference between the two means. Tukey also suggested a convenient visual representation to keep track of the results of all these pairwise tests. The Tukey’s HSD procedure and this representation are illustrated in the following example. Example 4.3.210 Using the same data as that in Example 4.3.1 conducts a multiple comparison procedure to distinguish which of the motor bearing brands are superior to the rest. Following Tukey’s HSD procedure given by Eq. (4.26a), the critical distance between sample means at α = 0.05 is: From Devore and Farnum (2005) by # permission of Cengage Learning. 10

9

Conclusiona μi ≠ μj

142

4

Making Statistical Inferences from Samples

Brand 1 13.68 Brand 5 13.08

Brand 3 13.67

Brand 4 14.73

Brand 2 15.95

Fig. 4.11 Graphical depiction summarizing the ten pairwise comparisons following Tukey’s HSD procedure. Brand 2 is significantly different from Brands 1, 3, and 5, and so is Brand 4 from Brand 5 (Example 4.3.2)

T = qα

MSE ni

1=2

= 4:15

0:913 6

1=2

= 1:62

where qα is found by interpolation from Table A.8 based on d.f. = (k, n - k) = (5, 25). The pairwise distances between the five-sample means shown in Table 4.6 can be determined, and appropriate inferences made. Note that the number of pairwise comparison is [k(k - 1)/2] which in this case of k = 5 is equal to 10 comparisons as shown in the table. Thus, the distance T between the following pairs is less than 1.62: {1,3;1,4;1,5}, {2,4}, {3,4;3,5}. This information is visually summarized in Fig. 4.11 by arranging the five sample means in ascending order and then drawing rows of bars connecting the pairs whose distances do not exceed T = 1.62. It is now clear that though brand 5 has the lowest mean value, it is not significantly different from brands 1 and 3. Hence, the final selection of which motor bearing to pick that has low vibration can be made from these three brands only (brands 1, 3, 5). ■ The Tukey method is said to be too conservative when the intent is to compare the means of (k - 1) samples against a single sample taken to be the “control” or reference. An alternative multi-comparison test suitable in such instances is the Dunnett’s method (see for example, Devore and Farnum 2005). Instead of performing pairwise comparison among all the samples, this method computes the critical T-value following: T = t α MSE

1 1 þ ni n c

1=2

ð4:26bÞ

where ni and nc are the sizes of the sample drawn from the individual groups and the control group respectively and tα is the critical value found from Table A.9 based on d. f. = (k - 1, n - k) degrees of freedom. Thus, the number of pairwise comparisons reduce to (k - 1) tests as against [k(k - 1)/2] for the Tukey method. For the example above, the number of tests is only 4 instead of 10 for the Tukey approach.

4.4

Tests of Significance of Multivariate Data

4.4.1

Introduction to Multivariate Methods

Multivariate statistical analysis (also called multifactor analysis) deals with statistical inference as applied to multiple parameters (each considered to behave like a random variable) deduced from one or several samples taken from one or several populations. Multivariate methods can be used to make inferences about parameters such as sample means and variances. Rather than treating each parameter or variable separately as done in t-tests and single-factor ANOVA, multivariate inferential methods allow the analyses of multiple variables simultaneously as a system of measurements. This generally results in sounder inferences to be made, a point elaborated below. The univariate probability distributions presented in Sect. 2.4 can also be extended to bivariate and multivariate distributions. Let x1 and x2 be two variables of the same type, say both discrete (for continuous variables, the summations in the equations below need to be replaced with integrals). Their joint distribution is given by: f ðx1 , x2 Þ ≥ 0 and

f ð x1 , x2 Þ = 1

ð4:27Þ

allðx1 , x2 Þ

Consider two sets of multivariate data each consisting of p variables. However, the sets could be different in size, i.e., the number of observations in each set may be different, say n1 and n2. Let X 1 and X 2 be the sample mean vectors of dimension p. For example, X 1 = x11 , x12 , . . . x1i , . . . x1p

ð4:28Þ

where x1i is the sample average over n1 observations of parameter i for the first set. Further, let C1 and C2 be the sample covariance matrices of size (p × p) for the two sets respectively (the basic concepts of covariance and correlation were presented in Sect. 3.4.2). Then, the sample matrix of variances and covariances for the first data set is given by:

4.4 Tests of Significance of Multivariate Data

143

Fig. 4.12 Two bivariate normal distributions and associated 50% and 90% contours assuming equal standard deviations for both variables. However, the left-hand side plot (a) presumes the two variables to be uncorrelated, while for the right (plot b) the two variables have a correlation coefficient of 0.75 which results in elliptical contours. (From Johnson and Wichern, 1988 by # permission of Pearson Education)

C1 =

c11

c12

::

c1p

c21 ::

c22 ::

:: ::

c2p ::

cp1

cp2

::

cpp

ð4:29Þ

where cii is the variance for parameter i and cik the covariance for parameters i and k. Similarly, the sample correlation matrix where the diagonal elements are equal to unity and other terms scaled appropriately is given by 1

r 12

::

r 1p

r 21 R1 = ::

1 ::

:: ::

r 2p ::

r p1

r p2

::

1

ð4:30Þ

Both matrices contain the correlations between each pair of variables, and they are symmetric about the diagonal since, say, r12 = r21, and so on. This redundancy is simply meant to allow easier reading. These matrices provide a convenient visual representation of the extent to which the different sets of variables are correlated with each other, thereby allowing strongly correlated sets to be easily identified. Note that correlations are not affected by shifting and scaling the data. Thus, standardizing the variables obtained by subtracting each observation by the mean and dividing by the standard deviation will still retain the correlation structure

of the original data set while providing certain convenient interpretations of the results. Underlying assumptions for multivariate tests of significance are that the two samples have close to multivariate normal distributions with equal population covariance matrices. The multivariate normal distribution is a generalization of the univariate normal distribution when p ≥ 2 where p is the number of dimensions or parameters. Figure 4.12 illustrates how the bivariate normal distribution is distorted in the presence of correlated variables. The contour lines are circles for uncorrelated variables and ellipses for correlated ones.

4.4.2

Hotteling T2 Test

The simplest extension of univariate statistical tests is the situation when two or more samples are evaluated to determine whether they originate from populations with: (i) different means, and (ii) different variances/covariances. One can distinguish between the following types of multivariate inference tests involving more than one parameter (Manly 2005): (a) comparison of several mean values of factors from two samples is best done using the Hotteling T2-test; (b) comparison of variance for two samples (several procedures have been proposed; the best known are

144

4

the Box’s M-test, the Levene’s test based on T2-test, and the Van Valen test); (c) comparison of mean values for several samples (several tests are available; the best known are the Wilks’ lambda statistic test, Roy’s largest root test, and Pillai’s trace statistic test); (d) comparison of variance for several samples (using the Box’s M-test). Only case (a) will be described here, while the others are treated in texts such as Manly (2005). Consider two samples with sample sizes n1 and n2. One wishes to compare differences in p random variables among the two samples. Let X 1 and X2 be the mean vectors of the two samples. A pooled estimate of covariance matrix is: C = fðn1 - 1ÞC1 þ ðn2 - 1ÞC2 g=ðn1 þ n2 - 2Þ

Making Statistical Inferences from Samples

157:381 241:000 X 1 = 31:433

and

18:500 20:810

C1 =

11:048

9:100

1:557

0:870

1:286

9:100

17:500

1:910

1:310

0:880

1:557

1:910

0:531

0:189

0:240

0:870

1:310

0:189

0:176

0:133

1:286

0:880

0:240

0:133

0:575

ð4:31Þ 158:429

where C1 and C2 are the covariance vectors given by Eq. (4.29). Then, the Hotteling T2-statistic is defined as:

241:571 X2 = 31:479

0

T2 =

n1 n2 X1 - X2 C - 1 X1 - X2 ð n1 þ n2 Þ

18:446

ð4:32Þ

20:839

A large numerical value of this statistic suggests that the two population mean vectors are different. The null hypothesis test uses the transformed statistic: ðn þ n2 - p - 1ÞT 2 F= 1 ðn1 þ n2 - 2Þp

C2 = ð4:33Þ

which follows the F-distribution with the number of p and (n1 + n2 - p - 1) degrees of freedom. Since, the T2 statistic is quadratic, it can also be written in double sum notation as: T2 =

n1 n 2 ð n1 þ n2 Þ

p

15:069

17:190

2:243

1:746

2:931

17:190

32:550

3:398

2:950

4:066

2:243

3:398

0:728

0:470

0:559

1:746

2:950

0:470

0:434

0:506

2:931

4:066

0:559

0:506

1:321

If one performed paired t-tests with each parameter taken one at a time (as described in Sect. 4.2.3), one would compute the pooled variance for the first parameter as:

p

ðx1i - x2i Þcik ðx1k - x2k Þ ð4:34Þ i=1 k=1

The following solved example serves to illustrate the use of the above equations. Example 4.4.111 Comparing mean values of two samples by pairwise and by Hotteling T2 procedures Consider two samples of 5 parameters ( p = 5) with paired samples. Sample 1 has 21 observations and sample 2 has 28. The mean and covariance matrices of both these samples have been calculated and shown below: 11

and

From Manly (2005) by permission of CRC Press.

s21 = ½ð21 - 1Þð11:048Þ þ ð28 - 1Þð15:069Þ=ð21 þ 28 - 2Þ = 13:36 And the t-statistic as: t=

ð157:381 - 158:429Þ = - 0:99 1 1 13:36 21 þ 28

with (21 + 28 - 2 =) 47 degrees of freedom. This is not significantly different from zero as one can note from the p-value indicated in Table A.4. Table 4.7 assembles similar results for all other parameters. One would conclude that none of the five parameters in both data sets are statistically different.

4.4 Tests of Significance of Multivariate Data

145

Table 4.7 Paired t-tests for each of the five parameters taken one at a time (Example 4.4.1) Parameter

First data set Mean 157.38 241.00 31.43 18.50 20.81

1 2 3 4 5

Second data set Mean 158.43 241.57 31.48 18.45 20.84

Variance 11.05 17.50 0.53 0.18 0.58

To perform the multivariate test, one first calculates the pooled sample covariance matrix (Eq. 4.31): 20C 1 þ 27C 2 47

C=

13:358

13:748

1:951

1:373 2:231

13:748

26:146

2:765

2:252 2:710

= 1:951

2:765

0:645

0:350 0:423

1:373

2:252

0:350

0:324 0:347

2:231

2:710

0:423

0:347 1:004

where, for example, the first entry is: (20 × 11.048 + 27 × 15.069)/47 = 13.358. The inverse of the matrix C yields - 0:0694 - 0:2395 0:0785

0:2061

- 0:0694 0:1234 C

-1

- 0:0376 - 0:5517 0:0277

= - 0:2395 - 0:0376 4:2219

- 3:2624 -0:0181

- 0:5517 - 3:2624 11:4610

0:0785

- 0:1969 0:0277

-0:1969

-1:2720

- 0:0181 - 1:2720 1:8068

Substituting the elements of the above matrix in Eq. (4.34) leads to: ð21Þð28Þ ½ð157:381 - 158:429Þð0:2061Þ ð21 þ 28Þ

T2 =

ð157:381 - 158:429Þ - ð241:000 - 241:571Þ ð0:0694Þð241:000 - 241:571Þ þ ⋯ þ ð20:810 - 20:839Þ ð1:8068Þð20:810 - 20:839Þ = 2:824 which

from

Eq. (4.33) results in a F-statistic = = 0:517 with d.f. = (5, 43).

ð21þ28 - 5 - 1Þð2:824Þ ð21þ28 - 2Þð5Þ

Variance 15.07 32.55 0.73 0.43 1.32

t-value (d.f. = 47)

p-value

-0.99 -0.39 -0.20 0.33 -0.10

0.327 0.698 0.842 0.743 0.921

This is clearly not significant since Fcritical = 2.4 (from Table A.6), and so there is no evidence to support that the population means of the two groups are statistically different when all five parameters are simultaneously considered. In this case one could have drawn such a conclusion directly from Table 4.7 by looking at the pairwise p-values, but this may not happen always. ■ Other than the elegance provided, there are two distinct advantages of performing a single multivariate test as against a series of univariate tests. The probability of finding a type-I or false negative result purely by accident increases as the number of variables increase, and the multivariate test takes proper account of the correlation between variables. The above example illustrated the case where no significant differences in population means could be discerned either from univariate tests performed individually or from an overall multivariate test. However, there are instances, when the latter test turns out to be significant as a result of the cumulative effects of all parameters while any one parameter is not significantly different. The converse may also hold, the evidence provided from one significantly different parameter may be swamped by lack of differences between the other parameters. Hence, it is advisable to perform tests as illustrated in the example above. Sections 4.2, 4.3, and 4.4 treated several cases of hypothesis testing. An overview of these cases is provided in Fig. 4.13 for greater clarity. The specific sub-section of each of the cases is also indicated. The ANOVA case corresponds to the lower right box, namely testing for differences in the means of a single factor or variable which is sampled from several populations, while the Hotteling T2test corresponds to the case when the mean of several variables from two samples are evaluated. As noted above, formal use of statistical methods can become very demanding mathematically and computationally when multivariate and multiple samples are considered, and hence the advantage of using numerical based resampling methods (discussed in Sect. 4.8).

146

4

Making Statistical Inferences from Samples

Hypothesis Tests

Two samples

One sample

One variable

Mean/ Proportion 4.2.2/ 4.2.4(a)

Two variables

Variance

Probability distribution

Correlation coefficient

4.2.5(a)

4.2.6

4.2.7

Non-parametric

4.5.3

One variable

Multi samples

Multivariate

One variable

Mean/ Proportion Variance

Mean

Mean

4.2.3(a) 4.2.3(b)/ 4.2.4(b)

4.4.2

4.3

4.5.1

4.2.5(b)

Hotteling T^2

ANOVA

4.5.2

Fig. 4.13 Overview of various types of parametric hypothesis tests treated in this chapter along with section numbers. Non-parametric tests are also covered briefly as indicated. The term “variable” is used instead of “parameters” since the latter are treated as random variables

4.5

Non-Parametric Tests

The parametric tests described above have implicit built-in assumptions regarding the distributions from which the samples are taken. Comparison of samples from populations using the t-test and F-test can yield misleading results when the random variables being measured are not normally distributed and do not have equal variances. It is obvious that fewer the assumptions, the broader would be the potential applications of the test. One would like that the significance tests used lead to sound conclusions, or that the risk of coming to misleading/wrong conclusions be minimized. Two concepts relate to the latter aspect. The concept of robustness of a test is inversely proportional to the sensitivity of the test and to violations of the underlying assumptions. The power of a test, on the other hand, is a measure of the extent to which cost of experimentation is reduced. Non-parametric tests can be applied to continuous numerical measurements but tend to be more widely used to categorical or ordinal data, i.e., those that can only be ranked in order of magnitude. This occurs frequently in management, social sciences, and psychology. For example, a consumer survey respondent may rate one product as better than another but is unable to assign quantitative values to each product. Data involving such “preferences” cannot also be

subject to the t- and F-tests. It is under such cases that one must resort to nonparametric statistics. Rather than use actual numbers, nonparametric tests usually use relative ranks by sorting the data by rank (or score) and discarding their specific numerical values. Because nonparametric tests do not use all the information contained in the data, they are generally less powerful than parametric ones (i.e., tend to lead to wider confidence intervals). On the other hand, they are more intuitive, simpler to perform, more robust, and less sensitive to outlier points when the data is noisy or when the underlying empirical distribution of the parameters is non-Gaussian. Much of the material discussed below can be found in statistical textbooks; for example, McClave and Benson (1988), Devore and Farnum (2005), Walpole et al. (2007), and Heiberger and Holland (2015). The non-parametric tests described in this section are meant to evaluate whether the medians of drawn samples/ groups are close enough to conclude that they have emanated from the same population. The tests are predicated on the homogeneity of variation assumption, i.e., the population variances of the dependent variable are equal within all groups with the same Gaussian or non-Gaussian distribution shape. Whether this is credible or not could be evaluated by visually inspecting sample distributions. A better approach is to use the Levene’s variance test. It is based on the fact that the absolute differences between all scores and their (group)

4.5 Non-Parametric Tests

147

means should be roughly equal over groups. A larger variance implies that, on average, the data values are “further away” from their mean, and vice versa. Thus, this test is very similar to a one-factor ANOVA test based, not on the actual data, but on the absolute difference scores.

Similarly, B(0;7,0.5) = .57 = 0.0078, and B(1;7,0.5) = 0.0547. Finally, the probability that two or less successes occurredpurelybychance = 0.0078 + 0.0547 + 0.164= 0.227. Since this value of 0.227 < 0.286, the null hypothesis that the median value of the response times is less than 6 h should be rejected. ■

4.5.1

When the number of observations is large, the calculations may become tedious. In such cases, it is convenient to use the binomial distribution tables (Appendix A1); alternatively, the normal distribution is a good assumption for ease in computation.

Signed and Rank Tests for Medians

Some of the common non-parametric tests for the medians of one sample and two sample data are described below. No restriction is placed on the empirical distributions other than they be continuous and symmetric (so that the mean and medians are approximately equal).

(b) Mann–Whitney or Wilcoxon rank sum test for medians of two samples

(a) The Sign Test for medians The median is a value where one expects 50% of the observations to be above and 50% below this value. The magnitude of the observations is converted to either a + sign or a - sign depending on whether the individual observation value is above or below the median value. One would expect half of the signs to be positive (+) and half to be negative (-). A statistical test is performed on the actual number of + and - values to draw an inference. The procedure is best illustrated by means of an example. Example 4.5.1 A building maintenance technician claims that he can repair most non-major mechanical breakdowns of the HVAC system within a median time of 6 h. The maintenance supervisor records the following response times in the last seven events (assumed independent): Response time (hour) Difference from median Sign of difference

5.2 -0.8

6.7 +0.7

5.5 -0.5

5.8 -0.2

6.3 +0.3

5.6 -0.4

-

+

-

-

+

-

In the observed sample of 7 events, there are 2 instances when the response time is greater than the median value of 6 h, i.e., a probability of (2/7 = 0.286). The probability that two “successes” occurred purely by chance can be ascertained by using the binomial distribution (Eq. 2.37a) with p = 0.5 since only one of two possible outcomes is possible (the response time is either greater or less than the stated median): Bð2; 7, 0:5Þ = =

7 2

1 2

2

1-

7×6 1 1 × × 2 4 4

5

1 2

7-2

= 0:164, i:e:, 16:4%

The sign test can be said to discard too much information in the data. A stronger test is the Mann–Whitney test (also referred to as Wilcoxon rank sum) used to evaluate whether the medians of two separate samples have been drawn from the same population or not. It is thus the non-parametric version of the two-sample t-test for sample means (Sect. 4.2.3). The sampled data must be continuous, and the sampled population should be close to symmetric and can be either Gaussian or non-Gaussian (if Gaussian, parametric tests are preferable). Strictly speaking, the Mann–Whitney test involves ranking the individual observations of both samples combined, and then summing the ranks of both groups separately. A test is performed on the two sums to deduce whether the two samples come from the same population or not. While simple and intuitive, the test is nonetheless grounded on statistical theory. The following example illustrates the approach. Example 4.5.2 Ascertaining whether oil company researchers and academics differ in their predictions of future atmospheric carbon dioxide levels The intent is to compare the predictions in the change of atmospheric carbon dioxide levels between researchers who are employed by oil companies and those who are in academia. The gathered data shown in Table 4.8 in percentage increase in carbon dioxide from the current level over the next 10 years as predicted by six oil company researchers and seven academics. Perform a statistical test at the 0.05 significance level to evaluate the following hypotheses: (i) Predictions made by oil company researchers differ from those made by academics. (ii) Predictions made by oil company researchers tend to be lower than those made by academics.

148

4

Table 4.8 Wilcoxon rank test calculation for paired independent samples (Example 4.5.2)

1 2 3 4 5 6 7 Sum

Oil company researchers Prediction (%) 3.5 5.2 2.5 5.6 2.0 3.0 –

(i) First, both groups are combined into a single group and ranks are assigned on the combined group starting from rank = 1 for the value with the lowest algebraic value and the rest in ascending order. Next, the ranks of each group are separately tabulated and summed. Since there are 13 predictions, the ranks run from 1 through 13 as shown in the table. The test statistic is based on the sum totals of each group separately (and hence its name). If they are close, the implication is that there is no evidence that the probability distributions of both groups are different; and vice versa. Let TA and TB be the rank sums of either group. Then, the sum of all the individual ranks is: TA þ TB =

nðn þ 1Þ 13ð13 þ 1Þ = = 91 2 2

ð4:35Þ

where n = n1 + n2 with n1 = 6 and n2 = 7. Note that n1 should be selected as the one with fewer observations. Since (TA + TB) is fixed, a small value of TA implies a large value of TB, and vice versa. Hence, the greater the difference between both the rank sums, greater the evidence that the samples come from different populations. Since one is testing whether the predictions by both groups are different or not, the two-tailed significance test is appropriate. Table A.11 provides the lower and upper cutoff values for different values of n1 and n2 for both the one-tailed and the two-tailed tests. Note that the lower and higher cutoff values are (28, 56) at 0.05 significance level for the two-tailed test. The computed statistics of TA = 25 and TB = 66 are outside the range, the null hypothesis is rejected, and one would conclude that the predictions from the two groups are different. (ii) Here one wishes to test the hypothesis that the predictions by oil company researchers is lower than those made by academics. Then, one uses a one-tailed test whose cutoff values are given in part (b) of Table A.11. These cutoff values at 0.05 significance

Rank 4 7 2 8 1 3 – 25

Making Statistical Inferences from Samples Academics Prediction (%) 4.7 5.8 3.6 6.2 6.1 6.3 6.5

Rank 6 9 5 11 10 12 13 66

level are (30, 54) but only the lower value of 30 is used for the problem specified. The null hypothesis will be rejected only if TA < 30. Since this is so, the above data suggests that the null hypothesis can be rejected at a significance level of 0.05. ■ (c) The Wilcoxon signed rank sum test for medians of two paired samples This test is meant for paired data where samples taken are not independent. This is analogous to the two-sample paired difference test treated in Sect. 4.2.3b. where the paired differences in data are converted to a single random variable. Again, the sampled data must be continuous, and the sampled population should be close to symmetric. The Wilcoxon signed rank sum test involves calculating the difference between the paired data, ranking them and summing the positive values and negative values separately. A statistical test is finally applied to reach a conclusion on whether the two distributions are significantly different or not. This is illustrated by the following example. Example 4.5.3 Evaluating predictive accuracy of two climate change models from expert elicitation A policy maker wishes to evaluate the predictive accuracy of two different climate change models for predicting shortterm (e.g., 30 years) carbon dioxide changes in the atmosphere. He consults 10 experts and asks them to grade these models on a scale from 1 to 10, with 10 being extremely accurate. Clearly, these data are not independent since the same expert is asked to make two value judgments about the models being evaluated. The data shown in Table 4.9 are obtained (note that these are not ranked values, except for the last column, but are grades from 1 to 10 assigned by the experts). The tests are on the medians which are approximately close to the means for symmetric distributions.

4.5 Non-Parametric Tests

149

Null hypothesis H 0 : μ1 - μ2 = 0 ðthere is no significant difference in the distributionsÞ Alternative hypothesis H a : μ1 - μ2 ≠ 0 ðthere is significant difference in the distributionsÞ The paired differences are first computed (as shown in Table 4.9) from which the ranks are generated based on the absolute differences, and finally the sums of the positive and negative ranks are computed. Note how the ranking has been assigned since there are repeats in the absolute difference values. There are three “1” in the absolute difference column. Hence a mean value of rank “2” has been assigned for all 3. Similarly, for the three absolute differences of “2,” the rank is given as “5,” and so on. For the highest absolute difference of “6,” the rank is assigned as “10.” The values shown in last two rows of the table are also simple to deduce. The values of the difference (A - B) column are either positive or negative. One simply adds up all the rank values corresponding to the cases when (A - B) is positive; and when they are negative. These are found to be 46 and 9 respectively. The test statistic for the null hypothesis is T = min (T-, T+). In our case, T = 9. The smaller the value of T, the stronger the evidence that the difference between both distributions is important. The rejection region for T is determined from Table A.12. The two-tailed critical value for n = 10 at 0.05 significance level is 8. Since the computed value for T is higher, one cannot reject the null hypothesis, and so one would conclude that there is not enough evidence to suggest that one of the models is more accurate than the other at the 0.05 significance level. Note that if a significance level of 0.10 were selected, the null hypothesis would have been rejected. Looking at the ratings shown in Table 4.9, one notices that these seem to be generally higher for model A than model B. In case one wishes to test the hypothesis, at a significance level of 0.05, that researchers deem model B to be less

accurate than model A, one would have used T- as the test statistic and compared it to the critical value of a one-tailed column values of Table A.12. Since the critical value is 11 for n = 10, which is greater than 9, one would reject the null hypothesis. This example illustrates the fact that it is important to frame the problem correctly in terms of whether a one-tailed or a two-tailed test is more appropriate. ■

4.5.2

Kruskal–Wallis Multiple Samples Test for Medians

Recall that the single-factor ANOVA test was described in Sect. 4.3.1 for inferring whether mean values from several samples emanate from the same population or not, with the necessary assumption of normal distributions. The Kruskal– Wallis H test (Kruskal and Wallis, 1952) is the nonparametric or distribution-free equivalent of the F-test used in one-factor ANOVA but applies to the medians. It can also be taken to be the extension or generalization of the rank-sum test to more than two groups. Hence, the test applies to the case when one wishes to compare more than two groups which should be symmetrical but may not be normally distributed. Again, the evaluation is based on the rank sums where the ranking is made based on samples of all k groups combined. The test is framed as follows: Null hypothesis H 0 : All populations have identical probability distributions Alternative hypothesis H a : Probability distributions of at least two populations are different Let R1, R2, R3 denote the rank sums of, say, three samples. The H-test statistic measures the extent to which the three samples differ with respect to their relative ranks, and is given by:

Table 4.9 Wilcoxon signed rank sum test calculation for paired non-independent samples (Example 4.5.3) Expert 1 2 3 4 5 6 7 8 9 10

Model A 6 8 4 9 4 7 6 5 6 8

Model B 4 5 5 8 1 9 2 3 7 2

Difference (A - B) 2 3 -1 1 3 -2 4 2 -1 6

Absolute difference 2 3 1 1 3 2 4 2 1 6 Sum of positive ranks T+ Sum of negative ranks T-

Rank 5 7.5 2 2 7.5 5 9 5 2 10 =46 =9

150

4

Making Statistical Inferences from Samples

Table 4.10 Data table for Example 4.5.4a Agriculture # Employees 10 350 4 26 15 106 18 23 62 8

1 2 3 4 5 6 7 8 9 10 a

Rank 5 27 2 13 8 21 11 12 17 4 R1 = 120

Manufacturing # Employees 244 93 3532 17 526 133 14 192 443 69

Rank 25 19 30 9.5 29 22 7 23 28 18 R2 = 210.5

Service # Employees 17 249 38 5 101 1 12 233 31 39

Rank 9.5 26 15 3 20 1 6 24 14 16 R3 = 134.5

Data available electronically on book website

H=

12 nð n þ 1 Þ

k j=1

R2j nj

- 3 ð n þ 1Þ

12 1202 210:52 134:52 þ þ - 3ð31Þ 10 10 30ð31Þ 10 = 99:097 - 93 = 6:097

H=

ð4:36Þ

where k is the number of groups, nj is the number of observations in the jth sample and n is the total sample size (n = n1 + n2 + . . . + nk). The number 12 occurs naturally from the expression for the sample variance of the ranks of the outcomes (for the mathematical derivation see Kruskal and Wallis, 1952). Thus, if the H-statistic is close to zero, one would conclude that all groups have the same mean rank, and vice versa. The distribution of the H-statistic is approximated by the chi-square distribution, which is used to make statistical inferences. The following example illustrates the approach. Example 4.5.412 Evaluating probability distributions of number of employees in three different occupations using a non-parametric test One wishes to compare, at a significance level of 0.05, the number of employees in companies representing each of three different business classifications, namely agriculture, manufacturing, and service. Samples from ten companies each were gathered which are shown in Table 4.10. Since the distributions are unlikely to be normal (for example, one detects some large numbers in the first and third columns), a nonparametric test is appropriate. First, the individual ranks for all samples from the three classes combined are generated as shown tabulated under the 2nd, 4th, and 6th columns. The values of the sums Rj are also computed and shown in the last row. Note that n = 30, while nj = 10. The test statistic H is computed next:

The degrees of freedom are the number of groups minus one, or d.f. = 3 - 1 = 2. From the chi-square tables (Table A.5), the two-tailed critical value at α = 0.05 is 5.991. Since the computed H value exceeds this threshold, one would reject the null hypothesis at 95% CL, and conclude that at least two of the three probability distributions describing the number of employees in the sectors are different. However, the verdict is marginal since the computed H statistic is close to the critical value. It would be wise to consider the practical implications of the statistical inference test and perform a decision analysis study. ■

4.5.3

Test on Spearman Rank Correlation Coefficient

The Pearson correlation coefficient (Sect. 3.4.2) was a parametric measure meant to quantify the correlation between two quantifiable variables. The Spearman rank correlation coefficient rs is similar in definition to the Pearson correlation coefficient but uses relative ranks of the data instead of the numerical values itself. The same equation as Eq. (3.12) can be used to compute this measure, with its magnitude and sign interpreted in the same fashion. However, a simpler formula is often used to calculate the Spearman rank correlation coefficient (McClave and Benson 1988): r Sp = 1 -

From McClave and Benson (1988) by # permission Pearson Education. 12

6 d 2i nð n 2 - 1Þ

ð4:37Þ

4.5 Non-Parametric Tests

151

Table 4.11 Data table for Example 4.5.5 showing how to conduct the non-parametric correlation test Faculty 1 2 3 4 5 6 7 8 9 10

research grants ($) 1,480,000 890,000 3,360,000 2,210,000 1,820,000 1,370,000 3,180,000 930,000 1,270,000 1,610,000

Teaching evaluation (out of 10) 7.05 7.87 3.90 5.41 9.02 6.07 3.20 5.25 9.50 4.45

Research rank (ui) 5 1 10 8 7 4 9 2 3 6

where n is the number of paired measurements, and the difference between the ranks for the ith measurement for ranked variables u and v is di = ui - vi. Example 4.5.5 Non-parametric testing of correlation between the sizes of faculty research grants and teaching evaluations The provost of a major university wants to determine whether a statistically significant correlation exists between the research grants and teaching evaluation rating of its senior faculty. Data over 3 years have been collected as assembled in Table 4.11, which also shows the manner in which ranks have been generated and the quantities di = (ui - vi) computed. Using Eq. (4.37) with n = 10, the Spearman rank correlation coefficient is: r Sp = 1 -

6ð260Þ = - 0:576 10ð100 - 1Þ

Thus, one notes that there exists a negative correlation between the sample data. However, whether this is significant at the population level requires that a statistical test be performed for the correlation coefficient rSp: Null hypothesis H 0 : r Sp = 0 ðthere is no significant population correlationÞ Alternative hypothesis H a : r Sp ≠ 0 ðthere is significant population correlationÞ Table A.10 in Appendix A gives the absolute cutoff values for different significance levels of the Spearman rank correlation. For n = 10, the one-tailed absolute critical value for α = 0.05 is rSp,α = 0.564. This implies that there is a negative

Teaching rank (vi) 7 8 2 5 9 6 1 4 10 3

Difference (di) -2 -7 8 3 -2 -2 8 -2 -7 3 Total

Diff squared (di2) 4 49 64 9 4 4 64 4 49 9 260

correlation between research grants and teaching evaluations which differs statistically from 0 at a significance level of 0.05 (albeit barely). It is interesting to point out that had a parametric analysis been undertaken, the corresponding Pearson correlation coefficient (Sect. 3.4.2) would have been r = -0.620 and deemed significant at α = 0.05 (the critical value from Table A.7 is 0.549). The correlation coefficients by both methods are quite close (-0.576 and -0.620) with the parametric method indicating stronger correlation. However, non-parametric tests are distribution-free and, in that sense, are more robust. It is advisable, as far as possible, to perform both types of tests and then draw conclusions. ■ The aspect related to how the confidence intervals widen as n decreases has been previously discussed in Sect 4.2.7 for the Pearson correlation coefficient. The number of data points n also has a large effect on whether the Spearman correlation coefficient rSp determined from a data set is significant or not (Wolberg, 2006). For values of n greater than about 10, the random variable z defined below (assuming Gaussian distribution): z = r sp =sqrtðn - 1Þ

ð4:38Þ

From Table A.3 for a one-tailed distribution, the critical value zα = 1.645 for a 5% significance level. From Eq. (4.38), for a sample n = 101, the critical value of rSp,α = 1.645/sqrt (101 - 1) = 0.1645. However, for a sample size of n = 10, the critical rSps = 1.645/sqrt (10 - 1) = 0.548 which is 3.3 times greater than the previous estimate! This simple example serves to illustrate the importance of the number of data points on the significance test of a sample correlation coefficient.

152

4

4.6

Bayesian Inferences

4.6.1

Background

Bayes’ theorem and how it can be used for probability related problems has been treated in Sect. 2.5. Its strength lies in the fact that it provides a framework for including prior information in a two-stage (or multi-stage) experiment whereby one could draw stronger conclusions than one could with observational data alone. It is especially advantageous for small data sets, and it was shown that its predictions converge with those of the classical method for two cases: (i) as the data set of observations gets larger; and (ii) if the prior distribution is modeled as a uniform distribution. It was pointed out that advocates of the Bayesian approach view probability as a degree of belief held by a person about an uncertainty issue as compared to the objective view of long run relative frequency held by traditionalists. This section will discuss how the Bayesian approach can be used to make statistical inferences from samples about an uncertain population parameter, and for addressing hypothesis testing problems.

4.6.2

Estimating Population Parameter from a Sample

Consider the case when the population mean μ is to be estimated (point and interval estimates) from the sample mean x with the population distribution assumed to be Gaussian with a known standard deviation σ. This case is given by the sampling distribution of the mean x treated in Sect. 4.2.1. The probability P of a two-tailed distribution at significance level α can be expressed as: P x - zα=2

σ σ < μ < x þ zα=2 1=2 = 1 - α n1=2 n

ð4:39Þ

where n is the sample size and z is the value from the standard normal tables. The traditional or frequentist interpretation is that one can be (1 - α) confident that the above interval contains the true population mean (see Sect. 4.2.1c). However, the interval itself should not be interpreted as a probability interval for the parameter. The Bayesian approach uses the same formula, but the mean is modified since the posterior distribution is now used which includes the sample data as well as the prior belief. The confidence interval is referred to as the credible interval (also, referred to as the Bayesian confidence interval). The Bayesian interpretation is that the value of the mean is fixed but has been chosen from some known (or assumed) prior probability distribution. The data collected allows one to recalculate the probability of different values of the mean (i.e., the posterior probability) from which the (1 - α)

Making Statistical Inferences from Samples

credible interval can be surmised. Thus, the traditional approach leads to a probability statement about the interval, while the Bayesian approach about the population mean parameter (Phillips 1973). The credible interval is usually narrower than the traditional confidence interval. The relevant procedure to calculate the credible intervals for the case of a Gaussian population and a Gaussian prior is presented without proof below (Wonnacutt and Wonnacutt 1985). Let the prior distribution, assumed normal, be characterized by a mean μ0 and variance σ 20 , while the sample mean and standard deviation values are x and sx. Selecting a prior distribution is equivalent to having a quasi-sample of size n0 whose size is given by: n0 =

s2x σ 20

ð4:40Þ

The posterior mean and standard deviation μ and σ  are then given by: μ =

n0 μ0 þ nx n0 þ n

and

σ =

sx ðn0 þ nÞ1=2

ð4:41Þ

Note that the expression for the posterior mean is simply the weighted average of the sample and the prior mean and is likely to be less biased than the sample mean alone. Similarly, the standard deviation is divided by the total normal sample size and will result in increased precision. However, had a different prior rather than the normal distribution been assumed above, a slightly different interval would have resulted which is another reason why traditional statisticians (so-called frequentists) are uneasy about fully endorsing the Bayesian approach. Example 4.6.1 Comparison of classical and Bayesian confidence intervals A certain solar PV module is rated at 60 W with a standard deviation of 2 W. Since the rating varies somewhat from one shipment to the next, a sample of 12 modules has been selected from a shipment and tested to yield a mean of 65 W and a standard deviation of 2.8 W. Assuming a Gaussian distribution, determine the two-tailed 95% CI by both the traditional and the Bayesian approaches. (a) Traditional approach μ = x ± 1:96

sx 2:8 = 65 ± 1:96 1=2 = 65 ± 1:58 n1=2 12

Note that a value of 1.96 is used from the z tables even though the sample is small since the distribution is assumed to be Gaussian.

4.6 Bayesian Inferences

153

(b) Bayesian approach. Using Eq. (4.40) to calculate the quasi-sample size inherent in the prior: n0 =

2:82 = 1:96 ≃ 2:0 22

i.e., the prior is equivalent to information from testing an additional 2 modules. Next, Eq. (4.41) is used to determine the posterior mean and standard deviation: μ =

2ð60Þ þ 12ð65Þ 2:8 = 64:29 and σ  = = 0:748 2 þ 12 ð2 þ 12Þ1=2

The Bayesian 95% CI is then: 



μ = μ ± 1:96 σ = 64:29 ± 1:96ð0:748Þ = 64:29 ± 1:47 Since prior information has been used, the Bayesian interval is likely to be better centered and more precise (with a narrower interval) than the traditional or classical interval.

4.6.3

Hypothesis Testing

Section 4.2 dealt with the traditional approach to hypothesis testing where one frames the problem in terms of two competing claims. The application areas discussed involved testing for single sample mean, testing for two sample means and paired differences, testing for single and two sample variances, testing for distributions, and testing on the Pearson correlation coefficient. In all these cases, one proceeds by defining two hypotheses: • The null hypothesis (H0), which represents the status quo, i.e., that the hypothesis will be accepted unless the data provides convincing evidence of the contrary. • The research or alternative hypothesis (Ha), which is the premise that the variation observed in the data sample cannot be ascribed to random variability or chance alone, and that there must be some inherent structural or fundamental cause. Thus, the traditional or frequentist approach is to divide the sample space into an acceptance region and a rejection region and posit that the null hypothesis can be rejected only if the probability of the test statistic lying in the rejection region can be ascribed to chance or randomness at the preselected significance level α. Advocates of the Bayesian

approach have several objections to this line of thinking (Phillips 1973): (i) The null hypothesis is rarely of much interest. The precise specification of, say, the population mean is of limited value; rather, ascertaining a range would be more useful. (ii) The null hypothesis is only one of many possible values of the uncertain variable, and undue importance being placed on this value is unjustified. (iii) As additional data are collected, the inherent randomness in the collection process would lead to the null hypothesis being rejected in most cases. (iv) Erroneous inferences from a sample may result if prior knowledge is not considered. The Bayesian approach to hypothesis testing is not to base the conclusions on a traditional significance level like α < 0.05. Instead it makes use of the posterior credible interval for the population mean μ of the sample collected against a prior mean value μ0. The procedure is described in texts such as Bolstad (2004) and illustrated in the following example. Example 4.6.2 Traditional and Bayesian approaches to determining confidence intervals The life of a certain type of smoke detector battery is specified as having a mean of 32 months and a standard deviation of 0.5 months. A building owner decides to test this claim at a significance level of 0.05. He tests a sample of 9 batteries and finds a mean of 31 and a sample standard deviation of 1 month. Note that this is a one-side hypothesis test case. (a) The traditional approach would entail testing H0: μ = 32 versus Ha: μ > 32. The Student-t value: = - 3:0. From Table A.4, the critical value t = 311=-p32 9 for d.f. = 8 is t0.05 = - 1.86. Thus, he can reject the null hypothesis, and state that the claim of the manufacturer is incorrect. (b) The Bayesian approach, on the other hand, would require calculating the posterior probability of the null hypothesis. The prior distribution has a mean μ0 = 32 and variance σ 20 = 0.52 = 0.25. 2

1 First, use Eq. (4.40) and determine n0 = 0:5 2 = 4, i.e., the prior information is “equivalent” to increasing the sample size by 4. Next, use Eq. (4.41) to determine the posterior mean and standard deviation:

154

μ =

4

4ð32Þ þ 9ð31Þ 1:0 = 31:3 and σ  = = 0:277: 4þ9 ð4 þ 9Þ1=2

- 32:0 From here: t = 31:30:277 = - 2:53: This t-value is outside the critical value t0.05 = -1.86. The building owner can reject the null hypothesis. In this case, both approaches gave the same result, but this is not always true ■

4.7

Some Considerations About Sampling

4.7.1

Random and Non-Random Sampling Methods

A sample is a limited portion, or a finite number of items/ elements/members drawn from a larger entity called population of which information and characteristic traits are sought. Point and interval estimation as well as notions of inferential statistics covered in the previous sections involved the use of samples drawn from some underlying population. The premise was that finite samples would reduce the expense associated with the estimation, while the associated uncertainty which would consequently creep into the estimation process could be estimated and managed. It is quite clear that the sample drawn must be representative of the population, and that the sample size should be such that the uncertainty is within certain preset bounds. However, there are different ways by which one could draw samples; this aspect falls under the purview of sampling design. Since these methods have different implications, they are discussed in this section. There are three general rules of sampling design: (i) The more representative the sample of the population, the better the results. (ii) All else being equal, larger samples yield better results, i.e., more precise estimates with narrower uncertainty bands. (iii) Larger samples cannot compensate for a poor sampling design plan or a poorly executed plan. Some of the common sampling methods are described below: (a) Random sampling (also called simple random sampling) is the simplest conceptually and is most widely used. It involves selecting the sample of n elements in such a way that all possible samples of n elements have the same chance of being selected. Two important strategies of random sampling involve: (i) Sampling with replacement, in which the object selected is put back into the population pool and

Making Statistical Inferences from Samples

has the possibility to be selected again in subsequent picks, and (ii) Sampling without replacement, where the object picked is not put back into the population pool prior to picking the next item. Random sampling without replacement of n objects from a population N could be practically implemented in one of several ways. The most common is to order the objects of the population (e.g., 1, 2, 3, . . ., N ), use a random number generator to generate n numbers from 1 to N without replication, and pick only the objects whose numbers have been generated. This approach is illustrated by means of the following example. A consumer group wishes to select a sample of 5 cars from a lot of 500 cars for crash testing. It assigns integers from 1 to 500 to every car on the lot, uses a random number generator to select a set of 5 integers, and then select the 5 cars corresponding to the 5 integers picked randomly. Dealing with random samples has several advantages: (i) any random sub-sample of a random sample or its complement is also a random sample; (ii) after a random sample has been selected, any random sample from its complement can be added to it to form a larger random sample. (b) Non-random sampling occurs when the selection of members from the population is done according to some method or pre-set process which is not random. Often it occurs unintentionally or unwittingly with the experimenter thinking that he is dealing with random samples while he is not. In such cases, bias or skewness is introduced, and one obtains misleading confidence limits which may lead to erroneous inferences depending on the degree of non-randomness in the data set. However, in some cases, the experimenter intentionally selects the samples in a non-random manner and analyzes the data accordingly. This can result in the required conclusions being reached with reduced sample sizes, thereby saving resources. There are different types of nonrandom sampling (ASTM E 1402 1996), and some of the important ones are listed below: (i) Stratified sampling in which the target population is such that it is amenable to partitioning into disjoint subsets or strata based on some criterion. Samples are selected independently from each stratum, possibly of different sizes. This improves efficiency of the sampling process in some instances and is discussed at more length in Sect. 4.7.4. (ii) Cluster sampling in which natural occurring strata or clusters are first selected, then random sampling is done to identify a subset of clusters, and finally all the elements in the picked clusters are selected for analysis. For example, a state can be divided into

(iv)

(v)

(vi)

4.7.2

Desirable Properties of Estimators

The parameters from a sample are random variables since different sets of samples will result in different values of the parameters. Recall the definition of two seemingly analogous, but distinct, terms: estimators are mathematical expressions to be applied to sample data which yield random variables while an estimate is a specific number or value of this random variable. Commonly encountered estimators are the mean, median, standard deviation, etc. Since the search for estimators is the crux of the parameter estimation process, certain basic notions and desirable properties of estimators need to be explicitly recognized (a good discussion is provided by Pindyck and Rubinfeld 1981). Many of these concepts are logical extensions of the concepts applicable to

Unbiased Biased

Actual value

Fig. 4.14 Concept of biased and unbiased estimators

Efficient estimator Probability distributions

(iii)

districts and then into municipalities for final sample selection. This approach is used often in marketing research. Sequential sampling is a quality control procedure where a decision on the acceptability of a batch of products is made from tests done on a sample of the batch. Tests are done on a preliminary sample, and depending on the results, either the batch is accepted, or further sampling tests are performed. This procedure usually requires, on an average, fewer samples to be tested to meet a pre-stipulated accuracy. Composite sampling where elements from different samples drawn over a designated time period are combined together. An example is mixing water samples drawn hourly to form a composite sample over a day. Multistage or nested sampling which involves selecting a sample in stages. A larger sample is first selected, and then subsequently smaller ones. For example, for testing indoor air quality in a population of office buildings, the design could involve selecting individual buildings during the first stage of sampling, choosing specific floors of the selected buildings in the second stage of sampling, and finally, selecting specific rooms in the floors chosen to be tested during the third and final stage. Convenience sampling, also called opportunity sampling, is a method of choosing samples arbitrarily following the manner in which they are acquired. If the situation is such that a planned experimental design cannot be followed, the analyst must make do with the samples collected in this manner. Though impossible to treat rigorously, it is commonly encountered in many practical situations.

155

Probability distributions

4.7 Some Considerations About Sampling

Inefficient estimator

Actual value

Fig. 4.15 Concept of efficiency of estimators

errors and also apply to regression models treated in Chap. 5. For example, consider the case where inferences about the population mean parameter μ are to be made from the sample mean estimator x. (a) Lack of bias: A very desirable property is for the distribution of the estimator to have the parameter as its mean value (see Fig. 4.14). Then, if the experiment were repeated many times, one would at least be assured that one would be right on an average. In such a case, the bias in Eðx - μÞ = 0, where E represents the expected value. An example of bias in the estimator is when (n) is used rather than (n - 1) while calculating the standard deviation following Eq. (3.8). (b) Efficiency: Lack of bias provides no indication regarding the variability. Efficiency is a measure of how small the dispersion can possibly get. The value of the mean x is said to be an efficient unbiased estimator if, for a given sample size, the variance of x is smaller than the variance of any unbiased estimator (see Fig. 4.15) and is the smallest limiting variance that can be achieved. More often a relative order of merit, called the relative efficiency, is used which is defined as the ratio of both variances. Efficiency is desirable since the greater the efficiency associated with an estimation process, the stronger the statistical or inferential statements one can make about the estimated parameters. Consider the following example (Wonnacutt and Wonnacutt 1985). If a population being sampled is

156

4 Minimum mean square error

Making Statistical Inferences from Samples

4 n=200

Probability distributions

3

n=50

2 Unbiased

1

n=10 n=5

Actual value

Fig. 4.16 Concept of mean square error which includes bias and efficiency of estimators

symmetric, its center can be estimated without bias by either the sample mean x or its median ~x. For some populations x is more efficient; for others ~x is more efficient. In case of a normal parent distribution, the standard error of p p ~x = SE(~x) = 1.25σ= n. Since SE(x) = σ= n, efficiency of x relative to ~x Efficiency 

var ~x = 1:252 = 1:56: var x

ð4:42Þ

(c) Mean square error: There are many circumstances in which one is forced to trade off bias and variance of estimators. When the goal of a model is to maximize the precision of predictions, for example, an estimator with very low variance and some bias may be more desirable than an unbiased estimator with high variance (see Fig. 4.16). One criterion which is useful in this regard is the goal of minimizing mean square error (MSE), defined as: MSEðxÞ = Eðx - μÞ2 = ½BiasðxÞ2 þ varðxÞ

ð4:43Þ

where E(x) is the expected value of x. Thus, when x is unbiased, the mean square error and variance of the estimator x are equal. MSE may be regarded as a generalization of the variance concept. This leads to the generalized definition of the relative efficiency of two estimators, whether biased or unbiased: “efficiency is the ratio of both MSE values.” (d) Consistency: Consider the properties of estimators as the sample size increases. In such cases, one would like the estimator x to converge to the true value, or stated differently, the probability limit of x (plim x) should equal μ as sample size n approaches infinity (see Fig. 4.17). This leads to the criterion of consistency: x is a consistent

0 True value

Fig. 4.17 A consistent estimator is one whose distribution becomes gradually peaked as the sample size n is increased

estimator of μ if plim (x ) = μ. In other words, as the sample size grows larger, a consistent estimation would tend to approximate the true parameters, i.e., the mean square error of the estimator approaches zero. Thus, one of the conditions that make an estimator consistent is that both its bias and variance approach zero in the limit. However, it does not necessarily follow that an unbiased estimator is a consistent estimator. Although consistency is an abstract concept, it often provides a useful preliminary criterion for sorting out estimators. Generally, one tends to be more concerned with consistency than with lack of bias as an estimation criterion. A biased yet consistent estimator may not equal the true parameter on average but will approximate the true parameter as the sample information grows larger. This is more reassuring practically than the alternative of finding a parameter estimate which is unbiased initially yet will continue to deviate substantially from the true parameter as the sample size gets larger. However, to finally settle on the best estimator, the efficiency is a more powerful criterion. As discussed earlier, the sample mean is preferable to the median for estimating the center of a normal population because the former is more efficient though both estimators are clearly consistent and unbiased.

4.7.3

Determining Sample Size During Random Surveys

Population census, market surveys, pharmaceutical field trials, etc. are examples of survey sampling. These can be done in one of two ways which are discussed in this section and in the next. The discussion and equations presented in the previous sub-sections pertain to random sampling. Survey sampling frames the problem using certain terms slightly

4.7 Some Considerations About Sampling

157

different from those presented above. Here, a major issue is to determine the sample size which can meet a certain pre-stipulated precision at predefined confidence levels. The estimates from the sample should be close enough to the population characteristic to be useful for drawing conclusions and taking subsequent decisions. One generally assumes the underlying probability distribution to be normal (this may not be correct since lognormal distributions are also encountered often). Let RE be the relative error (often referred to as the margin of error) of the population mean μ at a confidence level (1 - α) where α is the significance level. For a two-tailed distribution, it is defined as: RE 1 - α = zα=2

SEðxÞ μ

ð4:44Þ

where SE ðxÞ is the standard error of the sample mean given by Eq. (4.3). A measure of variability in the population needs to be introduced, and this is done through the coefficient of variation (CV) defined as:13 CV =

std:dev: s = x μ true mean

ð4:45Þ

where sx is the sample standard deviation (if the population standard deviation is known, it is better to use that value). From a practical point of view, one should ascertain that sx is lower than the maximum value sx,max at a confidence level of significance (1 - α) given by: sx, max,1 - α = zα=2 :CV 1 - α :μ

ð4:46Þ

Let N be the population size. One could deduce the required sample size n from the above equation to reach the target RE1 - α. Replacing (N - 1) by N in Eq. (4.3), which is the expression for the standard error of the mean for small samples without replacement, results in: SEðxÞ2 =

s2 s2 s2x N - n = x - x n N n N

ð4:47Þ

The required sample size n is found by readjusting terms and using Eqs. (4.44) and (4.46): n=

1 SE ðxÞ2 s2x

þ

1 N

=

1 RE1 - α zα=2 CV 1 - α

2

ð4:48Þ þ N1

This is the functional form normally used in survey sampling to determine sample size provided some prior estimate of the 13

Note that this definition is slightly different from that of CV defined by Eq. (3.9a) since population mean rather than sample mean is used.

population mean and standard deviation are known. In summary, sample sizes relative to the population are determined from three considerations: the margin of error, the confidence level and the expected variability. Example 4.7.1 Determination of random sample size needed to verify peak reduction in residences at preset confidence levels An electric utility has provided financial incentives to many customers to replace their existing air-conditioners with high efficiency ones. This rebate program was initiated to reduce the aggregated electric peak during hot summer afternoons which is dangerously close to the peak generation capacity of the utility. The utility analyst would like to determine the sample size necessary to assess whether the program has reduced the peak as projected such that the relative error RE ≤ 10% at 95% CL. The following information is given: N = 20,000

The total number of customers: Estimate of the mean peak saving Estimate of the standard deviation

μ = 2 kW (from engineering calculations) sx = 1 kW (from engineering calculations)

This is a one-tailed distribution problem with 95% CL. Then, from Table A.4, z0.05 = 1.645 for large sample sizes. Inserting values of RE = 0.1 and CV = sμx = 12 = 0:5 in Eq. (4.48), the required sample size is: n=

1 0:1 ð1:645Þð0:5Þ

2

1 þ 20,000

= 67:4 ≈ 70

It would be advisable to perform some sensitivity runs given that many of the assumed quantities are based on engineering calculations. It is simple to use the above approach to generate figures such as Fig. 4.18 for assessing tradeoff between reducing the margin of error versus increasing the cost of verification (instrumentation, installation, and monitoring) as sample size is increased. Note that accepting additional error reduces sample size in a hyperbolic manner. For example, lowering the requirement that RE ≤10% to ≤15% decreases n from 70 to about 30; while decreasing RE requirement to ≤5% would require a sample size of about 270 (outside the range of the figure). On the other hand, there is not much one could do about varying CV since this represents an inherent variability in the population if random sampling is adopted. However, non-random stratified sampling, described next, could be one approach to reduce sample sizes. ■

158

4

Fig. 4.18 Size of random sample needed to satisfy different relative errors of the population mean for two different values of population variability (CV of 25% and 50%). Data correspond to Example 4.7.1

Making Statistical Inferences from Samples

Population size = 20,000 One tailed 95% CL

CV(%)

Sample size n

25

50

Relative Error (%)

4.7.4

Stratified Sampling for Variance Reduction

Variance reduction techniques are a special type of sample estimating procedures which rely on the principle that prior knowledge about the structure of the model and the properties of the input can be used to increase the precision of estimates for a fixed sample size, or, conversely, to decrease the sample size required to obtain a fixed degree of precision. These techniques distort the original problem so that special techniques can be used to obtain the desired estimates at a lower cost. Variance can be decreased by considering a larger sample size which involves more work. So, the effort with which a parameter is estimated can be evaluated as (Shannon, 1975): efficiency = ðvariance × workÞ - 1 :

ð4:49Þ

This implies that a reduction in variance is not worthwhile if the work needed to achieve it is excessive. A common recourse among social scientists to increase efficiency per unit cost in statistical surveys is to use stratified sampling, which counts as a variance reduction technique. In stratified sampling, the distribution function to be sampled is broken up into several pieces, each piece is then sampled separately, and the results are later combined into a single estimate. The specification of the strata to be used is based on prior knowledge about the characteristics of the population to be sampled. Often an order of magnitude variance reduction is achieved by stratified sampling as compared to the standard random sampling approach.

Example 4.7.214 Example of stratified sampling for variance reduction Suppose a home improvement center wishes to estimate the mean annual expenditure of its local residents in the hardware section and the drapery section. It is known that the expenditures by women differ more widely than those by men. Men visit the store more frequently and spend annually approximately $50; expenditures of as much as $100 or as little as $25 per year are found occasionally. Annual expenditures by women can vary from nothing to over $ 500. The variance for expenditures by women is therefore much greater, and the mean expenditure more difficult to estimate. Assume that 80% of the customers are men and that a sample size of n = 15 is to be taken. If simple random sampling were employed, one would expect the sample to consist of approximately 12 men (original male fraction of the population f1 = 12/15 = 0.8) and 3 women (original female fraction f2 = 0.2). However, assume that a sample that included n1 = 5 men and n2 = 10 women was selected instead (more women have been preferentially selected because their expenditures are more variable). Suppose the annual expenditures of the members of the sample turned out to be: Men: Women:

45, 50, 55, 40, 90 80, 50, 120, 80, 200, 180, 90, 500, 320, 75

It is intuitively clear that such data will lead to a more accurate estimate of the overall average than would the expenditures of 12 men and 3 women. 14

From Shannon (1975).

4.8 Resampling Methods

159

where 0.80 and 0.20 are the original weights in the population, and 0.33 and 0.67 the sample weights respectively. This value is likely to be a more realistic estimate than if the sampling had been done based purely on the percentage of the gender of the customers. The above example is a simple case of stratified sampling where the customer base was first stratified into the two genders, and then these were sampled disproportionately. There are statistical formulae which suggest near-optimal size of selecting stratified samples, for which the interested reader can refer to Devore and Farnum (2005) and other texts.

numerical methods have, in large part, replaced closed forms solution techniques of differential equations in almost all fields of engineering mathematics. Thus, versatile numerical techniques allow one to overcome such problems as the lack of knowledge of the probability distribution of the errors of the variables, and even determine sampling distributions of such quantities as the median or of the inter-quartile range or even the 5th and 95th percentiles for which no traditional tests exist. The methods are conceptually simple, requiring low levels of mathematics, can be used to determine any estimate whatsoever of any parameter (not just the mean or variance) and even allows the empirical distribution of the parameter to be obtained. Thus, they have clear advantages when assumptions of traditional parametric tests (such as normal distributions) are not met. Note that the estimation must be done directly and uniquely from the samples, and none of the parametric equations related to standard error, etc., discussed in Sect. 4.2 (such as Eq. 4.2 or Eq. 4.5), should be used. Since the needed additional computing power is easily provided by present-day personal computers, resampling methods have become increasingly popular and have even supplanted classical/traditional parametric tests.

4.8

Resampling Methods

4.8.2

4.8.1

Basic Concept

The appropriate sample weights must be applied to the original sample data if one wishes to deduce the overall mean. Thus, if Mi and Wi are used to designate the ith sample of men and women, respectively, X= =

1 f1 n ðn1 =nÞ

5

Mi þ i=1

f2 ðn2 =nÞ

10

Wi i=1

ð4:50Þ

1 0:80 0:20 280 þ 1695 ≈ 79 15 0:33 0:67

Resampling methods reuse a single available sample to make statistical inferences about the population. The precision of a population-related estimate can be improved by drawing multiple samples from the population and inferring the confidence limits from these samples rather than determining them from classical analytical estimation formulae based on a single sample only. However, this is infeasible in most cases because of the associated cost and time of assembling multiple samples. The basic rationale behind resampling methods is to draw one single sample, treat this original sample as a surrogate for the population, and generate numerous sub-samples by simply resampling the sample itself. Thus, resampling refers to the use of given data, or a data generating mechanism, to produce new samples from which the required estimates can be deduced numerically. It is obvious that the sample must be unbiased and be reflective of the population (which it will be if the sample is drawn randomly), otherwise the precision of the method is severely compromised. Efron and Tibshirani (1982) have argued that given the available power of computing, one should move away from the constraints of traditional parametric theory with its overreliance on a small set of standard models for which theoretical solutions are available and substitute computational power for theoretical analysis. This parallels the way

Application to Probability Problems

How resampling methods can be used for solving probability type of problems are illustrated below (Simon 1992). Consider a simple example, where one has six balls labeled 1–6. What is the probability that three balls will be picked such that they have 1, 2, 3 in that order if this is done with replacement? The traditional probability equation would yield (1/6)3. The same result can be determined by simulating the 3-ball selection a large number of times. This approach, though tedious, is more intuitive since this is exactly what the traditional probability of the event is meant to represent; namely, the long run frequency. One could repeat this 3-ball selection say a million times, and count the number of times one gets 1, 2, 3 in sequence, and from there infer the needed probability. The procedure rules or the sequence of operations of drawing samples must be written in computer code, after which the computer does the rest. Much more difficult problems can be simulated in this manner, and its advantages lie in its versatility, its low level of mathematics required, and most importantly, its direct bearing with the intuitive interpretation of probability as the long-run frequency.

4.8.3

Different Methods of Resampling

The creation of multiple sub-samples from the original sample can be done in several ways and distinguishes one method

160

against the other (an important distinction is whether the sampling is done with or without replacement). The three most common resampling methods are: (a) Permutation method (or randomization method without replacement) is one where all possible subsets of r items (which is the sub-sample size) out of the total n items (the sample size) are generated and used to deduce the population estimate and its confidence levels or its percentiles. This may require some effort in many cases, and so, an equivalent and less intensive deviant of this method is to use only a sample of all possible subsets. The size of the sample is selected based on the accuracy needed, and about 1000 samples are usually adequate. The use of the permutation method when making inferences about the medians of two populations is illustrated below. The null hypothesis is that the there is no difference between the two populations. First, we sample both populations to create two independent random samples. The difference in the medians between both samples is computed. Next, two subsamples without replacement (say 10–20 cases per subgroup) are created from the two samples, and the difference in the medians between both resampled subgroups recalculated. This is done a large number of times, say 1000 times. The resulting distribution contains the necessary information regarding the statistical confidence in the null hypothesis of the parameter being evaluated. For example, if the difference in the median between the two original samples was lower in 50 of 1000 possible subgroups, then one concludes that the one-tailed probability of the original event was only 0.05. It is clear that such a sampling distribution can be done for any statistic of interest, not just the median. However, the number of randomizations become quickly very large, and so one must select the number of randomizations with some care. (b) The jackknife method creates subsamples without replacement. The jackknife method, introduced by Quenouille in 1949 and later extended by Tukey in 1958, is a technique of universal applicability and great flexibility that allows confidence intervals to be determined of an estimate calculated from sub-groups while reducing bias of the estimator. There are several numerical schemes for implementing the jackknife scheme. The original “leave one out” method of implementation is to simply create n subsamples with (n - 1) data points wherein a single different observation is omitted in each subgroup or subsample. There is no randomness in the results since the same parameter estimates and confidence intervals will be obtained if repeated several times. However, if n is large, this

4

Making Statistical Inferences from Samples

process may be time consuming. A more recent and popular version is the “k-fold cross-validation” method where (i) one divides the random sample of n observations into k groups of equal size (ii) omits one group at a time and determines what are called pseudo-estimates from the (k - 1) groups, (iii) estimates the actual confidence intervals of the parameters.15 (c) The bootstrap method (popularized by Efron in 1979) is similar but differs in that no groups are formed but the different sets of data sequences are generated by repeated sampling with replacement from the observational data set (Davison and Hinkley 1997). Individual estimators deduced from such samples directly permit estimates and confidence intervals to be determined. The analyst must select the number of randomizations while the sample size is selected to be equal to that of the original sample. The method would appear to be circular; i.e., how can one acquire more insight by resampling the same sample? The simple explanation is that “the population is to the sample as the sample is to the bootstrap sample.” Though the jackknife is a viable method, it has been supplanted by the bootstrap method, which has emerged as the most efficient of the resampling methods in that better estimates of parameters such as the mean, median, variance, percentiles, empirical distributions and confidence limits are obtained. Several improvements to the naïve bootstrap have been proposed (such as the bootstrapt method) especially for long-tailed distributions or for time series data. There is a possibility of confusion between the bootstrap method and the Monte Carlo approach (presented in Sect. 3.7.2). The tie between them is obvious: both are based on repetitive sampling and then direct examination of the results. A key difference between the methods, however, is that bootstrapping uses the original or initial sample as the population from which to resample, whereas Monte Carlo simulation is based on setting up a sample data generation process for the inputs of the simulation or computational model.

4.8.4

Application of Bootstrap to Statistical Inference Problems

The use of the bootstrap method to two types of instances is illustrated in this section: determining confidence intervals and for correlation analysis. At its simplest, the algorithm of the bootstrap method consists of the following steps (Devore and Farnum 2005): 15

The k-fold cross-validation is also used in regression modeling (Sect. 5.8) and in tree-based classification problems (Sect. 11.5).

4.8 Resampling Methods

161

1. Obtain a random sample of size n from the population. 2. Generate a random sample of size n with replacement from the original sample in step 1. 3. Calculate the statistic of interest for the sample in step 2. 4. Repeat steps 2 and 3 many times to form an approximate sampling distribution of the statistic. It is important to note that bootstrapping requires that sampling be done with replacement, and about 1000 samples are often required. It is advised that the analyst performs a few evaluations with different number of samples in order to be more confident about his results. The following example illustrates the implementation of the bootstrap method. Example 4.8.116 Using the bootstrap method for deducing confidence intervals The data in Table 4.12 correspond to the breakdown voltage (in kV) of an insulating liquid, which is indicative of its dielectric strength. Determine the 95% CI. First, use the large sample confidence interval formula to estimate the two-tailed 95% CI of the mean. Summary quantities are: sample size n = 48, ∑xi = 2646 and x2i = 144,950 from which x = 54:7 and standard deviation s = 5.23. The 95% CI following the traditional parametric approach is then:

Further, the bootstrap resampling approach can also provide 95% CI for other quantities not shown in the figure, such as for the standard deviation (4.075, 6.238) and for the median (53.0, 56.0). ■ The following example illustrates the versatility of the bootstrap method for determining correlation between two variables, a problem which is recast as comparing two sample means. Example 4.8.217 Using the bootstrap method with a nonparametric test to ascertain correlation of two variables One wishes to determine whether there exists a correlation between athletic ability and intelligence level of teenage students. A sample of 10 high school athletes was obtained involving their athletic and IQ scores. The data are listed in terms of descending order of athletic scores in the first two columns of Table 4.13. A nonparametric approach is adopted to solve this problem, the parametric version would be the test on the Pearson correlation coefficient (Sect. 4.2.7). The athletic scores and the IQ scores are rank ordered from 1 to 10 as shown in the

5:23 54:7 ± 1:96 p = 54:7 ± 1:5 = ð53:2,56:2Þ 48 The confidence intervals using the bootstrap method are now recalculated to evaluate differences. A histogram of 1000 samples of n = 48 each, drawn with replacement, is shown in Fig. 4.19. The 95% CI correspond to the two-tailed 0.05 significance level. Thus, one selects 1000(0.05/2) = 25 units from each end of the distribution, i.e., the value of the 25th and that of the 975th largest values which yield (53.33, 56.27) which are very close to the parametric range determined earlier. This example illustrates the fact that bootstrap intervals usually agree with traditional parametric ones when all the assumptions underlying the latter are met. It is when they do not that the power of the bootstrap stands out.

Fig. 4.19 Histogram of bootstrap sample means with 1000 samples (Example 4.8.1) Table 4.13 Data table for Example 4.8.2 along with ranksa Athletic score 97 94 93 90 87 86 86 85 81 76

Table 4.12 Data table for Example 4.8.1 62 59 54 46 57 53

50 64 55 55 48 52

53 50 57 53 63 50

57 53 50 54 57 55

41 64 55 52 57 60

53 62 50 47 55 50

55 50 56 47 53 56

61 68 55 55 59 58 a

From Devore and Farnum (2005) by # permission of Cengage Learning.

IQ score 114 120 107 113 118 101 109 110 100 99

Athletic rank 1 2 3 4 5 6 7 8 9 10

Data available electronically on book website

16

17

From Simon (1992) by # permission of Duxbury Press.

IQ rank 3 1 7 4 2 8 6 5 9 10

162

4

Making Statistical Inferences from Samples

Fig. 4.20 Histogram based on 100 trials of the sum of 5 random IQ ranks from the sample of 10. Note that in only 2% of the trials was the sum equal to 17 or lower (Example 4.8.2)

last two columns of the table. The two observations (athletic rank, IQ rank) are treated together since one would like to determine their joint behavior. The table is split into two groups of five “high” and five “low.” An even split of the group is advocated since it uses the available information better and usually leads to greater “efficiency.” The sum of the observed IQ ranks of the five top athletes = (3 + 1 + 7 + 4 + 2) = 17. The resampling scheme will involve numerous trials where a subset of 5 numbers is drawn randomly from the set {1. . .10}. One then adds these five IQ score numbers for each individual trial. If the observed sum across trials is consistently higher than 17, this will indicate that the best athletes have not earned the observed IQ scores purely by chance. The probability can be directly estimated from the proportion of trials whose sum exceeded 17. Figure 4.20 depicts the histogram of the IQ score sum of 5 random observations using 100 trials (a rather low number of trials meant for illustration purposes). Note that in only 2% of the trials was the sum 17 or lower. Hence, one can state to within 98% CL, that there does exist a correlation between athletic ability and IQ score. It is instructive to compare this conclusion against one from a parametric method. The Pearson correlation coefficient (Sects. 3.4.2 and 4.2.7) between the raw athletic scores and the IQ scores has been determined to be r = 0.7093 for this sample with a p-value of 2.2% which is almost identical to the approximate p-value of 2.0% determined by the bootstrap method.

4.8.5

Closing Remarks

Resampling methods can be applied to diverse problems (Good 1999): (i) for determining probability in complex

situations, (ii) to estimate confidence intervals (CI) of an estimate during univariate sampling of a population, (iii) hypothesis testing to compare estimates of two samples, (iv) to estimate confidence bounds during regression, and (v) for classification. These problems can all be addressed by classical methods provided one makes certain assumptions regarding probability distributions of the random variables. The analytic solutions can be daunting to those who use these statistical analytic methods rarely, and one can even select the wrong formula by error. Resampling is much more intuitive and provides a way of simulating the physical process without having to deal with the, sometimes obfuscating, statistical constraints of the analytic methods. A big virtue of resampling methods is that they extend classical statistical evaluation to cases which cannot be dealt with mathematically. The downside to the use of these methods is that they require larger computing resources (two or three orders of magnitude). This issue is no longer a constraint because of the computing power of modern-day personal computers. Resampling methods are also referred to as computer-intensive methods, although other techniques discussed in Sect. 10.6 are more often associated with this general appellation. It has been suggested that one should use a parametric test when the samples are large, say number of observations is greater than 40, or when they are small ( 350 at the 0.05 significance level. Pr. 4.15 Comparison of human comfort correlations between Caucasian and Chinese subjects Human indoor comfort can be characterized by to the occupants’ feeling of well-being in the indoor environment. It depends on several interrelated and complex phenomena involving subjective as well as objective criteria. Research initiated over 50 years back and subsequent chamber studies have helped define acceptable thermal comfort ranges for indoor occupants. Perhaps the most widely used standard is ASHRAE Standard 55-2004 (ASHRAE 2004) which is described in several textbooks (e.g., Reddy et al., 2016). The basis of the standard is the thermal sensation scale determined by the votes of the occupants following the scale in Table 4.23. The individual votes of all the occupants are then averaged to yield the predicted mean vote (PMV). This is one of the two indices relevant to define acceptability of a large population of people exposed to a certain indoor environment. PMV = 0 is defined as the neutral state (neither cool nor warm), while positive values indicate that occupants feel warm, and vice versa. The mean scores from the chamber studies are then regressed against the influential environmental parameters to yield an empirical correlation which can be used as a means of prediction: PMV = a T db þ b Pv þ c

ð4:51Þ

where Tdb is the indoor dry-bulb temperature (degrees C), Pv is the partial pressure of water vapor (kPa), and the numerical values of the coefficients a*, b*, and c* are dependent on such factors as sex, age, hours of exposure, clothing levels, type of activity, . . . . The values relevant to healthy adults in an office setting for a 3 h exposure period are given in Table 4.24.

Fig. 4.21 Predicted percentage of dissatisfied (PPD) people as function of predicted mean vote (PMV) following Eq. (4.52)

In general, the distribution of votes will always show considerable scatter. The second index is the percentage of people dissatisfied (PPD), defined as people voting outside the range of -1 to +1 for a given value of PMV. When the PPD is plotted against the mean vote of a large group characterized by the PMV, one typically finds a distribution such as that shown in Fig. 4.21. This graph shows that even under optimal conditions (i.e., a mean vote of zero), at least 5% are dissatisfied with the thermal comfort. Hence, because of individual differences, it is impossible to specify a thermal environment that will satisfy everyone. A curve fit expression between PPD and PMV has also been suggested: PPD=100-95  exp -0:03353:PM V 4 þ 0:2179:PM V 2 Þ ð4:52Þ Note that the overall approach is consistent with the statistical approach of approximating distributions by the two primary measures, the mean and the standard deviation. However, in this instance, the standard deviation (characterized by PPD) has been empirically found to be related to the mean value, namely PMV (Eq. 4.51). A research study was conducted in China by Jiang (2001) in order to evaluate whether the above types of correlations, developed using American and European subjects, are applicable to Chinese subjects as well. The environmental chamber test protocol was generally consistent with previous Western studies. The total number of Chinese subjects in

References

167

better visualized if plotted on a psychrometric chart shown in Fig. 4.22. Based on this data, one would like to determine whether the psychological responses of Chinese people are different from those of American/European people. (a) Formulate the various types of statistical tests one would perform stating the intent of each test. (b) Perform some or all these tests and draw relevant conclusions. (c) Prepare a short report describing your entire analysis. Hint One of the data points is suspected. Also use Eqs. (4.51) and (4.52) to generate the values pertinent to Western subjects prior to making comparative evaluations.

References

Fig. 4.22 Chamber test conditions plotted on a psychrometric chart for Chinese subjects (Problem 4.15)

the pool was about 200, and several tests were done with smaller batches (about 10–12 subjects per batch evenly split between males and females). Each batch of subjects first spent some time in a pre-conditioning chamber after which they were moved to the main chamber. The environmental conditions (dry-bulb temperature Tdb, relative humidity RH and air velocity) of the main chamber were controlled such that: Tdb(±0.3 ° C), RH(±5%) and air velocity CV, this would indicate that the model deviates more at the lower range, and vice versa. There is a third-way RMSE that can be normalized, though it is not used as much in the statistical literature. It is simply to divide the RMSE by the range of variation in the response variable y: CV00 = RMSE=ðymax - ymin Þ

ð5:10cÞ

This measure has a nice intuitive appeal, and its range is bounded between [0,1]. 6

Parsimony in the context of regression model building is a termmeant to denote the most succinct model, i.e., one without any statistically superfluous regressors.

5.3 Simple OLS Regression

175

SFð%Þ = ½ðRMSEmodel A - RMSEmodel B Þ=RMSEmodel A 

(e) Mean Bias Error The mean bias error (MBE) is defined as the mean difference between the actual data values and model predicted values: MBE =

n i = 1 ð yi

- yi Þ n-p

ð5:11aÞ

Note that when a model is identified by OLS using the original set of data, MBE should be zero (to within round-off errors of the computer). Only when, say, the model identified from a first set of observations is used to predict the value of the response variable under a second set of conditions will MBE be different than zero (see Sect. 5.8 for further discussion). Under the latter circumstances, the MBE is also called the mean simulation or prediction error. A normalized MBE (or NMBE) is often used and is defined as the MBE given by Eq. 5.9a divided by the mean value of the response variable y : NMBE =

MBE y

ð5:11bÞ

(f) Mean Absolute Deviation The mean absolute deviation (MAD) is defined as the mean absolute difference between the actual data values and model predicted values: MAD =

n i = 1 jyi

- yi j ð n - pÞ

ð5:12Þ

This metric is also called mean absolute error (MAE) and is a measure of the systematic bias in the model. Example 5.3.2 Using the data from Example 5.3.1 repeat the exercise using a spreadsheet program. Calculate R2, RMSE, and CV values. From Eq. 5.2, SSE = 323.3 and SSR = 3390.5. From this SST = SSE + SSR = 3713.9. Then from Eq. 5.7a, R2 = 91.3%, while from Eq. 5.8, RMSE = 3.2295, from which CV = 0.095 = 9.5%. ■

× 100

- 1 < SF ≤ 100 ð5:13Þ

Hence, SF < 0 % would indicate that model B is poorer than model A. Conversely, SF> 0 would mean that model B is an improvement. If, say, SF = 1.35, this could be interpreted as the predictive accuracy of model B is 35% better than that of model A. The limit of SF = 100% is indicative of the upper limit of perfect predictions, that is, RMSE (model B) = 0. The SF metric is a conceptually appealing measure that allows several potential models to be directly evaluated and ranked compared to a baseline or reference model. The Adj. R-square, the RMSE (or CV), MBE (or NMBE), and MAD are perhaps the most widely used metrics to evaluate competing regression model fits to data. Under certain circumstances, one model may be preferable to another in terms of one index but not the other. The analyst is then perplexed as to which index to pick as the primary one. In such cases, the specific intent of how the model is going to be subsequently applied should be considered which may suggest the model selection criterion.

5.3.3

Inferences on Regression Coefficients and Model Significance

Once an overall regression model is identified, is the model statistically significant? If it is not, the entire identification process loses its value. The F-statistic, which tests for the significance of the overall regression model (not that of a particular regressor), is defined as: variance explained by the regression variance not explained by the regression MSR SSR n - p = = MSE p - 1 SSE

F=

ð5:14aÞ

Note that the degrees of freedom for SSR = ( p - 1) while that for SSE = (n - p). Thus, the smaller the value of F, the poorer the regression model. It will be noted that the Fstatistic is directly related to R2 as follows:

(g) Skill Factor Another measure is sometimes used to compare the relative improvement of one model over another when applied to the same data set. The relative score or skill factor (SF) allows one to quantify the improvement in predictive accuracy when, say model B would bring compared to another, say model A. It is common to use the RMSE of both models as the basis:

F=

ð n - pÞ R2 2 ð p - 1Þ 1-R

ð5:14bÞ

Hence, the F-statistic can alternatively be viewed as being a measure to test the R2 significance itself. During univariate regression, the F-test is really the same as a Student t-test for

176

5

the significance of the slope coefficient. In the general case, the F-test allows one to test the joint hypothesis of whether all coefficients of the regressor variables are equal to zero or not. Example 5.3.3 Calculate the F-statistic for the model identified in Example 5.3.1. What can you conclude about the significance of the fitted model? From Eq. 5.14a, F=

3390:5 33 - 2  = 325 323:3 2-1

which clearly indicates that the overall regression fit is significant. The reader can verify that Eq. 5.14b also yields an identical value of F. ■ Note that the values of coefficients a and b based on the given sample of n observations are only estimates of the true model parameters α and β. If the experiment is repeated, the estimates of a and b are likely to vary from one set of experimental observations to another. OLS estimation assumes that the model residual ε is a random variable with zero mean. Further, the residuals εi at specific values of x are taken to be randomly distributed, which is akin to saying that the distributions shown in Fig. 5.3 at specific values of x are normal and have equal variance. After getting an overall picture of the regression model, it is useful to study the significance of each individual regressor on the overall statistical fit in the presence of all other regressors. The Student t-statistic is widely used for this purpose and is applied to each regression parameter. For the slope parameter b in Eq. 5.1: Student t-value t=

b-0 sb

ð5:15aÞ

The estimated standard deviation (also referred to as “the standard error of the sampling distribution”) of the slope parameter b is given by sb = RMSE=

Sxx :

ð5:15bÞ

with Sxx being the sum of squares given by Eq. 5.6b and RMSE by Eq. 5.8. For the intercept parameter a in Eq. 5.1, Student t-value a-0 t= sa

ð5:16aÞ

where the estimated standard deviation of the intercept parameter a is

Linear Regression Analysis Using Least Squares n

sa = RMSE

1=2

xi 2

i

ð5:16bÞ

n:Sxx

Basically, the t-test as applied to regression model building is a formal statistical test to determine how significantly different an individual coefficient is from zero in the presence of the remaining coefficients. Stated simply, it enables an answer to the following question: would the fit become poorer if the regressor variable in question is not used in the model at all? Recall that the confidence intervals CI refer to the limits for the mean response at specified values of the predictor variables for a specified confidence level (CL). Let β and α denote the hypothesized true values of the slope and intercept coefficients. The CI for the model parameters are determined as follows. For the slope term: b-

t α=2  RMSE t α=2  RMSE p p Þ < 21:9025 þ ð2:04Þð0:87793Þ

20

30 40 Solids Reduction

50

60

(ii) Single-equation or multi-equation depending on whether only one or several interconnected response variables are being considered. (iii) Linear or nonlinear, depending on whether the model is linear or nonlinear in its parameters (and not its functional form). Thus, a regression equation such as y = a + b  x + c  x2 is said to be linear in its parameters {a, b, c} though it is nonlinear in its functional structure (see Sect. 1.4.4 for a discussion on the classification of mathematical models).

or,

Certain simple univariate equation models are shown in Fig. 5.7. Frame (a) depicts simple linear models (one with a 20:112 < ð < y20 > Þ < 23:693 at 95% CL: ■ positive slope and another with a negative slope), while frames (b) and (c) are higher-order polynomial models which, though nonlinear in the function, are models linear Example 5.3.9 Calculate the 95% PI for predicting the individual response in their parameters. The other figures depict nonlinear for x = 20 using the linear model identified in Example 5.3.1. models. Analysts often approximate a linear model (especially over a limited range) even if the relationship of the Using Eq. 5.21, data is not strictly linear. If a function such as that shown in 2 1=2 frame (d) is globally nonlinear, and if the domain of the 1 ð20 - 33:4545Þ = 3:3467 experiment is limited say to the right knee of the curve Varðy0 Þ = ð3:2295Þ 1 þ þ 4152:18 33 (bounded by points c and d), then a linear function in this region could be postulated. Models tend to be preferentially Further, t0.05/2 = 2.04. Using an analogous expression as framed as linear ones largely due to the simplicity in Eq. 5.20 yields the PI for the mean response subsequent model building and the prevalence of solution methods based on matrix algebra. 21:9025 - ð2:04Þð3:3467Þ < y20 < 21:9025 þ ð2:04Þð3:3467Þ or

15:075 < y20 < 28:730:at 95%CL ■

5.4

Multiple OLS Regression

Regression models can be classified as: (i) Univariate or multivariate, depending on whether only one or several regressor variables are being considered.

5.4.1

Higher Order Linear Models

When more than one regressor variable is known to influence the response variable, a multivariate model will explain more of the variation and provide better predictions than a univariate model. The parameters of such a model can be identified using multiple regression techniques. This section will discuss certain important issues regarding multivariate, singleequation models linear in the parameters using the OLS

5.4 Multiple OLS Regression

179

Fig. 5.7 General shape of regression curves (From Shannon 1975 by # permission of Pearson Education)

approach. For now, the treatment is limited to regressors, which are uncorrelated or independent. Consider a data set of n readings that include k regressor variables. The number of model parameters p will then be (k+1) because one of the parameters is the intercept or constant term. The corresponding form, called the additive multivariate linear regression (MLR) model, is: y = β 0 þ β 1 x1 þ β 2 x2 þ ⋯ þ β k xk þ ε

ð5:22aÞ

where ε is the error or unexplained variation in y. Due to the lack of any interaction terms, the model is referred to as “additive.” The simple interpretation of the numerical value of the model parameters is that βi represents the unit influence dy ). Note that this is strictly valid only of xi on y (i.e., slope dx i when the variables are independent or uncorrelated, which, often, is not true.

The same model formulation is equally valid for a kth degree polynomial regression model which is a special case of Eq. 5.22a with x1 = x, x2 = x2, etc. y = β 0 þ β 1 x þ β 2 x2 þ ⋯ þ β k x k þ ε

ð5:23Þ

Polynomial models are commonly used to represent empirical behavior and can capture a variety of shapes. Usually, they are limited to second-order (quadratic) or thirdorder (cubic) functional forms. Let xij denote the ith observation of parameter j. Then Eq. 5.22a can be re-written as yi = β0 þ β1 xi1 þ β2 xi2 þ ⋯ þ βk xik þ ε

ð5:22bÞ

Often, it is most convenient to transform the regressor variables and express them as a difference from the mean

180

5

(this approach is used in Sect. 5.4.4 and also in Sect. 6.4 while dealing with experimental design methods). This transformation is also useful to reduce the ill-conditioning effects of multicollinearity, which introduces errors and large uncertainties in the model parameter estimates (discussed in Sect. 9.3). Specifically, Eq. 5.22a can be transformed into: y = β 0 0 þ β 1 ð x1 - x1 Þ þ β 2 ð x2 - x2 Þ þ ⋯ þ β k ð xk - xk Þ þ ε ð5:24Þ An important special case is the second-order or quadratic regression model when ( p = 3) in Fig. 5.7b. The straight line is now replaced by parabolic curves depending on the value of β (i.e., either positive or negative). Multivariate model development utilizes some of the same techniques as discussed in the univariate case. The first step is to identify all variables that can influence the response as predictor variables. It is the analyst’s responsibility to identify these potential predictor variables based on his or her knowledge of the physical system. It is then possible to plot the response against all possible predictor variables to identify any obvious trends. The greatest single disadvantage to this approach is the sheer labor involved when the number of possible regressor variables is high. A situation that arises in multivariate regression is the concept of variable synergy, or commonly called interaction between variables (this is a consideration in other problems; for example, when dealing with the design of experiments). This occurs when two or more variables interact and impact system response to a degree greater than when the variables operate independently. In such a case, the first-order linear model with two interacting regressor variables takes the form: y = β 0 þ β 1 x1 þ β 2 x2 þ β 3 x1  x 2 þ ε

ð5:25Þ

The term (β3x1  x2) is called the interaction term. How the interaction parameter affects the shape of the family of Fig. 5.8 Plots illustrating the effect of interaction among two regressor variables due to the presence of cross-product terms. (a) Non-interacting. (b) Interacting (From Neter et al. 1983)

Linear Regression Analysis Using Least Squares

curves is illustrated in Fig. 5.8. The origin of this model function is easy to derive. The lines for different values of regressor x1 are essentially parallel, and so the slope terms for both models are equal. Let the model with the first regressor be: y = a′ + bx1, while the intercept be given by: a′ = f(x2) = a + cx2. Combining both equations results in: y = a + bx1 + cx2. This corresponds to Fig. 5.8a. For the interaction case, both the slope and the intercept terms are functions of x2. Hence, representing a′ = a + bx1 and b′ = c + dx1, then: y = a þ bx1 þ ðc þ dx1 Þx2 = a þ bx1 þ cx2 þ dx1 x2 which is identical in structure to Eq. 5.25. Simple linear functions have been assumed above. It is straightforward to derive expressions for higher-order models by analogy. For example, the second-order (or quadratic) model without interacting variables is: y = β0 þ β1 x1 þ β2 x2 þ β3 x21 þ β4 x22 þ ε

ð5:26Þ

For a second-order model with interacting terms, the corresponding expression can be easily derived. Consider the linear polynomial model with one regressor (with the error term dropped): y = b0 þ b1 x1 þ b2 x1 2

ð5:27Þ

If the parameters {b0, b1, b2} can themselves be expressed as second-order polynomials of another regressor x2, the full model will have nine regression parameters: y = b00 þ b10 x1 þ b01 x2 þ b11 x1 x2 þb20 x21 þ b02 x22 þ b21 x21 x2 þb12 x1 x22

þ

ð5:28Þ

b22 x21 x22

The functional dependence of a response variable (mortality ratio) on two independent variables (age and percent of

5.4 Multiple OLS Regression

181

Fig. 5.9 Mortality ratio of men as a function of age and percent of normal weight (Ezekiel and Fox 1959)

Fig. 5.10 Response contour diagrams (Neter et al. 1983). (a) Noninteracting independent variables: y = 20 + 0.95x1 0.50x2. (b) Interacting independent variables: y = 5x1 +7x2 + 3x1x2

normal weight) is perhaps better illustrated using 3-D plots as shown in Fig. 5.9. Clearly, any model used to fit the shape of this curved surface would require higher-order functional models with interacting terms. For example, men who are either underweight or overweight at age 22 seem to have higher mortality rates than normal (y-axis=100 is normal), but this is not so for the 52-year age group where overweight is the only high-risk factor. This example also illustrates the fact that such polynomial models can fit simple ridges, peaks, valleys, and saddles. It is important to emphasize that the analyst should strive to identify the simplest model possible with the model order as low as possible in the case of multivariate regression. 3-D plots, such as in Fig. 5.9, are sometimes hard to read, and response contour plots for two independent variable situations are often more telling. Instead of plotting the

dependent or response variable on the z-axis, the two independent variables are shown on the x- and y-axis, and discrete values of the response variables are shown as contour lines. Fig. 5.10 illustrates this type of graphical presentation for two cases: without and with interaction between the two independent variables. The synergistic behavior of independent variables can result in two or more variables working together to “overpower or usurp” another variable’s prediction capability. As a result, it is necessary to always check the importance of each individual predictor variable while performing multivariate regression. Those variables with low absolute values of the tstatistic should be omitted from the model and the remaining predictors used to re-estimate the model parameters. The stepwise regression method described in Sect. 5.4.6 is based on this concept.

182

5

5.4.2

∂L = - 2XT Y þ 2XT Xβ = 0 ∂β

Matrix Formulation

When dealing with multiple regression, it is advantageous to resort to matrix algebra because of the compactness and ease of manipulation it offers without loss in clarity. Though the solution is conveniently provided by a computer, a basic understanding of matrix formulation is nonetheless useful. In matrix notation (with YT denoting the transpose of Y), the linear model given by Eq. 5.22(b) can be expressed as follows for the n data points (with the matrix dimension shown in subscripted brackets for better understanding): Yðn,1Þ = Xðn,pÞ βðp,1Þ þ εðn,1Þ

ð5:29Þ

βT = ½β0 β1 . . . βk ,

ε T = ½ ε1 ε2 . . . εn 

ð5:30Þ

ð5:33Þ

which leads to the system of normal equations XT X b = XT Y

ð5:34Þ

with n

n i=1 n

n

XT X =

i=1 n

where p is the number of parameters in the model (=k + 1 for a linear model) k is the order of the model, and n is the number of data points which consists of one response and p regressors observations. The individual terms can be expressed as: Y T = ½y1 y2 . . . yn ,

Linear Regression Analysis Using Least Squares

xi1

i=1

::

i=1

n

xik

i=1

xi1

::

x2i1

::

::

::

xik :xi1

::

n i=1 n i=1

xik

xi1 :xik

n

::

i=1

ð5:35Þ

x2ik

The above matrix is a symmetric matrix with the main diagonal elements being the sum of squares of the elements in the columns of X and the off-diagonal elements being the sum of the cross-products. From here, the regression model coefficient vector b is the least square estimator vector of β given by: b = XT X

1

XT Y = C XT Y

ð5:36Þ

and

X=

1 1

x11 x21

 

x1k 









1

xn1



xnk

ð5:31Þ

The first column of 1 is meant for the constant term; it is strictly not needed but is convenient for matrix manipulation. The interpretation of the matrix elements is simple. For example, x21 refers to the second observation of the first regressor x1, and so on. The descriptive measures applicable for a single variable can be extended to multiple variable models of order k and written in compact matrix notation.

5.4.3

The approach involving the minimization of SSE for the univariate case (Sect. 5.3.1) can be generalized to multivariate linear regression. Here, the parameter set β is to be identified such that the sum of squares function L is minimized:

or

n

ε2 i=1 i

= εT ε = ðY - XβÞT ðY - XβÞ

VarðbÞ = σ 2 XT X

-1

= σ2 C

ð5:37Þ

where σ 2 is the mean square error of the model error terms = ðsum of square errorsÞ=ðn - pÞ

ð5:38Þ

An unbiased estimator of σ 2 is the sample s2 or residual mean square

Point and Interval Estimation

L=

provided matrix (XTX) is not singular. Note that the matrix C = (XTX)-1 called the variance-covariance matrix of the estimated regression coefficients is also a symmetric matrix with the main diagonal elements being the variances of the model coefficient estimators and the off-diagonal elements being the sum of the covariances. Under OLS regression, the variance of the model parameters is given by:

s2 =

εT ε SSE = n-p n-p

ð5:39Þ

For predictions within the range of variation of the original data, the mean and individual response values are normally distributed with the variance given by the following:

ð5:32Þ (a) For the mean response at a specific set of x0 values, or the confidence interval CI, under OLS

5.4 Multiple OLS Regression

183

varðy0 Þ = s2 X0 XT X

-1

(b) The variance of an individual prediction, or the prediction level, is varðy0 Þ = s2 1 þ X0 XT X

-1

1

0:2

0:04

1 1

0:3 0:4

0:09 0:16

1

0:5

0:25

X= 1 1

0:6 0:7

0:36 0:49

1 1

0:8 0:9

0:64 0:81

1

1

1

ð5:40Þ

XT0

XT0

ð5:41Þ

where 1 is a column vector of unity. Two-tailed confidence intervals CI at a significance level α are: y0 ± t ðn - k, α=2Þ  var1=2 ðy0 Þ

ð5:42Þ

Example 5.4.1 Part load performance of fans (and pumps) Part-load performance curves do not follow the idealized fan laws due to various irreversible losses. For example, decreasing the flow rate by half of the rated flow does not result in a (1/8)th decrease in its rated power consumption as predicted by the fan laws. Hence, actual tests are performed for such equipment under different levels of loading. The performance tests of the flow rate and the power consumed are then normalized by the rated or 100% load conditions called part load ratio (PLR) and fractional full-load power (FFLP) respectively. Polynomial models can then be fit between these two quantities with PLR as the regressor and FFLP as the response variable. Data assembled in Table 5.2 were obtained from laboratory tests on a variable speed drive (VSD) control, which is a very energy efficient device and increasingly installed. (a) What is the matrix X in this case if a second-order polynomial model is to be identified of the form y = β0 þ β1 x1 þ β2 x21 ? (b) Using the data given in the table, identify the model and report relevant statistics on both parameters and overall model fit. (c) Compute the confidence interval and the prediction interval at 0.05 significance level for the response at values of PLR = 0.2 and 1.00 (i.e., the extreme points). Solution (a) The independent Eq. 5.26b is:

variable

matrix

X

given

0.7 0.51

0.8 0.68

0.9 0.84

by

Table 5.2 Data table for Example 5.4.1 PLR FFLP

0.2 0.05

0.3 0.11

0.4 0.19

0.5 0.28

0.6 0.39

1.0 1.00

(b) The results of the regression are shown below: Parameter Constant PLR PLR^2

Estimate - 0.0204762 0.179221 0.850649

Standard error - 0.0173104 0.0643413 0.0526868

t-statistic - 1.18288 2.78547 16.1454

p-value 0.2816 0.0318 0.0000

Analysis of Variance Source Model Residual Total (Corr.)

Sum of squares 0.886287 0.000512987 0.8868

Df 2 6 8

Mean square 0.443144 0.0000854978

F-ratio 5183.10

p-value 0.0000

Goodness-of-fit R2 = 99.9 % , Adj ‐ R2 = 99.9 %, RMSE = 0.009246, and mean absolute error (MAD) = 0.00584. The equation of the fitted model is (with appropriate rounding) FFLP = - 0:0205 þ 0:1792 PLR þ 0:8506 PLR2 ð5:43Þ Since the model p-value in the ANOVA table is less than 0.05, there is a statistically significant relationship between FFLP and PLR at 95% CL. However, the p-value of the constant term is large (>0.05), and a model without an intercept term is more appropriate as physical considerations suggest since power consumed by the pump is zero if there is no flow. The values shown are those provided by the software package. Note that the standard errors and the Student t-values for the model coefficients shown in the table cannot be computed from Eq. 5.14 to 5.15, which apply for the simple linear model. The equations for polynomial regression are rather complicated to solve by hand, and the interested reader can refer to texts such as Neter et al. (1983) for more details. The 95% CI and PI are shown in Fig. 5.11. Because the fit is excellent, these are very narrow and close to each other. The predicted values as well as the 95% CI and PI for the two data points are given in the table below. Note that the uncertainty range is relatively larger at the lower value than at the higher range.

184

5

Linear Regression Analysis Using Least Squares

Fig. 5.11 Plot of fitted model along with 95% CI and 95% PI

x 0.2 1.0

Predicted y 0.0493939 1.00939

95% prediction limits Lower Upper 0.0202378 0.0785501 0.980238 1.03855

95% confidence limits Lower Upper 0.0310045 0.0677834 0.991005 1.02778

Example 5.4.2 Table 5.3 gives the solubility of oxygen in water in (mg/L) at 1 atm pressure for different temperatures and different chloride concentrations in (mg/L). (a) Plot the data and formulate two different potential models for oxygen solubility (the response variable) against the two regressors. (b) Evaluate both models and identify the better one. Give justification for your choice Report pertinent statistics for model parameters as well as for the overall model fit. (a) The above data set (28 data points in all) is plotted in Fig. 5.12a. One notes that the series of plots are slightly nonlinear but parallel, suggesting a higher-order model without interaction terms. Hence, the second-order polynomial models without interaction are probably more logical but let us investigate both the first-order and second-order linear models. (b1) Analysis results of the first-order model (n = 28 and p = 3) Goodness-of-fit indicators: R2 = 96.83%, AdjR = 96.57%, RMSE = 0.41318. All three model parameters are statistically significant as indicated by the pvalues ( p, this would indicate a biased model because of underfitting (Walpole et al. 1998). Measures other than AdjR2 and Mallows Cp statistic have been proposed for subset selection. Two of the most widely adopted ones are the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) (see, e.g., James et al. 2013). Another model selection approach, which can be automated to evaluate models with a large number of possible parameters, is the iterative approach which comes in three variants (in many cases, all three may yield slightly different results). (b-1) Backward Elimination Method One begins with selecting an initial model that includes the full set of possible predictor variables from the candidate pool, and then successively dropping one variable at a time Actually, there is no “best” model since random variables are involved. A better term would be “most plausible” and should include mechanistic considerations, if appropriate.

8

Linear Regression Analysis Using Least Squares

based on their contribution to the reduction of SSE (or dropping the variable which results in the smallest decrease in R2) until a sudden large drop in R2 is noticed. The OLS method is used to estimate all model parameters along with t-values for each model parameter. If all model parameters are statistically significant, the model building process stops. If some model parameters are not significant, the model parameter of least significance (lowest t-value) is omitted from the regression equation, and the reduced model is refit. This process continues until all parameters that remain in the model are statistically significant. (b-2) Forward Selection Method One begins with an equation containing no regressors (i.e., a constant model). The model is then augmented by including the regressor variable with the highest simple correlation with the response variable (or one which will increase R2 by the highest amount). If this regression coefficient is significantly different from zero, it is retained, and the search for a second variable is made. This process of adding regressors one by one is terminated when the last variable entering the equation is not statistically significant or when all the variables are included in the model. Clearly, this approach involves fitting many more models than in the backward elimination method. (b-3) Stepwise Regression Method Prior to the advent of resampling methods (Sect. 5.8), stepwise regression was the preferred model-building approach. It combines both the procedures discussed above. Stepwise regression begins by computing correlation coefficients between the response and each predictor variable (like the forward selection method). The variable most highly correlated with the response is then allowed to “enter the regression equation.” The parameter for the single-variable regression equation is then estimated along with a measure of the goodness of fit. The next most highly correlated predictor variable is identified, given the current variable already in the regression equation. This variable is then allowed to enter the equation and the parameters re-estimated along with the goodness of fit. Following each parameter estimation, tvalues for each parameter are calculated and compared to the t-critical value to determine whether all parameters are still statistically significant. Any parameter that is not statistically significant is removed from the regression equation. This process continues until no more variables “enter” or “leave” the regression equation. In general, it is best to select the model that yields a reasonably high “goodness of fit” for the fewest parameters in the model (referred to as model parsimony). The final decision on model selection requires the judgment of the model builder based on mechanistic insights into the problem. Again, one must guard against

5.5 Applicability of OLS Parameter Estimation

189

the danger of overfitting by performing a cross-validation check (Sect. 5.8). When a black-box model is used containing several regressors, stepwise regression would improve the robustness of the model identified by reducing the number of regressors and, thus, hopefully reduce the adverse effects of multicollinearity between the remaining regressors. Many packages use the F-test indicative of the overall model instead of the t-test on individual parameters to perform the stepwise regression. It is suggested that stepwise regression not be used in case the regressors is highly correlated since it may result in nonrobust models. However, the backward procedure is said to better handle such situations than the forward selection procedure. A note of caution is warranted in using stepwise regression for engineering models based on mechanistic considerations. In certain cases, stepwise regression may omit a regressor which ought to be influential when using a particular data set, while the regressor is picked up when another data set is used. This may be a dilemma when the model is to be used for subsequent predictions. In such cases, discretion based on physical considerations should trump purely statistical model building. Resampling methods, such as the cross-validation method (Sect. 5.8.2), can be used as the final judge to settle on the most appropriate model among a relatively small number of variable subsets found by automatic model selection. A sounder estimate of the model prediction error is also directly provided. Example 5.7.39 Proper model identification with multivariate regression models An example of multivariate regression is the development of model equations to characterize the performance of refrigeration compressors. It is possible to regress the compressor manufacturer’s tabular data of compressor performance using the following simple bi-quadratic formulation (see Fig. 5.14 for nomenclature): y = b0 þ b1 T cho þ b2 T cdi þ b3 T 2cho þ

b4 T 2cdi

þ b5 T cho T cdi

ð5:48aÞ

where y represents either the compressor power (Pcomp) or the cooling capacity (Qch). OLS is then used to develop estimates of the six model parameters, b0 to b5, based on the compressor manufacturer’s data. The biquadratic model was used to estimate the parameters for compressor cooling capacity (in refrigeration From ASHRAE (2005) # American Society of Heating, Refrigerating and Air-Conditioning Engineers, Inc., www.ashrae.org

Table 5.5 Results of the first and second stage model building (Example 5.7.3) With all parameters Coefficient Value 152.50 b0 3.71 b1 - 0.335 b2 b3 0.0279 b4 - 0.000940 - 0.00683 b5

t-value 6.27 36.14 - 0.62 52.35 - 0.32 - 6.13

With significant parameters only Value t-value 114.80 73.91 3.91 11.17 – – 0.027 14.82 – – - 0.00892 - 2.34

tons)10 for a screw compressor. The model and its corresponding parameter estimates are given below. Although the overall curve fit for the data was excellent (R2 = 99.96%), the t-values of b2 and b4 were clearly insignificant indicating that the first-order term of Tcdi and the second-order term of Tcdi2 should be dropped. As an aside, it has been suggested by some authors that one should “maintain hierarchy,” that is, include in the model lower order terms of a specific variable even if found to be statistically insignificant by stepwise analysis or analysis akin to the one done above. This practice is not universally accepted though and is better left to the discretion of the analyst. A second stage regression is done by omitting these regressors, resulting in the following model and coefficient t-values shown in Table 5.5. y = b0 þ b1 T cho þ b3 T 2cho þ b5 T cho T cdi

ð5:48bÞ

All the parameters in the simplified model are significant and the overall model fit remains excellent: R2 = 99.5%. ■

5.5

Applicability of OLS Parameter Estimation

5.5.1

Assumptions

The term “least squares” regression is generally applied to linear models, although the concept can be extended to nonlinear functions as well. The ordinary least squares (OLS) regression method is a special and important sub-class whose parameter estimates are best only when a number of conditions regarding the functional form and the model residuals or errors are met (discussed below). It enables simple (univariate) or multivariate linear regression models to be identified from data, which can then be used for future prediction of the response variable along with its uncertainty intervals. It also allows statistical statements to be made

9

10

1 Ton of refrigeration = 12,000 Btu/h.

190

about the estimated model parameters- a process known as “inference”. No statistical assumptions are used to obtain the OLS estimators for the model coefficients. When nothing is known regarding measurement errors, OLS is often the best choice for estimating the parameters. However, to make statistical statements about these estimators and the model predictions, it is necessary to acquire information regarding the measurement errors. Ideally, one would like the error terms (or residuals) εi to be normally distributed, without serial correlation, with mean zero and constant variance. The implications of each of these four assumptions, as well as a few additional ones, will be briefly addressed below since some of these violations may lead to biased coefficient estimates and to distorted estimates of the standard errors, of the, confidence intervals, and to improper conclusions from statistical tests. (a) Errors should have zero mean: If this is not true, the OLS estimator of the intercept will be biased. The impact of this assumption not being correct is generally viewed as the least critical among the various assumptions. Mathematically, this implies that expected error values E(εi) = 0. (b) Errors should be normally distributed: If this is not true, statistical tests and confidence intervals are incorrect for small samples though the OLS coefficient estimates are unbiased. Fig. 5.3 which illustrates this behavior has already been discussed. This problem can be avoided by having larger samples and verifying that the model is properly specified. (c) Errors should have constant variance: var (εi) = σ 2. This violation of the basic OLS assumption results in increasing the standard errors of the estimates and widening the model prediction confidence intervals (though the OLS estimates themselves are unbiased). In this sense, there is a loss in statistical power. This condition in which the variance of the residuals or error terms is not constant is called heteroscedasticity and is discussed further in Sect. 5.6.3. (d) Errors should not be serially correlated: This violation is equivalent to having fewer independent data and also results in a loss of statistical power with the same consequences as (c) above. Serial correlations may occur due to the manner in which the experiment is carried out. Extraneous factors, that is, factors beyond our control (such as the weather) may leave little or no choice as to how the experiments are executed. An example of a reversible experiment is the classic pipefriction experiment where the flow through a pipe is varied to cover both laminar and turbulent flows, and the associated friction drops are observed. Gradually increasing the flow one way (or decreasing it the other

5

Linear Regression Analysis Using Least Squares

way) may introduce biases in the data, which will subsequently also bias the model parameter estimates. In other circumstances, certain experiments are irreversible. For example, the loading on a steel sample to produce a stress–strain plot must be performed by gradually increasing the loading till the sample breaks, one cannot proceed in the other direction. Usually, the biases brought about by the test sequence are small, and this may not be crucial. In mathematical terms, this condition, for a first-order case, can be written as the expected value of the product of two consecutive errors E(εi. εi + 1) = 0. This assumption, which is said to be hardest to verify, is further discussed in Sect. 5.6.4. (e) Errors should be uncorrelated with the regressors: The consequences of this violation result in OLS coefficient estimates being biased and the predicted OLS confidence intervals understated, that is, narrower. This violation is a very important one and is often due to “mis-specification error” or underfitting. Omission of influential regressor variables and improper model formulation (assuming a linear relationship when it is not) are likely causes. This issue is discussed at more length in Sect. 5.6.5. (f) Regressors should not have any measurement error: Violation of this assumption in some (or all) regressors will result in biased OLS coefficient estimates for those (or all) regressors. The model can be used for prediction, but the confidence intervals will be understated. Strictly speaking, this assumption is hardly ever satisfied since there is always some measurement error. However, in most engineering studies, measurement errors in the regressors are not large compared to the random errors in the response, and so this violation may not have important consequences. As a rough rule of thumb, this violation becomes important when the errors in x reach about a fifth of the random errors in y, and when multicollinearity is present. If the errors in x are known, there are procedures that allow unbiased coefficient estimates to be determined (see Sect. 9.4.6). Mathematically, this condition is expressed as Var (xi) = 0. (g) Regressor variables should be independent of each other: This violation applies to models identified by multiple regression when the regressor variables are correlated with each other (called multicollinearity). This is true even if the model provides an excellent fit to the data. Estimated regression coefficients, though unbiased, will tend to be unstable (their values tend to change greatly when a data point is dropped or added), and the OLS standard errors and the prediction intervals will be understated. Multicollinearity is likely to be a problem only when one (or more) of the correlation coefficients among the regressors exceeds 0.8–0.85 or so. Sect. 9.3 deals with this issue at more length.

5.6 Model Residual Analysis and Regularization

5.5.2

191

Sources of Errors During Regression

Perhaps the most crucial issue during parameter identification is the type of measurement inaccuracy present. This has a direct influence on the estimation method to be used. Though statistical theory has neatly classified this behavior into a finite number of groups, the data analyst is often stymied by data which does not fit into any one category. Remedial action advocated does not seem to entirely remove the adverse data conditioning. A certain amount of experience is required to surmount this type of adversity, which, further, is circumstance specific. As discussed earlier, there can be two types of errors: (a) Measurement error. The following sub-cases can be identified depending on whether the error occurs: (i) In the dependent variable, in which case the model form is: yi þ δi = β0 þ β1 xi

ð5:49aÞ

or in the regressor variable, in which case the model form is: yi = β 0 þ β 1 ð xi þ γ i Þ

ð5:49bÞ

or, in both dependent and regressor variables: y i þ δ i = β 0 þ β 1 ð xi þ γ i Þ

ð5:49cÞ

Further, the errors δ and γ (which will be jointly represented by ε) can have an additive error, in which case, εi ≠ f(yi, xi), or a multiplicative error: εi = f(yi, xi), or worst still, a combination of both. Section 9.4.1 discusses this issue further. (b) Model misspecification error: How this would affect the model residuals εi is difficult to predict and is extremely circumstance specific. Misspecification could be due to several factors, for example, (i) One or more important variables have been left out of the model (ii) The functional form of the model is incorrect Even if the physics of the phenomenon or of the system is well understood and can be cast in mathematical terms, identifiability constraints may require that a simplified or macroscopic model be used for parameter identification rather than the detailed model (see Sect. 9.2). This is likely to introduce both bias and random noise in the parameter estimation process except when model R2 is very high (R2 > 0.9). This issue is further discussed in Sect. 5.6. Formal statistical procedures do not explicitly treat this case but

limit themselves to type (a) errors and, more specifically, to case (i) assuming purely additive or multiplicative errors. The implicit assumptions in OLS and their implications, if violated, are described below.

5.6

Model Residual Analysis and Regularization11

5.6.1

Detection of Ill-Conditioned Behavior

The availability of statistical software has resulted in routine and easy application of OLS to multiple linear models. However, there are several underlying assumptions that affect the individual parameter estimates of the model as well as the overall model itself. Once a model has been identified, the general tendency of the analyst is to hasten and use the model for whatever purpose intended. However, it is extremely important (and this phase is often overlooked) that an assessment of the model be done to determine whether the OLS assumptions are met, otherwise the model is likely to be deficient or misspecified and yield misleading results. In the last 50 years or so, there has been much progress made on how to screen model residual behavior to gain diagnostics insight into model deficiency or misspecification, take remedial action, or adopt more advanced regression techniques.12 Some of the simple methods to screen and correct for ill-behaved model residuals are presented in this chapter, while more advanced statistical concepts and regression methods are addressed in Chap. 9. A few idealized plots illustrate some basic patterns of improper residual behavior, which are addressed in more detail in the later sections of this chapter. Fig. 5.15 illustrates the effect of omitting an important dependence, which suggests that an additional variable is to be introduced in the model which distinguishes between the two groups. The presence of outliers and the need for more robust regression schemes which are immune to such outliers are illustrated in Fig. 5.16. The presence of nonconstant variance (or heteroscedasticity) in the residuals is a very common violation and one of several possible manifestations is shown in Fig. 5.17. This particular residual behavior is likely to be remedied by using a log transform of the response variable instead of the variable itself. Another approach is to use weighted least squares estimation procedures described later in this section. Though nonconstant variance is easy to detect visually, its cause is difficult to identify. Fig. 5.18 illustrates a typical behavior that arises when a Herschel: “. . . almost all of the greatest discoveries in astronomy have resulted from the consideration of what . . . (was) termed residual phenomena.” 12 Unfortunately, many of these techniques are not widely used by those involved in energy-related data analysis. 11

++ + +++ + +

.. .. .... .

.

Model Residuals

5

Fig. 5.15 The residuals can be separated into two distinct groups (shown as crosses and dots) which suggest that the response variable is related to another regressor not considered in the regression model. This improper residual pattern can be rectified by reformulating the model to include this additional variable. One example of such a timebased event system change is shown in Fig. 8.17

Model Residuals

Linear Regression Analysis Using Least Squares

.. ......... . ... . . . . . .

192

Fig. 5.18 Bow-shaped residuals can often be rectified by evaluating higher-order linear models

+ Outliers

... ......... +

.

Model Residuals

+

Model Residuals

+

Fig. 5.19 Serial correlation is indicated by a pattern in the residuals when plotted in the sequence the data was collected, that is, when plotted against time even though time may not be a regressor in the model Fig. 5.16 Outliers indicated by crosses suggest that data should be checked and/or robust regression used instead of OLS

.

Model Residuals

. . . . . . . . . . . . . .. . .

Fig. 5.17 Residuals with bow shape and increased variability (i.e., the error increases as the response variable y increases) can be often rectified by a log transformation of y

linear function is used to model a quadratic variation. The proper corrective action will increase the predictive accuracy of the model (RMSE will be lower), result in the estimated parameters being more efficient (i.e., lower standard errors), and most importantly, allow more sound and realistic interpretation of the model prediction uncertainty bounds. Figure 5.19 illustrates the occurrence of serial correlations in time series data which arises when the error terms are not independent. Such patterned residuals occur commonly during model development and provide useful insights into

model deficiency. Serial correlation (or autocorrelation) has special pertinence to time series data (or data ordered in time) collected from in-situ performance of mechanical and thermal systems and equipment. Autocorrelation is present if adjacent residuals show a trend or a pattern of clusters above or below the zero value that can be discerned visually. Such correlations can either suggest that additional variables have been left out of the model (model-misspecification error) or could be due to the nature of the process itself (dynamic nature of the process—further treated in Chap. 8 on time series analysis). The latter is due to the fact that equipment loading over a day would follow an overall cyclic curve (as against random jumps from say full load to half load) consistent with the diurnal cycle and the way the system is operated. In such cases, positive residuals would tend to be followed by positive residuals, and vice versa. Problems associated with model underfitting and overfitting are usually the result of a failure to identify the non-random pattern in time series data. Underfitting does not capture enough of the variation in the response variable which the corresponding set of regressor variables can possibly explain. For example, all four models fit to their respective sets of data as shown in Fig. 5.20 have identical R2 values and t-statistics but are distinctly different in how they capture the data variation. Only plot (a) can be described by a linear model. The data in (b) need to be fitted by a higher-order model, while one data point in (c) and (d) distorts the entire

5.6 Model Residual Analysis and Regularization

193

Fig. 5.20 Plot of the data (x, y) with the fitted lines for four data sets. The models have identical R2 and t-statistics but only the first model is a realistic model (From Chatterjee and Price 1991, with permission from John Wiley and Sons)

model. Blind model fitting (i.e., relying only on model statistics) is, thus, inadvisable. This aspect is further discussed in Sect. 5.6.5. Overfitting implies capturing randomness in the model, that is, attempting to fit the noise in the data. A rather extreme example is when one attempts to fit a model with six parameters to six data points, which have some inherent experimental error. The model has zero degrees of freedom and the set of six equations can be solved without error (i.e., RMSE = 0). This is clearly unphysical because the model parameters have also “explained” the random noise in the observations in a deterministic manner. Both underfitting and overfitting can be detected by performing certain statistical tests on the residuals. The most used test for white noise (i.e., uncorrelated residuals) involving model residuals is the Durbin-Watson (DW) statistic defined by: DW =

n 2 i = 2 ð εi - εi - 1 Þ n 2 i = 1 εi

ð5:50aÞ

where εi is the residual at time interval i, defined as εi = y i - yi : An approximate relationship can also be used (Chatterjee and Price (1991) DW ≈ 2 ð1 - r Þ

ð5:50bÞ

where r is the correlation coefficient (Eq. 3.12) between timelagged residuals. If there is no serial or autocorrelation present, the expected value of DW = 2 (the limiting range being 0–4). The closer DW is to 2, the stronger the evidence that there is no autocorrelation in the data. If the model underfits DW < 2; while DW > 2 indicates an overfitted model. Tables are available for approximate significance tests with different numbers of regressor variables and a number of data points. Table A.13 assembles lower and upper critical values of DW statistics to test autocorrelation. These apply to positive DW values; if, however, a test is to be conducted with negative DW values, the quantity (4—DW) should be used. For example, if n = 20, and the model has three variables ( p = 3), the null hypothesis that the correlation coefficient is equal to zero can be rejected at the 0.05 significance level if its value is either below 1.00 or above 1.68. Note that the critical values in the table are one-sided, that is, apply to a one-tailed distribution. It is important to note that the DW statistic is only sensitive to correlated errors in adjacent observations, that is, when only first-order autocorrelation is present. For example, if the time series has seasonal patterns, then higher autocorrelations may be present which the DW statistic will be unable to detect. More advanced concepts and modeling are discussed in Sect. 8.5.3 while treating stochastic time series data.

194

5.6.2

5

Leverage and Influence Data Points

Most of the aspects discussed above relate to identifying general patterns in the residuals of the entire data set. Another issue is the ability to identify subsets of data that have an unusual or disproportionate influence on the estimated model in terms of parameter estimation. Being able to flag such influential subsets of individual points allows one to investigate their validity, or to glean insights for better experimental design since they may contain the most interesting system behavioral information. Note that such points are not necessarily “bad” data points which should be omitted, while are to be viewed as “distinctive” observations in the overall data set. It is useful to provide a geometrical understanding of outlier points and their potential impact on the model parameter estimates. No matter how carefully an experiment is designed and performed, there always exists the possibility of serious errors. These errors could be due to momentary instrument malfunction (say, dirt sticking onto a paddle wheel of a flow meter), power surges (which may cause data logging errors), or the engineering system deviating from its intended operation due to random disturbances. Usually, it is difficult to pinpoint the cause of the anomalies. The experimenter is often not fully sure whether the outlier is anomalous, or whether it is a valid or legitimate data point which does not conform to what the experimenter “thinks” it should. In such cases, throwing out a data point may amount to data “tampering” or fudging of results. Usually, data that exhibit such anomalous tendencies are a minority. Even then, if the data analyst retains these questionable observations, they can bias the results of the entire analysis since they exert an undue influence and can dominate a computed relationship between two variables.

Fig. 5.21 Illustrating different types of outliers. Point A is very probably a doubtful point; point B might be bad but could potentially be a very important point in terms of revealing unexpected behavior; point C is close enough to the general trend and should be retained until more data is collected

Linear Regression Analysis Using Least Squares

Consider the case of outliers during regression for the univariate case. Data points are said to be outliers when their model residuals are large relative to the other points. A visual investigation can help one distinguish between endpoints and center points (this is the intent of exploratory data analysis Sect. 3.5). For example, point A of Fig. 5.21 is quite obviously an outlier, and if the rejection criterion orders its removal, one should proceed to do so. On the other hand, point B, which is near the end of the data domain, may not be a bad point at all, but merely the beginning of a new portion of the curve (say, the onset of turbulence in an experiment involving laminar flow). Similarly, even point C may be valid and important. Hence, the only way to remove this ambiguity is to take more observations at the lower end. Thus, a simple heuristic is to reject points only when they are center points. Several advanced books present formal statistical treatment of outliers in a regression context. One can diagnose whether the data set is ill-conditioned or not, as well as identify and reject, if needed, the necessary outliers that cause ill-conditioning during the model-building process (e.g., Belsley et al. 1980). Consider Fig. 5.22a. The outlier point will have little or no influence on the regression parameters identified, and in fact retaining it would be beneficial since it would lead to a reduction in model parameter variance. The behavior shown in Fig. 5.22b is more troublesome because the estimated slope is almost wholly determined by the extreme point. In fact, one may view this situation as a data set with only two data points, or one may view the single point as a spurious point and remove it from the analysis. Gathering more data at that range would be advisable but may not be feasible; this is where the judgment of the analyst or prior information about the underlying trend line is useful. How and the extent to which each of the data points will affect the outcome of the regression line will

5.6 Model Residual Analysis and Regularization

195

Fig. 5.22 Two other examples of outlier points. While the outlier point in (a) is most probably a valid point, it is not clear for the outlier point in (b). Either more data must be collected, failing which it is advisable to delete this data from any subsequent analysis (From Belsley et al. (1980) by permission of John Wiley and Sons)

determine whether that particular point is an influence point or not. Scatter plots often reveal such outliers easily for single regressor situations but are inappropriate for multivariate cases. Hence, several statistical measures have been proposed to deal with multivariate situations, the influence and leverage indices being widely used (Belsley et al. 1980; Cook and Weisberg 1982; Chatterjee and Price 1991). The leverage of a datum point quantifies the extent to which that point is “isolated” in the x-space, that is, its distinctiveness in terms of the regressor variables. Consider the following symmetric matrix (called the hat matrix): H = X XT X

-1

XT = pij

ð5:51Þ

where X is a data matrix with n rows (n is the number of observations) and p columns (given by Eq. 5.31). The order of the H matrix would be (n x n), that is, equal to the number of observations. The diagonal element pii is defined as the leverage of the ith data point. Since the diagonal elements can be related to the distance between Xi and x, with values between 0 and 1, their average value is equal to (p/n). Points with pii > [3 (p/n)] are regarded as points with high leverage (sometimes the threshold is taken as [2 (p/n)]). Large residuals are traditionally used to highlight suspect data points or data points unduly affecting the regression model. Instead of looking at residuals εi, it is more meaningful to study a normalized or scaled value, namely the R-student residuals, where R - student =

εi 1

RMSE:½1 - pii 2

ð5:52Þ

Thus, studentized residuals measure how many standard deviations each observed value deviates from a model fitted using all of the data except that observation. Points with | R-student| > 3 can be said to be influence points that

correspond to a significance level of 0.01. Sometimes a less conservative value of 2 is used corresponding to the 0.05 significance level, with the underlying assumption that residuals or errors are Gaussian. A data point is said to be influential if its deletion, singly or in combination with a relatively few others, causes statistically significant changes in the fitted model coefficients. There are several measures used to describe influence, a common one is DFITS: 1

DFITSi =

ei ðpii Þ2 1

si ð1 - pii Þ2

ð5:53Þ

where εi is the residual error of observation i, and si is the standard deviation of the residuals without considering the ith residual. Points with DFITS ≥ 2 [p/(n - p)]1/2 are flagged as influential points. It is advisable to identify points with high leverage, and then examine them in terms of R-student statistics and the DFITS index for final determination. Influential observations can impact the final regression model in different ways (Hair et al. 1998). For example, in Fig. 5.23a, the model residuals are not significant, and the two influential observations shown as filled dots reinforce the general pattern in the model and lower the standard error of the parameters and of the model prediction. Thus, the two points would be considered to be leverage points that are beneficial to our model building. Influential points that adversely impact model building are illustrated in Fig. 5.23b and c. In the former, the two influential points almost totally account for the observed relationship but would not have been identified as outlier points. In Fig. 5.23c, the two influential points have totally altered the model identified, and the actual data points would have shown up as points with large residuals which the analyst would probably have identified as spurious. The next frame (d) illustrates the instance when an influential point changes the intercept of the model but leaves the

196

5

Linear Regression Analysis Using Least Squares

Fig. 5.23 (a–f) Common patterns of influential observations (From Hair et al. 1998 by # permission of Pearson Education)

slope unaltered. The two final frames, Fig. 5.23e, f, illustrate two, hard to identify and rectify, cases when two influential points reinforce each other in altering both the slope and the intercept of the model though their relative positions are very much different. Note that data points that satisfy both these statistical criteria, that is, are both influential and have high leverage, are the ones worthy of closer scrutiny. Most statistical programs have the ability to flag such points, and hence performing this analysis is fairly straightforward. Thus, in summary, individual data points can be outliers, leverage, or influential points. Leverage of a point is a measure of how unusual the point lies in the x-space. As mentioned above, just because a point has high leverage does not make it influential. An influence point is one that has an important effect on the regression model if that particular point were to be removed from the data set. Influential points are the ones that need particular attention since they provide insights about the robustness of the fit. In any case, all three measures (leverage pii, DFITS, and R-student) provide indications as to the role played by different observations toward the overall model fit. Ultimately, the decision to either retain or reject such points is somewhat based on judgment.

Table 5.6 Data table for Example 5.6.1a x 1 2 3 4 5 6 7 8 9 10 a

y[0,1] 11.69977 12.72232 16.24426 19.27647 21.19835 23.73313 21.81641 25.76582 29.09502 28.9133

y1 11.69977 12.72232 16.24426 19.27647 21.19835 23.73313 21.81641 25.76582 29.09502 50

Data available electronically on book website

Example 5.6.1 Example highlighting different characteristics of residuals versus influence points. Consider the following made-up data (Table 5.6) where x ranges from 1 to 10, and the model is y = 10 + 1.5 * x to which random normal noise ε = [0, σ = 1] has been added to give y1 (second column). The response of the last observation has been intentionally corrupted to a value of 50 as shown (say, due to a momentary spike in power supply to the instrument).

5.6 Model Residual Analysis and Regularization

197

Fig. 5.24 (a) Observed vs predicted plot. (b) Residual plot versus regressor plot

How well a linear model fits the data is depicted in Fig. 5.24. Not surprisingly, the table of unusual residuals shown below does include the last observation since its Studentized absolute residual value is greater than 3.0 (99% CL). It has been flagged as an influential point since it has a major impact on the model coefficients. However, the same point has not been flagged as a leverage one since the point is not “isolated” in the x-space. This example serves to highlight the different impacts of leverage versus influence points. ■ Influential points flagged by the statistical package Row 10

5.6.3

x 10.0

y 50.0

Predicted Y Residual 37.2572 12.743

Studentized residual 11.43

Remedies for Nonuniform Residuals

Nonuniform model residuals or heteroscedasticity can be due to: (i) the nature of the process investigated, (ii) noise in the data, or (iii) the method of data collection from samples that are known to have different variances. Three possible generic remedies for nonconstant variance are to (Chatterjee and Price 1991):

(a) Introduce additional variables into the model and collect new data The physics of the problem along with model residual behavior can shed light into whether certain key variables, left out in the original fit, need to be introduced or not. This aspect is further discussed in Sect. 5.6.5. (b) Transform the dependent variable This is appropriate when the errors in measuring the dependent variable may follow a probability distribution whose variance is a function of the mean of the distribution. In such cases, the model residuals are likely to exhibit heteroscedasticity that can be removed by using exponential, Poisson, or binomial transformations. For example, a variable that is distributed binomially with parameters “n and p” has mean (n.p) and variance [n.p.(1 - p)] (Sect. 2.4.2). For a Poisson variable, the mean and variance are equal. The transformations shown in Table 5.7 will stabilize variance, and the distribution of the transformed variable will be closer to the normal distribution. The logarithmic transformation is also widely used in certain cases to transform a nonlinear model into a linear

198

5

Table 5.7 Transformations in dependent variable y likely to stabilize nonuniform model variance Poisson Binomial

Variance of y in terms of its mean μ μ μ(1-μ)/n

Transformation y1/2 sin-1(y)1/2

Table 5.8 Data table for Example 5.6.2 Obs # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 a

x 294 247 267 358 423 311 450 534 438 697 688 630 709 627

Y 30 32 37 44 47 49 56 62 68 78 80 84 88 97

Obs # 15 16 17 18 19 20 21 22 23 24 25 26 27

x 615 999 1022 1015 700 850 980 1025 1021 1200 1250 1500 1650

y 100 109 114 117 106 128 130 160 97 180 112 210 135

Data available electronically on book website

one (see Sect. 9.4.3). When the variables have a large standard deviation compared to the mean, working with the data on a log scale often has the effect of dampening variability and reducing asymmetry. This is often an effective means of removing heteroscedasticity as well. However, this approach is valid only when the magnitude of the residuals increases (or decreases) with that of one of the variables. Example 5.6.2 Example of variable transformation to remedy improper residual behavior The following example serves to illustrate the use of variable transformation. Table 5.8 shows data/observations from 27 departments in a university with y as the number of faculty and staff and x the number of students. A simple linear regression yields a model with R2 = 77.6% and a RMSE = 21.73. However, the residuals reveal an unacceptable behavior with a strong funnel behavior (see Fig. 5.25a). Instead of a linear model in y, a linear model in ln(y) is investigated. In this case, the model R2 = 76.1% and RMSE =0.25. However, these statistics should NOT be compared directly with the previous indices since the y variable is no longer the same (in one case, it is “y”; in the other “ln(y)”). Leaving this issue aside for now, notice that a first-order model does reduce some of the improper residual variances but an inverted “U” shape behavior can still be detected—indicating model misspecification (see Fig. 5.25b).

Linear Regression Analysis Using Least Squares

Finally, using a quadratic model along with the ln transformation results in a model: lnðyÞ = 2:8516 þ 0:00311267 x‐0:00000110226 x2 ð5:54Þ The residuals shown in Fig. 5.25c are now quite well behaved as a result of such a transformation. ■ (c) Perform weighted least squares This approach is more flexible, and several variants exist (Chatterjee and Price 1991). As described earlier, OLS model residual behavior can exhibit nonuniform variance (called heteroscedasticity) even if the model is structurally complete, that is, the model is not misspecified. This violates one of the standard OLS assumptions. In a multiple regression model, the detection of heteroscedasticity may not be very straightforward since only one or two variables may be the culprits. Examination of the residuals versus each variable in turn along with intuition and understanding of the physical phenomenon being modeled can be of great help. Otherwise, the OLS estimates will lack precision, and the estimated standard errors of the model parameters will be wider. If this phenomenon occurs, the model identification should be redone with explicit recognition of this fact. During OLS, the sum of the model residuals of all points is minimized with no regard to the values of the individual points or to points from different domains of the range of variability of the regressors. The basic concept of weighted least squares (WLS) is to simply assign different weights to different points according to a certain (rational) statistical scheme. The magnitude of the weight of an observation indicates the importance to be given to that observation. A note of caution is that the weights are not known exactly and that assigning values to them is sort of circumstance specific. The general formulation of WLS is that the following function should be minimized: WLS function =

wi yi - β0 - β1 x1i ⋯ - βp xpi

2

ð5:55Þ where wi are the weights of individual points. These are formulated differently depending on the weighting scheme selected. (c-i) Errors Are Proportional to x Resulting in FunnelShaped Residuals Consider the simple model y = α + βx + ε whose residuals ε have a standard deviation that increases as the regressor variable (resulting in the funnel-like shape in Fig. 5.26). Dividing the terms of the model by x results in:

5.6 Model Residual Analysis and Regularization

199

Fig. 5.25 (a) Residual plot of linear model. (b) Residual plot of log-transformed linear model. (c) Residual plot of log-transformed quadratic model (Eq. 5.54)

y α ε = þ β þ or y0 = αx0 þ β þ ε0 x x x

Fig. 5.26 Type of heteroscedastic model residual behavior which arises when errors are proportional to the magnitude of the x variable

ð5:56Þ

with the variance of ε′ becoming constant and equal to a constant k2. This is akin to weighting different vertical slices of the regressor variable by varðεi Þ = k 2 x2i . If the assumption about the weighting scheme is correct, the transformed model will be homoscedastic, and the model parameters α and β will be efficiently estimated by OLS (i.e., the standard errors of the estimates will be optimal). The above transformation is only valid when the model residuals behave as shown in Fig. 5.26. If residuals behave differently, then different transformations or weighting schemes should be explored. Whether a particular transformation is adequate or not can only be gauged by the behavior of the variance of the residuals. Note that the analyst must

200

5

Linear Regression Analysis Using Least Squares

Table 5.9 Measured x and y variables, OLS residuals deduced from Eq. 5.58a and the weights calculated from Eq. 5.58b (Example 5.6.3)a x 1.15 1.90 3.00 3.00 3.00 3.00 3.00 5.34 5.38 5.40 5.40 5.45 7.70 7.80 7.81 7.85 7.87 7.91 7.94 a

y 0.99 0.98 2.60 2.67 2.66 2.78 2.80 5.92 5.35 4.33 4.89 5.21 7.68 9.81 6.52 9.71 9.82 9.81 8.50

OLS residual εi 0.26329 - 0.59826 - 0.2272 - 0.1572 - 0.1672 - 0.0472 - 0.0272 0.435964 - 0.17945 - 1.22216 -0.66216 - 0.39893 - 0.48358 1.53288 - 1.76847 1.37611 1.463402 1.407986 0.063924

wi 0.9882 1.7083 6.1489 6.1489 6.1489 6.1489 6.1489 15.2439 13.6185 12.9092 12.9092 11.3767 0.9318 0.8768 0.8716 0.8512 0.8413 0.8219 0.8078

x 9.03 9.07 9.11 9.14 9.16 9.37 10.17 10.18 10.22 10.22 10.22 10.18 10.50 10.23 10.03 10.23

OLS residual εi - 0.20366 1.730922 2.375506 1.701444 0.828736 0.580302 - 1.18802 1.410628 0.005212 - 3.02479 0.875212 - 2.29937 - 4.0927 2.423858 - 0.61906 - 1.10614

y 9.47 11.45 12.14 11.50 10.65 10.64 9.78 12.39 11.03 8.00 11.90 8.68 7.25 13.46 10.19 9.93

wi 0.4694 0.4614 0.4535 0.4477 0.4440 0.4070 0.3015 0.3004 0.2963 0.2963 0.2963 0.3004 0.2696 0.2953 0.3167 0.2953

Data available electronically on book website

perform two separate regressions: First, an OLS regression to determine the residual amounts of the individual data points, and then a WLS regression for final parameter identification. This is often referred to as two-stage estimation. (c-ii) Replicated Measurements with Different Variance It could happen, especially with designed experiments involving one regressor variable only that one obtains replicated measurements on the response variable corresponding to a set of fixed values of the explanatory variables. For example, consider the case when the regressor variable x takes several discrete values. If the physics of the phenomenon cannot provide any theoretical basis on how to select a particular weighting scheme, then this must be determined heuristically from studying the data. If there is an increasing pattern in the heteroscedasticity present in the data, this could be modeled either by a logarithmic transform (as illustrated in Example 5.6.2) or a suitable variable transformation. Another more versatile approach that can be applied to any pattern of the residuals is illustrated. Each observed residual εij (where the index for discrete x values is i, and the number of observations at each discrete x value is j = 1, 2, . . . ni) is made up of two parts, that is, εij = yij - yi þ yi - yij . The first part is referred to as pure error while the second part measures lack of fit. An assessment of heteroscedasticity is based on pure error. Thus,

the WLS weight may be estimated as wi = 1=s2i where the mean square error is: s2i

=

yij - yi ð n i - 1Þ

2

ð5:57Þ

Alternatively, a model can be fit to the mean values of x and the s2i values in order to smoothen out the weighting function, and this function is used instead. Thus, this approach would also qualify as a two-stage estimation process. The following example illustrates this approach. Example 5.6.313 Example of two-stage weighted regression for replicate measurements Consider the x-y data given in Table 5.9 noting that replicate measurements of y are taken at different values of x (which vary slightly). Step 1: An OLS model is identified from the data. y = - 0:578954 þ 1:1354 x with R2 = 0:841 and RMSE = 1:4566

ð5:58aÞ

From the summary tables of the regression analysis, one notes that the intercept term in the model is not statistically 13

From Draper and Smith (1981), with permission from John Wiley and Sons.

5.6 Model Residual Analysis and Regularization

201

Fig. 5.27 (a) Data set and OLS regression line of observations with nonconstant variance and replicated observations in x. (b) Residuals of a simple linear OLS model fit (Eq. 5.58a) during step 1. (c) Residuals and regression line of a second order polynomial OLS fit to the mean x and mean square error (MSE) of the replicate values during step 2 (Eq. 5.58b). d Residuals of the weighted regression model identified during step 3 (Eq. 5.58c)

significant ( p-value = 0.4 for the t-statistic), while the overall model fit given by the F-ratio is significant. A scatter plot of these data and the simple OLS linear model are shown in Fig. 5.27a. The residuals of a simple linear OLS model shown in Fig. 5.27b reveal, as expected, marked

heteroscedasticity. Hence, the OLS model is bound to lead to misleading uncertainty bands even if the model predictions themselves are not biased. The residuals from the above OLS model are also shown in the 3rd column of Table 5.9.

202

5

Step 1: OLS model coefficients Parameter Intercept

Least squares estimate - 0.578954

Standard error 0.679186

Slope

1.1354

0.086218

t-statistic 0.852423 13.169

pvalue 0.4001

Sum of squares 367.948 70.0157 437.964

D. f. 1 33 34

Mean square 367.948 2.12169

F-ratio 173.42

pvalue 0.0000

Step 2: The residuals of the OLS model are heteroscedastic. One needs to identify a regression model for the OLS model residuals. The data range of the regressor variable can be partitioned into five ranges (these are the discrete values of x for which observations were taken if the first two rows are omitted from the analysis). These five values of x and the corresponding average of the mean square error s2i following Eq. 5.58a are shown in the Step 2 table. s2i 0.0072 0.373 1.6482 0.8802 4.1152

x 3 5.39 7.84 9.15 10.22

The data pattern exhibits a quadratic pattern (see Fig. 5.27c) and so a second-order polynomial model is regressed to this data to yield: s2i = 1:887 - 0:8727x þ 0:9967x2 with R2 = 0:743%

ð5:58bÞ

Step 3: The regression weights wi can thus be deduced by using individual values of xi instead of x in the above equation. The values of the weights are also shown in Table 5.9 under the 4th column. Step 4: Finally, a weighted regression is performed following the functional form given by Eq. 5.55 (most statistical packages have this capability) using the data under the 1st, 2nd, and 4th columns. y = - 0:942228 þ 1:16252x R = 0:896 and 2

with

RMSE = 1:2725:

the real advantage is that this model will have better prediction accuracy and more realistic (unbiased) prediction errors ■ than Eq. 5.58a. (c-iii) Nonpatterned Variance in the Residuals

0.0000

Step 1: OLS analysis of variance Source Model Residual Total (Corr.)

Linear Regression Analysis Using Least Squares

ð5:58cÞ

The residual plot is shown as Fig. 5.27d. Though the goodness of fit is only slightly better than the OLS model,

A third type of nonconstant residual variance is one when no pattern is discerned with respect to the regressors which can be discrete or vary continuously. In this case, a practical approach is to look at a plot of the model residuals against the response variable, divide the range in the response variable into as many regions as seem to have different variances, and calculate the standard deviation of the residuals for each of these regions. In that sense, the general approach parallels the one adopted in case (c-ii) when dealing with replicated values with a non-constant variance; however, now, no model such as Eq. 5.58b is needed. The general approach would involve the following steps: • First, fit an OLS model to the data. • Next, discretize the domain of the regressor variables into a finite number of groups and determine εi2 from which the weights wi for each of these groups can be deduced. • Finally, perform a WLS regression to estimate the efficient model parameters. Though this two-stage estimation approach is conceptually easy and appealing for simple models, it may become rather complex for multivariate models, and moreover, there is no guarantee that heteroscedasticity will be removed entirely.

5.6.4

Serially Correlated Residuals

Another manifestation of improper residual behavior is serial correlation. As stated earlier (Sect. 5.6.1). one should distinguish between the two different types of autocorrelation, namely pure autocorrelation and model-misspecification, although it is often difficult to distinguish between them. The latter is usually addressed using the weight matrix approach (Pindyck and Rubinfeld 1981) which is fairly formal and general, but somewhat demanding. Pure autocorrelation relates to the case of “pseudo” patterned residual behavior, which arises because the regressor variables have strong serial correlation. This serial correlation behavior is subsequently transferred over to the model, and hence to its residuals, even when the regression model functional form is close to “perfect.” The remedial approach to be adopted is to transform the original data set prior to regression itself. There are several techniques for doing so, and the widely used Cochrane-Orcutt (CO) procedure is described. It involves the use of generalized differencing to alter the linear model into one in which the errors are independent. The two-stage first-order CO procedure involves:

5.6 Model Residual Analysis and Regularization

203

(i) Fitting an OLS model to the original variables (ii) Computing the first-order serial correlation coefficient r of the model residuals (Eq. 3.12) (iii) Transforming the original variables y and x into a new set of pseudo-variables: yt  = yt - r:yt - 1 and xt  = xt - r:xt - 1

ð5:59Þ

(iv) OLS regression on the pseudo variables y* and x* to re-estimate the parameters (b0* and b1*) of the model (v) Finally, obtaining the fitted regression model in the original variables by a back transformation of the pseudo regression coefficients: b 0 = b0 

1 and b1 = b1  1-r

ð5:60Þ

Though two estimation steps are involved, the entire process is simple to implement. This approach, when originally proposed, advocated that this process be continued till the residuals become random (say, based on the DurbinWatson test). However, the current recommendation is that alternative estimation methods should be attempted if one iteration proves inadequate. This approach can be used during model parameter estimation of MLR models provided only one of the regressor variables is the cause of the pseudo-correlation. Also, a more sophisticated version of the CO procedure has been suggested by Hildreth and Lu (Chatterjee and Price 1991) involving only one estimation process where the optimal value of r is determined along with the parameters. This, however, requires nonlinear estimation methods. Example 5.6.4 Using the Cochrane-Orcutt (CO) procedure to remove firstorder autocorrelation Consider the case when observed pre-retrofit data of either cooling or heating energy consumption in a commercial building support a linear regression model as follows: E i = b0 þ b1 T i

ð5:61Þ

where Ti = daily average outdoor dry-bulb temperature Ei = daily total energy use predicted by the model i = subscript representing a particular day bo and b1 are the least-square regression coefficients How the above transformation yields a regression model different from OLS estimation is illustrated in

Fig. 5.28 How serial correlation in the residuals affects model identification (Example 5.6.4)

Fig. 5.28 with year-long daily cooling energy use from a large institutional building in central Texas. The first-order autocorrelation coefficients of cooling energy and average daily temperature were both equal to 0.92, while that of the OLS residuals was 0.60. The Durbin-Watson statistic for the OLS residuals (i.e. untransformed data) was DW = 3 indicating strong residual autocorrelation, while that of the CO transform was 1.89 indicating little or no autocorrelation. Note that the CO transform is inadequate in cases of model misspecification and/or seasonal operational changes. ■

5.6.5

Dealing with Misspecified Models

An important source of error during model identification is model misspecification error. This is unrelated to measurement error and arises when the functional form of the model is not appropriate. This can occur due to: (i) Inclusion of irrelevant variables: Does not bias the estimation of the intercept and slope parameters, but generally reduces the efficiency of the slope parameters, that is, their variance will be larger. This source of error can be eliminated by, say, stepwise regression or simple tests such as t-tests. (ii) Exclusion of an important variable: Will result in the slope parameters being both biased and inconsistent. (iii) Assumption of a linear model: When a linear model is erroneously assumed. (iv) Incorrect model order: When one assumes a lower or higher model than what the data warrants.

204

5

Linear Regression Analysis Using Least Squares

Fig. 5.29 Improvement in residual behavior for a model of hourly energy use of a variable air volume HVAC system in a commercial building as influential regressors are incrementally added to the model. (From Katipamula et al. 1998)

The latter three sources of errors are very likely to manifest themselves in improper residual behavior (the residuals will show serial correlation or non-constant variance behavior). The residual analysis may not identify the exact cause, and several attempts at model reformulations may be required to overcome this problem. Even if the physics of the phenomenon or of the system is well understood and can be cast in mathematical terms, experimental or identifiability constraints may require that a simplified or macroscopic model be used for parameter identification rather than the detailed model. This could cause model misspecification. Example 5.6.5 Example to illustrate how the inclusion of additional regressors can remedy improper model residual behavior

Energy use in commercial buildings accounts for about 19% of the total primary energy use in the United States and consequently, it is a prime area of energy conservation efforts. For this purpose, the development of baseline models, that is, models of energy use for a specific end-use before energy conservation measures are implemented, is an important modeling activity for monitoring and verification studies. Let us illustrate the effect of improper selection of regressor variables or model misspecification for modeling measured thermal cooling energy use of a large commercial building operating 24 hours a day under a variable air volume HVAC system (Katipamula et al. 1998). Figure 5.29 illustrates the residual pattern when hourly energy use is modeled with only the outdoor dry-bulb temperature (To). The residual pattern is blatantly poor exhibiting both non-constant variances as well as systematic bias in the low

5.7 Other Useful OLS Regression Models

205

range of the x-variable. Once the outdoor dew point tempera14 the global horizontal solar radiation (qsol) and ture T þ dp ,

the internal building heat loads qi (such as lights and equipment) are introduced in the model, the residual behavior improves significantly but the lower tail is still improper. Finally, when additional terms involving indicator variables I to both intercept and To are introduced (described in Sect. 5.7.2), is an acceptable residual behavior achieved. ■

5.7

Other Useful OLS Regression Models

5.7.1

Zero-Intercept Models

Sometimes the physics of the system dictates that the regression line passes through the origin. For the linear case, the model assumes the form: y = β1 x þ ε

ð5:62Þ

The interpretation of R2 under such a case is not the same as for the model with an intercept, and this statistic cannot be used to compare the two types of models directly. Recall that for linear models the R2 value indicates the percentage variation of the response variable about its mean explained by that of the regressor variable. For the no-intercept case, the R2 value relates to the percentage variation of the response variable about the origin explained by the regressor variable. Thus, when comparing both models, one should decide on which is the better model based on their RMSE values.

5.7.2

Indicator Variables for Local Piecewise Models— Linear Splines

Spline functions are an important class of functions, described in numerical analysis textbooks in the framework of interpolation, which allows distinct functions to be used over different ranges while maintaining continuity in the function. They are extremely flexible functions in that they allow a wide range of locally different behavior to be captured within one elegant functional framework. In addition to interpolation, splines have been used in a regression context as well as for data smoothing (discussed below and in Sect. 9.6)

Fig. 5.30 Piece-wise linear model or first-order spline fit with hinge point at xc. Such models are referred to as change point models in building energy modeling terminology

Thus, a globally nonlinear function can be decomposed into simpler local patterns. Two common cases are discussed below. (a) The simpler case is one where it is known which points lie on which trend, that is, when the physics of the system is such that the location of the structural break or “hinge point” xc of the regressor is known. One could represent the two regions by piece-wise linear spline (as shown in Fig. 5.30); otherwise, the third-degree polynomial spline is often used to capture highly nonlinear trends (see Sect. 9.6.4). The objective here is to formulate a linear model and identify its parameters that best describe data points in Fig. 5.30. One cannot simply divide the data into two regions and fit each region with a separate linear model since the two segments are unlikely to intersect at exactly the hinge point (a constraint that the model be continuous at the hinge point would be violated). A model of the following form would be acceptable: y = b 0 þ b1 x þ b2 ð x - x c Þ I

ð5:63aÞ

where the indicator (also called dummy or binary) variable I=

1

if x > xc

0

otherwise

ð5:63bÞ

Hence, for the region x ≤ x c , y = b0 þ b1 x

ð5:64Þ

and for the region 14

Actually, the outdoor humidity impacts energy use only when the dew point temperature Tdp exceeds a certain threshold which many studies have identified to be about 55°F (this is related to how the HVAC cooling coil is controlled to meet indoor occupant comfort). This conditional variable indicated by a + superscript is equal to (Tdp- 55) when the term is positive, and zero otherwise.

x > xc , y = ðb0 - b2 xc Þ þ ðb1 þ b2 Þx Thus, the slope of the model is b1 before the break and (b1 + b2) afterward. The intercept term changes as well from

206

5

b0 before the break to (b0 - b2xc) after the break. The logical extensions to linear spline models with two structural breaks or to higher order splines involving quadratic and cubic terms are fairly straightforward and treated further in Sect. 9.6.4. (b) The second case arises when the change point is not known. A simple approach is to look at the data, identify a “ball-park” range for the change point, perform numerous regression fits with the data set divided according to each possible value of the change point in this ball-park range, and pick that value which yields the best overall R2 or RMSE. Alternatively, the more accurate but more complex approach is to cast the problem as a nonlinear estimation method with the change point variable as one of the parameters. Example 5.7.1 Change point models for building utility bill analysis The theoretical basis of modeling monthly energy use in buildings is discussed in several papers (e.g., Reddy et al., 1997, 2016). The interest in this particular time scale is obvious—such information is easily obtained from utility bills, which are usually available on a monthly time scale. The models suitable for this application are similar to linear spline models and are referred to as change point models by building energy analysts. A simple example is shown below to illustrate the above equations. Electricity utility bills of a residence in Houston, TX have been normalized by the number of days in the month and assembled in Table 5.10 along with the corresponding month and monthly mean outdoor temperature values for Houston (the first three columns of the table). The intent is to use Eq. 5.63a to model this behavior. The scatter plot and the trend lines drawn in Fig. 5.31 suggest that the change point is in the range 17–19 °C. Let us

Linear Regression Analysis Using Least Squares

perform the calculation assuming a value of 17 °C. Defining an indicator variable: I=

1

if x > 17 ° C

0

otherwise

Based on this assumption, the last two columns of the table have been generated to correspond to the two regressor variables in Eq. 5.63a. A linear multiple regression yields: y = 0:1046 þ 0:005904x þ 0:00905ðx - 17ÞI with R2 = 0:996 and RMSE = 0:0055

ð5:65Þ

with all three parameters being statistically significant. The reader can repeat this analysis assuming a different value for the change point (say xc = 18 °C) in order to study the

Fig. 5.31 Piece-wise linear regression lines for building electric use with outdoor temperature. The change point is the point of intersection of the two lines. The combined model is called a change point model, which in this case, is a four-parameter model given by Eq. 5.65

Table 5.10 Measured monthly energy use data and calculation step for deducing the change point independent variable assuming a base value of 17°C. Data for Example 5.7.1 Month Jan Feb Mar Apr May Jun Jul Aug Sept Oct Nov Dec a

Mean outdoor temperature (°C) 11 13 16 21 24 27 29 29 26 22 16 13

Data available electronically on book website

Monthly mean daily electric use (kWh/m2/day) 0.1669 0.1866 0.1988 0.2575 0.3152 0.3518 0.3898 0.3872 0.3315 0.2789 0.2051 0.1790

x (°C) 11 13 16 21 24 27 29 29 26 22 16 13

(x - 17 ° C)I (°C) 0 0 0 4 7 10 12 12 9 5 0 0

5.7 Other Useful OLS Regression Models

207

sensitivity of the model to the choice of the change point value. Though only three parameters are determined by regression, this is an example of a four-parameter (or 4-P) model in building science terminology. The fourth parameter is the change point xc which also needs to be selected/ determined. Specialized software programs have been developed to determine the optimal value of xc (i.e., that which results in minimum RMSE of different possible choices of xc) following a numerical search process akin to the one described in this example. ■

and model 2 be for energy - efficient buildings : y = a 2 þ b2 x1 þ c2 x2

The complete model (or model 3) would be formulated as: y = a 1 þ b 1 x 1 þ c 1 x 2 þ I ð a2 þ b 2 x 1 þ c 2 x 2 Þ

Indicator Variables for Categorical Regressor Models

The use of indicator (also called dummy) variables has been illustrated in the previous section when dealing with spline models. Indicator variables are also used in cases when shifts in either the intercept or the slope are to be modeled with the condition of continuity now being relaxed. Most variables encountered in mechanistic models are quantitative and continuous, that is, the variables are measured on a numerical scale. Variables which cannot be controlled are often called “covariates”. Some examples are temperature, pressure, distance, energy use, and age. Occasionally, the analyst comes across models involving qualitative or categorical variables, that is, regressor data that belong in one of two (or more) possible categories. One would like to evaluate whether differences in intercept and slope between categories are significant enough to warrant two separate models or not. This concept is illustrated by the following example. Whether the annual energy use of a regular commercial buildings is markedly higher than that of another certified as being energy efficient is to be determined. Data from several buildings which fall in each group are gathered to ascertain whether the presumption is supported by the actual data. Covariates that affect the normalized energy use (variable y) of both experimental groups are conditioned floor area (variable x1) and outdoor temperature (variable x2). Suppose that a linear relationship can be assumed with the same intercept for both groups. One approach would be to separate the data into two groups: one for regular buildings and one for efficient buildings and develop regression models for each group separately. Subsequently, one could perform a t-test to determine whether the slope terms of the two models are significantly different or not. However, the assumption of constant intercept term for both models may be erroneous, and this may confound the analysis. A better approach is to use the entire data and adopt a modeling approach involving indicator variables. Let model 1 be for regular buildings : y = a 1 þ b1 x 1 þ c 1 x 2

ð5:66aÞ

ð5:67aÞ

where I is an indicator variable such that I=

5.7.3

ð5:66bÞ

1 0

for energy efficient buildings for regular buildings

ð5:67bÞ

Note that a basic assumption in formulating this model is that all three model parameters are affected by the building group. Formally, one would like to test the null hypothesis H0: a2 = b2 = c2 = 0. The hypothesis is tested by constructing an Fstatistic for the comparison of the two models. Note that model 3 is referred to as the full model (FM) or pooled model. Model 1, when the null hypothesis holds, is the reduced model (RM). The idea is to compare the goodnessof-fit of the FM and that of the RM using both data sets combined. If the RM provides as good a fit as the FM, then the null hypothesis is valid. Let SSE(FM) and SSE(RM) be the corresponding model sum of squared errors or squared model residuals. Then, the following F-test statistic is defined: F=

½SSEðRMÞ - SSEðFMÞ p-m

SSEðFMÞ n-p

ð5:68Þ

where n is the number of data sets, p is the number of parameters of the FM, and m is the number of parameters of the RM. If the observed F-value is larger than the tabulated value of F with (n -p) and (p - m) degrees of freedom at the prespecified significance level (provided by Table A.6), the RM is unsatisfactory, and the full model has to be retained. As a cautionary note, this test is strictly valid only if the OLS assumptions for the model residuals hold. Example 5.7.2 Combined modeling of energy use in regular and energyefficient buildings Consider the data assembled in Table 5.11. Let us designate the regular buildings by group (A) and the energyefficient buildings by group (B), with the problem simplified by assuming both types of buildings to be located in the same geographic location. Hence, the model has only one regressor variable involving floor area. The complete model with the indicator variable term given by Eq. 5.67a is used to verify whether group B buildings consume less energy than group A buildings.

208

5

Linear Regression Analysis Using Least Squares

Table 5.11 Data table for Example 5.7.2a Energy use ( y) 45.44 42.03 50.1 48.75 47.92 47.79 52.26 50.52 45.58 44.78 a

Floor area (x1) 225 200 250 245 235 237 265 259 221 218

Bldg type A A A A A A A A A A

Energy use ( y) 32.13 35.47 33.49 32.29 33.5 31.23 37.52 37.13 34.7 33.92

Floor area (x1) 224 251 232 216 224 212 248 260 243 238

Bldg type B B B B B B B B B B

Data available electronically on book website

The full model (FM) given by Eq. 5.67a reduces to the following form since only one regressor is involved: y = a + b1x1 + b2Ix1 where the variable I is an indicator variable such that it is 0 for group A and 1 for group B. The null hypothesis is that H0 : b2 = 0. The reduced model (RM) is y = a + bx1. It is identified using the entire data set without distinguishing between the building types. The estimated model : y = 14:2762 þ 0:14115 x1 - 13:2802 ðI  x1 Þ while the RM model : y = 5:7768 þ 0:1491 x1 :

ð5:69Þ

The analysis of variance results in SSR(FM) = 7.7943 and SSR(RM) = 889.245. The F-statistic in this case is: F=

ð889:245 - 7:7943Þ=1 = 1922:5 7:7943=ð20 - 3Þ

One can thus safely reject the null hypothesis, and state with confidence that buildings built as energy-efficient ones consume energy which is statistically lower than those of regular buildings. ■

5.8

Resampling Methods Applied to Regression

5.8.1

Basic Approach

The fundamental tasks in regression involve model selection, that is, identifying a suitable model and estimating the values and the uncertainty intervals of the model parameters, and model assessment, that is, deducing the predictive accuracy of the model for subsequent use. The classical OLS equations presented in Sects. 5.3 and 5.4 can be used for this purpose in conjunction with model residual analysis along with the

Durbin-Watson statistic (Sect. 5.6.1) to guard against the dangers or model under-fitting and over-fitting. However, it should be recognized that the data set used is simply a sample of a much larger set of population data characterizing the behavior of the stochastic system under study. In that sense, the model selection and evaluation results are somewhat limited because they do not make full use of the variability inherent in samples drawn from a population. These limitations can be overcome by resampling methods (see Sect. 4.8), which offer distinct advantages in terms of better accuracy, robustness, versatility, and intuitive appeal in the context of regression modeling as well. They are widely regarded as being able to better perform the fundamental tasks involved in regression, and as a result are becoming increasingly popular. Recall from Sect. 4.8.3 that the basic rationale behind resampling methods is to draw one single sample or experimental data set, treat this original sample as a surrogate for the population, and generate numerous sub-samples by simply resampling the sample itself. Thus, resampling refers to the use of given data, or a data generating mechanism, to produce new samples from which the required estimates can be deduced numerically. Note, however, that the resampling methods cannot overcome some of the limitations inherent in the original sample. The resampling samples are to the sample what the sample is to the population. Hence, if the sample does not adequately cover the spatial range or if the sample is not truly random, then the resampling results will be inaccurate as well.

5.8.2

Jackknife and k-Fold Cross-Validation

In most practical situations, it is misleading to use the entire data set to identify the regression model and report the resulting RMSE as the predictive accuracy of the model. This is due to the possibility of model overfitting and associated under-estimation of the predictive RMSE error.

5.8 Resampling Methods Applied to Regression

Overfitting refers to the situation in which the regression model is able to capture the training set with high accuracy but is poor at predicting new data. In other words, the model is over-trained on features specific to the training set, which may differ for new data. A better but somewhat empirical approach is to randomly partition the data set into two samples (say, in the proportion of 80/20), use the 80% portion of the data to train or develop the model, calculate the internal predictive error (say the RMSE following Eq. 5.8), use the 20% portion of the data as the validation data set, and predict or test the y values using the already identified model, and finally calculate the test or external or simulation error magnitude. The competing models can then be compared, and a selection made based on both the internal and external predictive errors pertinent to the training and testing data sets respectively. The test errors indices will generally be greater than the training errors; larger discrepancies are suggestive of greater over-fitting, and vice versa. This general approach is the basis of the two most common resampling methods discussed below. The jackknife method and its more recent version, the cross-validation method, were described in Sect. 4.8.3. The latter method of model evaluation, also referred to as holdout sample validation, can avoid model over-fitting. With the advent of powerful computers, a variant, namely the “k-fold cross-validation” method has become popular. It involves: (i) Dividing the random sample of n observations into k groups of equal size. (ii) Omitting one group at a time and performing the regression with the other (k-1) groups. (iii) Determining and saving the internal or modeling errors as well as the external or simulation or predictive errors (say, in terms of the RMSE values) of both sets of sub-samples. (iv) And using the saved parameters values to deduce the mean and uncertainty intervals for the model parameters and computing the mean and confidence levels of the modeling and simulation prediction errors. The model parameters and the test values determined in the last step are likely to be less biased, much more robust, and more representative of the actual model behavior than using classical methods.

209

Note, however, that though the same equations are used to compute the RMSE indices, the degrees of freedom (d.f.) are different. Let n be the total number of data points. Then d. f. = {[(k-1)(n-p)]/k]} while computing the internal errors for model building or selection, and d.f. = [(n/k)] while computing the RMSE of the external predictive errors. There is a trade-off between high bias error and high variance in the choice of the number k. It is recommended that k be selected as either k = 5 or k = 10 (James et al. 2013), colloquially referred to the “magic” numbers of folds. Example 5.8.1 k-fold cross-validation Consider Example 5.3.1, which involved fitting the simple OLS regression with 33 data observations of solids reduction (x-variable) and oxygen demand (y-variable). The regression analysis will be redone to illustrate the insights provided by k-fold cross-validation; k = 3 has been assumed in this simplified illustration. The data set has been first randomized to remove the monotonic increase in the regressor variable and broken up into 3 sub-sets of 11 observations each. Three data samples with different combinations of the sub-sets involving two sub-sets (i.e., 22 data points) can now be created. The analysis is performed with 22 data points sets for training, that is, to identify an OLS regression model, and the remaining sub-set of 11 data points used for testing, that is, to compute the simulation or external prediction error. The results are summarized in Table 5.12. One notes that the model parameters vary from one run to another indicating their random variability as different samples of data are selected. If the k-fold analysis was done with a greater number of folds (say k = 5), one could have deduced the variance of these estimates which would most likely be less biased and more accurate than those determined from classical methods. The final model determined from kfold regression analysis is the same as the one using all data points while the estimate of the predictive error (as indicated by the test RMSE = 3.447) is the average of the RMSE values from the three samples. Thus, the extra effort involving creating k-fold samples, identifying regression models, and calculating the prediction errors was simply to get a better estimate of the prediction RMSE error. It is a more

Table 5.12 Summary of the OLS regression analysis results using threefold cross-validation using data from Example 5.3.1 Data All-data (Example 5.3.1) Threefold sample 1 Threefold sample 2 Threefold sample 3 Final model

OLS model y = 3.830 + 0.9036 x y = 1.797 + 0.9734 x y = 3.962 + 0.8941 x y = 6.079 +0.8339 x y = 3.830 + 0.9036 x

R2 0.913 0.933 0.949 0.911 0.913

Training RMSE 3.229 3.176 3.442 2.980 3.229

Test RMSE – 3.628 2.826 3.886 3.447

210

5

representative value which, as stated earlier, is usually greater than the training or internal RMSE.

5.8.3

error variance in the regression if normal errors can be assumed (this is analogous to the concept behind the Monte Carlo approach), or (ii) nonparametrically, by resampling residuals from the original regression. One would then regress the bootstrapped values of the response variable on the fixed X matrix to obtain bootstrap replications of the regression coefficients. This approach is often adopted with data from designed experiments.

Bootstrap Method

Recall that in Sect. 4.8.3, the use of the bootstrap method (one of the most powerful and popular methods currently in use) was illustrated to infer variance and confidence intervals of parametric statistical measures in a univariate context, and also in a situation involving a nonparametric approach, where the correlation coefficient between two variables was to be deduced. Bootstrap is a statistical method where random resampling with replacement is done repeatedly from an original or initial sample, and then each bootstrapped sample is used to compute a statistic (such as the mean, median, or the interquartile range). The resulting empirical distribution of the statistic is then examined and interpreted as an approximation to the true sampling distribution. Thus, bootstrapping is an ensemble training method, which “bags” the results from numerous data sets into an ensemble average. It is often used as a robust nonparametric alternative to inference-type problems when parametric assumptions are in doubt (e.g., knowledge of the probability distribution of the errors), or where parametric inference is impossible or requires very complicated formulas for the calculation of variance. Say, one has a data set of multivariate observations: zi = {yi, x1i, x2i,. . .} with i = 1, . . .n (this can be viewed as a sample with n observations in the bootstrap context taken from a population of possible observations). One distinguishes between two approaches: (i) Case resampling, where the predictors and response observations i are random and change from sample to sample. One selects a certain number of bootstrap sub-samples (say 1000) from zi, fits the model and saves the model coefficients from each bootstrap sample. The generation of the confidence intervals for the regression coefficients is now similar to the univariate situation and is quite straightforward (Sects. 5.3.3 and 5.3.4). One of the benefits is that the correlation structure between the regressors is maintained; (ii) Model-based resampling or fixed- X resampling, where the regressor data structure is already imposed or known with confidence. Here, the basic idea is to generate or resample the model residuals and not the observations themselves. This preserves the stochastic nature of the model structure and so the variance is better representative of the model’s own assumption. The implementation involves attaching a random error to each yi, and thereby producing a fixed X- bootstrap sample. The errors could be generated: (i) parametrically from a normal distribution with zero mean and variance equal to the estimated

Linear Regression Analysis Using Least Squares

The reader can refer to Efron and Tibshirani (1985), Davison and Hinkley (1997) and other more advanced papers such as Freedman and Peters (1984) for a more complete treatment.

5.9

Case Study Example: Effect of Refrigerant Additive on Chiller Performance15

The objective of this analysis is to verify the claim of a company, which had developed a refrigerant additive to improve chiller COP. The performance of a chiller before (called pre-retrofit period) and after (called post-retrofit period) addition of this additive was monitored for several months to determine whether the additive results in an improvement in chiller performance, and if so, by how much. The same four variables described in Example 5.4.3, namely two temperatures (Tcho and Tcdi), the chiller thermal cooling load (Qch) and the electrical power consumed (Pcomp) were measured in intervals of 15 min. Note that the chiller COP can be deduced from the last two variables. Altogether, there were 4607 and 5078 data points for the pre-and postperiods respectively. Step 1: Perform Exploratory Data Analysis At the onset, an exploratory data analysis should be performed to determine the spread of the variables, and their occurrence frequencies during the pre- and post-periods, that is, before and after the addition of the refrigerant additive. Further, it is important to ascertain whether the operating conditions during both periods are similar or not. The eight frames in Fig. 5.32 summarize the spread and frequency of the important variables. The chiller outlet water temperature ranges are similar during both periods. However, the condenser water temperature and the chiller cooling load show much larger variability during the post-period. Finally, the cooling load and power consumed are noticeably lower during the post period. The histogram of Figure 5.33 suggests 15

The monitored data were provided by Ken Gillespie, for which we are grateful.

5.9 Case Study Example: Effect of Refrigerant Additive on Chiller Performance

Pre

(X 1000.0) 5

211

Post

(X 1000.0) 5

Tcho 4

percentage

percentage

4 3 2 1

2 1

0

0 40

42

44

46

48

50

40

42

44

46

48

50

66

70

74

78

82

86

0

300

600

900

1200

1500

1000

2400

Tcdi

2000

800

percentage

percentage

3

1600 1200 800

600 400 200

400

0

0 66

70

74

78

82

86 1500

1800

Qch

1200

percentage

percentage

1500 1200 900 600

900 600 300

300

0

0 0

300

600

900

1200

1500

1500

1800

Pcomp

1200

percentage

percentage

1500 1200 900 600

900 600 300

300 0

0 0

200

400

600

800

Fig. 5.32 Histograms depicting the range of variation and frequency of the four important variables before and after the retrofit (pre = 4607 data points, post = 5078 data points). The condenser water temperature

0

200

400

600

800

and the chiller cooling load show much larger variability during the post-period. The cooling load and power consumed are noticeably lower during the post period

212

5

that COPpost > COPpre. An ANOVA test with results shown in Table 5.13 and Fig. 5.34 also indicates that the mean of post-retrofit power use is statistically different at 95% CL as compared to the pre-retrofit power. t -test to Compare Means Null hypothesis: mean (COPpost) = mean (COPpre) Alternative hypothesis: mean (COPpost) ≠ mean (COPpre) assuming equal variances:

Linear Regression Analysis Using Least Squares

Step 2: Use the Entire Pre-retrofit Data to Identify a Model The GN chiller models (Gordon and Ng 2000) are described in Sect. 10.2.3. The monitored data are first used to compute transformed y, x1 and x2 (temperatures are in Kelvin and cooling load and power in kW) of the model given by Eq. 10.14b. Then, a linear regression is performed using Eq. 10.14a, which is given below along with standard errors of the coefficients shown within parenthesis with AdjR2 = 0.998: y = - 0:00187  x1 þ261:2885  x2 þ0:022461  x3

t = 38:8828, p‐value = 0:0

ð0:00163Þ

ð15:925Þ

ð0:000111Þ

with adjusted R = 0:998 2

The null hypothesis is rejected at α = 0.05. Of particular interest is the CI for the difference between the means, which extends from 0.678 to 0.750. Since the interval does not contain the value 0.0, there is a statistically significant difference between the means of the two samples at 95.0% CL. However, it would be incorrect to infer that COPpost > COPpre since the operating conditions are different, and thus one should not use the t-test to draw any conclusions. Hence, a regression model-based approach is warranted.

ð5:70Þ

This model is then re-transformed into a model for power using Eq. 10.15, and the error statistics using the pre-retrofit data are found to be: RMSE = 9.36 kW and CV = 2.24%. Figure 5.35 shows the x–y plot from which one can visually evaluate the goodness of fit of the model. Note that the mean power use is 418.7 kW while the mean model residuals are 0.017 kW (very close to zero, as it should be. This step validates the fact that the spreadsheet cells have been coded correctly with the right formulas). Step 3: Calculate Savings in Electrical Power The above chiller model representative of the thermal performance of the chiller without refrigerant additive is used to estimate savings by first predicting power consumption for each 15 min interval using the two operating temperatures and the load corresponding to the 5078 postretrofit data points. Subsequently, savings in chiller power are deduced for each of the 5078 data points: Power savings = Model predictedpre‐retrofit - Measuredpost‐retrofit

Fig. 5.33 Histogram plots of the coefficient of performance (COP) of the chiller before and after the retrofit. Clearly, there are several instances when COPpost > COPpre, but that could be due to operating conditions. Hence, a regression modeling approach is clearly warranted

ð5:71Þ

It is found that the mean power savings are - 21.0 kW (i.e., an increase in power use) in the measured mean power use of 287.5 kW. Overlooking the few outliers, one can detect two distinct patterns from the x-y plot of Fig. 5.36: (i) the lower range of data (for chiller power < 300 kW or so) when the differences between model predicted and postmeasurements are minor (or nil), and (ii) the higher range of data for which post-retrofit electricity power usage was

Table 5.13 Results of the ANOVA Test of comparison of means at a significance level of 0.05 95.0% CI for mean of COPpost: 95.0% CI for mean of COPpre: 95.0% CI for the difference between the means assuming equal variances:

8.573 ± 0.03142 = [8.542, 8.605] 7.859 ± 0.01512 = [7.844, 7.874] 0.714 ± 0.03599 = [0.678, 0.750]

5.10

Parting Comments on Regression Analysis and OLS

213

higher than that of the model identified from pre-retrofit data. This is the cause of the negative power savings determined above. The reason for the onset of two distinct patterns in operation is worthy of a subsequent investigation.

Step 4: Calculate Uncertainty in Savings and Draw Conclusions The uncertainty arises from two sources: prediction model and power measurement errors. The latter are usually small, about 0.1% of the reading, which in this particular case is less than 1 kW. Hence, this contribution can be neglected during an initial investigation such as this one. The model uncertainty is given by: Fig. 5.34 ANOVA test results in the form of box-and-whisker plots for chiller COP before and after addition of refrigerant additive

absolute uncertainty in power use savings or reduction = ðt value × RMSEÞ ð5:72Þ The t-value at 90% CL = 1.65 and RMSE of model (for pre-retrofit period) = 9.36 kW. Hence, the calculated increase in power due to refrigerant additive = - 21.0 kW ± 15.44 kW at 90% CL. Thus, one would conclude that the refrigerant additive is actually penalizing chiller performance by 7.88% since electric power use has increased.

5.10

Fig. 5.35 Measured vs modeled plot of chiller power during pre-retrofit period. The overall fit is excellent (RMSE = 9.36 kW and CV = 2.24%), and except for a few data points, the data seem well behaved. Total number of data points = 4607

Fig. 5.36 Difference in post-period measured vs pre-retrofit model predicted data of chiller power indicating that post-retrofit values are higher than those during pre-retrofit period (mean increase = 21 kW or 7.88%). One can clearly distinguish two operating patterns in the data suggesting some intrinsic behavioral change in chiller operation. Entire data set for the post-period consisting of 5078 observations has been used in this analysis

Parting Comments on Regression Analysis and OLS

Recall that OLS regression is an important sub-class of regression analysis methods. The approach adopted in OLS was to minimize an objective function (also referred to as the loss function) expressed as the sum of the squared residuals (given by Eq. 5.3). One was able to derive closed form solutions for the model parameters and their variance under certain simplifying assumptions as to how noise corrupts measured system performance. Such closed form solutions cannot be obtained for many situations where the function to be minimized has to be framed differently, and these require the adoption of search methods. Thus, parameter estimation problems are, in essence, optimization problems where the objective function is framed in accordance with what one knows about the errors in the measurements and in the model structure. OLS yields the best linear unbiased estimates provided the conditions (called the Gauss-Markov conditions) stated in Sect. 5.5.1 are met. These conditions, along with the stipulation that an additive linear relationship exists between the response and regressor variables, can be summarized as:

214

5

• Data are random and normally distributed, that is, the expected value of model residuals/errors is zero: E{εi} = 0, i = 1, . . ., N. • Model residuals {ε1. . .εN} and regressors {x1, . . ., xN} are independent. • Model residuals are not collinear: cov{εi, εj} = 0, i, j = 1, . . ., N, i ≠ j. • Residuals have constant variance : εi = σ 2 , i = 1, . . . N

ð5:73Þ

Further, OLS applies when the measurement errors in the regressors are small compared to that of the response and when the response variable is normally distributed. These conditions are often not met in practice. This has led to the development of a unified approach called generalized linear models (GLM) which is treated in Sect. 9.4.

Problems Pr. 5.1 Table 5.14 lists various properties of saturated water in the temperature range 0–100°C. (a) Investigate first-order and second-order polynomials that fit saturated vapor enthalpy to temperature in °C. Identify the better model by looking at R2, RMSE, and CV values for both models. Predict the value of saturated vapor enthalpy at 30°C along with 95% CI and 95% prediction intervals. (b) Repeat the above analysis for specific volume but investigate third-order polynomial fits as well. Predict the value of specific volume at 30°C along with 95% CI and 95% prediction intervals. (c) Calculate the skill factors for the second and third-order models with the first order as the baseline model. Pr. 5.2 Regression of home size versus monthly energy use It is natural to expect that monthly energy use in a home increases with the size of the home. Table 5.15 assembles 10 data points of home size (in square feet) versus energy use (kWh/month).

Linear Regression Analysis Using Least Squares

You will analyze this data as follows: (a) Plot the data as is and visually determine linear or polynomial trends in this data. Perform the regression and report the model goodness-of-fit and model parameters values using (i) all 10 points, and (ii) a bootstrap analysis with 10 samples. Compare the results by both methods and draw pertinent conclusions. (b) Repeat the above analysis taking the logarithm of (home area) versus energy use. (c) Which of these two models would you recommend for future use? Provide justification. Pr. 5.3 Tensile tests on a steel specimen yielded the results shown in Table 5.16. (a) Assuming the regression of y on x to be linear, estimate the parameters of the regression line and determine the 95% CI for x = 4.5. (b) Now regress x on y and estimate the parameters of the regression line. For the same value of y predicted in (a) above, determine the value of x. Compare this value with the value of 4.5 assumed in (a). If different, discuss why. (c) Compare the R2 and CV values of both models. Discuss difference with the results of part (b). (d) Plot the residuals of both models and identify the preferable one for OLS. (e) Repeat the analysis using bootstrapping and compare the model parameter estimates with those of (a) and (b) above. Pr. 5.4 The yield of a chemical process was measured at three temperatures (in °C), each with two concentrations of a particular reactant, as recorded in Table 5.17. (a) Use OLS to find the best values of the coefficients a, b, and c assuming the equation: y = a + b.t + c.x. (b) Calculate the R2, RMSE, and CV of the overall model as well as the SE of the parameters. (c) Using the β coefficient concept described in Sect. 5.4.5, determine the relative importance of the two independent variables on the yield.

Table 5.14 Data table for Problem 5.1 Temperature t (°C) Specific volume v (m^3/kg) Sat. vapor enthalpy kJ/kg

0 206.3

10 106.4

20 57.84

30 32.93

40 19.55

50 12.05

60 7.679

70 5.046

80 3.409

90 2.361

100 1.673

2501.6

2519.9

2538.2

2556.4

2574.4

2592.2

2609.7

2626.9

2643.8

2660.1

2676

Problems

215

Table 5.15 Data table for Problem 5.2

Home area (sq. ft) 1290 1350 1470 1600 1710 1840 1980 2230 2400 2930

Table 5.16 Data table for Problem 5.3

Tensile force x Elongation y

1 15

2 35

3 41

4 63

5 77

6 84

Table 5.17 Data table for Problem 5.4

Temperature, t Concentration, x Yield y

40 0.2 38

40 0.4 42

50 0.2 41

50 0.4 46

60 0.2 46

60 0.4 49

Energy use (kWh/mo) 1182 1172 1264 1493 1571 1711 1804 1840 1956 1954

Table 5.18 Data table for Problem 5.5a LF CCoal CEle a

85 15 4.1

80 17 4.5

70 27 5.6

74 23 5.1

67 20 5.0

87 29 5.2

78 25 5.3

73 14 4.3

72 26 5.8

69 29 5.7

82 24 4.9

89 23 4.8

Data available electronically on book website

Table 5.19 Data table of outlet water temperature Tco (°C) for Problem 5.6a

Range R (°C) 10 13 16 19 22 a

Ambient wet-bulb temperature Twb (°C) 20 21.5 23 25.54 26.47 27.31 26.51 27.30 27.69 27.34 27.86 28.14 27.68 28.40 29.24 27.89 29.15 29.19

23.5 27.29 28.18 29.16 29.29 29.34

26 29.08 29.84 29.88 30.98 30.83

Data available electronically on book website

Pr. 5.5 Cost of electric power generation versus load factor and cost of coal The cost to an electric utility of producing power (CEle) in mills per kilowatt-hr ($ 10-3/kWh) is a function of the load factor (LF) in % and the cost of coal (Ccoal) in cents per million Btu. Relevant data are assembled in Table 5.18. (a) Investigate different models (first order and second order with and without interaction terms) and identify the best model for predicting CEle vs LF and CCoal. Use stepwise regression if appropriate. (Hint: plot the data and look for trends first). (b) Perform residual analysis. (c) Calculate the R2, RMSE, CV, and DW of the overall model as well as the SE of the parameters. Is DW relevant?

(d) Repeat the analysis using bootstrapping and compare the model parameter estimates with those of the best model identified earlier. Pr. 5.6 Modeling of cooling tower performance Manufacturers of cooling towers often present catalog data showing outlet-water temperature Tco as a function of ambient air wet-bulb temperature (Twb) and range R (which is the difference between inlet and outlet water temperatures). Table 5.19 assembles data for a specific cooling tower. (a) Identify an appropriate model (investigate first-order linear and second-order polynomial models without and with interaction terms for Tco) by looking at R2, RMSE, and CV values, the individual t-values of the parameters as well as the behavior of the overall model residuals.

216

5

(b) Calculate the skill factor of the final model compared to the baseline model (assumed to be the first-order baseline model without interaction effects). (c) Repeat the analysis for the best model using k-fold cross-validation with k=5. (d) Summarize the additional insights which the k-fold analysis has provided. Pr. 5.7 Steady-state performance testing of solar thermal flat plate collector Solar thermal collectors are devices that convert the radiant energy from the sun into useful thermal energy that goes to heating, say, water for domestic or for industrial applications. Because of low collector time constants, heat capacity effects are usually small compared to the hourly time step used to drive the model. The steady-state useful energy qC delivered by a solar flat-plate collector of surface area AC is given by the Hottel-Whillier-Bliss equation (see any textbook on solar energy thermal collectors, e.g., Reddy 1987): qc = Ac F R ½I T ηn - U L ðT Ci - T a Þþ

ð5:74Þ

where FR is called the heat removal factor and is a measure of the solar collector performance as a heat exchanger (since it can be interpreted as the ratio of actual heat transfer to the maximum possible heat transfer); ηn is the optical efficiency or the product of the transmittance and absorptance of the cover and absorber of the collector at normal solar incidence; UL is the overall heat loss coefficient of the collector, which is dependent on collector design only, IT is the radiation intensity on the plane of the collector, Tci is the temperature of the fluid entering the collector, and Ta is the ambient temperature. The + sign denotes that only positive values are to be used, which physically implies that the collector should not be operated if qC is negative, that is, when the collector loses more heat than it can collect (which can happen under low radiation and high Tci conditions).

Linear Regression Analysis Using Least Squares

to the collector. Thus, measurements (of course done as per the standard protocol, ASHRAE 1978) of IT, Tci and Tco are done under a pre-specified and controlled value of fluid flow rate from which ηc can be calculated using Eq. 5.75a. The test data are then plotted as ηc against reduced temperature [(TCi - Ta)/IT] as shown in Fig. 5.37. A linear fit is made to these data points by regression using Eq. 5.75b from which the values of FR.ηn and FR UL are deduced. If the same collector is testing during different days, slightly different numerical values are obtained for the two parameters FR.ηn and FRUL which are often, but not always, within the uncertainty bands of the estimates. Model misspecification (i.e., the model is not perfect, which can occur, for example, the collector heat losses are not strictly linear) is partly the cause of such variability. This is somewhat disconcerting to a manufacturer since this introduces ambiguity as to which values of the parameters to present in his product specification sheet. The data points of Fig. 5.37 are assembled in Table 5.20. Assume that water is the working fluid. (a) Perform OLS regression using Eq. 5.75b and identify the two parameters FRηn and FRUL along with their variance. Plot the model residuals and study their behavior. (b) Repeat the analysis using bootstrapping and compare the model parameter estimates with those of the best model identified earlier. (c) Draw a straight line visually through the data points and determine the x-axis and y-axis intercepts. Estimate the FRηn and FRUL parameters and compare them with those determined from (a). (d) Calculate the R2, RMSE and CV values of the model.

Steady-state collector testing is the best manner for a manufacturer to rate his product. From an overall heat balance on the collector fluid and from Eq. 5.74, the expressions for the instantaneous collector efficiency ηc under normal solar incidence are: mcp C ðT Co - T Ci Þ qC = AC I T AC I T T - Ta = F R ηn - F R U L Ci IT

ηC 

ð5.75a, bÞ

where mc is the total fluid flow rate through the collectors, cpc is the specific heat of the fluid flowing through the collector, and Tci and Tco are the inlet and exit temperatures of the fluid

Fig. 5.37 Test data points of thermal efficiency of a double glazed flatplate liquid collector with reduced temperature. The regression line of the model given by Eq. 5.75 is also shown (From ASHRAE (1978) # American Society of Heating, Refrigerating and Air-conditioning Engineers, Inc., www.ashvae.org)

Problems

217

(e) Calculate the F-statistic to test for overall model significance of the model. (f) Perform t-tests on the individual model parameters. (g) Use the model to predict collector efficiency when IT = 800 W/m2, Tci = 35 °C and Ta = 10 °C. (h) Determine the 95% CL intervals for the mean and individual responses for ( f ) above. (i) The steady-state model of the solar thermal collector assumes the heat loss term given by [UA(Tci - Ta] is linear with the temperature difference between collector inlet temperature and the ambient temperature. One wishes to investigate whether the model improves if the loss term is to include an additional second order term: (i) Derive the resulting expression for collector efficiency analogous to Eq. 5.75b? (Hint: start with the fundamental heat balance equation—Eq. 5.74). (ii) Does the data justify the use of such a model?

Table 5.20 Data table for Problem 5.7a x 0.009 0.011 0.025 0.025 0.025 0.025 0.050 a

y (%) 64 65 56 56 52.5 49 35

x 0.051 0.052 0.053 0.056 0.056 0.061 0.062

y (%) 30 30 31 29 29 29 25

X 0.064 0.065 0.065 0.069 0.071 0.071 0.075

y (%) 27 26 24 24 23 21 20

x 0.077 0.080 0.083 0.086 0.091 0.094

y (%) 20 16 14 14 12 10

Data available electronically on book website

Pr. 5.816 Dimensionless model for fans or pumps The performance of a fan or pump is characterized in terms of the head or the pressure rise across the device and the flow rate for a given shaft power. The use of dimensionless variables simplifies and generalizes the model. Dimensional analysis (consistent with fan affinity laws for changes in speed, diameter, and air density) suggests that the performance of a centrifugal fan can be expressed as a function of two dimensionless groups representing flow coefficient and pressure head, respectively: Ψ=

SP D2 ω2 ρ

and Φ =

Q D3 ω

ð5:76Þ

where SP is the static pressure, Pa; D is the diameter of wheel, m; ω is the rotative speed, rad/s; ρ is the density, kg/m3 and Q is the volume flow rate of air, m3/s. For a fan operating at constant density, it should be possible to plot one curve of Ψ vs Φ that represents the performance at all speeds and diameters for this generic class of pumps. The performance of a certain 0.3 m diameter fan is shown in Table 5.21. (a) Convert the given data into the two dimensionless groups defined by Eq. 5.76. (b) Next, plot the data and formulate two or three promising functions. (c) Identify the best function by looking at the R2, RMSE, CV and DW values and at residual behavior. (d) Repeat the analysis for the best model using k-fold cross-validation with k = 5. (e) Summarize the additional insights which the k-fold analysis has provided.

Table 5.21 Data table for Problem 5.8a Rotation ω (Rad/s) 157 157 157 157 157 157 126 126 126 126 126 126 a

Flow rate Q (m3/s) 1.42 1.89 2.36 2.83 3.02 3.30 1.42 1.79 2.17 2.36 2.60 3.30

Static pressure SP (Pa) 861 861 796 694 635 525 548 530 473 428 351 114

Rotation ω (Rad/s) 94 94 94 94 94 63 63 63 63

Flow rate Q (m3/s) 0.94 1.27 1.89 2.22 2.36 0.80 1.04 1.42 1.51

Static pressure SP (Pa) 304 299 219 134 100 134 122 70 55

Data available electronically on book website

16

From Stoecker (1989), with permission from McGraw-Hill.

218

5

Table 5.22 Data table for Problem 5.10a

kT 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 a

(Id/I ) 0.991 0.987 0.982 0.978 0.947 0.903 0.839 0.756

Linear Regression Analysis Using Least Squares kT 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9

(Id/I ) 0.658 0.55 0.439 0.333 0.244 0.183 0.164 0.166 0.165

Data available electronically on book website

Table 5.23 Data table for Problem 5.11 Balance point temp. (°C) VBDD (°C-Days)

25 4750

20 3900

15 2000

10 1100

5 500

0 100

-5 0

Assume the density of air at STP conditions to be 1.204 kg/m3.

[Hint: Make sure that the function is continuous at the hinge point].

Pr. 5.9 Consider the data used in Example 5.6.3 meant to illustrate the use of weighted regression for replicate measurements with non-constant variance. For the same data set, identify a model using the logarithmic transform approach similar to that shown in Example 5.6.2

Pr. 5.11 Modeling variable base degree-days with balance point temperature at a specific location Degree-day methods provide a simple means of determining annual energy use in envelope-dominated buildings operated constantly and with simple HVAC systems, which can be characterized by a constant efficiency. Such simple singlemeasure methods capture the severity of the climate in a particular location. The variable base degree day (VBDD) is conceptually similar to the simple degree-day method but is an improvement since it is based on the actual balance point of the house instead of the outdated default value of 65°F or 18.3°C (Reddy et al. 2016). Table 5.23 assembles the VBDD values for New York City, NY from actual climatic data over several years at this location.

Pr. 5.10 Spline models for solar radiation This problem involves using splines for functions with abrupt hinge points. Several studies have proposed correlations to predict different components of solar radiation from more routinely measured components. One such correlation relates the fraction of hourly diffuse solar radiation on a horizontal radiation (Id) and the global radiation on a horizontal surface (I) to a quantity known as the hourly atmospheric clearness index (kT = I/I0) where I0 is the extraterrestrial hourly radiation on a horizontal surface at the same latitude and time and day of the year (Reddy 1987). The latter is an astronomical quantity and can be predicted almost exactly. Data have been gathered (Table 5.22) from which a correlation between (Id/I ) = f(kT) needs to be identified. (a) Plot the data and visually determine the likely locations of hinge points. (Hint: there should be two points, one at either extreme). (b) Previous studies have suggested the following three functional forms: a constant model for the lower range, a second order for the middle range, and a constant model for the higher range. Evaluate with the data provided whether this functional form still holds, and report pertinent models and relevant goodness-of-fit indices.

(a) Identify a suitable regression curve for VBDD versus balance point temperature for this location and report all pertinent statistics (goodness-of-fit and model parameter estimates and their CL). (b) Repeat the analysis using bootstrapping and compare the model parameter estimates and their CL with those of the results from (a). Pr. 5.12 Consider Example 5.7.2 where two types of buildings were modeled following the full-model (FM) and the reduced model (RM) approaches using categorical variables. Whether the model slope parameters of the two types of buildings are different or not were only evaluated. Extend the analysis to test whether both model slope and intercept parameters are affected by the type of building.

Problems

219

Table 5.24 Data table for Pr. 5.13a Year 94 94 94 94 94 95 95 95 95 95 95 95 a

Month Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul

E (W/ft2) 1.006 1.123 0.987 0.962 0.751 0.921 0.947 0.876 0.918 1.123 0.539 0.869

To (°F) 78.233 73.686 66.784 61.037 52.475 49.373 53.764 59.197 65.711 73.891 77.840 81.742

foc 0.41 0.68 0.67 0.65 0.42 0.65 0.68 0.58 0.66 0.65 0 0

Year 95 95 95 95 95 96 96 96 96 96 96 96

Month Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul

E (W/ft2) 1.351 1.337 0.987 0.938 0.751 0.921 0.947 0.873 0.993 1.427 0.567 1.005

To (°F) 81.766 76.341 65.805 56.714 52.839 49.270 55.873 55.200 66.221 78.719 78.382 82.992

foc 0.39 0.71 0.68 0.66 0.41 0.65 0.66 0.57 0.65 0.64 0.1 0.2

Data available electronically on book website

Pr. 5.13 Change point models of utility bills in variable occupancy buildings Example 5.7.1 illustrated the use of linear spline models to model monthly energy use in a commercial building versus outdoor dry-bulb temperature. Such models are useful for several purposes, one of which is for energy conservation. For example, the energy manager may wish to track the extent to which energy use has been increasing over the years, or the effect of a recently implemented energy conservation measure (such as a new chiller). For such purposes, one would like to correct, or normalize, for any changes in weather since an abnormally hot summer could obscure the beneficial effects of a more efficient chiller. Hence, factors that change over the months or the years need to be considered explicitly in the model. Two common normalization factors include changes to the conditioned floor area (e.g., an extension to an existing building wing), or changes in the number of students in a school. A model regressing monthly utility energy use against outdoor temperature is appropriate for buildings with constant occupancy (such as residences) or even offices. However, buildings such as schools are practically closed during summer, and hence, the occupancy rate needs to be included as the second regressor. The functional form of the model, in such cases, is a multivariate change point model given by: y = β0,un þ β0 f oc þ β1,un x þ β1 f oc x þβ2,un ðx - xc ÞI þ β2 f oc ðx - xc ÞI

ð5:77Þ

where x is the monthly mean outdoor temperature (To) and y is the electricity use per square foot of the school (E). Also, foc =(Noc/Ntotal ) represents the fraction of days in the month when the school is in session (Noc) to the total number of days in that particular month (Ntotal). The factor foc can be determined from the school calendar. Clearly, the unoccupied fraction fun = (1 - foc).

The term I represents an indicator variable whose numerical value is given by Eq. 5.71b. Note that the change point temperatures for occupied and unoccupied periods are assumed to be identical since the monthly data does not allow this separation to be identified. Consider the monthly data assembled for an actual school (shown in Table 5.24). (a) Plot the data and look for change points in the data. Note that the model given by Eq. 5.77 has 7 parameters of which xc (the change point temperature) is the one that makes the estimation nonlinear. By inspection of the scatter plot, you will assume a reasonable value for this variable, and proceed to perform a linear regression as illustrated in Example 5.7.1. The search for the best value of xc (one with minimum RMSE) would require several OLS regressions assuming different values of the change point temperature. (b) Identify the parsimonious model and estimate the appropriate parameters of the model. Note that of the six parameters appearing in Eq. 5.77, some of the parameters may be statistically insignificant, and appropriate care should be exercised in this regard. Report appropriate model and parameter statistics. (c) Perform a residual analysis and discuss the results. (d) Repeat the analysis using k-fold cross-validation with k=4 and compare the model parameter estimates with those of (b) above. Pr. 5.14 Determining energy savings from monitoring and verification (M&V) projects A crucial element in any energy conservation program is the ability to verify savings from measured energy use data—this is referred to as monitoring and verification (M&V). Energy service companies (ESCOs) are required to perform this as part of their services. Figure 5.38 depicts how energy savings are estimated. A common M&V protocol involves measuring

220

5

Linear Regression Analysis Using Least Squares

Fig. 5.38 Schematic representation of energy use prior to and after installing energy conservation measures (ECM) and of the resulting energy savings

the monthly total energy use at the facility for the whole year before the retrofit (this is the baseline period or the “preretrofit period”) and a whole year after the retrofit (called the “post-retrofit period”). The time taken for implementing the energy-saving measures (called the “construction period”) is neglected in this simple example. One first identifies a baseline regression model of energy use against ambient dry-bulb temperature To during the pre-retrofit period Epre = f(To). This model is then used to predict energy use during each month of the post-retrofit period by using the corresponding ambient temperature values. The difference between model predicted and measured monthly energy use is the energy savings during that month.

Table 5.25 Data table for Problem 5.14a Pre-retrofit period Month To (°F) 1994-Jul 84.04 Aug 81.26 Sep 77.98 Oct 71.94 Nov 66.80 Dec 58.68 199556.57 Jan Feb 60.35 Mar 62.70 Apr 69.29 May 77.14 Jun 80.54

Energy savings = Model‐predicted pre‐retrofit use - measured post‐retrofit use

a

ð5:78Þ The determination of the annual savings resulting from the energy retrofit and its uncertainty are finally determined. It is very important that the uncertainty associated with the savings estimates be determined as well for meaningful conclusions to be reached regarding the impact of the retrofit on energy use. You are given monthly data of outdoor dry bulb temperature (To) and area-normalized whole building electricity use (WBe) for two years (Table 5.25). The first year is the pre-retrofit period before a new energy management and control system (EMCS) for the building is installed, and the second is the post-retrofit period. Construction period, that is, the period it takes to implement the conservation measures, is taken to be negligible.

WBe (W/ft2) 3.289 2.827 2.675 1.908 1.514 1.073 1.237 1.253 1.318 1.584 2.474 2.356

Post-retrofit period Month To (°F) 1995-Jul 83.63 Aug 83.69 Sep 80.99 Oct 72.04 Nov 62.75 Dec 57.81 199654.32 Jan Feb 59.53 Mar 58.70 Apr 68.28 May 78.12 Jun 80.91

WBe (W/ft2) 2.362 2.732 2.695 1.524 1.109 0.937 1.015 1.119 1.016 1.364 2.208 2.070

Data available electronically on book website

(a) Plot time series and x–y plots and see whether you can visually distinguish the change in energy use as a result of installing the EMCS (similar to Fig. 5.38); (b) Evaluate at least two different models (with one of them being a model with indicator variables) for the pre-retrofit period, and select the better model. (c) Repeat the analysis using bootstrapping (10 samples are adequate) and compare the model parameter estimates with those of (a) and (b) above. (d) Use this baseline model to determine month-by-month energy use during the post-retrofit period representative of energy use had not the conservation measure been implemented. (e) Determine the month-by-month as well as the annual energy savings (this is the “model-predicted pre-retrofit energy use” of Eq. 5.78).

References

(f) The ESCO which suggested and implemented the ECM claims a savings of 15%. You have been retained by the building owner as an independent M&V consultant to verify this claim. Prepare a short report describing your analysis methodology, results, and conclusions. (Note: you should also calculate the 90% uncertainty in the savings estimated assuming zero measurement uncertainty. Only the cumulative annual savings and their uncertainty are required, not month-by-month values).

References ASHRAE, 1978, Standard 93-77: Methods of Testing to Determine the Thermal Performance of Solar Collectors, American Society of Heating, Refrigerating and Air-Conditioning Engineers, Atlanta, GA. ASHRAE, 2005. Guideline 2-2005: Engineering Analysis of Experimental Data, American Society of Heating, Refrigerating and Air-Conditioning Engineers, Atlanta, GA. Belsley, D.A., E. Kuh and R.E Welsch, 1980, Regression Diagnostics, John Wiley & Sons, New York. Chatfield, C., 1995. Problem Solving: A Statistician’s Guide, 2nd Ed., Chapman and Hall, London, U.K. Chatterjee, S. and B. Price, 1991. Regression Analysis by Example, 2nd Edition, John Wiley & Sons, New York. Cook, R.D. and S. Weisberg, 1982. Residuals and Influence in Regression, Chapman and Hall, New York. Davison, A.C. and D. Hinkley, 1997, Cambridge University Press, U.K. Draper, N.R. and H. Smith, 1981. Applied Regression Analysis, 2nd Ed., John Wiley and Sons, New York. Efron, B. and R. Tibshirani, 1985. The Bootstrp Method for Assessing Statistical Accuracy, Behaviormetrika, 12, 1–35, Springer, 1985. Ezekiel, M. and K.A. Fox, 1959. Methods of Correlation and Regression Analysis, 3rd ed., John Wiley and Sons, New York.

221 Freedman, D. and Peters, S. (1984) Bootstraping an Econometric Model: Some Empirical Results. Journal of Business Economic Statistics, 2, 150–158. Gordon, J.M. and K.C. Ng, 2000. Cool Thermodynamics, Cambridge International Science Publishing, Cambridge, UK Hair, J.F., R.E. Anderson, R.L. Tatham and W.C. Black, 1998. Multivariate Data Analysis, 5th Ed., Prentice Hall, Upper Saddle River, NJ, James, G., D. Witten, T. Hastie and R. Tibshirani, 2013. An Introduction to Statistical Learning: with Applications to R, Springer, New York. Katipamula, S., T.A. Reddy and D. E. Claridge, 1998. Multivariate regression modeling, ASME Journal of Solar Energy Engineering, vol. 120, p.177, August. Neter, J. W. Wasserman and M.H. Kutner, 1983. Applied Linear Regression Models, Richard D. Irwin, Homewood IL. Pindyck, R.S. and D.L. Rubinfeld, 1981. Econometric Models and Economic Forecasts, 2nd Edition, McGraw-Hill, New York, NY. Reddy, T.A., 1987. The Design and Sizing of Active Solar Thermal Systems, Oxford University Press, Clarendon Press, U.K., September. Reddy, T.A., N.F. Saman, D.E. Claridge, J.S. Haberl, W.D. Turner and A.T. Chalifoux, 1997. Baselining methodology for facility-level monthly energy use- part 1: Theoretical aspects, ASHRAE Transactions, v.103 (2), American Society of Heating, Refrigerating and Air-Conditioning Engineers, Atlanta, GA. Reddy, T.A., J.F. Kreider, P. Curtiss and A. Rabl, 2016. Heating and Cooling of Buildings- Principles and Practice of Energy Efficient Design, 3rd Edition, CRC Press, Boca Raton, FL. Schenck, H., 1969. Theories of Engineering Experimentation, Second Edition, McGraw-Hill, New York. Shannon, R.E., 1975. System Simulation: The Art and Science, PrenticeHall, Englewood Cliffs, NJ. Stoecker, W.F., 1989. Design of Thermal Systems, 3rd Edition, McGraw-Hill, New York. Walpole, R.E., R.H. Myers and S.L. Myers, 1998. Probability and Statistics for Engineers and Scientists, 6th Ed., Prentice Hall, Upper Saddle River, NJ

6

Design of Physical and Simulation Experiments

Abstract

One of the objectives of performing engineering experiments is to assess performance/quality improvements (or system response variable) of a product under different changes/variations during the manufacturing process (called “treatments”). Experimental design is the term used to denote the series of planned experiments to be undertaken to compare the effect of one or more treatments or interventions on a response variable. The analysis of such data once collected entails methods which are a logical extension of the Student t-test and one-way ANOVA hypothesis tests meant to compare two or more population means of samples. Design of Experiments (DOE) is a broader term which includes defining the objective and scope of the study, selecting the response variable, identifying the treatments and their levels/ranges of variability, prescribing the exact manner in which samples for testing need to be selected, specifying the conditions and executing the test sequence where one variable is varied at a time, analyzing the data collected to verify (or refute) statistical hypotheses, and then drawing meaningful conclusions. Selected experimental design methods are discussed such as full and fractional factorial designs, and complete block and Latin squares designs. The parallel between model building in a DOE framework and linear multiple regression is illustrated. Also discussed are response surface modeling (RSM) designs, which are meant to accelerate the search toward optimizing a process or finding the proper product mix by simultaneously varying more than one continuous treatment variable. It is a sequential approach where one starts with test conditions in a plausible area of the search space, analyzes test results to determine the optimal direction to move, performs a second set of test conditions, and so on till the required optimum is reached. Central composite design (CCD) is often used for RSM situations for

continuous treatment variables since it allows fitting a second-order response surface with greater efficiency. Computer simulations are, to some extent, replacing the need to perform physical experiments, which are more expensive, time-consuming, and limited in the number of factors one can consider. There are parallels between the traditional physical DOE approach and designs based on computer simulations. The last section of this chapter discuses similarities and the important considerations/ differences between experimental design in both fields. It presents the various methods of sampling when the set of input design variables is very large (in some cases the RSM-CCD design can be used but the Latin Hypercube Monte Carlo and its variants such as the Morris method are much more efficient), for performing sensitivity analysis to identify important input variables (similar to screening) and reducing the number of computer simulations by adopting space-filling interpolation methods (also called surrogate modeling).

6.1

Introduction

6.1.1

Types of Data Collection

All statistical data analyses are predicated on acquiring proper data, and the more “proper” the data the sounder the statistical analysis. Basically, data can be collected in one of three ways (Montgomery 2017): (i) A retrospective cohort study involving a control group and a test group. For example, in medical and psychological research, data are collected from a group of individuals exposed or vaccinated against a certain factor and compared to another control group; (ii) An observational study where data are collected during normal operation of the system and the observer cannot

# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 T. A. Reddy, G. P. Henze, Applied Data Analysis and Modeling for Energy Engineers and Scientists, https://doi.org/10.1007/978-3-031-34869-3_6

223

224

6

intervene; relevant analyses methods have been discussed briefly in Sect. 3.8 and in Sect. 10.3; (iii) A designed experiment where the analyst can control certain system inputs and the leeway to frame/perform the sequence of experiments as desired. This is the essence of a body of knowledge referred to as design of experiments (DOE), the focus of this chapter. One of the objectives of performing designed experiments in an engineering context is to improve some quality of products during their manufacture. Specifically, this involves evaluating process yield improvement or detecting changes in model system performance, i.e., the response of different specific changes (called treatments). An example related to metallurgy is to study the influence of adding carbon to iron in different concentrations to increase the strength and toughness of steels. The word “treatment” is generic and is used to denote an intervention in a process such as, say an additive nutrient in fertilizers, use of different machines during manufacture, design change in manufacturing components/processes, etc. (Box et al. 1978). Often data are collected without proper reflection of the intended purpose, and the analyst then tries to do the best he can. The two previous chapters dealt with statistical techniques for analyzing data, which were already gathered. However, no amount of “creative” statistical data analysis can reveal information not available in the data itself. The richness of the data set cannot be ascertained by the amount of data but by the extent to which all possible states of the system are represented in the data set. This is especially true for observational data sets where data are collected while the system is under routine dayto-day operation without any external intervention by the observer. The process of proper planning and execution of experiments, intentionally designed to provide data rich in information especially suited for the intended objective with low/least effort and cost, is referred to as experimental design. Optimal experimental design is one which stipulates the conditions under which each observation should be taken to minimize/maximize the system response or inherent characteristics. Practical considerations and constraints often complicate the design of optimal experiments, and these factors need to be explicitly considered.

6.1.2

Design of Physical and Simulation Experiments

Purpose of DOE

Experimental design techniques were developed at the beginning of the twentieth century primarily in the context of agricultural research, subsequently migrating to industrial engineering, and then on to other fields. The historic reason for their development was to ascertain, by hypothesis testing, whether a certain treatment increased the yield or improved the strength of the product. The statistical techniques which stipulate how each of the independent variables or factors have to be varied so as to obtain the most information about system behavior quantified by a response variable and do so with a minimum of tests (and hence, low/least effort and expense) are the essence of the body of knowledge known as experimental design. Design of experiments (DOE) is broader in scope and includes additional aspects: defining the objective and scope of the study, selecting the response variable, identifying the treatments and their levels/ranges of variability, prescribing the exact way samples for testing need to be selected, specifying the conditions and executing the test sequence, analyzing the data collected to verify (or refute) statistical hypotheses and then drawing meaningful conclusions. Often the terms experimental design and DOE are used interchangeably with the context dispelling any confusion. Experimental design involves one or more aspects where the intent is to (i) “screen” a large number of possible candidates or likely variables/factors and identify the dominant variables. These possible candidate factors are then subject to more extensive investigation; (ii) formulate the test conditions and sequence so that sources of unsuspecting and uncontrollable/extraneous errors can be minimized while eliciting the necessary “richness” in system behavior, and (iii) build a suitable mathematical model between the factors and the response variable using the data set acquired. This involves both hypothesis testing to identify the significant factors as well as model building and residual diagnostic checking. Many software packages have a DOE wizard which walks one through the various steps of defining the entire set of experimental combinations. Typical application goals of DOE are listed in Table 6.1. The relative importance of these goals depends on the

Table 6.1 Typical application goals of DOE Application Hypothesis testing

Factor screening Factor interaction Model development Response surface design

Description Determine whether there is a difference between the mean of the response variable for the different levels of a factor (i.e., to verify whether the new product or process is indeed an improvement over the status quo) Determine which factors greatly influence the response variable Determine whether (or which) factors interact with each other Determine functional relationship between factors and response variable Experimental design which allows determining the numerical values or settings of the factors that maximize/minimize the response variable

Chapter section 6.2, 6.4, 6.5

6.4.2 6.4 6.4, 6.5 6.6

6.2 Overview of Different Statistical Methods

specific circumstance. For example, often and especially so in engineering model building, the dominant regressor or independent variable set is known beforehand, acquired either from mechanistic insights or prior experimentation; and so, the screening phase may be redundant. In the context of calibrating a detailed simulation program with monitored data (Sect. 6.7), the problem involves dozens of input parameters. Which parameters are influential can be determined from a sensitivity analysis which is directly based on the principles of screening tests in DOE.

6.1.3

DOE Terminology

Literature on DOE has its own somewhat unique terminology which needs to be well understood. Referring to Fig. 1.8, a component or system can be represented by a simple block diagram, which consists of controllable input variables, uncontrollable inputs, system response, and the model structure denoted by the mathematical model and its parameter vector. The description of important terms is assembled in Table 6.2. Of special importance are the three terms: treatment factors, nuisance factors, and random effects. These and other terms appearing in this table are discussed in this chapter.

225

6.2

Overview of Different Statistical Methods

6.2.1

Different Types of ANOVA Tests

Recall the two statistical tests covered in Chap. 4, both of which involve comparing the mean values of one factor or variable corresponding to samples drawn from different populations. The Student t-test is a hypothesis test meant to compare the mean values on two population means using the t-distribution (Sect. 4.2). The one-factor or one-way ANOVA (Sect. 4.3) is a variance-based statistical F-test to concurrently compare three or more population means of samples gathered after the system has been subject to one or more treatments or interventions. Note that it could be used for comparing two population means; but the t-test is simpler. These tests are not meant for cases where more than one factor influences the outcome. Consider an example where an animal must be fattened (increase in weight to fetch a higher sale price), and the effect of two different animal feeds is being evaluated. If the tests are conducted without considering the confounding random effect of other disturbances which may influence the outcome, one obtains a large variance in the mean and the effect of the feed type is difficult to isolate/identify statistically

Table 6.2 Description of important terms used in DOE 1 2 2a

Terminology Component/system response Factors

2c

Treatment or primary factors Nuisance or secondary factors Covariates

3

Random effects

4

Factor levels

5

Blocking

6

Randomization

7

Replication

8

Experimental units

9

Block

2b

Description Usually a continuous variable (akin to the dependent variable in regression analysis discussed in Chap. 5) Controllable component or system variables/inputs, usually discrete or qualitative/categorical. Continuous variables need to be discretized The interventions or controlled variables whose impact on the response is the primary purpose of adopting DOE Variables which are sources of variability and can be controlled by blocking. They are not of primary interest to the experimenter but need to be included in the model Nuisance factors that cannot be controlled but can be measured prior to, or during, the experiment. They can be included in the model if they are of interest Extraneous or unspecified (or lurking) disturbances which the experimenter cannot control or measure, but which impact system response and obfuscate the statistical analysis Discrete values of the treatment and nuisance factors selected for conducting experiments (e.g., low, medium, and high can be coded as -1, 0, and +1). For continuous variables, the range of variation is discretized into a small number of numerical value sets or levels Clamping down on the variability of known nuisance factors to minimize or remove their influence/impact on the variability of the main factor. Reduces experimental error and increases power of statistical analysis. By not blocking a nuisance factor, it is much more difficult to detect whether the primary factor is significant or not Executing test runs in a random order to average out the impact of random effects. Meant to minimize the effect of “noise” in the experiments and improve estimation of other factors using statistical methods Repeating each factor/level combination independently of any previous run to reduce the effect of random error. Replication allows estimating the experimental error and provides a more precise estimate of the actual parameter values Physical entities such as lab equipment setups, pilot plants, field parcels, etc. for conducting DOE experiments. Any practical situation will have a limitation on the number of units available A noun used to denote a set or group of experimental units where one of the treatment factors has been blocked.

226

6

Design of Physical and Simulation Experiments

disturbances requires suitable experimental procedures involving randomization and replication; (ii) Several levels or treatments, but only two to four levels will be considered here for conceptual simplicity. The levels of a factor are the different values or categories which the factor can assume. This can be dictated by the type of variable (which can be continuous or discrete or qualitative) or selected by the experimenter. Note that the factors can be either continuous or qualitative. In case of continuous variables, their range of variation is discretized into a small set of numerical values; as a result, the levels have a magnitude associated with them. This is not the case with qualitative/categorical variables where there is no magnitude involved; the grouping is done based on some classification criterion such as different types of treatments. Fig. 6.1 The variance in two samples of the response variable (weight) against two treatments (factors 1 and 2) can be large when several nuisance factors are present. In such cases, one-way ANOVA tests are inadequate. DOE strategies reduce the variance and bias due to secondary or nuisance factors and average out the influence of random or uncontrollable effects. This is achieved by restricted randomization, replication, and blocking resulting in tighter limits as shown

(Fig. 6.1). It is in such cases that DOE can determine, by adopting suitable block designs, whether (and by how much) a change in one input variable (or intended treatment) among several known and controllable inputs affects the mean behavior/response/output of the process/system. Two-way ANOVA is used to compare mean differences on one dependent continuous variable between groups which have been subject to different interventions or treatments involving two independent discrete factors with one of them being a nuisance/secondary factor/variable. Similarly, three-way ANOVA involves techniques where two nuisance factors are blocked. The design of experiments can be extended to include: (i) Several factors, but only two to four factors will be discussed here for the sake of simplicity. The factors could be either controllable by the experimenter or cannot be controlled (random or extraneous factors). The source of variation of controllable variables can be reduced by fixing them at preselected levels called blocking and collecting different sets of experimental data. The analysis of these different sets or experimental units or blocks would allow the effect of treatment variables to be better discerned. If a model is being identified, the impact of cofactors can be included as well. The effect of the uncontrollable/unspecified

6.2.2

Link Between ANOVA and Regression

ANOVA and linear regression are equivalent when used to test the same hypotheses. The response variable is continuous in both cases while the factors or independent variables are discrete for ANOVA and continuous in the regression setting (Sect. 5.7 discusses the use of discrete or dummy regressors, but this is not the predominant situation). ANOVA can be viewed as a special case of regression analysis with all independent factors being qualitative. However, there is a difference in their application. ANOVA is an analysis method, widely used in experimental statistics, which addresses the question: what is the expected difference in the mean response between different groups/categories? Its main concern is to reduce the residual variance of a response variable in such a manner that the individual impact of specific factors can be better determined statistically. On the other hand, regression analysis is a mathematical modeling tool that aims to develop a predictive model for the change in the response when the predictor(s) changes by a given amount (or between different groups/categories). Achieving high model goodness-of-fit and predictive accuracy are the main concerns.

6.2.3

Recap of Basic Model Functional Forms

Section 5.4 discussed higher-order regression models involving more than one regressor. The basic additive linear model was given by Eq. 5.22a, while its “normal” transformation is one where the individual regressor variables are subtracted by their mean values:

6.3 Basic Concepts

227

Fig. 6.2 Points required for modeling linear and non-linear effects

y = β 0 0 þ β 1 ð x1 - x1 Þ þ β 2 ð x2 - x2 Þ þ ⋯ þ β k ð xk - xk Þ þ ε

ð5:20Þ

where k is the number of regressors and ε is the error or unexplained variation in y. The intercept term β0′ can be interpreted as the mean response of y. The basic regression model can be made to capture nonlinear variation in the response variables. For example, as shown in Fig. 6.2, a third observation would allow quadratic behavior (either concave or convex) to be modeled. The linear models capture variation when there is no interaction between regressors. The first-order linear model with two interacting regressor variables could be stated as: y = β 0 þ β 1 x1 þ β 2 x2 þ β 3 x1  x 2 þ ε

ð5:21Þ

The term (β3x1  x2) was called the interaction term. How the presence of interaction affects the shape of the family of curves was previously illustrated in Fig. 5.8. The secondorder quadratic model with interacting terms (Eq. 5.28) was also presented. The number of runs/trials or data points must be greater than the number of terms in the model to (i) estimate the model parameters, (ii) determine, by hypothesis testing, whether they are statistically significant or not, and (iii) obtain an estimate of the random/pure error. Even with blocking and randomization, it is important to adopt a replication strategy to increase the power of the DOE design.

6.3

Basic Concepts

6.3.1

Levels, Discretization, and Experimental Combinations

A full factorial design is one where experiments are done for each and every combination of the factors and their levels. The number of combinations is then given by:

Number of Experiments for Full Factorial = ∏ki ni

ð6:1Þ

where k is the number of factors, n is the number of levels, and i is the index for the factors. For example, a factorial experiment with three factors involving one two-level factor, a three-level factor, and a four-level factor would have 2 × 3 × 4 = 24 runs or trials. For the special case when all factors have the same number of levels, the number of experiments = nk. A widely used experimental design is the 2k factorial design since it is intuitive and easy to design. It derives its terminology from the fact that only two levels for each factor k are presumed, one indicative of the lower level of its range of variation (coded as “-”) and the other representing the higher level (coded as “+”). The factors can be qualitative or continuous; if the latter, they are discretized into categories or levels. Figure 6.3a illustrates how the continuous regressors x1 and x2 are discretized depending on their range of variation into four system states, while Fig. 6.3b depicts how these four observations would appear in a scatter plot should they exhibit no factor interaction (that is why the lines are parallel). For two factors, the number of experiments/trials (without any replication) would be 22 = 4; for three factors, this would be 23 = 8, and so on. The formalism of coding the low and high levels of the factors as -1 and +1 respectively is most widespread; other ways of coding variables can also be adopted. A full factorial design involving three factors (k = 3) at two levels each (n = 2) would involve 8 tests. These are shown conceptually in Fig. 6.4 as the 8 corner points of the cube. One could add center points to a two-level design at the center of the cube, and this would provide an estimate of residual error (the variability of the variation in the response not captured by the model) and allow one to test the goodness of model fit of a linear model. It would also indicate whether quadratic effects are present; however, it would be unable to identify the specific factor causing this behavior. To capture the quadratic effect of individual factors, one

228

6

High

o

o

Low

o

o

Low

High

X2

X1

oHigh X2

Y 0

o

Low X2 0

Low

High

X1 Fig. 6.3 Illustration of how models are built from factorial design data. A 22 factorial design is assumed (a) Discretization of the range of variation of the regressors x1 and x2 into “low” and “high” ranges, and (b) Regression of the system performance data as they appear on a scatter plot (there is no factor interaction since the two lines are shown parallel)

Fig. 6.4 Full factorial design for three factors at two-Levels (23 design) involves 8 experiments mapped to each of the corners of the cube. The three factors (A, B, C) are coded as 1 or -1 to denote the high- and low-level settings of the factors. Section 6.4.2 discusses this representation in more detail

needs to add experimental points at the center of each of the 6 surfaces. This strategy is not used in the classic factorial designs but has been adopted in later designs discussed in

Design of Physical and Simulation Experiments

Sect. 6.6.4. Designs with more than 3 factors are often referred to as hyper-cube designs and cannot be drawn graphically. Figure 6.5 is a flowchart of some of the simpler cases encountered in practice and discussed in this chapter. One distinguishes between two broad categories of traditional designs: (a) Full factorial and fractional factorial designs (Sect. 6.4): Full factorial designs are the most conservative of all design types where it is assumed that all trials can be run randomly by varying one factor/variable at a time. For example, with 3 factors with 3 levels each, the number of combinations is 33 = 27 while that for four factors is 34 = 81 and so on. Full factorial designs apply to instances when all factors at all their individual levels are deemed equally important and one wishes to compare k treatment/factor means. They allow estimating all possible interactions. Of special importance in this category of designs is the 2k design which is limited to each factor having two levels only. It is often used at the early stages of an investigation for factor screening and involves an iterative approach with each iteration providing incremental insight into factor dominance and into model building. The total number of runs required is much lower than that needed for the full factorial design. The number of experiments required for fullfactorial design increases exponentially, and it is rare to adopt such designs. Moreover, not all the factors initially selected may be dominant thereby warranting that the list of factors be screened. The fractional factorial design is appropriate in such instances. It only requires a subset of trials of those of the full factorial design, but then some of the main effects and two-way interactions are confounded and cannot be separated from the effects of other higher-order interactions. (b) Complete block and Latin squares (Sect. 6.5): Complete block design is used for instances when only the effect of one categorical treatment variable is being investigated over a range of different conditions when one or more categorical nuisance variables are present. The experimental units can be grouped in such a way that complete block designs require fewer experiments than the full factorial and allow the effect of interaction to be modeled. Latin squares are a special type of complete block design requiring even fewer number of experiments or trials. However, it has limitations: interaction effects cannot be captured, and it is restricted to the case of equal number of levels for all factors.

6.3 Basic Concepts

229

Number of Total Factors k=1

k>1 No. of Important Treatment Factors?

Only Random Effects

All

One

Two samples

Multiple samples

Yes

t-test 4.2.3

One-factor ANOVA 4.3.1

Enough number of experimental units?

Interaction terms important? Yes

No

No. of blocked factors Linear

Complete Block 6.5.1

No Fractional Factorial

Full 6.4.1 Factorial

Two

Three

Latin Squares

Graeco-Latin Squares

Two-level 2k Design 6.4.2

6.5.2

Model Behavior?

6.4.4

Quadratic

Three-level 3k Design 6.4.2

Fig. 6.5 Flow chart meant to provide an overview of the applicability of the various traditional DOE methods covered in different sections of this chapter. Response surface design (treated in Sect. 6.6) is akin to a three-level design while requiring fewer experiments; this is not

shown. Monte Carlo methods are also not indicated in the flowchart since they are not traditional and are adopted for different types of applications (such as computer simulation experiments)

6.3.2

6.3.3

Blocking

The general strategies or foundational principles adopted by DOE in order to reduce the effect of nuisance factors and random effects are block what factors you can and randomize and replicate (or repeat) what you cannot (Box et al. 1978). The concept of blocking is a form of stratified sampling whereby experimental tests or items in the sample of data are grouped into blocks according to some “matching” criterion so that the similarity of subjects within each block or group is maximized while that from block to block is minimized. Pharmaceutical companies wishing to test the effectiveness of a new drug adopt the above concept extensively. Since different people react differently, grouping of subjects is done according to some criteria (such as age, gender, body fat percentage, etc.). Such blocking would result in more uniformity among groups. Subsequently, a random administration of the drugs to half of the people within each group/block with a placebo to the other half would constitute randomization. Thus, any differences between each block taken separately would be more pronounced than if randomization was done without blocking.

Unrestricted and Restricted Randomization

After the experimental design is formulated, the sequence of the tests, i.e., the selection of the combinations of different levels and different factors, should be done in a random manner. This randomization would reduce the effect of random or extraneous factors beyond the control of the experimenter or arising from inherent experimental bias on the results of the statistical analysis. The simplest design is the randomized unrestricted block design, which involves selecting at random the combinations of the factors and levels under which to perform the experiments. This type of design, if done naively, is not very efficient and may require an unnecessarily large number of experiments to be performed. The concept is illustrated with a simple example, from the agricultural area from which DOE emerged. Say, one wishes to evaluate the yield of four newly developed varieties of wheat (labeled x1, x2, x3, x4). Since the yield is affected in addition by regional climatic and soil differences, one would like to perform the evaluation at four different locations. If one had four plots of land at each

230

6

Table 6.3 Example of unrestricted randomized block design (one of several possibilities) for one factor of interest at levels x1, x2, x3, x4 and one nuisance variable (Regions 1, 2, 3, 4) Region 1 x1 x3 x1 x2

Region 2 x1 x2 x3 x1

Region 3 x2 x1 x3 x3

Region 4 x3 x4 x2 x4

Design of Physical and Simulation Experiments

Table 6.5 Standard method of assembling test results for a balanced (3 × 2) design with two replication levels

Factor A

Average

Level 1 Level 2 Level 3

Factor B Level 1 10, 14 23, 21 31, 27 21

Level 2 18, 14 16, 20 21, 25 19

Average 14 20 26 20

This is not an efficient design since x1 appears twice under Region 1 and not at all in Region 4 Table 6.4 Example of restricted randomized block design (one of several possibilities) for the same example as that of Table 6.3 Region 1 x1 x2 x3 x4

Region 2 x2 x3 x4 x1

Region 3 x3 x4 x1 x2

Region 4 x4 x1 x2 x3

individual region (i.e., 16 experimental units in total), all the tests could be completed over one period or cycle. Consider the more realistic case, when unfortunately, only one plot of land or field station is available at each of the four different geographic regions, i.e., only four experimental units in all. The total time duration of the evaluation process due to this limitation would now require four time periods. This simple case illustrates the importance of designing one’s DOE strategy keeping in mind physical constraints such as the number of experimental units available and the total time duration within which to complete the entire evaluation. The simplest way of assigning which station will be planted with which variety of wheat is to do so randomly; one such result (among many possible ones) is shown in Table 6.3. Such an unrestricted randomization leads to needless replication (e.g., wheat type x1 is tested twice in Region 1 and not at all in Region 4) and is not very efficient. Since the intention is to reduce variability in the uncontrolled variable, in this case the “region” variable, one can insist that each variety of wheat be tested in each region. There are again several possibilities, with one being shown in Table 6.4. This example illustrates the principle of restricted randomization by blocking the effect of the uncontrollable factor. This aspect is further discussed in Sect. 6.5.2. Note that the intent of this investigation is to determine the effect of wheat variety on total yield. The location of the field or station is a “nuisance” variable, but in this case can be controlled by suitable blocking. However, there may be other disturbances which are uncontrollable (say, excessive rainfall in one of the regions during the test) and even worse some of the disturbances may be unknown. Such effects can be partially compensated for by replication, i.e., repeating the tests more than once for each combination.

6.4

Factorial Designs

6.4.1

Full Factorial Design

Consider two factors (labeled A and B) that are to be studied at two levels: a and b. This is often referred to as (a × b) design, and the standard manner of representing the results of the test is by assembling them as shown in Table 6.5. Each combination of factor-level can be tested more than once to minimize the effect of random errors, and this is called replication. Though more tests are done, replication reduces experimental errors introduced by extraneous factors not explicitly controlled during the experiments that can bias the results. Often for mathematical convenience, each combination is tested at the same replication level, and this is called a balanced design. Thus, Table 6.5 is an example of a (3 × 2) balanced design with replication r = 2. The above terms are perhaps better understood in the context of regression analysis (treated in Chap. 5). Let Z be the response variable which is linear in regressor variable X, and a model needs to be identified. Further, say, another variable Y is known to influence Z which may corrupt the sought-after relation. Selecting three specific values is akin to selecting three levels for the factor X (say, x1, x2, x3). The nuisance effect of variable Y can be “blocked” by performing the tests at pre-selected fixed levels or values of Y (say, y1 and y2). The corresponding scatter plot is shown in Fig. 6.6. Repeat testing at each of the six combinations in order to reduce experimental errors is akin to replication; in this example, replication r = 3. Finally, if the 18 tests are performed in random sequence, the experimental design would qualify as a full factorial random design. The averages shown in Table 6.5 correspond to those of either the associated row or the associated column. Thus, the average of the first row, i.e. {10, 14, 18, 14}, is shown as 14, and so on. Plots of the average response versus the levels of a factor yield a graph which depicts the trend, called main effect of the factor. Thus, Fig. 6.7a suggests that the average response tends to increase linearly as factor A changes from A1 to A3, while that of factor B decreases a

6.4 Factorial Designs

231

little as factor B changes from B1 to B2. The effect of the factors on the response may not be purely additive, and an interaction or multiplicative term (see Eq. 5.25) may have to be included as well. In such cases, the two factors are said to interact with each other. Whether this interaction effect is statistically significant or not can be determined from the results shown in Table 6.6. The effect of going from A1 to A3 is 17 under B1 and only 7 under B2. This

o o o

Z

y1

o o o o o o

o o o

y2

o o o

SST = SSA þ SSB þ SSðABÞ þ SSE

o o o

x1

suggests interaction effects. A simpler and more direct approach is to graph the two-factor interaction plot as shown in Fig. 6.7b. Since the lines are not parallel (in this case they cross each other), one would infer interaction between the two factors. However, in many instances, such plots are not conclusive enough, and one needs to perform statistical tests to determine whether the main or the interaction effects are significant or not (illustrated below). Figure 6.8 shows the type of interaction plots one would obtain for the case when the interaction effects are not at all significant. ANOVA decompositions allow breaking up the observed total sum of square variation (SST) into its various contributing causes (the one- factor ANOVA was described in Sect. 4.3.1). For a two-factor ANOVA decomposition (Devore and Farnum 2005): ð6:2aÞ

where the observed sum of squares: x2

x3

X

SST =

a

b

r

i=1 j=1 m=1 2

yijm - < y >

2

ð6:2bÞ

= ðstdevÞ ðabr - 1Þ

Fig. 6.6 Correspondence between block design approach and multiple regression analysis

Fig. 6.7 Plots for the (3 × 2) balanced factorial design. (a) Main effects of factors A and B with mean and 95% intervals (data from Table 6.5). (b) Two-factor interaction plot. (Data from Table 6.6)

Table 6.6 Interaction effect calculations for the data in Table 6.5 Effect of changing A (B fixed at B1) (10 + 14)/2 = 12 A1 and B1 (23 + 21)/2 = 22 A2 and B1 (31 + 27)/2 = 29 A3 and B1

29 - 12 = 17

Effect of changing A (B fixed at B2) A1 and B2 (18 + 14)/2 = 16 A2 and B2 (16 + 20)/2 = 18 A3 and B2 (21 + 25)/2 = 23

23 - 16 = 7

232

6

sum of squares associated with factor A: a

SSA = br

Ai - < y >

2

ð6:2cÞ

i=1

sum of squares associated with factor B: b

SSB = ar

Bj - < y >

2

ð6:2dÞ

j=1

error or residual sum of squares: a

b

r

yijm - yij

SSE =

2

ð6:2eÞ

i=1 j=1 m=1

with yijm = observation under mth replication when A is at level i and B is at level j a = number of levels of factor A b = number of levels of factor B r = number of replications per cell Ai = average of all response values at ith level of factor A Bj = average of all response values at jth level of factor B

Design of Physical and Simulation Experiments

yij = average of y for each cell (i.e., across replications) = grand average of y values i = 1, . . . a is the index for levels of factor A j = 1, . . . b is the index for levels of factor B m = 1, . . . r is the index for replicate. The sum of squares associated with the AB interaction is SST(AB) and this is deduced from Eq. 6.2a since all other quantities can be calculated. A linear statistical model, referred to as a random effects model, between the response and the two factors which include the interaction term between factors A and B can be deduced. More specifically, this is called a nonadditive two-factor model (it is nonadditive because the interaction term is present). It assumes the following form given that one starts with the grand average and then adds individual effects of the factors, the interaction terms, and the noise or error term: yij = < y > þ αi þ βj þ ðαβÞij þ εij

ð6:3Þ

where αi represents the main effect of factor A at the ith level = Ai - < y > and =

a i=1

αi = 0

βj the main effect of factor B at the jth level = Bj - < y > and = (αβ)ij

b j=1

the

βj = 0 interaction

between

B = yij - ð < y > þ αi þ βi Þ =

b

factors a

j=1 i=1

A

and

ðαβÞij = 0

and εij is the error (or residuals) assumed uncorrelated with mean zero and variance σ 2 = MSE.

Fig. 6.8 An example of a two-factor interaction plot when the factors have no interaction

The analysis of variance is done as described earlier, but care must be taken to use the correct degrees of freedom to calculate the mean squares (refer to Table 6.7). The analysis of the variance model (Eq. 6.3) can be viewed as a special case of multiple linear regression (or more specifically to one with indicator variables—see Sect. 5.7.3). This concept is illustrated in Example 6.4.1.

Table 6.7 Computational procedure for a two-factor ANOVA design Source of variation Factor A Factor B AB interaction

Sum of squares SSA SSB SS(AB)

Degrees of freedom a-1 b-1 (a - 1)(b - 1)

Error Total variation

SSE SST

ab(r - 1) abr - 1

Mean square MSA = SSA/(a - 1) MSB = SSB/(b - 1) MS(AB) = SS(AB)/(a - 1) (b - 1) MSE = SSE/[ab(r - 1)] –

Computed F statistic MSA/MSE MSB/MSE MS(AB)/MSE

Degrees of freedom for pvalue a - 1, ab(r - 1) b - 1, ab(r - 1) (a - 1)(b - 1), ab(r - 1)

– –

– –

6.4 Factorial Designs

233

Example 6.4.1 Two-factor ANOVA analysis and random effect model fitting Using the data from Table 6.5, determine whether the main effect of factor A, the main effect of factor B, and the interaction effect of AB are statistically significant at α = 0.05. Subsequently, identify the random effects model. It is recommended that one start by generating the treatment and effect plots- as shown in Fig. 6.7a, b. The response increases with the increasing level of factor A while it decreases a little with Factor B. The effect of factor A on the response looks more pronounced. First, using all 12 observations, one computes the grand average = 20 and the standard deviation stdev = 6.015. Then, following Eq. 6.2b: SST = stdev2 :ðabr - 1Þ = 6:0152 ½ð3Þð2Þð2Þ - 1 = 398 SSA = ð2Þ:ð2Þ ð14 - 20Þ2 þ ð20 - 20Þ2 þ ð26 - 20Þ2 = 288 SSB = ð3Þ:ð2Þ ð21 - 20Þ2 þ ð19 - 20Þ2 = 12 SSE = ½ð10 - 12Þ2 þ ð14 - 12Þ2 þ ð18 - 16Þ2 þð14 - 16Þ2 þ ð23 - 22Þ2 þ ð21 - 22Þ2 þð16 - 18Þ2 þ ð20 - 18Þ2 þ ð31 - 29Þ2 þð27 - 29Þ2 þ ð21 - 23Þ2 þ ð23 - 25Þ2 = 42

The statistical significance of the factors can now be evaluated by computing the F-values and comparing them with the corresponding critical values. S A 144 • Factor A: F - value = M M S E = 7 = 20:57: Since critical F-value for degrees of freedom (2, 6) = Fc (2,6) @ 0.05 significance level = 5.14, and because calculated F > Fc, one concludes that this factor is indeed significant at the 95% confidence level (CL). S B 12 • Factor B: F - value = M M S E = 7 = 1:71: Since Fc (1,6) @ 0.05 significance level = 5.99; this factor is not significant.

• Factor AB: F - value = MMSðSA EBÞ = 28 7 = 4: Since Fc (2,6) @ 0.05 significance level = 5.14; this factor is not significant. These results are also assembled in Table 6.8 for easier comprehension. The use of Eq. 6.3 can also be illustrated in terms of this example. The main effects of A and B are given by the differences between the cell averages and the grand average = 20 (see Table 6.7): α1 = ð14 - 20Þ = - 6; α2 = ð20 - 20Þ = 0; α3 = ð26 - 20Þ = 6; β1 = ð21 - 20Þ = 1; β2 = ð19 - 20Þ = - 1;

Then, from Eq. 6.2a SST = SSA þ SSB þ SSðABÞ þ SSE SSðABÞ = SST - SSA - SSB - SSE = 398 - 288 - 12 - 42 = 56

and those of the interaction terms by (refer to Table 6.6): ðαβÞ11 = 12 - ð20 - 6 þ 1Þ = - 3; ðαβÞ21 = 22 - ð20 þ 0 þ 1Þ = 1;

Next, the expressions shown in Table 6.7 result in: MSA = MSB = MSðABÞ = MSE =

ðαβÞ31 = 29 - ð20 þ 6 þ 1Þ = 2; ðαβÞ12 = 16 - ð20 - 6 - 1Þ = 3;

SSA 288 = = 144 a-1 3-1 SSB 12 = = 12 b-1 2-1 SSðABÞ 56 = = 28 ð a - 1Þ ð b - 1Þ ð 2 Þ ð 1 Þ SSE 42 = =7 abðr - 1Þ ð3Þð2Þð1Þ

ðαβÞ22 = 18 - ð20 þ 0 - 1Þ = - 1; ðαβÞ32 = 23 - ð20 þ 6 - 1Þ = - 2; Finally, following Eq. 6.3, the random effects model can be expressed as:

Table 6.8 Results of the ANOVA analysis for Example 6.4.1 Source Main effects A:Factor A B:Factor B Interactions AB Residual Total (corrected)

Sum of squares

d.f.

Mean square

F-ratio

P-value

288.0 12.0

2 1

144.0 12.0

20.57 1.71

0.0021 0.2383

56.0 42.0 398.0

2 6 11

28.0 7.0

4.00

0.0787

All F-ratios are based on the residual mean square error

234

6

yij = 20 þ f - 6, 0, 6gi þ f1, - 1gj þf - 3, 1, 2, 3, - 1, - 2gij with

ð6:4aÞ

i = 1, 2, 3 and j = 1, 2

For example, the cell corresponding to (A1, B1) has a mean value of 12 which is predicted by the above model as: yij = 20 - 6 þ 1 - 3 = 12, and so on. Finally, the prediction error of the model has a variance σ 2 = MSE = 7. Recasting the above model (Eq. 6.4a) as a regression model with indicator variables may be insightful (though cumbersome) to those more familiar with regression analysis methods: yij = 20 þ ð - 6ÞI 1 þ ð0ÞI 2 þ ð6ÞI 3 þ ð1ÞJ 1 þ ð - 1ÞJ 2 þð - 3ÞI 1 J 1 þ ð1ÞI 1 J 2 þ ð2ÞI 2 J 1 þ ð3ÞI 2 J 2 þð - 1ÞI 3 J 1 þ ð - 2ÞI 3 J 2 ð6:4bÞ where Ii and Ji are indicator variables corresponding to the α and β terms.

6.4.2

Design of Physical and Simulation Experiments

(following the standard form suggested by Yates). Notice that the last but one column has four (-) followed by four (+), the last but two column by successive pairs of (-) and (+), and the second column has alternating (-) and (+). The Yates algorithm is easily extended to higher number of factors. However, the sequence in which the runs are to be performed should be randomized; a good way is to simply sample the set of trials {1, . . ., 8} in a random fashion without replacement. The approach can be modified to treat the case of parameter interaction. Table 6.9 is simply modified by including separate columns for the three interaction terms, as shown in Table 6.10. The product of any two columns of the factors yields a column for the effect of the interaction term (e.g., the interaction of A and B is denoted by AB). The appropriate sign for the interactions is determined by multiplying the signs of each of the two corresponding terms. For example, AB for trial 1, would be coded as (-)(-) = (+); and so on.1 Note that every column has an equal number of (–) and (+) signs. The orthogonality property (discussed in the next section) relates to the fact that the sum of the product of the signs in any two columns is zero.

2k Factorial Designs

The above treatment of full factorial designs can lead to a prohibitive number of runs when numerous levels need to be considered. As pointed out by Box et al. (1978), it is wise to design a DOE investigation in stages, with each successive iteration providing incremental insight into influential factors, type of interaction, etc. while suggesting subsequent investigations. Factorial designs, primarily 2k and 3k, are of great value at the early stages of an investigation, where many possible factors are investigated with the intention of either narrowing down the number (screening), or to get a preliminary understanding of the mathematical relationship between factors and the response variable. These are, thus, viewed as a logical lead-in to the response surface method discussed in Sect. 6.6. The associated mathematics and interpretation of 2k designs are simple and can provide insights into the framing of more sophisticated and complete experimental designs called sequential designs which allow for more precise parameter estimation (a practical example is given in Sect. 10.4.2). They are popular in R&D of products and processes and are used extensively. They can also be used during computer simulation experiments; see, for example, Hou et al. 1996 for evaluating the performance of building energy systems (discussed in Sect. 6.7.4). Figure 6.4 depicts the full factorial design for three factors at two-levels (23 design), which involves 8 experiments mapped to each of the corners of the cube. The three factors (A, B, C) are coded as 1 or -1 to denote the high- and low-level settings of the factors. Table 6.9 depicts a quick and easy way of setting up a two-level three-factor design

Table 6.9 The standard form (suggested by Yates) for setting up the two-level three-factor (or 23) design Trial 1 2 3 4 5 6 7 8

Level of factors A B + + + + + + + +

C + + + +

Response y1 y2 y3 y4 y5 y6 y7 y8

Table 6.10 The standard form of the two-level three-factor (or 23) design with interactions Trial 1 2 3 4 5 6 7 8

1

Level of factors A B C + + + + + + + + + + + +

Interactions AB AC + + + + + + + +

BC + + + +

ABC + + + +

Response y1 y2 y3 y4 y5 y6 y7 y8

The statistical basis of this simple coding process is given by Box et al. (1978).

6.4 Factorial Designs

235

Table 6.11 Response table representation for the 23 design with interactions (omitting the ABC term) generated by expanding Table 6.10 Trial 1 2 3 4 5 6 7 8 Sum Avg Effect

Resp. y1 y2 y3 y4 y5 y6 y7 y8 8

A+

Ay1

B+

y3

y3 y4

y2 y4 y5

y5 y6

y6 y7 y8 4

4



A-

(Aþ - A - )

By1 y2

y7 y8 4 Bþ

4 B-

(Bþ - B - )

C+

y5 y6 y7 y8 4 Cþ

ð6:5Þ

AB-

AC+ y1

y2 y3

y2

4 C-

ABþ

BC+ y1 y2

y3

y4 y5

y8 4

AC-

y6 y7

4

y8 4

4

AB -

ACþ

AC -

(ABþ - AB - )

(ACþ - AC - )

ð6:6aÞ

Similarly, the interaction effect of, say, BC can be determined as the average of the B effect when C is held constant at +1 minus the B effect when C is held constant at -1. Interaction effect of BC = BC þ - BC ð6:6bÞ 1 = ½ðy1 þ y2 þ y7 þ y8 Þ - ðy3 þ y4 þ y5 þ y6 Þ 4

BC-

y3 y4 y5 y6

y4 y5 y7 y8 4 BCþ

4 BC -

(BCþ - BC - )

Thus, the individual and interaction effects directly provide a prediction model of the form: y = b 0 þ b 1 A þ b 2 B þ b3 C Main effects

þ b12 AB þ b13 AC þ b23 BC þ b123 ABC

where the overbar indicates the average value. Statistical textbooks on DOE provide elaborate details of how to obtain estimates of all main and interaction effects when more factors are to be considered, and then how to use statistical procedures such as ANOVA to identify the significant ones. The standard form shown in Table 6.10 can be rewritten as in Table 6.11 for the 23 design with interactions by expanding each of the four interaction columns into their (+) and (–) columns respectively. For example, AB in Table 6.10 is (+) for trials 1, 4, 5, and 8 and these are listed under AB+ column in Table 6.11. This is referred to as the response table form and is advantageous in that it allows the analysis to be done in a clear and modular manner. The physical interpretation of the measure of interaction AB is that it is the difference between the average change in the response with factor A and that of factor B. Similarly, AB+ denotes the average effect of A on the response variable when B is held fixed (or blocked) at the B+ level. On the other hand, AB- denotes the effect of A when B is held fixed at the lower level B-. The main effect of A is simply: 1 = Aþ - A - = ½ðy2 þ y4 þ y6 þ y8 Þ 4 - ðy1 þ y3 þ y5 þ y7 Þ

AB+ y1

y6 y7

(Cþ - C - )

The main effect of, say, factor C can be determined simply as: Main effect of C = C þ - C ðy þ y6 þ y7 þ y8 Þ ðy1 þ y2 þ y3 þ y4 Þ = 5 4 4

Cy1 y2 y3 y4

ð6:7aÞ

Interaction terms

The intercept term is given by the grand average of all the response values y. This model is analogous to Eq. 5.25 which is one form of the additive multiple linear models discussed in Chap. 5. Note that Eq. 6.7a has eight parameters and with eight experimental runs, the model fit will be perfect with no variance. A measure of the random error can only be deduced if the degrees of freedom (d. f.) > 0, and so replication (i.e., repeats of runs) is necessary. Another option, relevant when interaction effects are known to be negligible, is to adopt a model which includes main effects only: y = b0 þ b 1 A þ b 2 B þ b 3 C

ð6:7bÞ

In this case, d.f. = 4, and so a measure of random error of the model can be determined. Example 6.4.2 Deducing a prediction model for a 23 factorial design Consider a problem where three factors {A, B, C} are presumed to influence a response variable y. The problem is to specify a DOE design, collect data, ascertain the statistical importance of the factors, and then identify a prediction model. The numerical values of the factors or regressors corresponding to the high and low levels are assembled in Table 6.12. It was decided to use two replicate tests for each of the 8 combinations to enhance accuracy. Thus, 16 runs were performed, and the results are tabulated in the standard form as suggested by Yates (Table 6.9) and shown in Table 6.13.

236

6

(a) Identify statistically significant terms This tabular data can be used to create a table similar to Table 6.11, which is left to the reader. Then, the main effects and interaction terms can be calculated following Eq. 6.6a and Eq. 6.6b. Main effect of factor A: 1 ½ð26 þ 29Þ þ ð21 þ 22Þ þ ð23 þ 22Þ ð 2Þ ð 4Þ þ ð18 þ 18Þ - ð34 þ 40Þ - ð33 þ 35Þ - ð24 þ 23Þ - ð19 þ 18Þ 47 == - 5:875 8 while the effect sum of squares SSA = (-47.0)2/16 = 138.063. Table 6.12 Assumed low and high levels for the three factors (Example 6.4.2) Factor A B C

Low level 0.9 1.20 20

High level 1.1 1.30 30

Table 6.13 Standard table (Example 6.4.2) Trial 1 2 3 4 5 6 7 8

Level of factors A B 0.9 1.2 1.1 1.2 0.9 1.3 1.1 1.3 0.9 1.2 1.1 1.2 0.9 1.3 1.1 1.3

C 20 20 20 20 30 30 30 30

Responses (two replicates) 34, 40 26, 29 33, 35 21, 22 24, 23 23, 22 19, 18 18, 18

Data available electronically on book website

Design of Physical and Simulation Experiments

Similarly, the main effects of B = - 4.625 and C = 9.375, while interaction effects AB = - 0.625, AC = 5.125, BC = - 0.125, and ABC= 0.875. The results of the ANOVA analysis are assembled in Table 6.14. One concludes that the main effects A, B, and C and the interaction effect AC are significant at the 0.01 level. The main effect and interaction effect plots are shown in Figs. 6.9 and 6.10. These plots do suggest that interaction effects are present only for factors A and C since the lines are clearly not parallel. (b) Identify prediction model Only four terms, namely A, B, C, and AC interaction are found to be statistically significant at the 0.05 level (see Table 6.14). In such a case, the functional form of the prediction model reduces to: y = b0 þ b 1 x A þ b 2 x B þ b 3 x c þ b 4 x A x C

ð6:8aÞ

Substituting the values of the effect estimates determined earlier results in y = 25:313 - 2:938xA - 2:313xB - 4:688xc þ 2:563xA xC ð6:8bÞ where coefficient b0 is the mean of all observations. Also, note that the values of the model coefficients are half the values of the main and interaction effects determined in part (a). For example, the main effect of factor A was calculated to be (- 5.875) which is twice the (- 2.938) coefficient for the xA factor shown in the equation above. The division by 2 is needed because of the way the factors were coded, i.e., the high and low levels, coded as +1 and -1, are separated by 2 units. The performance equation thus determined can be used for predictions. For example, when xA = +1, xB = -1, xC = -1, one gets y = 26:813 which agrees reasonably well with the average of the two replicates performed (26 and 29) despite dropping the interaction terms.

Table 6.14 Results of the ANOVA analysis Source Main effects Factor A Factor B Factor C Interactions AB AC BC ABC Residual or error Total (corrected)

Sum of squares

D.f.

Mean square

F-ratio

p-value

138.063 85.5625 351.563

1 1 1

138.063 85.5625 351.563

41.68 25.83 106.13

0.0002 0.0010 0.0000

1.5625 105.063 0.0625 3.063 26.5 711.438

1 1 1 1 8 15

1.5625 105.063 0.0625 3.063 3.3125

0.47 31.72 0.02 0.92

0.5116 0.0005 0.8941 0.3640

Interaction effects AB, BC, and ABC are not significant (Example 6.4.2a)

6.4 Factorial Designs

237

Fig. 6.9 Main effect scatter plots for the three factors for Example 6.4.2 Fig. 6.10 Interaction plots for Example 6.4.2

(c) Comparison with linear multiple regression approach The parallel between this approach and regression modeling involving indicator variables was discussed previously (Sects. 5.7.3 and 6.2.2). For example, if one were to perform a multiple regression to the above data with the three regressors coded as -1 and +1 for low and high values respectively, one obtains the results shown in Table 6.15.

Note that the same four variables (A, B, C, and AC interaction) are statistically significant while the model coefficients are identical to the ones determined by the ANOVA analysis. If the regression were to be redone with only these four variables present, the model coefficients would be identical. This is a great advantage with factorial designs in that one could include additional variables incrementally in the model without impacting the model

238

6

Design of Physical and Simulation Experiments

Table 6.15 Results of performing a multiple linear regression to the same data with regressors coded as +1 and -1 (Example 6.4.2c) Parameter Constant Factor A Factor B Factor C Factor A*Factor B Factor A*Factor C Factor B*Factor C Factor A*Factor B* Factor C

Parameter estimate 25.3125 -2.9375 -2.3125 -4.6875 -0.3125 2.5625 -0.0625 0.4375

Standard error 0.455007 0.455007 0.455007 0.455007 0.455007 0.455007 0.455007 0.455007

t-statistic 55.631 -6.45595 -5.08234 -10.302 -0.686803 5.63178 -0.137361 0.961524

p-value 0.0000 0.0002 0.0010 0.0000 0.5116 0.0005 0.8941 0.3644

Table 6.16 Goodness-of-fit statistics of different multiple linear regression models (Example 6.4.2) Regression model With all terms With only four significant terms

Model R2 0.963 0.956

Adjusted R2 0.930 0.940

RMSE 1.820 1.684

Fig. 6.12 Model residuals versus model predicted values highlight the larger scatter present at higher values indicative of non-additive errors (Example 6.4.2)

Fig. 6.11 Observed versus predicted values for the regression model indicate larger scatter at high values (Example 6.4.2)

coefficients of variables already identified. Why this is so is explained in the Sect. 6.4.3. Table 6.16 assembles pertinent goodness-of-fit indices for the complete model and the model with the four significant regressors only. Note that while the R2 value of the former is higher (a misleading statistic to consider when dealing with multivariate model building), the Adj-R2 and the RMSE of the reduced model are superior. Finally, Figs. 6.11 and 6.12 are model predicted versus observed plots, which allow one to ascertain how well the model has fared; in this case, there seems to be a larger scatter at higher values indicative of non-additive errors. This suggests that a linear additive model may not be the best choice, and the analyst may undertake further refinements if time permits. In summary, DOE involves the complete reasoning process of defining the structural framework, i.e., prescribing the exact manner in which samples for testing need to be selected, and the conditions and sequence under which the testing needs to be performed under specific restrictions

imposed by space, time and nature of the process (Mandel 1964). The applications of DOE have expanded to the area of model building as well. It is now used to identify which subsets among several possible variables influence the response variable, and to determine a quantitative relationship between them.

6.4.3

Concept of Orthogonality

An important concept in DOE is orthogonality by which it is implied that trials should be framed such that the data matrix X2 results in (XTX) = -1 where XT is the transpose of X. 3 In such a case, the off-diagonal terms of the matrix (XTX) will be zero, i.e., the regressors are uncorrelated. This would lead to the best designs since it would minimize the variance of the regression coefficients. For example, consider Table 6.10 where the standard form for the two-level threefactor design is shown. Replacing low and high values (i.e., 2

Refer to Sect. 5.4.2 for refresher. Recall from basic geometry that two straight lines are perpendicular when the product of their slopes is equal to -1. Orthogonality is an extension of this concept to multiple dimensions. 3

6.4 Factorial Designs

239

Fig. 6.13 The coded regression matrix with four main and four interaction parameters with eight experiments

x1 1 1 1 1 1 1 1 1

X=

-1 1 -1 1 -1 1 -1 1

x2 -1 -1 1 1 -1 -1 1 1

x3

x1x2

-1 -1 -1 -1 1 1 1 1

1 -1 -1 1 1 -1 -1 1

Main effects - and +) by - 1 and +1, and noting that an extra column of 1 needs to be introduced to take care of the constant term in the model (see Eq. 5.31) results in the regressor matrix being defined by:

X=

1 1

-1 þ1

-1 -1

-1 -1

1

-1

þ1

-1

1 1

þ1 -1

þ1 -1

-1 þ1

1 1

þ1 -1

-1 þ1

þ1 þ1

1

þ1

þ1

þ1

ð6:9Þ

Example 6.4.34 Matrix approach to inferring a prediction model for a 23design This example will illustrate the analysis procedure for a complete 2k factorial design with three factors. The model, assuming a linear form, is given by Eq. 5.28 and includes main and interaction effects. Denoting the three factors by x1, From Beck and Arnold (1977) by permission of Beck.

1 -1 1 -1 -1 1 -1 1

1 1 -1 -1 -1 -1 1 1

-1 1 1 -1 1 -1 -1 1

Interaction effects

x2, and x3, the regressor matrix X will have four main effect parameters (the intercept term is the first column) as well as the four interaction terms as shown in Fig. 6.13. For example, the 6th column is the product of the 2nd and 3rd columns and so on. Let us assume that a DOE has yielded the following eight values for the response variable: YT = ½49 62 44 58 42 73 35 69

The reader can verify that the off-diagonal terms of the matrix (XTX) are indeed zero. All nk factorial designs are thus orthogonal, i.e., (XTX)-1 is a diagonal matrix with nonzero diagonal components. This leads to the soundest parameter estimation (as discussed in Sect. 9.2.3). Another benefit of orthogonal designs is that parameters of regressors already identified remain unchanged as additional regressors are added to the model; thereby allowing the model to be developed incrementally. Thus, the effect of each term of the model can be examined independently. These are two great benefits when factorial designs are adopted for model identification.

4

x1x3 x2x3 x1x2x3

ð6:10Þ

The intention is to identify a parsimonious model, i.e., one in which only the statistically significant terms are retained in the model given by Eq. 5.28. The inverse of (XTX) = 18 I and the (XTX) terms can be deduced by taking the sums of the yi terms multiplied by either (+1) or (-1) as indicated in XT. The coefficient b0 = 54 (average of all eight values of y), b1 following Eq. 6.5 is: b1 = [(62 + 58 + 73 + 69) - (49 + 44 + 42 + 35)]/ (4 × 2) = 11.5, and so on. The resulting model is: yi = 54 þ 11:5x1i - 2:5x2i þ 0:75x3i þ 0:5x1i x2i þ4:75x1i x3i - 0:25x2i x3i þ 0:25x1i x2i x3i

ð6:11Þ

With eight parameters and eight observations (and no replication), the model will be perfect with zero degrees of freedom; this is referred to as a saturated model. This is not a prudent situation since a model variance cannot be computed nor can the p-values of the various terms be inferred. Had a replication design been adopted, an estimate of the variance in the model could have been conveniently estimated and some measure of the goodness-of-fit of the model deduced (as in Example 6.4.1). In this case, the simplest recourse is to drop one of the terms from the model (say the (x1.x2.x3) interaction term) and then perform the ANOVA analysis.

240

6

Design of Physical and Simulation Experiments

Table 6.17 Results of the ANOVA analysis (Example 6.4.3) Source Main effects Factor x1 Factor x2 Factor x3 Interactions x1x2 x1x3 x2x3 Residual or error Total (Corrected)

Sum of squares

D.f.

Mean square

F-ratio

p-value

1058 50.0 4.50

1 1 1

1058 50.0 4.50

2116 100 9.00

0.0138 0.0635 0.2050

2.00 180.5 0.50 0.50 1296

1 1 1 1 7

2.00 180.5 0.50 0.50

4.00 361 1.00

0.2950 0.0335 0.5000

Because of the orthogonal behavior, the significance of the dropped term can be evaluated at a later stage without affecting the model terms already identified. The effect of individual terms is now investigated in a manner similar to the previous example. The ANOVA analysis shown in Table 6.17 suggests that only the terms x1 and (x1x3) are statistically significant at the 0.05 level. However, the p-value for x2 is close, and so it would be advisable to keep this term. Then, the parsimonious model is directly stated as: yi = 54 þ 11:5x1i - 2:5x2i þ 4:75x1i x3i

ð6:12Þ Fig. 6.14 Illustration of the differences between (a) full factorial and

The above example illustrates how data gathered within a DOE design and analyzed following the ANOVA method can yield an efficient functional predictive model of the data. It is left to the reader to repeat the analysis illustrated in Example 6.4.1 where an identical model was obtained by straightforward use of multiple linear regression. Note that orthogonality is maintained only if the analysis is done with coded variables (-1 and +1), and not with the original ones. Recall that a 22 factorial design implies two regressors or factors, each at two levels; say “low” and “high.” Since there are only two states, one can only frame a first-order functional model to the data such as Eq. 6.12. Thus, a 22 factorial design is inherently constrained to identifying a first-order linear model between the regressors and the response variable. If the mathematical relationship requires higher order terms, multilevel factorial designs are required (to identify polynomial models such as Eq. 5.23). For example, the 3k design will require the range of variation of the factors to be aggregated into three levels, such as “low”, “medium” and “high”. If the situation is one with three factors (i.e., k = 3), one needs to perform 27 experiments even if no replication tests are considered. This is more than three times the number of tests needed for the 23 design. Thus, the additional higherorder insight can only be gained at the expense of a larger number of runs which, for higher number of factors, may

(b) fractional factorial design for a 23 DOE experiment. Several different combinations of fractional factorial designs are possible; only one such combination is shown

become prohibitive. In such instances, central composite designs are often advisable since they allow second-order effects to be modeled with 2k designs; this design is discussed in Sect. 6.6.4.

6.4.4

Fractional Factorial Designs

One way of greatly reducing the number of runs, provided interaction effects are known to be negligible, is to adopt fractional factorial designs. The 27 tests needed for a full 33 factorial design can be reduced to 9 tests only. Thus, instead of 3k tests, an incomplete block design would only require (3k-1) tests. A graphical interpretation of how a fractional factorial design differs from a full factorial one for a 23 instance is illustrated in Fig. 6.14. Three factors are involved (A, B, and C) at two levels each (-1, 1). While 8 test runs are performed corresponding to each of the 8 corners of the cube for the full factorial, only 4 runs are required for the fractional factorial as shown. The Latin squares design, discussed in Sect. 6.5.2, is a type of fractional factorial method. The interested reader can refer to Box et al. (1978) or Montgomery (2017) for a detailed treatment of fractional factorial design methods.

6.5 Block Designs

241

Table 6.18 Machining time (in minutes) for Example 6.5.1 Operator 1 42.5 39.8 40.2 41.3 40.950

Machine 1 2 3 4 Average

2 39.3 40.1 40.5 42.2 40.525

3 39.6 40.5 41.3 43.5 41.225

4 39.9 42.3 43.4 44.2 42.450

5 42.9 42.5 44.9 45.9 44.050

6 43.6 43.1 45.1 42.3 43.525

Average 41.300 41.383 42.567 43.233 42.121

Data available electronically on book website

Table 6.19 ANOVA table for Example 6.5.1 Source of variation Machines Operators Error Total

Sum of squares 15.92 42.09 23.84 81.86

6.5

Block Designs

6.5.1

Complete Block Design

Degrees of freedom 3 5 15 23

Complete block design pertains to the instance when one wishes to investigate the effect of only one primary or treatment factor/variable on the response when other secondary factors (or nuisance variables) are present. The effect of these nuisance variables is minimized/eliminated by blocking. The following example serves to illustrate the concept of randomized complete block design with one nuisance factor.5 Example 6.5.16 Evaluating performance of four machines while blocking effect of operator dexterity The performance of four different machines M1, M2, M3, and M4 is to be evaluated in terms of time needed to manufacture a widget. It is decided that the same widget will be made on these machines by six different machinists/operators in a randomized block experiment. The machines are assigned in a random order to each operator. Since dexterity is involved, there will be a difference among the operators in the time needed to machine the widget. Table 6.18 assembles the machining time in minutes after 24 tests have been completed. Here, the machine type is the primary treatment factor, while the nuisance factor is the operator (the intent of the study could have been the reverse). The effect of this nuisance factor is blocked or controlled by the randomized complete block design where all operators use all four machines. The analysis calls for testing the hypothesis at

5

Since two factors are involved, such problems are often referred to as two-way ANOVA problems. 6 From Walpole et al. (2007) by # permission of Pearson Education.

Mean square 5.31 8.42 1.59 –

Computed F statistic 3.34

p-value 0.048

the 0.05 level of significance that the performance of the machines is identical. Let Factor A correspond to the machine type and B to the operator. Thus a = 4 and b = 6, with replication r = 1. Then, Eq. 6.2a reduces to:

where

SST = SSA þ SSB þ SSE

SSA = ð6Þ ð41:3 - 42:121Þ2 þ ð41:383 - 42:121Þ2 þ . . . = 15:92 SSB = ð4Þ ð40:95 - 42:121Þ2 þ ð40:525 - 42:121Þ2 þ . . . = 42:09 Total variation = SST = (abr - 1). stdev2 = (23).(1.88652) = 81.86 Subsequently, SSE = 81.86 - 15.92 - 42.09 = 23.84 The ANOVA table can then be generated as depicted in Table 6.19. The F-statistic = (5.31/1.59) = 3.34 is significant at probability p = 0.048. One would conclude that the performance of the machines cannot be taken to be similar at the 0.05 significance level (this is a close call though and would merit further investigation!). What can one infer about differences in the dexterity of the machinists? As illustrated earlier, graphical display of data can provide useful diagnostic insights in ANOVA type of problems as well. For example, a simple plotting of the raw observations around each treatment mean can provide a feel for variability between sample means and within samples. Figure 6.15 depicts all the data as well as the mean variation. One notices that there are two unusually different values which stand out, and it may be wise to go back and study the experimental conditions which produced these results. Without these, the interaction effects seem small.

242

6

Fig. 6.15 Factor mean plots of the two factors with six levels for the operator variable and four for the machine variable

Design of Physical and Simulation Experiments

Fig. 6.18 Normal probability plot of the residuals

Inspection of the residuals can provide diagnostic insights regarding violation of normality and non-uniform variance akin to regression analysis. Since model predictions are given by: yij = < y > þ Ai - < y > þ Bj - < y > = Ai þ Bj - < y >

ð6:14aÞ

the residuals of the (i,j) observation are: εij  yij - yij = yij - Ai þ Bj - < y > i = 1, . . . , 4 Fig. 6.16 Scatter plot of the residuals versus the six operators

Fig. 6.17 Scatter plot of residuals versus predicted values

A random effects model can also be identified. In this case, an additive linear model is perhaps adequate such as: yij = < y > þ αi þ βj þ εij

ð6:13Þ

and

j = 1, . . . , 6

ð6:14bÞ

Two different residual plots have been generated. Figures 6.16 and 6.17 reveal that the variance of the errors versus operators and versus model predicted values are fairly random except for two large residuals (as noted earlier). Further, a normal probability plot of the model residuals seems to show some departure from normality, and this issue may need further scrutiny (Fig. 6.18). An implicit and important assumption in the above model design is that the treatment and block effects are additive, i.e., negligible interaction effects. In the context of Example 6.5.1, it means that if, say, Operator 3 is on average 0.5 min faster than Operator 2 on machine 1, the same difference also holds for machines 2, 3, and 4. This pattern would be akin to that depicted in Fig. 6.3 where the mean responses of different blocks differ by the same amount from one treatment to the next. In many experiments, this assumption of additivity does not hold, and the treatment and block effects interact (as illustrated in Fig. 6.7b). For example, Operator 1 may be faster by 0.5 min on the average than Operator 2 when machine 1 is used, but he may be slower by, say, 0.3 min on the average than Operator 2 when machine 2 is used. In such a case, the operators and the machines are said to be interacting.

6.5 Block Designs

243

This results in a significant reduction in the number of experimental runs especially when several levels are involved. However, replication is advisable to reduce random error, to estimate the experimental error, and to provide more precise estimate of the parameter values. A Latin square for n levels denoted by (n × n) is a square of n rows and n columns with each of the n2 cells containing one specific treatment that appears once, and only once, in each row and column. Consider a three-factor experiment at four different levels each. The number of experiments required for full factorial, i.e., to map out the entire experimental space would be 43 = 64. For incomplete factorials, the

number of experiments reduces to 42 = 16 experiments. A Latin square is said to be reduced (also, normalized or in standard form) if both its first row and its first column are in their natural order. The standard manner of specifying a (4 × 4) Latin square design with three factors is shown in Table 6.20a. One of the blocked factors is represented by levels (1, 2, 3, 4) and the other by (I, II, III, IV) with the primary or treatment factor by (A, B, C, D). A randomized design (one of several) is shown in Table 6.20b. Note that the Latin square design shown in Table 6.20b is not unique. The number of possible combinations N grows exponentially with the number of levels n. For n = 3, N = 12; for n = 4, N = 576 and for n = 5, N = 161,280. For 3 factors at 3 levels, each Latin square design only needs 32 = 9 as against 33 = 27 experiments required for the full factorial design. Thus, Latin square designs reduce the required number of experiments from n3 to n2 (where n is the number of levels), thereby saving cost and time. In general, the fractional factorial design requires nk-1 experiments, while the full factorial requires nk. A simple way of generating Latin square designs for higher values of n is to simply write them in order of level in the first row with the subsequent rows generated by simply shifting the sequence of levels one space to the left. Then one needs to randomize (and perhaps include replicates as well) to average out the effect of random influences. Table 6.21 assembles the analysis of variance equations for a Latin square design, which will be illustrated in Example 6.5.2. Latin square designs usually have a small number of error degrees of freedom (e.g., 2 for a 3 × 3 and 6 for a 4 × 4 design), which allows a measure of model variance to be deduced. In summary, while the randomized block design allows blocking of one source of variation, the Latin square design allows systematic blocking of two sources of variability for problems involving three factors (k = 3 with two nuisance factors). The restrictions of this design are that (i) all 3 factors must have the same number of levels, and (ii) no interaction effects are present. The concept, under the same assumptions as those for Latin square design can be extended to problems

Table 6.20 A (4 × 4) Latin square design with three factors and four levels. The treatment factor levels are (A, B, C, D) while those of the two nuisance factors are shown as (1, 2, 3, 4) and (I, II, III, IV). Note that

each treatment occurs in every row and column. (a) The standard manner of specifying the design called reduced form. (b) One of several possible randomized designs

The above treatment of full factorial designs was limited to one nuisance factor. The treatment can be extended to a greater number of factors, but the analysis gets messier though the extension is quite straightforward; see, for example, Box et al. (1978) or Montgomery (2017).

6.5.2

Latin Squares

For the special case when all factors have the same number of levels, the number of experiments necessary for a complete factorial design which includes all main effects and interactions is nk where k is the number of factors and n the number of levels. If certain assumptions are made, this number can be reduced considerably (see Fig. 6.14). Such methods are referred to as fractional factorial designs. The Latin squares approach is one such special design meant for problems: (i) involving three factors with one treatment and two noninteracting nuisance factors (i.e., k = 3), (ii) that allows blocking in two directions, i.e., eliminating two sources of nuisance variability, (iii) where the number of levels for each factor is the same, and (iv) where interaction terms among factors are negligible (i.e., the interaction terms (αβ)ij in the statistical effects model given by Eq. 6.3 are dropped).

(a)

I II III IV

1 A B C D

2 B C D A

3 C D A B

4 D A B C

(b)

I II III IV

1 A D B C

2 B C D A

3 D A C B

4 C B A D

244

6

Design of Physical and Simulation Experiments

Table 6.21 The analysis of variance equations for (n × n) Latin square design Source of variation Row Column Treatment Error Total

Sum of squares SSR SSC SSTr SSE SST

Degrees of freedom n-1 n-1 n-1 (n - 1)(n - 2) n2 - 1

with four factors (k = 4) where three sources of variability need to be blocked; this is done using Graeco-Latin square designs (see Box et al. 1978; Montgomery 2017). Example 6.5.2 Evaluating impact of air filter type on breathing complaints with school vintage and season being nuisance factors7 To reduce breathing related complaints from students, four different types of air cleaning filters (labeled A, B, C, and D which are the treatment factors) are being considered for mandatory replacement of existing air filters in all schools in a school district. Since seasonal effects are important, tests are to be performed under each of the four seasons (and correct for the days when the school is in session for each of these seasons). Further, it is decided that tests should be conducted in four schools representative of different vintage (labeled 1 through 4). Because of the potential for differences in the HVAC systems between old and new schools, it is logical to insist that each filter type be tested at each school during each season of the year. It would have been advisable to have replicates but that would have increased the duration of the testing period from one year to two years which was not acceptable to the school board. (a) Develop a DOE design This is a three-factor problem with four levels in each. The total number of treatment combinations for a completely randomized design would be 43 = 64. The selection of the same number of categories for all three criteria of classification could be done following a Latin square design, and the analysis of variance was performed using the results of only 16 treatment combinations. One such Latin square is given in Table 6.22. The rows and columns represent the two sources of variation one wishes to control. One notes that in this design, each treatment occurs exactly once in each row and in each column. Such a balanced arrangement allows the effect of the air-cleaning filter to be separated from that of the season variable. Note that if the interaction between the sources of variation is present, the Latin square model cannot be used; 7

Since three factors are involved, such problems are often referred to as three-way ANOVA problems.

Mean square SSR/(n - 1) SSC/(n - 1) SSTr/(n - 1) SSE/(n - 1)(n - 2) –

Computed F statistic FR = (MSR/MSE) FC = (MSC/MSE) FTr = (MSTr/MSE) – –

Table 6.22 Experimental design (Example 6.5.2) School vintage 1 2 3 4

Season Fall A D C B

Winter B A D C

Spring C B A D

Summer D C B A

Table 6.23 Data table showing a number of breathing complaints School vintage 1 2 3 4 Average

Fall A 70 D 66 C 59 B 41 59.00

Winter B 75 A 59 D 66 C 57 64.25

Spring C 68 B 55 A 39 D 39 50.25

Summer D 81 C 63 B 42 A 55 60.25

Average 73.5 60.75 51.50 48.00 58.4375

A, B, C, and D are four different types of air filters being evaluated (Example 6.5.2) Data available electronically on book website

this assessment ought to be made based on previous studies or expert opinion. (b) Perform an ANOVA analysis Table 6.23 summarizes the data collected under such an experimental protocol, where the numerical values shown are the number of breathing-related complaints per season corrected for the number of days when the school is in session and for changes in a number of student population. Assuming that the various sources of variation do not interact, the objective is to statistically determine whether filter type affects the number of breathing complaints. A secondary objective is to investigate whether any (and, if so, which) of the nuisance factors (school vintage and season) are influential. Generating scatter plots such as those shown in Fig. 6.19 for school vintage and filter type is a logical first step. The standard deviation is stdev = 12.91, while the averages of the four treatments or filter types are:

245

89

89

79

79

Complaints

Complaints

6.6 Response Surface Designs

69 59

69 59 49

49

39

39

x1

x2

x3

z1

x4

z2

z3

z4

Filter

School

Fig. 6.19 Scatter plots of number of complaints vs (a) school vintage and (b) filter type

Table 6.24 ANOVA results following equations shown in Table 6.21 (Example 6.5.2) Source of variation School vintage Season Filter type Error Total

Sum of squares 1557.2

Degrees of freedom 3

Mean square 519.06

Computed F statistic 11.92

pvalue 0.006

417.69 263.69 261.37 2499.94

3 3 6 15

139.23 87.90 43.56 –

3.20 2.02 – –

0.105 0.213 – –

A = 55:75, B = 53:25, C = 61:75, D = 63:00 In this example, one would make a fair guess based on the intra and within variation that filter type is probably not an influential factor on the number of complaints while school vintage may be. The analysis of variance approach or ANOVA is likely to be more convincing because of its statistical rigor. From the probability values in the last column of Table 6.24, it can be concluded that the number of complaints is strongly dependent on the school vintage, statistically significant at the 0.10 level on the season, and not statistically significant on filter type.

6.6

Response Surface Designs

6.6.1

Applications

Recall that the factorial methods described in the previous section can be applied to either continuous or discrete qualitative/categorical variables with only one variable changed at

a time. The 2k factorial method allows both screening to identify dominant factors and to identify a robust linear predictive model. In a historic timeline, these techniques were then extended to optimizing a process or product by Box and Wilson in the early 1950s. A special class of mathematical and statistical techniques was developed meant to identify models and analyze data between a response and a set of continuous treatment variables with the intent of determining the conditions under which a maximum (or a minimum) of the response variable is obtained when one or more of the variables are simultaneously changed (Box et al. 1978). For example, the optimal mix of two alloys which would result in the product having maximum strength can be deduced by fitting the data from factorial experiments with a model from which the optimum is determined either by calculus or search methods (described in Chap. 7 under optimization methods). These models, called response surface (RS) models, can be framed as second-order models (sometimes, the first order is adequate if the optimum is far from the initial search space) which are linear in the parameters. RS designs involve not just the modeling aspect, but also recommendations on how to perform the sequential search involving several DOE steps. The reader may wonder why most of the DOE models treated in this chapter assume empirical polynomial models. This was because of historic reasons where the types of applications which triggered the development of DOE were not understood well enough to adopt mechanistic functional forms. Empirical polynomial models are linear in the parameters but can be nonlinear in their functional form due to interaction terms and higher terms in the variables (such as Eq. 6.7a).

246

6.6.2

6

Methodology

A typical RS experimental design involves three general phases (screening, optimizing, and confirming) performed with the specific intention of limiting the number of experiments required to achieve a rich data set. This will be illustrated using the following example. The R&D staff of a steel company wants to improve the strength of the metal sheets sold. They have identified a preliminary list of five factors that might impact the strength of their metal sheets: concentrations of chemical A and chemical B, the annealing temperature, the time to anneal, and the thickness of the sheet casting. The first phase is to run a screening design to identify the main factors influencing the metal sheet strength. Thus, those factors that are not important contributors to the metal sheet strength are eliminated from further study. How to perform such screening tests involving the 2k factorial design have been discussed in Sect. 6.4.2. It was concluded that the chemical concentrations A and B are the main treatment factors that survive the screening design. To optimize the mechanical strength of the metal sheets, one needs to know the relationship between the strength of the metal sheet and the concentration of chemicals A and B in the mix; this is done in the second phase, which requires a sequential search process. The following steps are undertaken during the sequential search: (i) Identify the levels of the amount of chemicals A and B to study. Three distinct values for each factor are usually necessary to fit a quadratic function, so standard two-level designs are not appropriate for fitting curved surfaces. (ii) Generate the experimental design using one of several factorial methods. (iii) Run the experiments. (iv) Analyze the data using ANOVA to identify the statistical significance of factors. (v) Draw conclusions and develop a model for the response variable. The quadratic terms in these equations approximate the curvature in the underlying response function. If a maximum or minimum exists inside the design region, the point where that value occurs can be estimated. Unfortunately, this is unlikely to be the case. The approximate model identified is representative of the behavior of the metal in the local design space only while the global optimum may lie outside the search space. (vi) Using optimization methods (such as calculus-based methods or search methods such as steepest descent), move in the direction where the overall optimum is likely to lie (refer to Sect. 7.4.2

Design of Physical and Simulation Experiments

(vii) Repeat steps (i) through (vi) until the global optimum is reached. Once the optimum has been identified, the R&D staff would want to confirm that the new, improved metal sheets have higher strength; this is the third phase. They would resort to hypothesis tests involving running experiments to support the alternate hypothesis that the strength of the new, improved metal sheet is greater than the strength of the existing metal sheet. In summary, the goals of the second and third phases of the RS design are to determine and then confirm, with the needed statistical confidence, the optimum levels of chemicals A and B that maximize the metal sheet strength.

6.6.3

First- and Second-Order Models

In most RS problems, the form of the relationship between the response and the regressors is unknown. Consider the case where the yield (Y ) of a chemical process is to be maximized with temperature (T ) and pressure (P) being the two independent variables (Montgomery 2017). The 3-D plot (called the response surface plot in DOE terminology) is shown in Fig. 6.20, along with its projection of a 2-D plane, known as a contour plot. The maximum yield is achieved under T = 138 and P = 28, at which the maximum yield Y = 70. If one did not know the shape of this curve, one simple approach would be to assume a starting point (say, T = 115 and P = 20, as shown) and repeatedly perform experiments in an effort to reach the maximum point. This is akin to a univariate optimization search (see Sect. 7.4) which is not very efficient. In this example involving a chemical process, varying one variable at a time may work because of the symmetry of the RS plot. However, in cases (and this is often so) when the RS plot is asymmetrical or when the search location is far away from the optimum, such a univariate search may erroneously indicate a nonoptimal maximum. A superior manner, and the one adopted in most numerical methods is the steepest gradient method, which involves adjusting all the variables together (see Sect. 7.4). As shown in Fig. 6.21, if the responses Y at each of the four corners of the square are known by experimentation, a suitable model is identified (in the figure, a linear model is assumed and so the set of lines for different values of Y are parallel). The steepest gradient method involves moving along a direction perpendicular to the sets of lines (indicated by the “steepest descent” direction in the figure) to another point where the next set of experiments ought to be performed. Repeated use of this testing, modeling, and stepping is likely to lead one close to the sought-after maximum or minimum (provided one is not caught in a local peak or valley or a saddle point).

6.6 Response Surface Designs

247

Fig. 6.20 A three-dimensional response surface between the response variable (the expected yield) and two regressors (temperature and pressure) with the associate contour plots indicating the optimal value. (From Montgomery 2017 by permission of John Wiley and Sons)

Fig. 6.21 Figure illustrating how the first-order response surface model (RSM) fit to a local region can progressively lead to the global optimum using the steepest descent search method

fractional, are good choices at the preliminary stage of the RS investigation. As stated earlier, due to the benefit of orthogonality, these designs are recommended since they would minimize the variance of the regression coefficients. (b) Once close to the optimal region, polynomial models higher than the first order are advised. This could be a second-order polynomial (involving just the main effects) or a higher-order polynomial which also includes quadratic effects and interactions between pairs of factors (two-factor interactions) to account for curvature. Quadratic models are usually sufficient for most engineering applications, though increasing the order of approximation to higher orders could, sometimes, further reduce model errors. Of course, it is unlikely that a polynomial model will be a reasonable approximation of the true functional relationship over the entire space of the independent variables, but for a relatively small region, they usually work well. Note that rarely would all the terms of the quadratic model be needed; and how to identify a parsimonious model has been illustrated in Examples 6.4.2 and 6.4.3.

The following recommendations are noteworthy to minimize the number of experiments to be performed:

6.6.4 (a) During the initial stages of the investigation, a firstorder polynomial model in some region of the range of variation of the regressors is usually adequate. Such models have been extensively covered in Chap. 5 with Eq. 5.29 being the linear first-order model form in vector notation. 2k factorial designs, both full and

Central Composite Design and the Concept of Rotation

One must assume 3 levels for the factors in order to fit quadratic models. For a 3k factorial design with the number of factors k = 3, one needs 27 experiments with no replication, which, for k = 4 grows to 81 experiments. Thus, the

248

6

Design of Physical and Simulation Experiments

Fig. 6.22 A central composite design (CCD) contains two sets of experiments: a fractional factorial or “cube” portion which serves as a preliminary stage where one can fit a first-order (linear) model, and a group of axial or “star” points that allow estimation of curvature. A CCD

always contains twice as many axial (or star) points as there are factors in the design. In addition, a certain number of center points are also used to capture inherent random variability in the process or system behavior. (a) CCD for two factors. (b) CCD for three factors

number of trials at each iteration point increases exponentially. Hence, 3k designs become impractical for k > 3. A more efficient design requiring fewer experiments is to use the concept of rotation, also referred to as axisymmetric. An experimental design is said to be rotatable if the trials are selected such that they are equidistant from the center. Since the location of the optimum point is unknown, such a design would result in equal precision of estimation in all directions. In other words, the variance of the response variable at any point in the regressor space is function of only the distance of the point from the design center. Central composite design (CCD) contains three components (Berger and Maurer 2002):

This equation is confirmed by Fig. 6.22. For a two-factor experiment design, the CCD generates 4 factorial points and 4 axial points, i.e., 4 + 4 + 1 = 9 points (assuming only one center point). For a three-factor experiment design, the CCD generates 8 factorial points and 6 axial points, i.e., the number of experiments = 8 + 6 + 1 = 15 points. The factorial or “cube” portion and center points (shown as circles in Fig. 6.22) may aid in fitting a first-order (linear) model during the preliminary stage while still providing evidence regarding the importance of a second-order contribution or curvature. A CCD always contains twice as many axial (or star) points as there are factors in the design. The star points represent new extreme values (low and high) for each factor in the design. The number of center points for some useful CCDs has also been suggested. Sometimes, more center points than the numbers suggested are introduced; nothing will be lost by this except the cost of performing the additional runs. For a two-factor CCD, it is recommended that at least two center points be used, while many researchers routinely are said to use as many as 6–8 points. CCDs are most widely used during RSM experimental design since they inherently satisfy the desirable design properties of orthogonal blocking and rotatability and allow for efficient estimation of the quadratic terms in the secondorder model. Central composite designs with two and three factors, along with the manner of coding the levels of the factors at which experimental tests must be conducted, are shown more clearly in Fig. 6.23a, b, respectively. If the distance from the center of the design space to a factorial point is ±1 unit for each factor, the distance from the center of the design space to a star point is±α with |α| > 1 . The precise value of α depends on certain properties desired for the design, like orthogonal blocking and on the number of

(a) A two-level (fractional) factorial design which estimates the main and two factor interaction terms; (b) A “star” or “axial” design, which in conjunction with the other two components, allows estimation of curvature by allowing quadratic terms to be introduced in the model function; (c) A set of center points (which are essentially random repeats of the center point) provides a measure of process stability by reducing model prediction error and allowing one to estimate the error. They provide a check for curvature, i.e., if the response surface is curved, the center points will be lower or higher than predicted by the design points (see Fig. 6.22). The total number of experimental runs for CCD with k factors = 2k þ 2k þ c here c is the number of center points.

ð6:15Þ

6.6 Response Surface Designs

249

Fig. 6.23 Rotatable central composite designs (CCD) for two factors and three factors during RSM. The black dots indicate locations of experimental runs

factors involved. To maintain rotatability, the value of α for CCD is chosen such that: α = nf

1=4

ð6:16Þ

where nf is the number of experimental runs in factorial portion. For example: if the experiment has 2 factors, the full factorial portion would contain 22 = 4 points; the value of α for rotatability would be α = (22)1/4 = 1.414. If the experiment has 3 factors, α = (23)1/4 = 1.682; if the experiment has 4 factors, α = (24)1/4 = 2; and so on. As shown in Fig. 6.23, CCDs usually have axial points outside the “cube” (unless one intentionally specifies α ≤ 1 due to, say, safety concerns in performing the experiments). Finally, since the design points describe a circle circumscribed about the factorial square, the optimum values must fall within this experimental region. If not, suitable constraints must be imposed on the function to be optimized (as illustrated in the example below). For further reading on CCD, the texts by Box et al. (1978) and Montgomery (2017) are recommended. Many software packages have a DOE wizard, which walks one through the various steps of defining the entire set of experimental/simulation combinations. Example 6.6.18 Optimizing the deposition rate for a tungsten film on silicon wafer. A two-factor rotatable central composite design (CCD) was run to optimize the deposition rate for a tungsten film on silicon wafer. The two factors are the process pressure (in kPa) and the ratio of hydrogen H2 to tungsten hexafluoride WF6 in the reaction atmosphere. The levels for these factors are given in Table 6.25. Let x1 be the pressure factor and x2 the ratio factor for the two coded factors. The rotatable CCD design with three center points was adopted with the experimental results 8

From Buckner et al. (1993) with small modification.

Table 6.25 Assumed low and high levels for the two factors (Example 6.6.1) Factor Pressure Ratio H2/WF6

Low level 0.4 2

High level 8.0 10

Table 6.26 Results of the CCD rotatable design for two coded factors with 3 center points (Example 6.6.1) x1 -1 1 -1 1 -1.414 1.414 0 0 0 0 0

x2 -1 -1 1 1 0 0 -1.414 1.414 0 0 0

y 3663 9393 5602 12488 1984 12603 5007 10310 8979 8960 8979

Data available electronically on book website

assembled in Table 6.26. For example, for the pressure term, the low level of 0.4 is coded as -1, and the high level of 8.0 as +1. The numerical value of the response is left unaltered. A second-order linear regression with all 11 data points results in a model with Adj-R2 = 0.969 and RMSE = 608.9. The model coefficients assembled in Table 6.27 indicate that coefficients (x1x2), and (x12x22) are not statistically significant. Dropping these terms results in a better model with Adj-R2 = 0.983, and RMSE = 578.8. The corresponding values of the reduced model coefficients are shown in Table 6.28. In determining whether the model can be further simplified, one notes that the highest p-value on the independent variables is 0.0549, belonging to (x22). Since the p-value is greater or equal to 0.05, that term may not be statistically significant and one could consider removing this term from the model; this, however, is a close call.

250

6

Design of Physical and Simulation Experiments

Table 6.27 Model coefficients for the second-order complete model with coded regressors (Example 6.4.1) Parameter Constant x1 x2 x1*x2 x1^2 x2^2 x1^2*x2^2

Estimate 8972.6 3454.43 1566.79 289.0 -839.837 -657.282 310.952

Standard error 351.53 215..284 215.284 304.434 277.993 277.993 430.60

t-statistic 25.5246 16.046 7.27781 0.949303 -3.02107 -2.36438 0.722137

p-value 0.0000 0.0001 0.0019 0.3962 0.0391 0.0773 0.5102

Table 6.28 Model coefficients for the reduced model with coded regressors (Example 6.4.1) Parameter Constant x1 x2 x1^2 x2^2

Estimate 8972.6 3454.43 1566.79 -762.044 -579.489

Standard error 334.19 204.664 204.664 243.63 243.63

t-statistic 26.8488 16.8785 7.65544 -3.12787 -2.37856

p-value 0.0000 0.0000 0.0003 0.0204 0.0549

Fig. 6.25 Studentized residuals versus model predicted values (Example 6.6.1)

The optimal values of the two coded regressors associated with the maximum response are determined by taking partial derivatives of Eq. 6.17 and setting them to zero; resulting in: x1 = 2.267 and x2 = 1.353. However, this optimum lies outside the experimental region used to identify the RSM model (see Fig. 6.26) and is unacceptable. A constrained optimization is warranted since the spherical constraint of a rotatable CCD must be satisfied; i.e., from Eq. 6.16: α = (22)1/4 = 1.414. This would guarantee that the optimal condition would fall within the experimental region assumed. Resorting to a constrained optimization (see Sect. 7.3) results in the optimal values of the regressors: x1* = 1.253 and x2* = 0.656 representing a maximum deposition rate y* = 12,883. The low and high values of the two regressors are shown in Table 6.25. These optimal values of the coded variables can be transformed back in terms of the original variables to yield pressure = 8.56 kPa and ratio H2/ WF6 = 8.0. Finally, a confirmatory experiment would have to be conducted in the neighborhood of this optimum.

Fig. 6.24 Observed versus model predicted values (Example 6.6.1)

Thus, the final RSM model with coded regressors is: y = 8972:6 þ 3454:4x1 þ 1566:8x2 - 762:0x21 - 579:5x22 ð6:17Þ It would be wise to look at the residuals. Figure 6.24 suggests quite good agreement between the model and the observations. However, Fig. 6.25 indicates that one of the points (the first row, i.e., y = 3663) is unusual since its studentized residual is high (recall that studentized residuals measure how many standard deviations each observed value of y deviates from the fitted model using all of the data except that observation, see Sect. 5.6.2). Those greater than 3 in absolute value warrant a close look and if necessary, may be removed prior to model fitting.

6.7

Simulation Experiments

6.7.1

Background

Computer simulations of product behavior, processes, and systems ranging from simple (say a solar photovoltaic system) to extremely complex (such as design of a wide-area distributed energy system involving traditional and renewable power generation subunits or simulation of future climate change scenarios) have acquired great importance and a well-respected role in all branches of scientific and engineering endeavor. These virtual tools are based on “behavioral models” at a certain level of abstraction which allow prediction, assessment, and verification of the performance of products and systems under different design and operating

6.7 Simulation Experiments

251

Fig. 6.26 Contour plot of the RSM given by Eq. 6.17 of the two coded factors during RSM (Example 6.6.1)

conditions. They are, to some extent, replacing the need to perform physical experiments which are more expensive, time-consuming, and limited in the number of issues one can consider. Note that simulations are done assuming a set of discrete values for the design parameters/variables, which is akin to the selection of specific levels for treatment and secondary factors while conducting physical experiments. Thus, the primary purpose of performing simulation studies is to learn as much as possible about system behavior under different sets of design factors/variables/parameters with the lowest possible cost (Shannon 1975). The parallel between the traditional physical-experimental DOE methods and the selection of samples of input vectors of design variables for simulations is obvious. Unfortunately, the technical developments in the design of computer simulations have tended to be siloed with rediscovery and duplication slowing down progress. It is common to have several competing commercial simulation programs meant for the same purpose but with differing degree of accuracy, sophistication, and capabilities; however, such issues will not be considered here. It is simply assumed that a reasonable high-fidelity validated computer simulation program is available for the intended purpose. Further, this discussion is aimed at computationally intensive long-run simulations i.e., to instances where a single computer simulation run may require minutes/hours to complete. The material in this section is primarily focused on the use of detailed hourly or sub-hourly time-step computer simulation programs for the design of energy-efficient buildings, which predict the hourly energy consumption and the indoor comfort conditions for a whole year for any building

geometry and climatic conditions (see, for example, Clarke 1993). These programs include models for the dynamic thermal response of the building envelope, that of the performance of the different types of HVAC systems, and they consider the specific manner in which the building is scheduled and operated (widely used ones are EnergyPlus, TRNSYS, Modelica, and eQuest, refer to, for example, de Witt 2003). Commonly, computer simulations can be used for one or more of the following applications: (a) During design, i.e., while conceptualizing the system before it is built. It generally involves two different tasks: (i) Sizing equipment or systems based on point conditions such as peak conditions during normal course of operation. System reliability, safety, etc. must meet some code-specified criteria. Examples involve (i) a sloping roof that is structurally able to support code-specified conditions (say, locationspecific 50-year maximum snow loading), (ii) extreme conditions of outdoor temperature and solar radiation for sizing heating and cooling equipment (often based on 1% and 99% annual probability criteria) for building HVAC equipment in a specific location, (iii) mitigation measures to be activated to avoid environmental pollutants due to vehicles exceeding some prespecified threshold in a specific city neighborhood. (ii) Long-term temporal simulation for system performance prediction under a pre-selected range of design conditions or during normal range of

252

operating conditions assuming efficient or optimal control as per prevailing industry norms. Two examples are: (i) simulating the energy use in a building controlled in a standard energy efficient manner on an hour-by-hour basis over a standard year, and (ii) predicting the annual electricity produced by a PV system assuming fault-free faulty behavior. Such capability allows optimal or satisfactory options to be evaluated based on circumstance-specific constraints in addition to criteria such as energy or cost or both. (b) During day-to-day operation, using model-based shortterm forecasting and optimization techniques to control and operate the system as efficiently as possible. Such conditions may deviate from the idealized and standard conditions assumed during design. Typical applications are model-based supervisory control of energy systems (such as a distributed energy system in conjunction with building HVAC systems) and fault detection and diagnosis. (c) System response/performance under extraordinary/rare events which result in full or partial component/system failure leading to large functionality loss and severe economic and social hardship (i.e., hurricane knocking down power lines). Simulations are performed to study system robustness and ability to recover quickly (this aspect falls under reliability (Sect. 7.5.6) and resilience analysis (complementary to sustainability briefly mentioned in Sect. 12.7.6). Note that the above applications may cover conditions wherein the model/simulation program inputs are either assumed to be deterministic or stochastic. The latter is an instance when some of the physical parameters of the model are not known with certainty and need to be expressed by a probability distribution. For example, during the construction of the wall assembly of a building, deviation from design specifications is common (referred to as specification uncertainty). Another source of uncertainty is that several external drivers (such as weather) may not be known with the needed accuracy at the intended site. In addition, the conditions under which the system is operated may not be known properly and this introduces scenario uncertainty. For example, an architect during design may assume the building to be operated for 12 h/day while the owner may subsequently use it for 16 h/day. Finally, given the complexity of the actual system performance, the underlying models used for the simulation are simplifications of reality (modeling uncertainty). The extent to which all these uncertainties along with the uncertainty of the numerical method used to solve the set of modeling equations (called numerical uncertainty) would affect the final design should be evaluated.

6

Design of Physical and Simulation Experiments

Surprisingly, this aspect has yet to reach the necessary level of maturity for practicing architects and building energy analysts to adopt routinely. The impact of uncertainties in building simulations, and how to address them realistically has been addressed by de Witt (2003).

6.7.2

Similarities and Differences Between Physical and Simulation Experiments

DOE in conjunction with model building can be used to simplify the search for an optimum or a satisficing solution when detailed simulation programs of physical systems requiring long-run computer times are to be used. The similarity of such problems to DOE experiments on physical processes is obvious since the former requires: (a) Important design parameters (or independent variables) and their range of variability to be selected, which is identical to that for treatment and secondary factors. (b) Number of levels for the factors to be decided based on some prior insight of the dependence between response and the set of design parameters. (c) Experimental design: the specific combinations of primary factors to be selected for which one predicts system responses to improve the design. This involves making multiple runs of the computer model using specific values and pairings of these input parameters. (d) Finally, the data analysis phase allows insights into the following: (i) Screening: performing a sensitivity analysis to determine a subset of dominant model input parameter combinations, (ii) Model: gain insights into the structure of the performance model structure such as whether linear, second-order, cross-product terms, etc. (iii) Design optimization: finding an optimum (or nearoptimum) either from an exhaustive search of the set of discrete simulation results or fitting an appropriate mathematical model to the data, referred to as surrogate modeling (discussed in Sect. 6.7.5) There are, however, some major differences. One major difference is that computer experiments are deterministic, i.e., one gets the same system response under the same set of inputs. Replication is not required and only one center point is sufficient if CCD design is adopted. A second major difference is that since the analyst selects the input variable set prior to each simulation run, blocking and control of the variables are inherent and no special consideration is required (Dean et al. 2017). Another difference is that the initial number of design variables tends to be usually much larger than that

6.7 Simulation Experiments

involving physical experiments. The resulting range of input combinations or design space is very large requiring different approaches to generate the input combination samples, reduce the initial variable set by sensitivity analysis (similar to screening), and reduce the number of computer simulations during the search for a satisficing/optimal solution by adopting space-filling interpolation methods (also called “surrogate modelings”). These aspects are discussed below.

6.7.3

Monte Carlo and Allied Sampling Methods

The Monte Carlo (MC) method has been introduced earlier in the context of uncertainty analysis (Sect. 3.7.2). The MC approach, of which there are several variants, comprises that branch of computational mathematics that relies on experiments using random numbers to infer the response of a system (Hammersley and Handscomb 1964). Chance events are artificially recreated numerically (on a computer), the simulation runs many times, and the results provide the necessary insights. MC methods provide approximate solutions to a variety of deterministic and stochastic problems, hence their widespread appeal. The methods vary but tend to follow a particular pattern when applied to computer simulation situations: (i) Define a domain of possible inputs, (ii) Generate inputs randomly from an assumed probability distribution over the domain, (iii) Perform a deterministic computation on the inputs, and (iv) Aggregate and analyze the results. The many advantages of MC methods are conceptual simplicity, low level of mathematics, applicability to a large number of different types of problems, ability to account for correlations between inputs, and suitability to situations where model parameters have unknown distributions. MC methods are numerical methods in that all the uncertain inputs must be assigned a definite probability distribution. For each simulation, one value is selected at random for each input based on its probability of occurrence. Numerous such input sequences are generated, and simulations are performed. Provided the number of runs is large, the simulation output values will be normally distributed irrespective of the probability distributions of the inputs (this follows from the Central Limit theorem described in Sect. 4.2.1). Even though nonlinearities between the inputs and output are accounted for, the accuracy of the results depends on the number of runs. However, given the power of modern computers, the relatively large computational effort is no longer a major limitation except in very large simulation studies. The concept of “sampling efficiency” has been used

253

to compare different schemes of implementing MC methods and was introduced earlier (refer to Eq. 4.49 in Sect.4.7.4 dealing with stratified sampling). In the present context of computer simulations, this term assumes a different meaning and Eq. 4.49 needs to be modified. Here, a more efficient scheme of sampling a pre-chosen set of input variables along with their range of variation is one which results in greater spread or variance in the resulting simulation response set with fewer simulations n (thus requiring less computing time). Say two methods, methods 1 and 2, are to be compared. Method 1 created a sample of n1 simulation runs, while method 2 created a sample of n2 runs. The resulting two sets of the response variable outputs were found to have variances σ 21 and σ 22 : Then, the sampling efficiency of method 1 with respect to method 2 can be said to be as: ε1 = ε2

σ 21 =σ 22 =ðn1 =n2 Þ

ð6:18Þ

where (n1/n2) is called the labor ratio, and σ 21 =σ 22 is called the variance ratio. MC methods have emerged as a basic and widely used generic approach to quantify variabilities associated with model predictions, and for examining the relative importance of model parameters that affect model performance (Spears et al. 1994; de Wit 2003). There are different types of MC methods depending on the sampling algorithm for generating the trials (Helton and Davis 2003): (a) Random sampling methods, which were the historic manner of explaining MC methods. They involve using random sampling for estimating integrals (i.e., for computing areas under a curve and solving differential equations). However, there is no assurance that a sample element will be generated from any subset of the sample space. Important subsets of space with low probability but high consequences are likely to be missed. (b) Crude MC which uses traditional random sampling where each sample element is generated independently following a pre-specified distribution. (c) Stratified MC (also called “importance sampling”), where the population is divided into groups or strata according to some pre-specified criterion, and sampling is done so that each stratum is guaranteed representation (unlike the crude MC method). Thus, it has the advantage of forcing the inclusion of specified subsets of sampling space while maintaining the probabilistic character of random sampling This method is said to be an order of magnitude more efficient than the crude MC method. A major problem is the necessity of defining the strata and calculating their probabilities, especially for high-dimension situations.

254

6

Design of Physical and Simulation Experiments

Fig 6.27 The LHMC sampling method is based on quantiles, i.e., dividing the range of variation of each input variable into intervals of equal probability and then combining successive variables by sampling without replacement. LHMC is more efficient than basic MC sampling since it assures that each interval is sampled with the same density

(d) Latin hypercube, or LHMC, a stratified sampling without replacement technique for generating a set of input vectors from a multidimensional distribution. This is often used to construct computer experiments for performing sensitivity and uncertainty analysis on complex systems. It uses stratified sampling without replacement and can be viewed as a compromise procedure combining many of the desirable features of random and stratified sampling. A Latin hypercube is the generalization of the Latin square to an arbitrary number of dimensions. It is most appropriate for design problems involving computer simulations because of its higher efficiency (McKay et al. 2000). LHMC is said to yield an unbiased estimator of the mean, but the estimator of the variance is biased (unknown but generally small).

Fig. 6.27). A more detailed discussion of how to construct LHMC and why this method is advantageous in terms of computer-time efficiency is discussed by Dean et al. (2017). How to modify this method to deal with correlated variables has also been proposed and is briefly discussed in the next section.

6.7.4

Sensitivity Analysis for Screening

LHMC sampling, which can be considered to be a factorial method, is conceptually easy to grasp. Say the input variable/factor vector is of dimension k and a sample of size m is to be generated from p = [p1, p2, p3, . . . ,pk]. The range of each variable pj is divided into n disjoint or nonoverlapping intervals of equal probability9 (see Fig. 6.27 where n = 8 and m = 16) and values are selected randomly from each interval. The m values thus obtained for p1 are paired at random without replacement with similarly obtained m values for p2. These two pairs (called k2-pairs) are then combined in a random manner without replacement with the m values of p3 to form k3-triples. This process is continued until m samples of kp-tuples are formed. Note that this method of generating samples assures that each interval/subspace is sampled with the same density; this leads to greater efficiency compared to the basic MC method (as shown in

Sensitivity analysis is akin to factor screening in DOE and is different from an uncertainty analysis (treated in Sects. 3.6 and 3.7). The aim of sensitivity analysis is to explore/determine/identify the impact of input factors/variables/ parameters on the predicted/simulated output variable/ response, and then, to quantify their relative importance.10 During many design studies, the performance of the system often depends on a relatively few significant factors and to a much lesser degree on several insignificant ones (this is the Pareto principle rule of thumb often stated as 20/80!). The mapping between random input and the obtained model output can be explored with various techniques to determine the effects of the individual factors. The simplest manner of identifying parameter importance, appropriate for low-dimension input parameter vectors, is to generate scatterplots for each factor versus the response, which can visually reveal the relationships between them. Another possibility is to use the popular least squares techniques to construct a regression model that relates the parameters with the response. Several studies have proposed using partial correlation coefficients of linear regression (Sect. 5.4.5) or using stepwise regression (Sect. 5.4.6) for ranking

9

10

Recall that quantiles split sorted data or a probability distribution into equal parts.

Uncertainty analyses, on the other hand, use probabilistic values of model inputs to estimate probability distributions of model outputs.

6.7 Simulation Experiments

255

Fig. 6.28 Figure illustrating linear and nonlinear sensitivity behavior of annual energy use of a building under different sets of design variables (from Lam and Hui 1996). (a) Linear effect of four different

options of glazing design, (b) Nonlinear effect of three different external shading designs

parameter importance and for detecting parameter interaction (one such study is that by Wang et al. 2014). The use of linear regression is questionable; it may be suitable when system performance is linear (as exhibited by energy use in certain types of buildings and their HVAC system behavior, certain design parameters, for narrow range of parameter variation etc.), but may not be of general applicability (see Fig. 6.28). Data mining methods, such as random forest algorithm (Sect. 11.5), are generally superior than regression-based methods in both screening and ranking parameters for building energy design (see, e.g., Dutta et al. 2016). There are several more formal sensitivity analysis methods (see Saltelli et al. 2000 and relevant technical papers). In such methods, the analyst is often faced with the difficult task of selecting the one method most appropriate for his application. A report by Iman and Helton (1985) compares different sensitivity analysis methods as applied to complex engineering systems and summarizes current knowledge in this area. Note that the modeling equations on which the simulation is based are often nonlinear in the parameters and even if linear, the parameters may interact. In that case, the sensitivity of a parameter/factor may vary from point to point in the parameter space, and random samples in different regions will be required for proper analysis. Thus, one needs to distinguish between local sensitivity analysis (LSA) and global sensitivity analysis (GSA) (Heiselberg et al. 2009). There are two types of sensitivities: (i) LSA (or oneparameter-at-a-time method) which describes the influences of individual design parameters on system response with all other parameters held constant at their standard or base values, i.e., at a specific local region in the parameter space. It is satisfactory for linear models and when the sensitivity of

each individual input is independent of the value of the other inputs (often not true), (ii) GSA which provides insight into the influence of a single design parameter on system response when all other parameters are varied together based on their individual ranges and probability distributions. These two types are discussed below. (a) LSA or local sensitivity analysis The general approach to determining individual sensitivity coefficients is summarized below: (i) Formulate a base case reference and its description. (ii) Study and break down the factors into basic parameters (parameterization). (iii) Identify parameters of interest and determine their base case values. (iv) Determine which simulation outputs are to be investigated and their practical implications. (v) Introduce perturbations to the selected parameters about their base case values one at a time. Factorial and fractional factorial designs are commonly used. (vi) Study the corresponding effects of the variation/perturbation on the simulation outputs. (vii) Determine the sensitivity coefficients for each selected parameter. The ones with the largest values are deemed more significant. Sensitivity coefficients (also called, elasticity in economics, as well as influence coefficients) are defined in various ways as shown in Table 6.29. All these formulae involve discrete changes to the variables denoted by step change Δ as against partial derivatives. The first form is simply the

256

6

Design of Physical and Simulation Experiments

Table 6.29 Different forms of sensitivity coefficient Form 1 2a 2b 3a 3b

Formula1 ΔOP ΔIP ΔOP=OPBC ΔIP=IPBC ΔOP=OPBC ΔIP ðOP þOP Þ ΔOP= 1 2 2 ðIP þIP Þ ΔIP= 1 2 2 ΔOP ΔIP

=

- OP - IP

Dimension With dimension

Common name(s) Sensitivity coefficient, influence coefficient

% OP change % IP change

Influence coefficient, point elasticity

With dimension

Influence coefficient

% OP change % IP change

Arc mid-point elasticity, meant for two inputs

% OP change % IP change

(See note 2)

From Lam and Hui (1996) 1. ΔOP, ΔIP = changes in output and input respectively OPBC, IPBC = base case values of output and input respectively IP1, IP2 = two values of input OP1, OP2 = two values of the corresponding output - OP, - IP = mean values of output and input respectively 2. the slope of the linear regression line divided by the ratio of the mean output and mean input values

derivative of the output variable (OP) with respect to the input parameter (IP). The second group uses the base case values to express the sensitivity in percentage change, while the third group uses the mean values to express the percentage change (this is similar to forward differencing and central differencing approaches used in numerical methods). Form (1) is local sensitivity coefficient and is the simplest to interpret. Forms (2a), (3a), and (3b) have the advantage that the sensitivity coefficients are dimensionless. However, form (3a) can only be applied to one-step change and cannot be used for multiple sets of parameters. In general, such methods are of limited use for simulation models with complex sets of non-linear functions and a large set of input variables. Figure 6.28 illustrates the linear and nonlinear behavior of two different sets of design strategies. The first set consists of four glazing design variables while the second set of three external shading variables (all the variables are described in the figure). Variations in the parameters related to the four different window designs (Fig. 6.28a) are linear, and the local sensitivity method would provide the necessary insight. Further, it is clear that the effect of the shading coefficient (SC) has the largest impact on annual energy use. The behavior of the sensitivity coefficients of the external shading designs likely to impact the energy use of the building is clearly nonlinear from Fig. 6.28b. The projection ratio of the egg-crate external shading design (shown as EG) has the largest impact of energy use. Further, all three external shading designs exhibit an exponential asymptotic behavior. (b) GSA or global sensitivity analysis Global sensitivity methods allow parameter sensitivity and ranking for nonlinear models over the entire design space. The simplest approaches applied to the design of energyefficient buildings include traditional randomized factorial two-level sampling designs (e.g., Hou et al. 1996), variations

of the Latin squares, and CCD in conjunction with quadratic regression (e.g., Snyder et al. 2013). These approaches are suitable when the number of input variables is relatively low (say up to 6–7 variables with 2 or 3 levels). However, for larger number of variables such approaches are usually infeasible since running detailed simulation programs are computationally intensive and have long run-times. One cannot afford to perform separate simulation runs for sensitivity analysis and for acquiring system response for different input vector combinations. While being an attractive way of generating multidimensional input vector samples for extensive computer experiments, LHMC is the most popular method for performing sensitivity studies as well, i.e., for screening variable importance (Hofer 1999). Once the LHMC simulation runs have been performed, one can identify the strong or influential parameters and/or the weak ones based on the results of the response variable. For example, the designer of an energy efficient building would be more concerned with identifying the subset of input parameters which are more likely to lead to low annual energy consumption (such parameters can be referred to as “strong” or influential parameters). This is achieved by a process called regional sensitivity analysis. If the “weak” parameters can be fixed at their nominal values and removed from further consideration during design, the parameter space would be reduced enormously and somewhat alleviate the “curse of dimensionality.” There are several approaches one could adopt; a rather simple and intuitive statistical method is described below. Assume that m “candidate” input/design parameter were initially selected with each parameter discretized into three levels or states (i.e., n = 3). The necessary number of LHMC runs (say, 1,000) are then conducted. Out of these 1,000 runs, it was found that only 30 runs had response variable values in the acceptable range corresponding to 30 specific vectors of input parameters. One would expect

6.7 Simulation Experiments

257

Table 6.30 Critical thresholds for the chi-square statistic with different significance levels for degrees of freedom 2 d. f 2

α = 0.001 13.815

α = 0.005 10.597

α = 0.01 9.210

α = 0.05 5.991

the influential input parameters to appear more often in one level in this “acceptable subset” of input parameter vectors than the weak or non-influential ones. In fact, the latter are likely to be randomly distributed among the acceptable subset of vectors. Thus, the extent to which the number of occurrences of an individual parameter differs from 10 within each discrete state would indicate whether this parameter is strong or weak. This is a type of sensitivity test where the weak and strong parameters are identified using non-random pattern tests (Saltelli et al. 2000). The wellknown chi-square χ 2 test for comparing distributions (see Sect. 2.4.3g) can be used to assess statistical independence for each and every parameter. First, the χ 2 statistic is computed for each of the m input variables as: 3

χ = 2

s=1

pobs,s - pexp pexp

2

ð6:19Þ

where pobs is the observed number of occurrences, and pexp is the expected number (in this example above, this will be 10), and the subscript s refers to the index of the state (in this case, there are three states). If the observed number is close to the expected number, the χ 2 value will be small indicating that the observed distribution fits the theoretical distribution closely. This would imply that the particular parameter is weak since the corresponding distribution can be viewed as being random. Note that this test requires that the degrees of freedom (d. f.) be selected as (number of states -1), i.e., in our case d.f. = 2. The critical values for the χ 2 distribution for different significance levels α are given in Table 6.30. If the χ 2 statistic for a particular parameter is greater than 9.21, one could assume it to be very strong since the associated statistical probability is greater than 99%. On the other hand, a parameter having a value of 1.386 (α = 0.5) could be considered to be weak, and those in between the two values as uncertain in influence. (c) The Morris method A superior method to perform GSA than the one described above is the variance-based approach, which depends on the decomposition of variance of the response variable. One such variant has proven to be particularly attractive for many practical design problems. The elementary effects method, also referred to as the Morris method (Morris 1991), is an MC approach utilizing random sampling and a one-at-a-time (OAT) approach to generate vectors of input parameters. It

α = 0.2 3.219

α = 0.3 2.408

α = 0.5 1.386

α = 0.9 0.211

Table 6.31 Simple example of how the Morris method generates a trajectory of 5 simulation runs with four design parameters (k = 4) Run 0a Run 1 Run 3 Run 4 Run 5

Parameter k1 k1,1 k1,2 k1,2 k1,2 k1,2

Parameter k2 k2,1 k2,1 k2,2 k2,2 k2,2

Parameter k3 k3,1 k3,1 k3,1 k3,2 k3,2

Parameter k4 k4,1 k4,1 k4,1 k4,1 k4,2

Note that the parameter values above and below those indicated in bold are frozen for consecutive runs a Baseline run

is an extension of the derivative-based methods (see Table 6.29 for different types of working definitions). It includes variance as an additional sensitivity index for screening the global space and allows combining first- and second-order sensitivity analysis. Each of the k input variables is defined within a range of a continuous variable, normalized by a min-max scaling (see Eq. 3.15) and discretized into sub-intervals (equal to the number of levels selected for each parameter) with equal probability (see Fig. 6.27). It uses an LHMC input vector generation method where the pairing of successive parameter combinations retains their correlation behavior. A trajectory is defined as a set of (k + 1) simulation runs or vector/sequence of the k parameters with each successive point differing from the preceding one in one variable value only by a multiple of a pre-defined step size Δi. Multiple trajectories need to be generated during the parametric design. How the vector sequence is generated is illustrated using an example with four parameters in Table 6.31. The baseline run (Run 0) is generated by randomly selecting the discrete levels of each parameter (indicated as k1,1, k2,1, . . .k4,1). For Run 1, one of the parameters, in this case, k1 is resampled with the other four parametric values unchanged. Run 2 consists of randomly selecting one of the parameters (the selection should not be sequential; in this case, k2 has been selected for easier comprehension) and then randomly resampling one of the parameters (shown as k2,2). This is repeated for the remaining parameters k3 and k4. Hence, only one parameter is resampled for each run and that value is frozen for the remaining subsequent runs in the trajectory. The second simulation trajectory is created by randomly selecting a new set of combinations for Run 1 and using the OAT approach for subsequent runs similar to the first trajectory. It has been found that only a relatively small number of such trajectory simulation runs are needed. Once the

258

6

Design of Physical and Simulation Experiments

Fig. 6.29 Variation of the mean and standard deviation (μ* and σ) for the 23 building design variables chosen with annual energy use as the response variable (from Didwania et al. 2023). Only the seven design variables falling outside the envelope indicated are influential, the others are anonymously shown as dots

simulations are performed, the elementary effect (EE) of an individual parameter or factor k can be calculated for all points in the trajectory as follows (Sanchez et al. 2014): y ð k þ ei Δ i Þ - y ð k Þ EEi = Δi

ð6:20Þ

where y is the response variable or design criterion or objective function and Δi is the pre-defined step size. The term ei is a vector of zeros except for its ith component, which takes on integer values by which different levels of the discretized parameter levels can be selected. Each trajectory with (k+1) simulation runs provides an estimate of the elementary effects for each of the k parameters or variables. A set of r such different trajectories are defined, and so the total number of simulations runs = [r (k + 1)]. The average μ and standard deviation σ of elementary effects are computed for each parameter and each trajectory t: μi = σi =

1 r

1 ð r - 1Þ

r t=1

EEit

r t=1

ðEEit - μi Þ2

ð6:21Þ ð6:22Þ

The ensemble of trajectory runs is analyzed by computing and plotting the two statistical indicators, namely mean and standard deviation (μ* and σ) of the response to each design or input parameter. The relative importance of each parameter on the response variable and parameter interaction on the response can be determined as11: 1. Negligible – low average (μ*) and low standard deviation (σ) 2. Linear and additive – high average (μ*) and low standard deviation (σ) 11

Note that this process is akin to the regional sensitivity approach using the chi-square test given by Eq. 6.19.

3. Nonlinear or presence of interactions – high standard deviation (σ) How the final selection is done based on the above criteria is illustrated in Fig. 6.29. The points falling outside the zone of influence indicated by a curved line are deemed influential. One notes that out of 23 parameters investigated, only 7 are significant and interactive. Lower μ* would indicate that changing these variables will not have a substantial impact on the objective function; lower σ would suggest that its impact on the objective function is not affected by other parameters and that its impact is not nonlinear. If the model parameters have a significantly nonlinear effect, then it is suggested that an additional analysis involving second-order effects be performed as described by Sanchez et al. (2014). The Morris method is said to require far fewer simulation trajectory runs (i.e., more efficient) than the traditional LHMC method when a large number of parameters with possible interaction and second-order effects are being screened. The greater the number of trajectory simulation runs, the greater the accuracy, but studies have shown that 15–20 trajectory runs are adequate with 25 or so design variables. The Morris method has also been combined with the parallel coordinate graphical representation to enhance the ability for architects and designers to visually explore different combinations of sustainable building design which meet prestipulated ranges of variation of multi-criteria objective functions (Didwania et al. 2023). (e) Closure Helton and Davis (2003) cite over 150 references in the area of sensitivity analysis, discuss the clear advantages of LHMC for the analysis of complex systems, and enumerate the reasons for the popularity of such methods. Another popular GSA method similar to the Morris method meant for complex mathematical models is the Sobol method (Sobol 2001). It generates a sample that is uniformly distributed over the unit

6.7 Simulation Experiments

hypercube and uses variance-based global sensitivity indices to determine first-order, second-order, and total effects of individual variables or groups of variables on the model output. It is said to be more stable than the Morris method but requires more simulation runs. The results of both methods have been reported to be similar by Didwania et al. (2023) for a case study involving the design of energy-efficient buildings. In addition to the LHMC method and the variance-based GSA method, variable screening and ranking of variables/ factors by importance can also be done by data mining methods (such as random forest algorithms discussed in Sect. 11.5). Dutta et al. (2016) illustrate this approach in the context of energy-efficient building design as a more attractive option to the CCD design approach which has been reported by Snyder et al. (2013) and described in Problem 11.14.

6.7.5

Surrogate Modeling

All the experimental design methods discussed above are referred to as static sampling approaches since all the samples are defined prior to running the batch of simulations and not adjusted depending on simulation outcomes. A more efficient approach that speeds up convergence is the iterative sampling approach while requiring fewer simulation runs. This is akin to RSM which, as described in Sect. 6.6, is an iterative approach, which accelerates the search toward finding the optimum condition by simultaneously varying more than one response variable. It proceeds by first identifying a model between a response and a set of several Fig. 6.30 Simple example with two design variables to visually illustrate the surrogate modeling iterative approach as applicable to the design of an energy-efficient building. The dots indicate the preliminary computer simulation runs which would provide insights into how to narrow the solution space for subsequent simulation runs. (From Westermann and Evins 2019)

259

continuous treatment variables over an initial limited solution space, analyzes test results to determine the optimal direction to move, performs a second set of test conditions, and so on until the desired optimum is reached. A similar methodology for iteratively reducing the number of simulations during the search for an optimum and speed-up convergence can be adopted for detailed computer simulation programs. Specifically, the following steps are undertaken:

(i) Initially select a relatively small set of specific values and pairings of these input parameters (akin to performing factorial experiments). (ii) Make multiple runs on the computer model and perform a sensitivity analysis to determine a subset of dominant model input parameter combinations (akin to screening). (iii) Fit an appropriate mathematical model between the response and the set of independent variables (usually a second-order polynomial model with/without interactions). (iv) Use this fitted response surface polynomial model as a surrogate (replacement or proxy) for the computer model to rapidly revise/shrink the original search space. (v) Repeat steps (ii) to (iv) till the desired optimal/ satisficing design solution is reached. Figure 6.30 visually illustrates the general approach in a conceptually clear manner for a simple case involving two design variables only. An energy-efficient building is to be designed by varying two design variables: WWR—window to wall ratio and SHGC—solar heat gain coefficient of the window. The traditional factorial-type experimental design would require that the solution space be uniformly blanketed to find a satisfactory or optimal solution. The same insight

260

6

could be provided by fewer simulation runs by performing an initial set of limited runs (indicated as dots), selecting a finer grid in the sub-space of interest, and then iteratively zeroing in on the design solution. This approach is akin to the response surface modeling (RSM) approach described in Sect. 6.6. The significant benefit of the mathematical surrogate model approach is that it allows the solution to be reached more quickly using calculus methods than by computer simulations alone. However, a higher level of domain knowledge and analytical skills is demanded of the analyst. A good review of surrogate modeling techniques in general can be found in Dean et al. (2017) while its application to the design of sustainable buildings and an extensive literature review as applied to the design of buildings is provided by Westermann and Evins (2019).

6.7.6

Summary

Sensitivity analysis as pertinent to computer-based simulation design involves three major stages: Stage 1: Pre-processing or selection of independent design variable combinations The pre-processing stage involves selecting design variables of interest and identifying practical ranges based on the building type, project requirements, and owner specifications. If nonlinear relationships between predictors and response are suspected, then a minimum of three levels (which allow nonlinear and interactions to be explicitly captured) for each variable should be selected for the traditional factorial design methods while two levels can be used for rotatable CCD design. The number of evaluative combinations increases exponentially with the number of factors and levels. For example, 15 variables at three levels would lead to 315~14 × 106 combinations, an impractical number of simulations. An experimental design technique is essential to select fewer runs while ensuring stratified (representative) sampling of the variable space. LHMC is popular since it is numerically efficient while its results are easy to interpret while performing sensitivity and uncertainty analyses of complex systems with many design parameters (Helton and Davis 2003; Heiselberg et al. 2009). Alternatively, a more traditional CCD design could also be adopted if the initial variable set is relatively small (say less than 6–7 variables)12. Since the samples are defined prior to simulation and not adjusted

Design of Physical and Simulation Experiments

depending on simulation outcomes, this approach is called static sampling (more popular and easier to implement than the iterative approach). Stage 2: Simulation-based generation of system responses Selected variable combinations can now be input into an hourly building energy simulation program for batch processing. The responses could be direct outputs from the chosen simulation program, such as annual energy use/peak demand or could be derived metrics like energy costs or environmental impacts. A database of such simulation outputs is created which is then used in the next stage. Stage 3: Post-processing of simulation results Traditionally, least squares regression analysis has been the most popular technique in developing a model for predicting energy consumption under the range of variation of the numerous design parameters. However, many studies have found that global regression techniques are questionable for this purpose and that nonparametric data mining methods such as the random forest algorithm (described in Sect. 11.5) are superior in both feature selection (identification of design variables that are most influential) and as a global prediction model.

Problems Pr. 6.113 Full-factorial design for evaluating three different missile systems A full-factorial experiment is conducted to determine which of three different missile systems is preferable. The propellant burning rate for 24 static firings was measured using four different propellant types. The experiment performed duplicate observations (replicate r = 2) of burning rates (in minutes) at each combination of the treatments. The data, after coding, are given in Table 6.32. Table 6.32 Burning rates in minutes for the (3 × 4) case with two replicates (Problem 6.1) Missile System A1 A2 A3

Propellant type b1 b2 34.0, 32.7 30.1, 32.8 32.0, 33.2 30.2, 29.8 28.4, 29.3 27.3, 28.9

b3 29.8, 26.7 28.7, 28.1 29.7, 27.3

b4 29.0, 28.9 27.6, 27.8 28.8, 29.1

Data available electronically on book website

12

Refer to Problem 6.14 in the context of energy efficient building design.

13

From Walpole et al. (2007) by # permission of Pearson Education.

Problems

261 Table 6.34 Data table for Problem 6.3

The following hypotheses tests are to be studied: (i) There is no difference in the mean propellant burning rates when different missile systems are used. (ii) There is no difference in the mean propellant burning rates of the four propellant types. (iii) There is no interaction between the different missile systems and the different propellant types. Pr. 6.2 Random effects model for worker productivity A full-factorial experiment was conducted to study the effect of indoor environment condition (depending on such factors as dry bulb temperature, relative humidity. . .) on the productivity of workers manufacturing widgets. Four groups of workers were selected and distinguished by such traits as age, gender. . . called G1, G2, G3, and G4. The number of widgets produced over a day by two members of each group under three different environmental conditions (E1, E2, and E3) was recorded. These results are assembled in Table 6.33. Using 0.05 significance level, test the hypothesis that: (a) different environmental conditions have no effect on number of widgets produced, (b) different worker groups have no effect on number of widgets produced, (c) there are no interaction effects between both factors. Subsequently, identify a suitable random effects model, study model residual behavior, and draw relevant conclusions. Pr. 6.3 Two-factor two-level (22) factorial design (complete or balanced) Consider a brand of variable speed electric motor which is meant to operate at different ambient temperatures (factor X) and at different operating speeds (factor Y ). The time to failure in hours (the response variable) has been measured for the four different treatment groups (conducted in a randomized manner) at a replication level of three (Table 6.34).

Table 6.33 Number of widgets produced daily using a replicate r = 2 (Problem 6.2) Environmental conditions E1 E2 E3

Group number G1 G2 227, 214, 221 259 187, 181, 208 179 174, 198, 202 194

Data available electronically on book website

G3 225, 236 232, 198 178, 213

G4 260, 229 246, 273 206, 219

Factor X Low High

Factor Y-!

Low 82, 78, 86 67, 74, 67

High 64, 61, 57 46, 54, 52

Table 6.35 Thermal efficiencies (%) of the two solar thermal collectors (Problem 6.4)

Without selective surface With selective surface

Mean operating temperature (°C) 80 70 60 28, 34, 38, 29, 31, 32 33, 35, 34 39, 41, 38 33, 38, 41, 36, 33, 34 38, 36, 35 40, 43, 42

50 40, 42, 41, 41 43, 45, 44, 45

Data available electronically on book website

(a) Code the data in the table using the standard form suggested by Yates (see Table 6.10). (b) Generate the main effect and interaction plots and summarize observations. (c) Repeat the analysis procedure described in Example 6.4.2 and identify a suitable prediction model for the response variable at the 0.05 significance level. (d) Compare this model with one identified using linear OLS multiple regression. Pr. 6.4 The thermal efficiency of solar thermal collectors decreases as their average operating temperatures increase. One of the means of improving the thermal performance is to use selective surfaces for the absorber plates which have the special property that the absorption coefficient is high for the solar radiation and low for the infrared radiative heat losses. Two collectors, one without a selective surface and another with, were tested at four different operating temperatures under replication r = 4. The experimental results of thermal efficiency in % are tabulated in Table 6.35. (i) Perform an analysis of variance to test for significant main and interaction effects. (ii) Identify a suitable random effects model. (iii) Identify a linear regression model and compare your results with those from part (ii). (iv) Study model residual behavior and draw relevant conclusions. Pr. 6.5 Complete factorial design (32) with replication The carbon monoxide (CO) emissions in g/m3 from automobiles (the response variable) depend on the amount of ethanol added to a standard fuel (Factor A) and the air/fuel ratio (Factor B). A standard 3^k factorial design (i.e. three levels) with k = 2 with two replicates results in the values shown in

262

6

Table 6.36 Data table for Problem 6.5 Trial # 1 2 3 4 5 6 7 8 9

Levels Factor A -1 0 +1 -1 0 +1 -1 0 +1

CO emissions Response Y 66, 62 78, 81 90, 94 72, 67 80, 81 75, 78 68, 66 66, 69 60, 58

Factor B -1 -1 -1 0 0 0 +1 +1 +1

Data available electronically on book website

Table 6.37 Data table for Problem 6.6 X1 -1 +1 -1 +1 -2 +2 0 0 0

X2 -1 -1 +1 +1 0 0 -2 +2 0

Y 2 4 3 5 1 4 1 5 3

Design of Physical and Simulation Experiments

(c) Identify a prediction model. (d) Using least squares, evaluate first-order and secondorder models. Compare them with the model identified from step (c). (e) Determine the optimal value. (f) Criticize the analysis and suggest improvements to the design procedure. (For example, more central points, replicates, more decimal points in the response variable, .... Is this a rotatable design?. . . .) Pr. 6.7 The close similarity between a factorial design model and a multiple linear regression model was illustrated in Example 6.4.2. You will repeat this exercise with data from Example 6.4.3. (a) Identify a multiple linear regression model and verify that the parameters of all regressors are identical to the factorial design model. (b) Verify that model coefficients do not change when multiple linear regression is redone with the reduced model using variables coded as -1 and +1. (c) Perform a forward stepwise linear regression and verify that you get back the same reduced model with the same coefficients.

Data available electronically on book website

Table 6.36. Low, medium, and levels are codes as -1, 0, and +1 (Table 6.36.) (a) Generate the main effect and interaction plots and summarize observations. (b) Identify the significant main and interaction terms using multifactor ANOVA. (c) Confirm that this is an orthogonal array (by taking the inverse of (XTX)). (d) Using the matrix inverse approach (see Eq. 6.11), identify a factorial design model. (e) Perform a multiple linear regression analysis with only the significant terms and identify a suitable prediction model for the response variable at the 0.05 significance level. Pr. 6.6 Composite design The following data were experimentally collected using a composite design (Table 6.37): (a) Generate the main effect and interaction plots and summarize observations. (b) Identify the significant main and interaction terms using multifactor ANOVA.

Pr. 6.8 23 factorial analysis for strength of concrete mix A civil construction company wishes to maximize the strength of its concrete mix with three factors or variables: A- water content, B- coarse aggregate, and C- silica. A 23 full factorial set of experimental runs, consistent with the nomenclature of Table 6.10, was performed. These results are assembled below: (a) You are asked to analyze these data, generate the ANOVA table and identify statistically meaningful terms (Hint: You will find that none are significant which is probably due to d.f. = 1 for the residual error. It would have been more robust to do replicate testing). (b) Analyze the data using stepwise multiple linear regression and identify the statistically significant factors and interactions (at 0.005 significance). (c) Identify the complete linear regression model using all main and interaction terms and verify that the model coefficients of the statistically significant terms are identical to those using stepwise regression (step b). This is one of the major strengths of the 2^k factorial design. (d) Develop a factorial design model for this problem using only the statistically significant terms (Table 6.38).

Problems

263

Pr. 6.9 Predictive model inferred from 23 factorial design on a large laboratory chiller14 Table 6.39 assembles steady-state data of a 23 factorial series of laboratory tests conducted on a 90 ton centrifugal chiller. There are three response variables (Tcho = chilled water leaving the evaporator, Tcdi = cooling water entering the condenser, and Qch = chiller cooling load) with two levels each, thereby resulting in 8 data points without any replication. Note that there are small differences in the high and low levels of each of the factors because of operational control variability during testing. The chiller coefficient of performance (COP) is the response variable. (a) Perform an ANOVA analysis, and check the importance of the main and interaction terms using the 8 data points indicated in the table. (b) Identify the parsimonious predictive model from the above ANOVA analysis. (c) Identify a least square regression model with coded variables and compare the model coefficients with those from the model identified in part (b). Table 6.38 Data table for Problem 6.8

Trial 1 2 3 4 5 6 7 8

Level of factors A -1 1 -1 1 -1 1 -1 1

B -1 -1 1 1 -1 -1 1 1

C -1 -1 -1 -1 1 1 1 1

Response Replication 1 58.27 55.06 58.73 52.55 54.88 58.07 56.60 59.57

(d) Generate model residuals and study their behavior (influential outliers, constant variance and near-normal distribution). (e) Reframe both models in terms of the original variables and compare the internal prediction errors. (f) Using the four data sets indicated in the table as holdout points meant for cross-validation, compute the NMSE, RMSE and CV values of both models. Draw relevant conclusions. Pr. 6.10 Blocking design for machining time Consider Example 6.5.1 where the performance of four machines was analyzed in terms of machining time with operator dexterity being a factor to be blocked. How to identify an additive linear model was also illustrated. It was pointed out that interaction effects may be important. (a) You will reanalyze the data to determine whether interaction terms are statistically significant or not. (b) It was noted from the residual plots that two of the extreme data points were suspect. Can you redo the analysis while accounting for this fact? Pr. 6.11 Latin squares design with k = 3 Reduction in nitrogen oxides due to gasoline additives (the main treatment factor) is to be analyzed for different types of automobiles (factor X) under different drivers (Factor Y ). One does not expect interaction effects and so a Latin squares series of experiments are conducted assuming five different levels for all three factors. The following table assembles the results for this design with nitrogen oxide emissions being shown as continuous numerical values. The different gasoline additives are the treatment coded as A, B, C, D, and E (Table 6.40).

Replication 2 57.32 55.53 57.95 53.09 55.2 58.76 56.16 58.87

Data available electronically on book website

Table 6.39 Laboratory tests from a centrifugal chiller (Problem 6.9)

Test # 1 2 3 4 5 6 7 8

Data for model development Tcdi Tcho (°C) (°C) 10.940 29.816 10.403 29.559 10.038 21.537 9.967 18.086 4.930 27.056 4.541 26.783 4.793 21.523 4.426 18.666

Qch (kW) 315.011 103.140 289.625 122.884 292.052 109.822 354.936 114.394

COP 3.765 2.425 4.748 3.503 3.763 2.526 4.411 3.151

Data available electronically on book website

14

Adapted from a more extensive table from data collected by Comstock and Braun (1999). We are thankful to James Braun for providing this data.

Data for cross-validation Tcho Tcdi (°C) (°C) 7.940 29.628 7.528 24.403 6.699 24.288 7.306 24.202

Qch (kW) 286.284 348.387 188.940 93.798

COP 3.593 4.274 3.678 2.517

264

6

Table 6.40 Data table for Problem 6.11

Factor X X1 X2 X3 X4 X5

Factor Y Y1 A B C D E

Y2 B C D E A

24 17 18 26 22

Design of Physical and Simulation Experiments

20 24 38 31 30

Y3 C D E A B

Course 2 B C D A

79 82 70 91

Y4 D E A B C

19 30 26 26 20

24 27 27 23 29

Y5 E A B C D

24 36 21 22 31

Data available electronically on book website Table 6.41 Data for Problem 6.12 where the grades are out of 100

Time period 1 2 3 4

Course 1 A B C D

84 91 59 75

Course 3 C D A B

63 80 77 75

Course 4 D A B C

97 93 80 68

Data available electronically on book website

(a) Briefly state the benefits and limitations of Latin square design. (b) State the relevant hypothesis tests one could perform. (c) Generate the analysis of variance table and identify the factors at the 0.05 significance levels. (d) Perform a linear regression analysis and identify a parsimonious model. Pr. 6.12 Latin squares for teaching evaluations The College of Engineering of a large university wishes to evaluate the teaching capabilities of four professors. To eliminate any effects due to different courses offered during different times of the day, a Latin squares experiment was performed in which the letters A, B, C, and D represent the four professors. Each professor taught one section of each of the four different courses scheduled at each of four different times of the day. Table 6.41 shows the grades assigned by these professors to 16 students of approximately equal ability. At the 0.05 level of significance, test the hypothesis that different professors have no effect on the grades. Pr. 6.13 As part of the first step of a response surface (RS) approach, the following linear model was identified from preliminary experimentation using two coded variables y = 55 - 2:5x1 þ 1:2x2

with

- 1 ≤ xi ≤ þ 1

ð6:23Þ

Determine the path of the steepest ascent and draw this path on a contour plot. Pr. 6.14 Building design involving simulations15 This is a simplified example to illustrate how DOE can be used in conjunction with a detailed building energy 15

We thank Steve Snyder for this design problem which is fully discussed in Snyder et al. (2013).

Table 6.42 Design parameters along with their range of variability and the response variables (Pr. 6.14) Design parameter Lighting power density Window shading coefficient Exterior Wall R-value (total resistance) Window U-value (overall heat loss coefficient) Window to wall ratio Response variables Electricity use (annual) Natural gas use (annual)

Variable name LPD SC EWR

WWR

Range 0.8–1.5 W/ft2 0.2–0.7 7.8–27 h-ft2-° F/Btu 0.26–1 Btu/hft2 °F 0.1–0.5

E_elec E_gas

103 kWh 106 Btu

WU

simulation program to find optimal values of the design parameter, which minimize annual energy use. Deru et al. (2011) undertook a project of characterizing the commercial building stock in US and developing reference models for them. Fifteen commercial building types and one multifamily residential building were determined to represent approximately two-thirds of the commercial building stock. The input parameters for the building models came from several sources, some determined from ASHRAE standards, and rest were determined from other studies of data and standard practices. National data from 2003 CBECS (EIA 2003) were used to determine the appropriate, average mix of representative buildings, with the intension to represent 70% of US commercial building floor area. The building selected for this example is a medium office building of about 52,000 ft2, three floors, 1.5 aspect ratio. Building energy use depends on several variables but only 5 variables are assumed, as shown in Table 6.42: lighting power density (LPD), window shading coefficient (SC), exterior wall R-value (EWR), window U-value (WU), and window-to-wall ratio (WWH). All these design variables are continuous, and the range of variation set by the architect is

References

265

also shown. The study used the CCD design at two levels to generate an “optimal” set of 43 input combinations (edge + axial + center = 25 + 2 × 5 + 1 = 43). Each of these combinations was then simulated by the detailed building energy simulation program to yield the annual energy use of both electricity use and natural gas use as assembled in the last two columns of Table B.3 in Appendix B and also available electronically. The location is Madison, WI, and the TMY2 climate data file was used. (a) Analyze this study in terms of design approach. Start by identifying the edge points, the axial and central points. Compare this design with the full factorial and the partial factorial design methods when the behavior is known to be non-linear (i.e., 3 levels for each factor). Generate the combinations and discuss benefits and limitations compared to CCD. (b) Electricity use in kWh must be converted into the same units as that of natural gas. Assume a conversion factor of 33% (this is the efficiency of electricity generation and supply). Combine the two annual energy use quantities into a total thermal energy use, which is the aggregated variable to be considered below. (c) Perform ANOVA analysis and identify significant terms and interactions of the aggregated energy use variable. Fit a response surface model and deduce the optimal design point (note: the design variables are bounded as indicated in Table 6.42). Discuss advantages and limitations. (d) If the study were to be expanded to include more variables (say 15), how would you proceed. Lay out your experimental design procedure in logical successive steps. (Hint: consider a two-step process: sensitivity analysis as well as identification of the optimal combination of design variables.) (Table B.3)

Table B.3 Five factors at two level CCD design combinations and the two response variable values (building energy use per year of electricity and natural gas) found by simulation. Only a few rows are shown for comprehension while the entire data set is given in Appendix B3. The units of the variables are specified in Table 6.42 Run # 1 2 3 43

LPD 0.800 1.003 1.003 ... 1.500

SC 0.450 0.555 0.555

EWR 17.400 13.364 21.436

WU 0.630 0.786 0.786

WWR 0.300 0.216 0.216

E_elec 444.718 463.986 463.374

E_gas 117.445 116.974 117.064

0.450

17.400

0.630

0.300

523.427

116.534

Data for this problem are given in Appendix B.3 and also available electronically on book website

References Beck, J.V. and K.J. Arnold, 1977. Parameter Estimation in Engineering and Science, John Wiley and Sons, New York Berger, P.D. and R.E. Maurer, 2002. Experimental Design with Applications in Management, Engineering, and the Sciences. Duxbury Press. Box, G.E.P., W.G. Hunter and J.S. Hunter, 1978. Statistics for Experimenters, John Wiley & Sons, New York. Buckner, J., D.J. Cammenga and A. Weber, 1993. Elimination of TiN peeling during exposure to CVD tungsten deposition process using designed experiments, Statistics in the Semiconductor Industry, Austin, Texas: SEMATECH, Technology Transfer No. 92051125A-GEN, vol. I, 4-445-3-71. Clarke, J.A., 1993. Assessing building performance by simulation, Building and Environment, 28(4), pp. 419–427 Comstock, M.C. and J.E. Braun, 1999. Development of Analysis Tools for the Evaluation of Fault Detection and Diagnostics in Chillers, ASHRAE Research Project 1043-RP; also, Ray W. Herrick Laboratories. Purdue University. HL 99-20: Report #4036-3, December. Dean, A., D. Voss and D. Draguljk, 2017. Design and Analysis of Experiments, 2nd ed., Springer-Verlag, New York. Deru, M., K. Field, D. Studer, K. Benne, B. Griffith, P. Torcellini, B. Liu, M. Halverson, D. Winiarski, M. Yazdanian, J. Huang and D. Crawley, 2011. U.D. Department of Energy Commercial Reference Building Models of the National Building Stock, National Renewable Energy Laboratory, NREL/TP-5500-46861, U.S. Department of Energy, February. Devore J., and N. Farnum, 2005. Applied Statistics for Engineers and Scientists, 2nd Ed., Thomson Brooks/Cole, Australia. De Witt, S., 2003. Chap. 2, Uncertainty in Building Simulation, in Advanced Building Simulation, Eds. A.M. Malkawi and G. Augenbroe, Spon Press, Taylor and Francis, New York. Didwania, K., T.A. Reddy and M. Addison, M., 2023. Synergizing Design of Building Energy Performance using Parametric Analysis, Dynamic Visualization and Neural Network Modeling, J. of Arch Eng., American Society of Civil Engineers, Vol. 29, issue 4, Sept Dutta, R., T.A. Reddy and G. Runger, 2016. A Visual Analytics Based methodology for Multi-Criteria Evaluation of Building Design Alternatives, ASHRAE Winter Conference paper, OR-16-C051, Orlando, FL, January. EIA. 2003. Commercial Building Energy Consumption Survey (CBECS), www.eia.gov/consumption/commercial/data/2005/ Hammersley, J.M. and D.C. Handscomb, 1964. Monte Carlo Methods, Methuen and Co., London. Heiselberg, P., H. Brohus, A. Hesselholt, H. Rasmussen, E. Seinre and S. Thomas, 2009. Application of sensitivity analysis in design of sustainable buildings, Renewable Energy, 34(2009) pp. 2030–2036. Helton, J.C. and F.J. Davis, 2003. Latin hypercube sampling and the propagation of uncertainty of complex systems, Reliability Engineering and System Safety, vol. 81, pp. 23–69. Hofer, E., 1999. Sensitivity analysis in the context of uncertainty analysis for computationally intensive models, Computer Physics Communication, vol. 117, pp. 21–34. Hou, D., J.W. Jones, B.D. Hunn and J.A. Banks, 1996. Development of HVAC system performance criteria using factorial design and DOE-2 simulation, Tenth Symposium on Improving Building Systems in Hot and Humid Climates, pp. 184–192, May-13–14, Forth Worth, TX. Iman, R.L. and J.C. Helton, 1985. A Comparison of Uncertainty and Sensitivity Analysis Techniques for Computer Models, Sandia National Laboratories report NUREG/CR-3904, SAND 84–1461.

266 Lam, J.C. and S.C.M. Hui, 1996. Sensitivity analysis of energy performance of office buildings, Building and Environment, vol. 31, no.1, pp 27–39. Mandel, J., 1964. The Statistical Analysis of Experimental Data, Dover Publications, New York. Mckay, M. D., R.J. Beckman and W.J. Conover, 2000. A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics, 42(1, Special 40th Anniversary Issue), 55–61. Montgomery, D.C., 2017, Design and Analysis of Experiments, 9th Edition, John Wiley & Sons, New York. Morris, M.D., 1991. Factorial sampling plans for preliminary computational experiments, Technometrics, 33(2)pp. 161–174. Saltelli, A., K. Chan and E.M. Scott (eds.) 2000. Sensitivity Analysis, John Wiley and Sons, Chichester. Sanchez, D.G., B. Lacarriere, M. Musy and B. Bourges, 2014. Application of sensitivity analysis in building energy simulations: Combining first-and-second-order elementary effects methods, Energy and Buildings, 68 (2014), pp. 741–750. Shannon, R.E., 1975. Systems Simulation: The Art and Science, Prentice-Hall, Englewood Cliffs, New Jersey.

6

Design of Physical and Simulation Experiments

Snyder, S., T.A. Reddy and M. Addison, 2013. Automated Design of Buildings: Need, Conceptual Approach, and Illustrative Example, ASHRAE Conference paper, paper #DA-13-C010, January Sobol, I.M., 2001. Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates, Mathematics and Computers in Simulations, 55 (2001), pp. 271–280, Elsevier Spears, R., T. Grieb and N. Shiang, 1994. Parameter uncertainty and interaction in complex environmental models, Water Resources Research, vol. 30 (11), pp 3159–3169. Walpole, R.E., R.H. Myers, S.L. Myers and K. Ye, 2007. Probability and Statistics for Engineers and Scientists, 8th Ed., Prentice-Hall, Upper Saddle River, NJ. Wang, M., J. Wright and A. Brownlee, 2014. A comparison of approaches to stepwise regression for the indication of variables sensitivities used with a multi-objective optimization problem, ASHRAE Annual Conference, paper SE-14-C060, Seattle, WA, June. Westermann, P. and R. Evins, 2019. Surrogate modelling for sustainable building design- A review, Energy and Buildings, 198 (2019), pp. 170–186.

7

Optimization Methods

Abstract

This chapter provides a rather introductory overview and foundation of traditional optimization techniques along with pertinent engineering applications and illustrative examples. These techniques apply to situations where the impact of uncertainties is relatively minor and can be viewed as a subset of the broader domain of decisionmaking (treated in Chap. 12). This chapter starts by defining the various terms used in the optimization literature, such as the objective function and the different types of constraints, followed by a description of the various steps involved in an optimization problem such as sensitivity and post-optimality analysis. Simple graphical methods are used to illustrate the fact that one may have problems with unique, none, or multiple solutions; and that one may encounter instances when not all the constraints are active, and some may even be redundant. Analytical methods involving calculus-based techniques (such as the Lagrange multiplier method) as well as numerical search methods, both for unconstrained and constrained problems as relevant to univariate and multivariate problems, are reviewed, and the usefulness of slack variable approach is explained. Subsequently, different solutions to problems which can be grouped as linear, quadratic, non-linear or mixed integer programming are described, while highlighting the differences between them. Several simple examples illustrate the theoretical approaches, while more in-depth practical examples involving network models and supervisory control of an integrated energy system are also presented. How such optimization analysis approaches can be used for system reliability studies involving breakage of one or more components or links in a power grid is also illustrated. Methods that allow global solutions as against local ones such as simulated annealing and genetic algorithms are briefly described. Finally, the important topic of dynamic optimization is covered, which applies to optimizing a trajectory over time, i.e., to situations

when a series of decisions have to be made to define or operate a system over a set of discrete timesequenced stages. There is a vast amount of published material on the subject of optimization, and this chapter is simply meant to provide a good foundation and adequate working understanding for the reader to tackle the more complex and ever evolving extensions and variants of optimization problems.

7.1

Introduction

7.1.1

What Is Optimization?

One of the most important tools for both design and operation of engineering systems is optimization which corresponds to the case of finding optimal solutions under low uncertainty. This branch of applied mathematics, also studied under “operations research” (OR),1 is the use of specific methods where one tries to minimize or maximize a global characteristic (say, the cost or the benefit) whose variation is modeled by an “objective function.” The setup of the optimization problem involves both the mathematical formulation of the objective function but as importantly, as well as the explicit and complete framing of a set of constraints. Optimization problems arise in almost all branches of industry or society, for example, in product and engineering process design, production scheduling, logistics, traffic control, and even strategic planning. Optimization in an engineering context involves certain basic aspects consisting of some or all of the following: (i) the framing of a situation or problem (for which a solution or a course of action is sought) in terms of a mathematical model often called the objective function; this could be a simple “Operations Research” is the scientific/mathematical/quantitative discipline adopted by industrial and business organizations to better manage their complex business operations/systems with hundreds of variables for optimal operation/scheduling and for planning for future expansion growth.

1

# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 T. A. Reddy, G. P. Henze, Applied Data Analysis and Modeling for Energy Engineers and Scientists, https://doi.org/10.1007/978-3-031-34869-3_7

267

268

7

expression, or framed as a decision tree model in case of multiple outcomes (deterministic or probabilistic) or sequential decision making stages; (ii) defining the range of constraints to the problem in terms of input parameters which may be dictated by physical considerations; (iii) placing bounds on the solution space of the output variables in terms of some practical or physical constraints; (iv) defining or introducing uncertainties in the input parameters and in the types of parameters appearing in the model; (v) mathematical techniques which allow solving such models efficiently (short execution times) and accurately (unbiased solutions); and (vi) sensitivity analysis to gauge the robustness of the optimal solution to various uncertainties. Framing of the mathematical model involves two types of uncertainties: (i) epistemic or lack of complete knowledge of the process or system which can be reduced as more data is acquired, and (ii) aleotory uncertainty which has to do with the stochasticity of the process and which cannot be reduced by collecting more data. These notions were introduced in Sect. 1.3.2 and are discussed at some length in Chap. 12 while dealing with decision analysis. This chapter deals with traditional optimization techniques as applied to engineering applications which are characterized by low aleatory and low epistemic uncertainty. Further, recall the concept of abstraction presented in Sect. 1.2.2 in the context of formulating models. It pertains to the process of deciding on the level of detail appropriate for the problem at hand without, on one hand, oversimplification which may result in loss of important system behavior predictability, while on the other hand, avoiding the formulation of an overly-detailed model which may result in undue data and computational resources as well as time spent in understanding the model assumptions and the results generated. The same concept of abstraction also applies to the science of optimization. One must set a level of abstraction commensurate with the complexity of the problem at hand and the accuracy of the solution sought. Consider a problem framed as finding the optimum of a continuous function. There could, of course, be the added complexity of considering several discrete options; but each option has one or more continuous variables requiring proper control to achieve a global optimum. The simple example given in Pr. 1.9 from Chap. 1 will be used to illustrate this case.

7.1.2

Simple Example

Example 7.1.1 Function minimization Two pumps with parallel networks (Fig. 7.1) deliver a volumetric flow rate F = 0.01 m3/s of water from a reservoir to the destination. The pressure drops in Pascals (Pa) of each network are given by: Δp1 = ð2:1Þ:1010 :F 21 and Δp2 = ð3:6Þ:

Optimization Methods

Fig. 7.1 Pumping system whose operational power consumption is to be minimized (Example 7.1.1)

1010 :F 22 where F1 and F2 are the flow rates through each branch in m3/s. Assume that both the pumps and their motor assemblies have equal efficiencies η1 = η2 = 0.9. Let P1 and P2 be the electric power in Watts (W) consumed by the two pump-motor assemblies. The total power draw must be minimized. Since, power consumed is equal to volume flow rate times the pressure drops, the objective function to be minimized is the sum of the power consumed by both pumps: J = or J =

Δp1 :F 1 Δp2 F 2 þ η1 η2 ð2:1Þ:1010 :F 31 ð3:6Þ:1010 :F 32 þ 0:9 0:9

ð7:1Þ

The sum of both flows is equal to 0.01 m3/s, and so F2 can be eliminated in Eq. 7.1. Thus, the sought-after solution is the value of F1, which minimizes the objective function J: J  = Min fJ g = Min

ð2:1Þ:1010 :F 31 ð3:6Þ:1010 :ð0:01 - F 1 Þ3 þ 0:9 0:9

ð7:2Þ subject to the constraint that F1 > 0. dJ From basic calculus, dF = 0 would provide the 1 optimum solution from where F1 = 0.00567 m3/s and F2 = 0.00433 m3/s, and the total power of both pumps P = 7501 W. The extent to which non-optimal performance is likely to lead to excess power can be gauged (referred to as post-optimality analysis) by simply plotting the function J vs. F1 (Fig. 7.2) In this case, the optima is rather broad; the system can be operated such that F1 is in the range of 0.005–0.006 m3/s without much power penalty. On the other hand, sensitivity analysis would involve a study of how the optimum value is affected by certain parameters. For example, Fig. 7.3 shows that varying the efficiency of pump 1 in the range of 0.85–0.95 has negligible impact on the optimal result. However, this may not be the case for

7.2 Terminology and Classification

some other variable. A systematic study of how various parameters impact the optimal value falls under “sensitivity analysis,” and there exist formal methods of investigating this aspect of the problem. Finally, note that this is a very simple optimization problem with a simple imposed constraint which was not even considered during the optimization. ■

7.2

Terminology and Classification

7.2.1

Definition of Terms

A mathematical formulation of an optimization problem involving control of an engineering system consists of the following terms:

Fig. 7.2 One type of post-optimality analysis involves plotting the objective function for Total Power against flow rate through pump 1 (F1) to evaluate the shape of the curve near the optimum. In this case, there is a broad optimum indicating that the system can be operated near-optimally over this range without much corresponding power penalty

Fig. 7.3 Sensitivity analysis with respect to efficiency of pump 1 on the overall optimum

269

(i) Decision variables or process variables, say (x1, x2 . . . xn), whose respective values are to be determined. These can be either discrete or continuous variables. An example is the air flow rate in an evaporative cooling tower; (ii) Control variables, which are the physical quantities that can be varied by hardware according to the numerical values of the decision variables sought. Determining the “best” numerical values of these variables is the basic intent of optimization. An example is the control of the fan speed to achieve the desired air flow rate through the evaporative cooling tower; (iii) Objective function is an analytical formulation of an appropriate measure of performance of the system (or characteristic of the design problem) in terms of decision variables. An example is the total electric power in Example 7.1.1; (iv) Constraints or restrictions imposed on the values of the decision variables. These can be of two types: nonnegative constraints, for example, flow rates cannot be negative; and functional constraints (also called structural constraints), which can be equality, non-equality or range constraints that specify a range of variation over which the decision variables can be varied. These can be based on direct considerations (such as not exceeding capacity of energy equipment, limitations of temperature, and pressure control values) or on indirect ones (when mass and energy balances have to be satisfied). (v) Model parameters are constants appearing in constraints and objective equations.

7.2.2

Categorization of Methods

Optimization methods can be categorized in a number of ways.

270

7

Optimization Methods

Fig. 7.4 Examples of unimodal, bimodal and multimodal peaks for a univariate unconstrained function showing the single, dual and multi-critical points, respectively

Fig 7.5 Two types of critical points for the unconstrained bivariate problem (a) saddle point, (b) global minimum of a unimodal function

(i) Univariate/multivariate problems with uni/multi-modal critical points Univariate problems are those where only one variable is involved in the objective function, and this can be unimodal, bimodal, or multi-modal functions (Fig. 7.4). Critical points are those where the first derivative of the function is zero, and they can be either local or global maxima or minima depending on whether the second derivative is negative or positive respectively. If the second derivative is zero, this indicates a saddle point. By extension, critical points of a function of two variables are those points at which both partial derivatives of the function are zero. The saddle point (Fig. 7.5) is one where the slopes (or function derivatives) in orthogonal directions are both zero. As the number of variables in the function increase, the solution of the problem becomes exponentially more difficult. Searching for optimum points of non-linear problems poses the great danger of getting stuck in a local minima region; software programs often avoid this situation by adopting a technique called “multistart,” where several searches are undertaken from different starting points of the feasible search space. (ii) Analytical vs. numerical methods. Analytical methods apply to cases when the optimization of the objective function can be expressed as a mathematical

relationship using subsidiary equations which can be solved directly or by calculus methods. This contrasts with numerical search methods which require that the function or its gradient at successive locations be determined by an algorithm which allows homing in on the solution. The categorization of these methods is based on the algorithm used. Even though more demanding computationally, search methods are especially useful and often adopted for discontinuous or complex problems. (iii) Linear vs. non-linear methods. Linear optimization problems (or linear programing LP) involve linear models and a linear objective function and linear constraints. The theory is well developed, and solutions can be found quickly and robustly. There is an enormous amount of published literature on LP problems, and it has found numerous practical applications involving up to several thousands of independent variables. There are several well-known techniques to solve them (the Simplex method in Operations Research used to solve a large set of linear equations being the best known). However, many problems from engineering to economics require the use of non-linear models or constraints, in which case, non-linear programming2 (NLP) techniques must be used. In some cases, non-linear models (e.g., equipment models such as chillers, fans, pumps, and cooling towers) can be expressed as quadratic models, and algorithms more efficient than non-linear programming ones have been developed; this falls under quadratic programming (QP) methods. Calculus-based methods are best suited for finding the solution for simpler problems; for more involved problems resulting in complex non-linear 2

Programming does not refer to computer programming, but arises from the use of program by the United States military to refer to proposed training and logistics schedules, which were the problems studied by Dantzig (the primary developer of the Simplex method).

7.2 Terminology and Classification

(iv)

(v)

(vi)

(vii)

simultaneous equations, a search process would be required. Continuous vs. discontinuous. When the objective functions are discontinuous, calculus-based methods can break down. In such cases, one could use non-gradient based methods or even heuristic based computational methods such as simulated annealing, particle swarm optimization, or genetic algorithms. The latter are very powerful in that they can overcome problems associated with local minima and discontinuous functions, but they need long computing times, have no guarantee of finding the global optimum, and require a certain amount of knowledge on the part of the analyst. Another form of discontinuity arises when one or more of the variables are discrete as against continuous. Such cases fall under the classification known as integer or discrete programming. Static vs. dynamic. If the optimization is done with time not being a factor, then the procedure is called static. However, if optimization is to be done over a time period where decisions can be made at several sub-intervals of that period, then a dynamic optimization method is warranted. Two such examples are when one needs to optimize the route taken by a salesman visiting different cities as part of his road trip, or when the operation of a thermal ice storage supplying cooling to a building must be optimized during several hours of the day when high electric demand charges prevail. Whenever possible, analysts make simplifying assumptions to make the optimization problem static. Deterministic vs. probabilistic. This depends on whether one neglects or considers the uncertainty associated with various parameters of the objective function and the constraints. The need to treat these uncertainties together, and in a probabilistic manner, rather than one at a time (as is done in a sensitivity analysis) has led to the development of several numerical techniques, the Monte Carlo technique being the most widely used (Sect. 12.2.8). Traditional vs. stochastic adaptive search. Most of the methods referred to above can be designated as

Fig. 7.6 An example of a constrained optimization problem with no feasible solution

271

traditional methods in contrast to stochastic methods developed in the last three to four decades (also referred to as global optimization methods or metaheuristic methods). The latter are especially meant for very complex multivariate non-linear problems. Three of the best-known methods are simulated annealing, particle swarm optimization, and genetic algorithms. Each of these use metaheuristic search techniques, that is, a mix of selective combination of intermediate results and randomness. These algorithms are patterned after evolutionary and flocking behavior which nature adopts for physical and biological processes to gradually and adaptively improve the search towards a near-optimal solution.

7.2.3

Types of Objective Functions and Constraints

Single criterion optimization is one where a single objective function can be formulated. For example, an industrialist is considering starting a factory to assemble photovoltaic (PV) cells into PV modules. Whether to invest or not, and if yes, at what capacity level are issues which can both be framed as a single criterion optimization problem. However, if maximizing the number of jobs created is another (altruistic) objective, then the problem must be treated as a multi-criteria decision problem. Such cases are discussed in Sects. 12.5–12.7. Establishing the objective function is often simple. The real challenge is usually in specifying the complete set of constraints. A feasible solution is one that satisfies all the stated constraints, while an infeasible solution is one where at least one constraint is violated. The optimal solution is a feasible solution that has the most favorable value (either maximum or minimum) of the objective function, and it is this solution that is being sought after. The optimal solutions can be a single point or even several points. Also, some problems may have no optimal solutions at all. Figure 7.6 shows a function to be maximized subject to several constraints (six in this case). Note that there is no feasible solution and one of the constraints must be relaxed or the

272

7

Optimization Methods

Fig. 7.7 An example of a constrained optimization problem with more than one optimal solution

problem reframed. In some optimization problems, one can obtain several equivalent optimal solutions. This is illustrated in Fig. 7.7, where several combinations of the two variables, which define the line segment shown, are possible optima. Sometimes, an optimal solution may not necessarily be the one selected for implementation. A “satisficing” solution (combination of words “satisfactory” and “optimizing”) may be the solution that is selected for actual implementation and reflects the difference between theory (which yields an optimal solution) and reality faced situations (due to actual implementation issues, heuristic constraints that cannot be expressed mathematically, the need to treat unpredictable occurrences, risk attitudes of the owner/operator, . . .). Some practitioners also refer to such solutions as “near-optimal” though this has a sort of negative connotation. A final issue with optimization problems is that the constraints defined have different importance on the optimal solution; sometimes, they have no role at all, and such cases are referred to as non-binding constraints. The detection of such superfluous constraints is not simple; relaxing them would often simplify the solution search.

7.2.4

Sensitivity Analysis and Post-Optimality Analysis

Model parameters are often not known with certainty and could be based on models identified from partial or incomplete observations, or they could even be guess-estimates. The optimum is only correct insofar as the model is accurate, and the model parameters and constraints reflective of the actual situation. Hence, the optimal solution determined needs to be reevaluated in terms of how the various types of uncertainties affect it. This is done by sensitivity analysis, which determines a range of values: (i) Of the model parameters over which the optimal solutions will remain unchanged or vary within an allowable range to stay near-optimal. This would flag critical parameters which may require closer investigation, refinement and monitoring.

(ii) Over which the optimal solution will remain feasible with adjusted values for the decision variables (allowable range to stay feasible, i.e., the constraints are satisfied). This would help identify influential constraints. Further, the above evaluations can be performed by adopting (see Sect 6.7.4): (i) Individual parameter sensitivity, where one parameter at a time in the original model is varied (or perturbed) to check its effect on the optimal solution; (ii) Total sensitivity (also called parametric programming) involves the study of how the optimal solution changes as many parameters change simultaneously over some range. Thus, it provides insight into “correlated” parameters and trade off in parameter values. Such evaluations are conveniently done using Monte Carlo methods (Sect. 12.2.8).

7.3

Analytical Methods

Calculus-based solution methods can be applied to both linear and non-linear problems and are the ones to which undergraduate students are most likely to be exposed to. They can be used for problems where the objective function and the constraints are differentiable. These methods are also referred to as classical or traditional optimization methods as distinct from stochastic methods such as simulated annealing or evolutionary algorithms. A brief review of calculus-based analytical and search methods is presented below.

7.3.1

Unconstrained Problems

The basic calculus of the univariate unconstrained optimization problem can be extended to the multivariate case of dimension n by introducing the gradient vector ∇ and by recalling that the gradient of a scalar y is defined as:

7.3 Analytical Methods

273

Fig. 7.8 Illustrations of convex, concave and combination functions. A convex function is one where every point on the line joining any two points on the graph does not lie below the graph at any point. A

∇y =

∂y ∂y ∂y i1 þ i2 þ . . . þ in ∂x1 ∂x2 ∂xn

With this terminology, the condition for optimality of a continuous function y is simply:

Example 7.3.1 Determine the minimum value of the following function: 1 1 þ 8x21 x2 þ 2 4x1 x2

1 ∂y = - 2 þ 16x1 x2 ∂x1 4x1

and

∂y 2 = 8x21 - 3 ∂x2 x2

Setting the above two expressions to zero and solving result in x1 = 0:2051 and x2 = 1:8114 at which condition the optimal value of the objective function y* = 2.133. It is left to the reader to verify these results, and check whether this is indeed the minimum. ■

ð7:4Þ

However, the optimality may be associated with stationary points which could be minimum, maximum, saddle, or ridge points. Since objective functions are conventionally expressed as a minimization problem, one seeks the minimum of the objective function. Recall that for the univariate case, assuring that the optimal value found is a minimum (and not a maximum or a saddle point) involves computing the numerical value of the second derivative at this optimal point, and checking that its value is positive. Graphically, a minimum for a continuous function is found (or exists) when the function is convex, while a saddle point is found for a combination function (see Fig. 7.8). In the multivariate optimization case, one checks whether the Hessian matrix (i.e., the second derivative matrix which is symmetrical) is positive definite or not. It is tedious to check this condition by hand for any matrix whose dimensionality is greater 2, and so computer programs are invariably used for such problems. A simple hand calculation method (which works well for low dimension problems) for ascertaining whether the optimal point is a minimum (or maximum) is to simply perturb the optimal solution vector obtained by a small amount, compute the objective function, and determine whether this value is higher (or lower) than the optimal value found.

y=

First, the two first order derivatives are found:

ð7:3Þ

where i1, i2,. . ., in are unit vectors and y is the objective function and is a function of n variables: y = y(x1, x2,. . ., xn)

∇y = 0

combination function is one that exhibits both convex and concave behavior during different portions with the switch-over being the saddle point

ð7:5Þ

7.3.2

Direct Substitution Method for Equality Constrained Problems

The simplest approach is the direct substitution method where for a problem involving “n” variables and “m” equality constraints, one tries to eliminate the m constraints by direct substitution and solve the objective function using the unconstrained solution method described above. This approach was used earlier in Example 7.1.1. Example 7.3.23 Direct substitution method Consider the simple optimization problem stated as: Minimize f ðxÞ = 4x21 þ 5x22

ð7:6aÞ

2x1 þ 3x2 = 6

ð7:6bÞ

subject to :

Either x1 or x2 can be eliminated without difficulty. Say, the constraint equation is used to solve for x1, and then substituted into the objective function. This yields the unconstrained objective function: f ðx2 Þ = 14x22 - 36x2 þ 36: The optimal value of x2* = 1.286 from which by substitution, x1* = 1.071. The resulting value of the objective function is f(x)* = 12.857. 3

From Edgar et al. (2001) by permission of McGraw-Hill.

274

7

Optimization Methods

Fig. 7.9 Graphical representation of how direct substitution can reduce a function with two variables x1 and x2 into one with one variable. The unconstrained optimum is at (0, 0) at the center of the contours (From Edgar et al. 2001 by permission of McGraw-Hill)

This simple problem allows a geometric visualization to better illustrate the approach. As shown in Fig. 7.9, the objective function is a paraboloid shown on the z axis with x1 and x2 being the other two axes. The constraint is represented by a plane surface which intersects the paraboloid as shown. The resulting intersection is a parabola whose optimum is the solution of the objective function being sought after. Notice how this constrained optimum is different from the unconstrained optimum which occurs at (0, 0) (Fig. 7.9). The above approach requires that one variable be first explicitly expressed as a function of the remaining variables, and then eliminated from all equations; this procedure is continued till there are no more constraints. Unfortunately, this is not an approach that is likely to be of general applicability in most problems.

7.3.3

Lagrange Multiplier Method for Equality Constrained Problems

A more versatile and widely used approach that allows the constrained problem to be reformulated into an unconstrained one is the Lagrange multiplier approach. Consider an optimization problem involving an objective function y, a set of n decision variable x and a set of m equality constraints h(x): Minimize

y = y ð xÞ

Subject to hðxÞ = 0

objective function equality constraints

ð7:7aÞ ð7:7bÞ

The Lagrange multiplier method simply absorbs the equality constraints into the objective function, and states that the optimum occurs when the following modified objective function is minimized:

J   minfyðxÞg = yðxÞ - λ1 h1 ðxÞ - λ2 h2 ðxÞ - . . . = 0 ð7:8Þ where the quantities λ1, λ2, . . . λm, are called the Lagrange multipliers and applied to each of the m equality constraints. The optimization problem, thus, involves minimizing y with respect to both x and the Lagrange multipliers. The cost of eliminating the constraints comes at the price of increasing the dimensionality of the problem from n to (n + m), or stated differently, one is now seeking the optimal values of (n + m) variables as against those of n variables which optimize the function y. A simple example with one equality constraint serves to illustrate this approach. The objective function y = 2x1 + 3x2 is to be optimized subject to the constraint x1 x22 = 48. Figure 7.10 depicts this problem visually with the two variables being the two axes and the objective function being represented by a series of parallel lines for different assumed values of y. Since the constraint is a curved line, the optimal solution is obviously the point where the tangent vector of the curve (shown as a dotted line) is parallel to these lines (shown as point A). Example 7.3.34 Optimizing a solar water heater system using the Lagrange method A solar water heater consisting of a solar collector and fully mixed storage tank is to be optimized for lowest first cost consistent with the following specified system performance. During the day, the storage temperature is to be raised gradually from an initial 30 °C (equal to the ambient temperature Ta) to a final desired temperature Tmax, while during the night heat is to be withdrawn from storage such that the storage temperature drops back to 30 °C for the next day’s 4

From Reddy (1987).

7.3 Analytical Methods

275

Plugging numerical values results in: ð20Þ 106 = AC ð12Þ 106 ð0:8Þ- ð4Þð3600Þð10Þ

T max þ 30 -30 2

or 20:106 = ð11:76- 0:072T max ÞAC

ð7:10bÞ A heat balance on the storage over the day yields: QC = Mcp ðT max - T initial Þ or 20:106 = V S ð1000Þð4190ÞðT max - 30Þ Fig. 7.10 Optimization of the linear function y = 2x1 + 3x2 subject to the constraint shown. The problem is easily solved using the Lagrange multiplier method to yield optimal values of (x1* = 3, x2* = 4). Graphically, the optimum point A occurs where the constraint function and the lines of constant y (which, in this case, are linear) have a common normal line indicated by the arrow at A

operation. The system should be able to store 20 MJ of thermal heat over a typical day of the year, during which HT, the incident radiation over the collector operating time (assumed to be 10 h) is 12 MJ/m2. The collector performance characteristics5 are FRη0 = 0.8 and FRUL = 4.0 W/m2.°C. The costs of the solar subsystem components are fixed cost Cb = $600, collector area proportional cost Ca = $200/m2 of collector area, and storage volume proportional cost Cs = $200/m3 of storage volume. Assume that the average inlet temperature to the collector Tci during a charging cycle over a day is equal to the arithmetic mean of Tmax and Ta. Let AC (m2) and VS (m3) be the collector area and storage volume respectively. The objective function is: J = 600 þ 200 AC þ 200V S

See Pr. 5.7 for a description of the solar collector model.

þ 30.

AC 9:6 -

0:344 = 20 VS

This allows the combined Lagrangian objective function to be deduced as: J = 600 þ 200 AC þ 200 V S - λ AC 9:6 -

0:344 - 20 VS ð7:12Þ

The resulting set of Lagrangian equations are: δJ 0:344 = 0 = 200 - 9:6 λ δAC VS 0:344 δJ = 0 = 200 - λAC δV S V 22 δJ 0:344 - 20 = 0 = AC 9:6 δλ VS Solving this set of non-linear equations is not straightforward, and numerical search methods (discussed in Sect. 7.4) need to be adopted. In this case, the sought-after optimal values are AC* = 2.36 m2 and VS* = 0.308 m3. The value of the Lagrangian multiplier is λ = 23.58, and the corresponding initial cost J* = $1134. The Lagrangian multiplier can be interpreted as the sensitivity coefficient, which in this example corresponds to the marginal cost of solar thermal energy. In other words, increasing the thermal requirements by 1 MJ would lead to an increase of λ = $23.58 in the initial cost of the optimal solar system. ■

ð7:10aÞ

where Δt is the number of seconds during which the collector operates. 5

20 ð4:19ÞV S

Substituting this back into the constraint Eq. 7.10b results in:

ð7:9Þ

In essence, the optimization involves determining the most cost-effective sizes of the collector area and of the storage tank that can deliver the required amount of thermal energy at the end of the day. A larger collector area would require a smaller storage volume (and vice versa); but then the water in the storage tank temperature would be higher thereby penalizing the thermal efficiency of the solar collector array. The constraint of the daily amount of solar energy collected is found from the collector performance model expressed as: QC = AC ½H T F R η0 - U L ðT Ci - T a ÞΔt 

from which T max =

ð7:11Þ

7.3.4

Problems with Inequality Constraints

Most practical problems have constraints in terms of the independent variables, and often these assume the form of

276

7

Optimization Methods

Fig 7.11 Graphical solution of Example 7.3.4

x1 ≤ 5.5

X2

x1 + x2 ≤ 7 x2 ≤ 3.5

x1 + 2x2 =10.5

X1

inequality constraints. There are several semi-analytical techniques that allow the constrained optimization problem to be reformulated into an unconstrained one, and the way this is done is what differentiates these methods. In such a case, one can avoid the use of generalized optimization solver approaches if such software programs are unavailable. Notice that no inequality constraints appear in Examples 7.3.2 or 7.3.3. When optimization problems involve nonequality constraints, they can be re-expressed as equality constraints by introducing additional variables, called slack (or artificial) variables. Each inequality constraint requires a new slack variable. The order of the optimization problem will increase, but the efficiency in the subsequent numerical solution approach outweighs this drawback. The following simple example serves to illustrate this approach. Example 7.3.4 Consider the following problem with x1 and x2 being continuous variables: Objective function : maximize J = ðx1 þ 2x2 Þ Subject to constraints 2x2 ≤ 7

ð7:13aÞ ð7:13bÞ

x 1 þ x2 ≤ 7

Max J = ðx1 þ 2x2 Þ 2x2 þ x3 = 7 x1 þ x2 þ x4 = 7 2x1 þ x5 = 11

and x3, x4, x5 ≥ 0. Because of the ≥ sign, the slack variables x3, x4, and x5 can only assume either zero or positive values. Using standard matrix inversion results in a minimum value of the objective function J* = 10.5 for x1* = 3.5, x2* = 3.5. The graphical solution is shown in Fig. 7.11. The dashed lines indicate the constraints, and the region within which the maximum should lie after meeting the constraints is shown partially hatched. The objective function line drawn as a solid line assumes a maximum value as indicated by the circled point. The above example was a linear optimization problem since the objective function and all the constraints were linear. For such problems, the slack variables are first order unknown quantities. For nonlinear problems involving non-linear objective function or one or more non-linear constraints, the standard way of introducing slack variables is as quadratic terms. In the example above, if the second constraints were x12 + x2 ≤ 7, it would be expressed as x12 + x2 + x42 = 7. Even the other slack variables x3 and x5 should also be represented by their squares.

2x1 ≤ 11 x1 , x 2 ≥ 0 The calculus-based solution involves introducing three additional variables for the three constraints (to simplify the analysis, the last two constraints that the two variables x1 and x2 be positive can be discarded and the resulting solutions verified that these constraints are met).

7.3.5

Penalty Function Method

Another widely used method for constrained optimization is the use of the penalty factor method, where the problem is converted to an unconstrained one. It is especially useful when the constraint is not very rigid or not very important. Consider the problem stated in Example 7.3.2, with the

7.4 Numerical Unconstrained Search Methods

277

possibility that the constraints can be inequality constraints as well. Then, a new unconstrained function is framed as: J   minfyðxÞg = min yðxÞ þ

k

Pi ðh1 Þ2

ð7:14Þ

i=1

where Pi is called the penalty factor for condition i with k being the number of constraints. The choice of this penalty factor provides the relative weighting of the constraint compared to the function. For high Pi values, the search will satisfy the constraints but move more slowly in optimizing the function. If Pi is too small, the search may terminate without satisfying the constraints adequately. The penalty factor can in general assume any function6, but the nature of the problem may often influence the selection. For example, when a forward model is being calibrated with experimental data, one has some prior knowledge of the numerical values of the model parameters. Instead of simply performing a calibration based on minimizing the least square errors, one could frame the problem as an unconstrained penalty factor problem where the function to be minimized consists of a term representing the root sum of square errors, and of the penalty factor term which may be the square deviations of the model parameters from their respective estimates. The following example illustrates this approach while it is further described in Sect. 9.5.2 when dealing with nonlinear parameter estimation. Example 7.3.5 Minimize the following problem using the penalty function approach: y = 5x21 þ 4x22

s:t: 3x1 þ 2x2 = 6

ð7:15Þ

Let us assume a simple form of the penalty factor and frame the problem as: J  = min ðJ Þ = min = min

y þ PðhÞ2

5x21 þ 4x22 þ Pð3x1 þ 2x2 - 6Þ2

ð7:16Þ

Then :

∂J = 10x1 þ 6Pð3x1 þ 2x2 - 6Þ = 0 ∂x1

ð7:17aÞ

and

∂J = 8x2 þ 4Pð3x1 þ 2x2 - 6Þ = 0 ∂x2

ð7:17bÞ

Solving these equations results in x1 = 6x52 which, when substituted back into the constraint of Eq. 7.15, yields:

6

The penalty function should always remain positive; for example, one could specify an absolute value for the function rather than the square.

x2 =

12 P

36 þ 108 5 þ 12

The optimal values of the variables are found as the limiting values when P becomes very large. In this case, x2* = 1.071 and, subsequently, from Eq. 7.17b ■ x1* = 1.286; these are the optimal solutions sought.

7.4

Numerical Unconstrained Search Methods

Most practical optimization problems will have to be solved using numerical search methods. The search toward an optimum is done either exhaustively (or blindly) or systematically and progressively using an iterative approach to gradually zoom onto the optimum. Because the search is performed at discrete points, the precise optimum will not be known. The best that can be achieved is to specify an interval of uncertainty In which is the range of x values in which the optimum is known to exist after n trials or function calls. The explanation of the various solution methods of optimization problems is mainly meant for conceptual understanding.

7.4.1

Univariate Methods

The search methods differ depending on whether the problem is univariate/multivariate or unconstrained/constrained or have continuous/discontinuous functions. Some basic methods for univariate problems without constraints are described below. (a) Exhaustive or direct search: This is the least imaginative but very straightforward and simple to use, and is appropriate for simple situations. As illustrated, for a maximum-seeking situation, in Fig. 7.12, the initial range or interval I0 over which the solution is being sought is divided into a number of discrete equal intervals (that dictates the “interval of uncertainty”), the function values are calculated for all the seven points simultaneously. Then, the maximum point is easily identified. The interval of uncertainty after n function calculations In = I0/[(n + 1)/2]. (b) Basic sequential search: This involves progressively eliminating ranges or regions of the search space based on pairs of observations done sequentially. As illustrated in Fig. 7.13 for the unimodal single variate problem, one starts by dividing the interval I0 (a,b) into, say three intervals, and calculating the function values y1 and y2 as shown. If y2 < y1, one can then say that the maximum would lie in the range (a, x2), and if y2 > y1 it would lie in the range (x1, b). For the case when the function values y1 = y2, one would assume that the optimal value would lie close to the center of this interval. Note that the search

278

7

Optimization Methods

Fig. 7.12 Conceptual illustration of the direct search method

Fig. 7.13 Conceptual illustration of the basic sequential search process

y(x)

y(x)

y1

y2

y1

a

y(x)

x1

y1

x2 b

(a)

x

a

y2

x1 (b)

x2 b

y2 x

a

x1

x2 b

x

(c)

Fig. 7.14 Comparison of reduction ratio of different univariate search methods. Note the logarithmic scale of the ordinate scale. (Adapted from Stoecker 1989)

process reuses one of the two function evaluations for the next step. The search is continued until the optimum point is determined at the preset range of uncertainty. This search algorithm is not very efficient computationally and better numerical methods are available. (c) More efficient sequential search methods These methods differ from the basic sequential search method in that irregularly spaced intervals are used over the interval. The three most commonly used are the dichotomous search, the Golden Search and the Fibonacci method (see, e.g., Beveridge and Schechter 1970 or Venkataraman 2002). Numerical efficiency (or power) of a method of solution involves both

robustness of the solution and fast execution times. A metric used to compare the execution time of different search methods is the reduction ratio RR = (I0/In) where I0 is the original interval of uncertainty, while In is the range of uncertainty after n trials. Figure 7.14 allows a comparison of the three sequential search methods with the exhaustive search method. The dichotomous search algorithm places the two starting points closer to the mid-point than spacing them equally over the starting interval as is adopted in the basic search procedure. This narrows the search interval for the next iteration, which increases the search efficiency and the RR. The way the spacing of the two points is selected is what differentiates the three algorithms.

7.4 Numerical Unconstrained Search Methods

279

discussion is pertinent for unimodal functions; multi-model functions require much greater care. (d) Newton-Raphson method

Fig. 7.15 (a) The Golden section rule divides a segment into two intervals following ratio r = 0.618 as shown. (b) Two search points x1 and x2 are determined within the search interval {a,b} following r = r2 = r1

The Fibonacci method is said to be the most efficient especially when greater accuracy is demanded. It requires that the number of trials n be selected/decided in advance which is a limitation in cases where one has no prior knowledge of the behavior of the function near the maximum. If the function is very steep close to the maximum, selecting too few points would not yield an accurate estimate of the maximum. The other minor disadvantage is that it requires calculations of the function y(x) at rather odd values of x. A modified Fibonacci search method has also been developed for situations when the function is such that one is unable to select the number of trials in advance (Beveridge and Schechter 1970). The Golden Section method is a compromise, being slightly less efficient than the Fibonacci but requiring no preselection of the number of trials n. While the Fibonacci requires different ratios of the ranges to be selected as the search progresses, the Golden section only assumes a single value, namely 0.618. Figure 7.15 illustrates how the two initial intervals r = r1 = r2 allow the points x1 and x2 to be are determined. Two function calls at x1 and x2 allow one to narrow the interval (a,b) to (x1,b). The next calculation is done by considering the interval (x1,b) and dividing it again into two ranges with end points (x1, x2′). The function value at x1 can be reused, while that at the new point x′2, the function value has to be determined again. The search is thus continued until an acceptable range of uncertainty is reached. The Golden Section method is robust and is a widely used numerical search method. It is obvious that the above

Gradient-based methods allow faster convergence than the previous methods. They are numerical methods, but the step size is varied based on the slope of the function. The NewtonRaphson method is quite popular; it has been developed for finding roots of single non-linear equations but can be applied to optimization problems as well if the equation is taken to be the derivative of the objective function. It is iterative with each step determined based on a linear Taylor series expansion of the derivative of the objective function, namely (dJ/dx). The algorithm is essentially as follows: (i) assume a starting value x0 and calculate the first derivative (dJ/dx) at x0, (ii) determine the slope of the first derivative function (dJ2/dx2) at x0, (iii) update the search value based on the slope to determine next value x1 = (x0 + Δx), (iv) continue till the desired convergence is reached. Step (iii) of the algorithm is based on the following approximation: d 2 J=dx2 = dJ ðx0 þ ΔxÞ = dJ ðx0 Þ þ d2 J=dx0 2 :Δx

ð7:18aÞ

which can be rewritten in terms of the step size: Δx = - ðdJ=dx0 Þ= d2 J=dx0 2

ð7:18bÞ

Example 7.4.17 Minimize J ðxÞ = ðx - 1Þ2 ðx - 2Þ ðx - 3Þ

s:t: 2 ≤ x ≤ 4

ð7:19Þ

The numerical search process is illustrated graphically in Fig. 7.16. As shown, the function obviously has roots (i.e., the function cuts the x-axis) at points 1, 2, and 3. The function J(x) and the first derivative are also plotted in the figure. The first derivative of the objective function is the equation for which the roots must be determined: ðdJ=dxÞ = 2ðx - 1Þðx - 2Þðx - 3Þ þ ðx - 1Þ2 ðx - 3Þ þ ð x - 1Þ 2 ð x - 2Þ = 0 and the second derivative d2 J=dx2 = 2ðx - 2Þðx - 3Þ þ 4ðx - 1Þðx - 3Þ þ4ðx - 1Þðx - 2Þ þ 2ðx - 2Þ2

7

From Venkatraman (2002).

280

7

Optimization Methods

dJ dx

J(x) slope J =0

x0 = 3

Δx x Fig. 7.16 Graphical illustration of Example 7.4.1 using the NewtonRaphson method

Assume, say a starting value of x0 = 3 (note: this value is selected to fall within the range constraint stipulated in the problem). The function value J = 0, the first derivative (dJ/ dx) = 4 and (d2J/dx2) = 16. From Eq. (7.18b), Δx = -0.25, and the corresponding value of (dJ/dx) = 0.875. The function is not zero as yet and so another iteration is required. Assuming x1 = 2.75, the second iteration yields J = 0.5742, (dJ/dx) = 0.0875 and (d2J/dx2) = 9.25, from which Δx = -0.0946 and (dJ/dx) = 0.104. The value of the derivative has decreased indicating that we are closer to zero, but more iterations are required. In fact, five iterations are needed to reach a value of (dJ/dx) = 0 at which point x = 2.6404 and J = -0.6197. As a precaution it is urged, especially when dealing with non-linear functions, that the search be repeated with different starting values to assure oneself that a global minimum has indeed been reached. It is important to realize that if one had stipulated the constraint as 0 ≤ x ≤ 4 and chosen a starting value of x0 = 0.5, the solution would have converged to x = 1 which is a local minimum (see Fig. 7.16). This highlights the dangers of local convergence when dealing with non-linear functions. Further, it is clear that the number of iterations will reduce if the starting value is taken to be close to the global solution. Under certain circumstances, all gradient-based methods may fail to converge, such as when the function derivative during the search equals zero. Algorithms based on a combination of gradient-based and search methods have also been developed and exhibit desirable qualities of robustness while being efficient.

Fig. 7.17 Conceptual illustration of finding a minimum point of a bi-variate function using a pattern or lattice search method. From an initial point 1, the best subsequent move involves determining the function values around that point at discrete grid points (points 2 through 9) and moving to the point with the lowest function value

7.4.2

Multivariate Methods

Realistically, the efficiency of single variate search algorithms is not a concern given the computing power available nowadays. For multivariate problems, on the other hand, efficiency is a critical aspect since the number of combinations increases exponentially. Numerous multivariate search methods are available in the literature, but only the basic ones are described below. These methods can be grouped into zero-order (when only the function values are to be determined), first-order (based on the first derivative or linear gradient), and secondorder (requiring the second order derivatives or quadratic polynomials). One can also categorize the methods as “numerical” or “analytical” or a combination of both. Search methods are most robust and appropriate for nondifferentiable or discontinuous functions while calculus-based methods are generally efficient for problems with continuous functions. One distinguishes between valley-descending for minimization problems and hill-climbing methods for maximization problems. Three general solution approaches for non-constrained optimization problems are described below in terms of bivariate functions for easier comprehension. (a) Pattern or lattice search is a directed search method where one starts at one point in the search space (shown as point 1 in Fig. 7.17), calculates values of the function in several points around the initial point (points 2–9), and moves to the point which has the lowest value (shown as point 5). This process is repeated till the overall minimum is found. This combination of exploratory moves and heuristic search done iteratively is the basis of the Hooke-Jeeves pattern search method, which is quite popular. Sometimes, one may use a coarse grid search initially; find the optimum within an interval of

7.4 Numerical Unconstrained Search Methods

Fig. 7.18 Conceptual illustration of finding a minimum point of a bi-variate function using the univariate search method. From an initial point 1, the gradient of the function is used to find the optimal point value of x2 keeping x1 fixed, and so on till the optimal point 5 is found

uncertainty, then repeat the search using a finer grid. Note that this is not a calculus-based method since the function calls are made at discrete surrounding points, and is not very efficient computationally, especially for higher dimension problems. However, it is more robust than calculus-based algorithms and simple to implement. (b) Univariate search method (Fig. 7.18) involves finding the minimum of one variable at a time keeping the others constant and repeating this process iteratively. This method accelerates the process of reaching the minimum (or maximum) point of a function compared to the lattice search. One starts by using some preliminary values for all variables other than the one being optimized and finds the optimum value for the selected variable using a one-dimension search process or a calculus-based one-step process (shown as x1). One then selects a second variable to optimize while retaining this optimal value of the first variable, and finds the optimal value of the second variable, and so on for all remaining variables until no significant improvement is found between successive searches. The entire process often requires more than one iteration, as shown in Fig. 7.18, a real danger is that it can get trapped in a local minimum region. The one-dimensional searches do not necessarily require numerical derivatives giving this algorithm an advantage when used with functions that are not easily differentiable or discontinuous. However, the search process is inherently not very efficient.

Example 7.4.2 Illustration of the univariate search method Consider the following function with two variables which is to be minimized using the univariate search process starting with an initial value of x2 = 3.

281

Fig. 7.19 Conceptual illustration of finding a minimum point of a bi-variate function using the steepest descent search method. From an initial point 1, the gradient of the function is determined, and the next search point determined by moving in that direction, and so on till the optimal point 4 is found

y = x1 þ

x 16 þ 2 x1 :x2 2

ð7:20aÞ

First, the partial derivatives are found: ∂y 16 =1∂x1 x2 :x21

and

∂y 16 1 =þ ∂x2 x1 :x22 2

ð7:20bÞ

Next, the initial value of x2 = 3 is used to find the next iterative value of x1 from the (∂y/∂x1) function as follows: p ∂y 16 4 3 =1= 0 from where x = = 2:309 1 3 ∂x1 ð3Þ:x21 The other partial derivative is finally used with this value of x1 to yield: -

16

p 4 3 2 3 x2



1 = 0 fromwhere x2 = 3:722 2

The new value of x2 is now used for the next cycle, and the iterative process repeated until consecutive improvements turn out to be sufficiently small to suggest convergence. It is left to the reader to verify that the optimal values are x1 = 2, x2 = 4 : ■ (c) Steepest-descent search (Fig. 7.19) is a widely used calculus-based approach because of its efficiency. The computational algorithm involves three steps: one starts with a guess value (represented by point 1), which is selected somewhat arbitrarily but, if possible, close to the optimal value. One then evaluates the gradient of the function at the current point by computing the partial derivatives either analytically or numerically. Finally,

282

7

one moves along this gradient (hence, the terminology “steepest”) by deciding, somewhat arbitrarily, on the step size. The relationship between the step sizes Δxi and the partial derivatives (∂y/∂xi) is: Δx1 Δx2 Δxi = = ... = ∂y=∂x1 ∂y=∂x2 ∂y=∂xi

discrete function calls at surrounding points to direct the search direction as in pattern search (which is referred to as a zero-order model since it does not require derivatives to be determine), this method uses quadratic polynomial approximation for local interpolation of the objective function. The solution of the quadratic approximation serves as the starting point of the next iteration, and so on, while providing an indication of the step size as well. Note that if the objective function is quadratic to start with, the approximation is exact, and the minimum point is found in one step. Otherwise, several iterations are needed, but the convergence is very rapid. The Powell algorithm is not a calculus-based method; however, it is unsuitable for complicated non-linear objective functions. Also, it may be computationally inefficient for higher dimension problems, and worse, may not converge to the global optimum if the search space is not symmetrical. The Fletcher-Reeves Method greatly improves the search efficiency of the steepest gradient method by adopting the concept of quadratic convergence. The transformation allows determining good search directions and distances based on the shape of the target function near the initial guess minimum, and then progresses towards the local minimum. This is a calculus-based method which involves using the Hessian matrix to determine the search direction. The advantage is that this method uses information about the local curvature of the fit statistics as well as its local gradients, which often tends to stabilize the search results. Textbooks such as Venkataraman (2002) provide more details about the mathematical theory and ways to code these methods into a software program.

ð7:21Þ

Steps 2–3 are performed iteratively until the minimum (or maximum) point is reached. A note of caution is that too large a step size can result in numerical instability, while too small a step size increases computation time. The above valley-descending methods are suited for function minimization. The algorithm is easily modified to the hill-climbing situation for problems requiring maximization. Example 7.4.3 Illustration of the steepest descent method Consider the following function with three variables to be minimized: y=

72x1 360 þ þ x1 x2 þ 2x3 x1 x3 x2

ð7:22Þ

Assume a starting point of (x1 = 5, x2 = 6, x3 = 8). At this point, the value of the function is y = 115. These numerical values are inserted in the expressions for the partial derivatives: 72 360 72 360 ∂y = þ x2 = þ 6 = 16:2 2 x 6 ∂x1 x3 :x1 2 ð8Þ:ð5Þ2 72ð5Þ 72x ∂y = - 2 1 þ x1 = þ 5= -5 ∂x2 x2 ð 6Þ 2 360 360 ∂y =þ 2= þ 2 = 0:875 ∂x3 x1 x23 ð 5Þ ð 8Þ 2 ð7:23Þ In order to compute the next point, a step size must be assumed. Arbitrarily assume Δx1 = -1 (verify that taking a negative value results in a decrease in the function value y). Δx3 -1 2 = Δx Applying Eq. 7.23 results in 16:2 - 5 = 0:875 from where Δx2 = 0.309, Δx3 = -0.054. Thus, the new point is (x1 = 4, x2 = 6.309, x3 = 7.946). The reader can verify that the new point has resulted in a decrease in the functional value from 115 to 98.1. Repeated use of the search method will gradually result in the optimal value being found. ■ (d) More Efficient Methods based on Quadratic Convergence. An improvement over the pattern or lattice search algorithm is the Powell’s Conjugate Direction Method, which searches along a set of directions that is conjugate or orthogonal to the objective function. Instead of

Optimization Methods

7.5

Linear Programming (LP)8

7.5.1

Standard Form

Recall the concept of numerical efficiency (or power) of a method of solution involving both robustness of the solution and fast execution times. Optimization problems which can be framed as a linear problem (even at the expense of a little loss in accuracy) have great numerical efficiency. Only if the objective function and the constraints (either equalities or inequalities) are both linear functions is the problem designated as a linear optimization problem; otherwise, it is deemed a non-linear optimization problem. The objective function can involve one or more functions to be either minimized or maximized (either objective can be treated identically since it is easy to convert one into the other). “Programming” is synonymous or optimization in operations research.

8

with

planning

activities

7.5 Linear Programming (LP)

283

The standard form of linear programming problems is: minimize f ðxÞ = cT x

ð7:24aÞ

subject to : gðxÞ : Ax = b

ð7:24bÞ

where x is the column vector of variables of dimension n, b that of the constraint limits of dimension m, c that of the cost coefficients of dimension n, and A is the (m x n) matrix of constraint coefficients. Example 7.5.1 Express the following linear two-dimensional problem into standard matrix notation: Maximize

subject to

f ðxÞ : 3186 þ 620x1 þ 420x2

ð7:25aÞ

g1 ðxÞ : 0:5x1 þ 0:7x2 ≤ 6:5 g2 ðxÞ : 4:5x1 - x2 ≤ 35

ð7:25bÞ

g3 ðxÞ : 2:1x1 þ 5:2x2 ≤ 60 with range constraints on the variables x1 and x2 being that these should not be negative. This is a problem with two variables (x1 and x2). However, three slack variables need to be introduced to reframe the three inequality constraints as equality constraints. This makes the problem into one with five unknown variables. The three inequality constraints are rewritten as: g1 ðxÞ : 0:5x1 þ 0:7x2 þ x3 = 6:5 g2 ðxÞ : 4:5x1 - x2 þ x4 = 35

ð7:26Þ

g3 ðxÞ : 2:1x1 þ 5:2x2 þ x5 = 60 Hence, the terms appearing in the standard form (Eqs. 7.24a and b) are: c = ½ - 620 - 420 0 0 0T , x = ½ x1 x2 x3 x4 x5  T , 0:5 0:7 1 0 0 A = 4:5 2:1

-1 0

1

0 ,

5:2

0

1

b = ½ 6:5 35

0

ð7:27Þ

T

60 

Note that the objective function is recast as a minimization problem simply by reversing the signs of the coefficients. Also, the constant does not appear in the optimization since it can be simply added to the optimal value of the function at the end. Step-by-step solutions of such optimization problems are given in several textbooks such as Edgar et al. (2001), Hillier and Lieberman (2001) and Stoecker (1989).

A commercial optimization software program was used to determine the optimal value of the above objective function: f  ðxÞ = 9803:8 Note that in this case, since the inequalities are “less than or equal to zero,” the numerical values of the slack variables (x3, x4, x5) will be positive. The optimal values for the primary variables are: x1 = 8:493, x2 = 3:219, while those for the slack variables are x3 = 0, x4 = 0, x5 = 25:424 (implying that constraints 1 and 2 in Eq. 7.25b have turned out to be equality constraints). ■ There is a great deal of literature on efficient algorithms to solve linear problems, which are referred to as linear programming methods. Because of its efficiency, the Simplex algorithm is the most popular numerical technique for solving large sets of linear equations. It proceeds by moving from one feasible solution to another with each step improving the value of the objective function; it also provides the necessary information for performing a sensitivity analysis at the same time. Hence, formulating problems as linear problems (even when they are not strictly so) has a great advantage in the solution phase. Such problems arise in numerous real-world applications where limited resources such as machines (an airline with a fixed number of airplanes that must serve a preset number of cities each day), material, etc. are to be allocated or scheduled in an optimal manner to one of several competing solutions/pathways.

7.5.2

Example of a LP Problem9

This example illustrates how the objective function and the constraints are to be framed given the necessary data, and then expressed in standard form. Specifically, the problem involves optimizing the mix of different mitigation technology pathways available to reduce pollution from a steelmanufacturing company. A steel company wishes to reduce its pollution emissions (specifically particulates, sulfur oxides and hydrocarbons) which are generated in two of its processes: blast furnaces for making pig iron and open-hearth furnaces for changing iron into steel. For both equipment, there are three viable technological solutions: taller smokestacks, filters, and better fuels. The amounts of required reduction for each of the three pollutants, the reduction of emissions per pound from the abatement technological option are given in Table 7.1 while the cost of the abatement method is shown in Table 7.2. This problem is formulated in terms of the six fractions xi shown in Table 7.3. The unit costs (shown in Table 7.2) are 9

Adapted from Hillier and Lieberman, 2001

284

7

Optimization Methods

Table 7.1 Emission rate reduction for different abatement technologies and required total annual emission reductions for different pollutants Pollutant

Particulates Sulfur oxides Hydrocarbons

Maximum feasible reduction in emission rates (106 lb/year) Taller smokestacks Filters Blast Open-hearth Blast Open-hearth furnace furnace furnace furnace 12 9 25 20 35 42 18 31 37 53 28 24

Table 7.2 Annual costs) for different abatement technologies if maximum feasible capacity is implemented Abatement method Taller smokestacks Filters Better fuels

Cost-Blast furnace ($ millions) 8 7 11

Cost-Open-hearth furnaces ($ millions) 10 6 9

assumed to be constant, and so are the emission rates (shown in Table 7.1). The objective function is framed in terms of minimizing total emissions subject to the cost constraint and the fact that the fractions xi must be positive and less than 1. Minimize J = 8x1 þ 10x2 þ 7x3 þ 6x4 þ 11x5 þ 9x6 s:t: 12x1 þ 9x2 þ 25x3 þ 20x4 þ 17x5 þ 13x6 = 60 35x1 þ 42x2 þ 18x3 þ 31x4 þ 56x5 þ 49x6 = 150 37x1 þ 53x2 þ 28x3 þ 24x4 þ 29x5 þ 20x6 = 125 x1 , x 2 , x 3 , x 4 , x 5 , x 6 ≥ 0 x1 , x2 , x3 , x4 , x5 , x6 ≤ 1

Better fuels Blast furnace 17 56 29

Reqd emission reduction (106 lb/year) Open-hearth furnace 13 49 20

60 150 125

Table 7.3 Decision variables fractions (xi) of the three combinations of maximum feasible capacity of an abatement technology and the two manufacturing options. Abatement method Taller smokestacks Filters Better fuels

Blast furnaces x1 x3 x5

Open-hearth furnaces x2 x4 x6

Thus, since x*1 = 1.0, all the blast furnaces that can be converted to have taller smokestacks should be modified accordingly, and so on. It is left to the reader to perform a post-optimality analysis to determine the sensitivity of the solution. For example, the individual manufacturing units come in discrete sizes and so the six fractions have to be rounded up to the closest number of units. How do the total cost and the total emissions change in such a case needs to be assessed. This example is strictly a mixed integer programming problem which is usually harder to solve. The above approach is a simplification which works well in most cases.

ð7:28abÞ Following the standard notation (Eqs. 7.24a and b), the problem is stated as: Minimize J = c x T

Constraints : gðxÞ : Ax = b c = ½ 8 10 7 6 11 9 T x = ½ x1 x2 x3 x4 x5 x6  T 12 9 25 20 17 13 A = 35

42

18

37

53

28

b = ½ 60

150 125 

31 56

49

24 29

20

ð7:29Þ

T

A commercial solver yields the following values at the optimal point: x*1 = 1.0, x*2 = 0.623, x*3 = 0.343, x*4 = 1, x*5 = 0.048, x*6 = 1.0 while the optimal objective function (total cost) J* = $32.154 million.

7.5.3

Linear Network Models

Network models are suitable for modeling systems with discrete nodes/vertices/points (such as junctions) interconnected by links/lines/edges/branches (such as water pipes for district energy distribution or power lines) through which matter or power can flow. They have been increasingly applied to engineered infrastructures when recovering from partial or complete failure due to extreme weather events10 (such as electric power transmission and wide-area water distribution systems) and have even found applications in social sciences and for modeling internet and social media communication interactions/dynamics (a comprehensive text is that by Newman 2010). A recent building sciences application of network modeling was developed by Sonta and Jain (2020) in which a social and organizational human network structure was learned using ambient sensing data from distributed plug load energy sensors in commercial buildings. In essence, network models are “simplified representations 10

Such issues are now being increasingly studied under the general area referred to as “resilience.”.

7.5 Linear Programming (LP)

285

that reduces a system to an abstract structure capturing only the basics of connection patterns with vertices or nodes for components and edges capturing some basic relationship of the node and of the system” (Alderson and Doyle 2010). The focus is on the topology of the essential structural interconnections among components (and not just on individual components), and the behavior of the system under eventbased disruptions of such interconnections. A network with m edges connecting n nodes (called a graph) can be formulated as a set of n linear equations using Kirchhoff’s conservation law which result in a (m × n) matrix called an incidence matrix. For simple cases without complex constraints, this leads to direct solutions (see Strang 1998 or Newman 2010). However, any realistic network would generally require a numerical method to solve the set of linear equations. It has been pointed out (for example, by Alderson et al. 2015) that the representation of an actual engineered system by a simplified surrogate network can be misleading if done simplistically. Hence, some sort of validation of the network topology, modeling equations and simulation is needed before one can place confidence in the analysis results.

7.5.4

Example of Maximizing Flow in a Transportation Network

This example is taken from Vugrin et al. (2014) to illustrate how to analyze transportation networks. Figure 7.20 represents a simple network with 7 nodes and 12 links where the objective is to maximize flow from one starting node 1 to another specified end node 7. The limiting capacities of various links are specified (and indicated in the figure). A fictitious return link (shown dotted) with infinite capacity needs to be introduced to complete the circuit. The optimization model for this flow problem can be framed as: max x71 ðt Þ xi ð t Þ xi ð t Þ = 0

s:t: i2I n

n = 1, . . . , 7

i2On

0 ≤ xi ðt Þ ≤ K i ðt Þ

ð7:30Þ

8i

Note that the time element t has been shown even though this analysis only involves the steady state situation. The second expression is the conservation constraint whereby the sum of inflows In is equal to the sum of outflows On at each node n. Ki denotes the limiting capacity of link i. The symbol 8 is used to state that the constraint range applies to all members i of the set. For the uninterrupted case, i.e., when no links are broken, the maximum flow is 14 units, while the flows through the

Fig. 7.20 Flow network topology with 7 nodes and 12 links. The intent is to maximize flow from node 1 to node 7 under different breakage scenarios. The limiting flow capacities of various links are shown above the corresponding lines. The dotted line is a fictitious link to complete the flow circuit

Table 7.4 Flows through different links for two scenarios in order to maximize total flow from link1 to 7 (see Fig. 7.20) Uninterrupted case Link Flow 1-2 3.0 1-3 7.0 1-4 4.0 2-3 0.0 2-5 3.0 3-4 0.0 3-5 4.0 3-6 3.0 4-6 4.0 6-5 1.0 5-7 8.0 6-7 6.0 Flow from 1-7 14.0

Compromised case (links 1-4, 2-3, 3-4 are broken) Link Flow 1-2 3.0 1-3 7.0 1-4 – 2-3 – 2-5 3.0 3-4 – 3-5 4.0 3-6 3.0 4-6 0.0 6-5 0.0 5-7 7.0 6-7 3.0 Flow from 1-7 10.0

individual links are assembled in Table 7.4. The same equations can be modified to analyze the situation when one or more of the links breaks. The corresponding flows for a breakage scenario when links 1-4, 2-3 and 3-4 are compromised are also assembled in Table 7.4. In this case, the maximum flow reduces to 10 units. Such analyses can be performed assuming different scenarios of one or more link breakages. Such types of evaluations are usually done in the framework of reliability analyses during the design of the networks.

7.5.5

Mixed Integer Linear Programing (MILP)

Mixed integer problems (MILP) are a special category of linear optimization problems where some of the variables are integers or even binary variables (such as a piece of

286

equipment being on or off or for a “yes/no” decision coded as 1 and 0). Integer/binary variables arise in scheduling problems involving multiple equipment. For example, if a large facility has numerous power generation units to meet a variable load, determining which units to operate so as to minimize operating costs would be an integer problem, while determining the fraction of their rated capacity at which to operate would be a continuous variable problem. Both issues taken together would be treated as a MILP problem (as illustrated in the solved example in Sect. 7.7). For example, Henze et al. (2008) developed and validated an optimization environment for a pharmaceutical facility with a chilled water plant with ten different chillers (electrical and absorption) that adopts mixed integer programming to optimize chiller selection (scheduling) and dispatch for any cooling load condition, while an overarching dynamic programming approach selects the optimal charge/discharge strategy of the chilled water thermal energy storage system. Another example is when a manufacturer who has the capability of producing different types of widgets must decide on how many items of each widget type to manufacture in order to maximize profit. Typically, such problems can be set up as standard linear optimization problems, with the added requirement that the some of the variables must be integers. MILP problems are generally solved using a LP based branch-and-bound algorithm (see Hillier and Lieberman 2001). The basic LP-based branch-and-bound can be described as follows. Start by removing all the integrality restrictions on decision variables which can only take on integer values. The resulting LP is called the LP relaxation of the original MILP. On solving this problem, if it so happens that the result satisfies all the integrality restrictions, even though these were not explicitly imposed, then that is the optimal solution sought. If not, as is usually the case, then the normal procedure is to pick one variable that is restricted to be integer, but whose value in the LP relaxation is fractional. For the sake of argument, suppose that this variable is x and its value in the LP relaxation is 3.3. One can then exclude this value by, in turn, imposing the constraints x ≤ 3.0 and x ≥ 4.0. This process is done sequentially for all the integer variables and is somewhat tedious. Usually, commercial optimization programs have in-built capabilities, and the user does not need to specify/perform such additional steps. Generally, it can be stated that (mixed) integer programming problems are much harder to solve than linear programming problems. MILP problems also arise in circumstances where the constraints are either-or, or the more general case when “K out of N” constraints need to be satisfied. A simple example illustrates the former case (Hillier and Lieberman 2001). Consider a case when at least one of the inequalities must hold:

7

Optimization Methods

Either 3x1 þ 2x2 ≤ 18 Or x1 þ 4x2 ≤ 16

ð7:31Þ

This can be reformulated as: 3x1 þ 2x2 ≤ 18 þ My x1 þ 4x2 ≤ 16 þ M ð1 - yÞ

ð7:32Þ

where M is a very large number and y is a binary variable (either 1 or 0). Solving these two constraints along with the objective function will provide the solution. A practical example of how MILP can be used for supervisory control is given in Sect. 7.7 wherein the component models of various energy equipment are framed as linear functions (a useful simplification in many cases).

7.5.6

Example of Reliability Analysis of a Power Network

Consider a simple electric power transmission system with seven loads and two generators (at nodes 1 and 3) as shown in Fig. 7.21.11 The step-down transformers are the nodes of this network while the high-voltage power lines are the links or lines indicated by arrows. The loss in power in the lines will be neglected, and the analysis will be done assuming DC current flow (the AC current analysis is more demanding computationally requiring solving non-linear equations and one needs to consider sophisticated stabilizing feedback control loops). Moreover, network models only capture energy/ power quantities, not current and voltage, as essential variables in a power system. This limits the ability to model cascading failures. Network models essentially capture snapshots of power systems at discrete time intervals, while actual power networks are continuous-time dynamical systems. (a) Mathematical model The flow model for the network shown i