111 60 39MB
English Pages 630 [622] Year 2023
T. Agami Reddy Gregor P. Henze
Applied Data Analysis and Modeling for Energy Engineers and Scientists Second Edition
Applied Data Analysis and Modeling for Energy Engineers and Scientists
T. Agami Reddy • Gregor P. Henze
Applied Data Analysis and Modeling for Energy Engineers and Scientists Second Edition
T. Agami Reddy The Design School and the School of Sustainable Engineering and the Built Environment Arizona State University Tempe, AZ, USA
Gregor P. Henze Department of Civil, Environmental and Architectural Engineering University of Colorado Boulder, CO, USA
ISBN 9783031348686 ISBN 9783031348693 https://doi.org/10.1007/9783031348693
(eBook)
# The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional afﬁliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable.
Dedicated to the men who have had a profound and signiﬁcant impact on my life (TAR): my grandfather Satyakarma, my father Dayakar, my brother Shantikar, and my son Satyajit. Dedicated to the strong, loving, and inspiring women in my life (GPH): my mother Uta Birgit, my wife Martha Marie, and our daughters Sophia Miriam and Josephine Charlotte.
Preface (Second Edition)
This second edition has been undertaken over a dozen years after the ﬁrst edition and is a complete revision meant to modernize, update, and expand on topic coverage and reference case study examples. The general intent remains the same, i.e., a practical textbook on applied data analysis and modeling targeting students and professionals in engineering and applied science working on energy and environmental issues and systems in general and in building energy domain in particular. Statistical textbooks often tend to be opaque from which an intuitive understanding is difﬁcult to acquire of how to (and when not to) apply the numerous analysis techniques to one’s chosen ﬁeld. The style of writing of this book is to present a simple, clear,1 and logically laidout structure of the various aspects of statistical theory2 and practice along with suggestions, discussion, and case study examples meant to enhance comprehension, and act as a catalyst for selfdiscovery of the reader. The book remains modular, and several chapters can be studied as standalone. The structure of the ﬁrst edition has been retained but important enhancements have been made. The ﬁrst six chapters deal with basic topics covered in most statistical textbooks (and this could serve as a ﬁrst course if needed), while the remaining six chapters deal with more advanced topics with domainrelevant discussion and case study examples. The latter chapters have been revised extensively with new statistical methods and subject matter along with numerous examples from recently published technical papers meant to nurture and stimulate a more researchfocused mind set of the reader. The chapter on “classiﬁcation and clustering” in the ﬁrst edition has been renamed as “statistical learning through data analysis” given the enormous advances in data science, and has been greatly expanded in scope and treatment. The chapters on inverse methods as applied to blackbox, graybox, and whitebox models (Chaps. 9 and 10) have been better structured, thoroughly revised, and improved. A new section in the last chapter on decisionmaking has been added which deals with sustainability assessments. It tries to dispel the current confusing nomenclature in the sustainability literature by deﬁning and scoping various terms and clearly distinguishing between the different assessment frameworks and their application areas. An attempt is made to combine traditional decisionmaking with the broader domain of “sustainability assessments.” In recent years, “sciencebased data analysis” is a term often used by certain sections of the society to lend credence to whatever opinion they wish to promote based on some sort of analysis on some sort of data—it is close to assuming the aura of a “new religion.” Data analysis and modeling is an art with principles grounded in science. The same data set can often be analyzed in different ways by different analysts, thereby affording a great deal of methodological freedom. Unfortunately, a lack of rigor and excessive reliance on freely available software packages undermines an attitude of humility and cautious scrutiny historically expected of the scientiﬁc method. Therefore, we hold that the signiﬁcant body of statistical analysis methods is something those wishing to become competent and trustworthy analysts are encouraged to “If you cannot explain the concepts simply, you probably do not understand them properly”—Richard Feynman. 2 Theory is something with principles which are not obvious initially but from which surprising consequences can be deduced and even the principles conﬁrmed. 1
vii
viii
Preface (Second Edition)
acquire as a foundation, but more important is the need to develop a mind set and a skill set built on years of handson experience and selfevaluation. With the advent of powerful computing and convenienttouse statistical packages, the analysts can perform numerous different types of analysis before reaching a conclusion, a convenience not available to analysts merely a few decades ago. As alluded to above, this has also led to cases of misuse and even error. Numerous technical papers report results of statistical analysis which are incomplete and deﬁcient, which leads to an unfortunate erosion in the conﬁdence of scientiﬁc ﬁndings by the research and policy community, not to mention the public. Further, there are several cases, where the original approach dealing with a speciﬁc topic went down an analysis path which was found later to be inappropriate, but subsequent researchers (and funding agencies) kept pushing that pathway just because of historic inertia. Hence, it is imperative that one have the courage to be impartial about one’s results and research ﬁndings. There are several instances where science, or at least applied science, is not selfcorrecting which can be attributed to the tendency to belittle conﬁrmatory research, to limited funding, to constant shifting of research focus by funding agencies, academics, and researchers, and to the mindset that changing course direction will open one’s previous research to criticism. The original author (TAR) is very pleased to have his colleague, Prof. Gregor Henze, join the authorship of this edition. One of the most noticeable differences in the second edition of this book is the inclusion of electronic resources. This will enhance seniorlevel and graduate instruction while also serving as a selflearning aid to professionals in this domain area. For readers to have exposure with (and perhaps become proﬁcient in) performing handson analysis, the opensource Python and R programming languages have been adopted in the form of Jupyter notebooks and R markdown ﬁles, which can be downloaded from https://github.com/henzeresearchgroup/adam. This repository contains numerous data sets and sample computer code reﬂective of realworld problems, which will continue to grow as new examples are developed and graciously contributed by readers of this book. The link also allows the large data tables in Appendix B and various chapters to be downloaded conveniently for analysis.
Acknowledgments In addition to the numerous talented and dedicated colleagues who contributed in various ways to the ﬁrst edition of this book, we would like to acknowledge, in particular, the following: • Colleagues: Bass Abushakra, Marlin Addison, Brad Allenby, JuanCarlos Baltazar, David Claridge, Daniel Feuermann, Srinivas Katipamula, George Runger, Kris Subbarao, Frank Vignola and Radu Zmeureanu. • Former students: Thomaz Carvalhaes, Srijan Didwania, Ranajoy Dutta, Phillip Howard, Alireza Inanlouganji, Mushﬁq Islam, Saurabh Jalori, Salim Moslehi, Emmanuel Omere, Travis Sabatino, and Steve Snyder. TAR would like to acknowledge the love and encouragement from his wife Shobha, their children Agaja and Satyajit, and granddaughters Maya and Mikaella. GPH would like to acknowledge the encouragement and support of his wife Martha and their children Sophia and Josephine. Tempe, AZ, USA Boulder, CO, USA
T. Agami Reddy Gregor P. Henze
Preface (First Edition)
A Third Need in Engineering Education At its inception, engineering education was predominantly process oriented, while engineering practice tended to be predominantly system oriented.3 While it was invaluable to have a strong fundamental knowledge of the processes, educators realized the need to have courses where this knowledge translated into an ability to design systems; therefore, most universities, starting in the 1970s, mandated that seniors take at least one design/capstone course. However, a third aspect is acquiring increasing importance: the need to analyze, interpret and model data. Such a skill set is proving to be crucial in all scientiﬁc activities, none so as much as in engineering and the physical sciences. How can data collected from a piece of equipment be used to assess the claims of the manufacturers? How can performance data either from a natural system or a manmade system be respectively used to maintain it more sustainably or to operate it more efﬁciently? Such needs are driven by the fact that system performance data is easily available in our presentday digital age where sensor and data acquisition systems have become reliable, cheap and part of the system design itself. This applies both to experimental data (gathered from experiments performed according to some predetermined strategy) and to observational data (where one can neither intrude on system functioning nor have the ability to control the experiment, such as in astronomy). Techniques for data analysis also differ depending on the size of the data; smaller data sets may require the use of “prior” knowledge of how the system is expected to behave or how similar systems have been known to behave in the past. Let us consider a speciﬁc instance of observational data: once a system is designed and built, how to evaluate its condition in terms of design intent and, if possible, operate it in an “optimal” manner under variable operating conditions (say, based on cost, or on minimal environmental impact such as carbon footprint, or any appropriate prespeciﬁed objective). Thus, data analysis and data driven modeling methods as applied to this instance can be meant to achieve certain practical ends—for example: (a) Verifying stated claims of manufacturer; (b) Product improvement or product characterization from performance data of prototype; (c) Health monitoring of a system, i.e., how does one use quantitative approaches to reach sound decisions on the state or “health” of the system based on its monitored data? (d) Controlling a system, i.e., how best to operate and control it on a daytoday basis? (e) identifying measures to improve system performance, and assess impact of these measures; (f) Veriﬁcation of the performance of implemented measures, i.e., are the remedial measures implemented impacting system performance as intended?
3
Stoecker, W.F., 1989. Design of Thermal Systems, 3rd Edition, McGrawHill, New York. ix
x
Intent Data analysis and modeling is not an end in itself; it is a wellproven and often indispensable aid for subsequent decisionmaking such as allowing realistic assessment and predictions to be made concerning verifying expected behavior, the current operational state of the system and/or the impact of any intended structural or operational changes. It has its roots in statistics, probability, regression, mathematics (linear algebra, differential equations, numerical methods,. . .), modeling and decision making. Engineering and science graduates are somewhat comfortable with mathematics while they do not usually get any exposure to decision analysis at all. Statistics, probability and regression analysis are usually squeezed into a sophomore term resulting in them remaining “a shadowy mathematical nightmare, and . . . a weakness forever”4 even to academically good graduates. Further, many of these concepts, tools and procedures are taught as disparate courses not only in physical sciences and engineering but in life sciences, statistics and econometric departments. This has led to many in the physical sciences and engineering communities having a pervasive “mental block” or apprehensiveness or lack of appreciation of this discipline altogether. Though these analysis skills can be learnt over several years by some (while some never learn it well enough to be comfortable even after several years of practice), what is needed is a textbook which provides: 1. A review of classical statistics and probability concepts, 2. A basic and uniﬁed perspective of the various techniques of data based mathematical modeling and analysis, 3. An understanding of the “process” along with the tools, 4. A proper combination of classical methods with the more recent machine learning and automated tools which the wide spread use of computers has spawned, and 5. Wellconceived examples and problems involving realworld data that would illustrate these concepts within the purview of speciﬁc areas of application. Such a text is likely to dispel the current sense of unease and provide readers with the necessary measure of practical understanding and conﬁdence in being able to interpret their numbers rather than merely generating them. This would also have the added beneﬁt of advancing the current state of knowledge and practice in that the professional and research community would better appreciate, absorb and even contribute to the numerous research publications in this area.
Approach and Scope Forward models needed for system simulation and design have been addressed in numerous textbooks and have been wellinculcated into the undergraduate engineering and science curriculum for several decades. It is the issue of datadriven methods, which I feel is inadequately reinforced in undergraduate and ﬁrstyear graduate curricula, and hence the basic rationale for this book. Further, this book is not meant to be a monograph or a compilation of information on papers i.e., not a literature review. It is meant to serve as a textbook for senior undergraduate or ﬁrstyear graduate students or for continuing education professional courses, as well as a selfstudy reference book for working professionals with adequate background. Applied statistics and data based analysis methods ﬁnd applications in various engineering, business, medical, and physical, natural and social sciences. Though the basic concepts are the same, the diversity in these disciplines results in rather different focus and differing emphasis of the analysis methods. This diversity may be in the process itself, in the type and quantity of data, and in the intended purpose of the analysis. For example, many engineering systems have 4
Keller, D.K., 2006. The Tao of Statistics, Saga Publications, London, UK.
Preface (First Edition)
Preface (First Edition)
xi
low “epistemic” uncertainty or uncertainty associated with the process itself, and, also allow easy gathering of adequate performance data. Such models are typically characterized by strong relationships between variables which can be formulated in mechanistic terms and accurate models consequently identiﬁed. This is in stark contrast to such ﬁelds as economics and social sciences where even qualitative causal behavior is often speculative, and the quantity and uncertainty in data rather poor. In fact, even different types of engineered and natural systems require widely different analysis tools. For example, electrical and speciﬁc mechanical engineering disciplines (ex. involving rotary equipment) largely rely on frequency domain analysis methods, while timedomain methods are more suitable for most thermal and environmental systems. This consideration has led me to limit the scope of the analysis techniques described in this book to thermal, energyrelated, environmental and industrial systems. There are those students for whom a mathematical treatment and justiﬁcation helps in better comprehension of the underlying concepts. However, my personal experience has been that the great majority of engineers do not fall in this category, and hence a more pragmatic approach is adopted. I am not particularly concerned with proofs, deductions and statistical rigor which tend to overwhelm the average engineering student. The intent is, rather, to impart a broad conceptual and theoretical understanding as well as a solid working familiarity (by means of case studies) of the various facets of datadriven modeling and analysis as applied to thermal and environmental systems. On the other hand, this is not a cookbook nor meant to be a reference book listing various models of the numerous equipment and systems which comprise thermal systems, but rather stresses underlying scientiﬁc, engineering, statistical and analysis concepts. It should not be considered as a substitute for specialized books nor should their importance be trivialized. A good general professional needs to be familiar, if not proﬁcient, with a number of different analysis tools and how they “map” with each other, so that he can select the most appropriate tools for the occasion. Though nothing can replace handson experience in design and data analysis, being familiar with the appropriate theoretical concepts would not only shorten modeling and analysis time but also enable better engineering analysis to be performed. Further, those who have gone through this book will gain the required basic understanding to tackle the more advanced topics dealt with in the literature at large, and hence, elevate the profession as a whole. This book has been written with a certain amount of zeal in the hope that this will give this ﬁeld some impetus and lead to its gradual emergence as an identiﬁable and important discipline (just as that enjoyed by a course on modeling, simulation and design of systems) and would ultimately be a required seniorlevel course or ﬁrstyear graduate course in most engineering and science curricula. This book has been intentionally structured so that the same topics (namely, statistics, parameter estimation and data collection) are treated ﬁrst from a “basic” level, primarily by reviewing the essentials, and then from an “intermediate” level. This would allow the book to have broader appeal, and allow a gentler absorption of the needed material by certain students and practicing professionals. As pointed out by Asimov,5 the Greeks demonstrated that abstraction (or simpliﬁcation) in physics allowed a simple and generalized mathematical structure to be formulated which led to greater understanding than would otherwise, along with the ability to subsequently restore some of the realworld complicating factors which were ignored earlier. Most textbooks implicitly follow this premise by presenting “simplistic” illustrative examples and problems. I strongly believe that a book on data analysis should also expose the student to the “messiness” present in realworld data. To that end, examples and problems which deal with case studies involving actual (either raw or marginally cleaned) data have been included. The hope is that this would provide the student with the necessary training and conﬁdence to tackle realworld analysis situations.
5
Asimov, I., 1966. Understanding Physics: Light Magnetism and Electricity, Walker Publications.
xii
Preface (First Edition)
Assumed Background of Reader This is a book written for two sets of audiences: a basic treatment meant for the general engineering and science senior as well as the general practicing engineer on one hand, and the general graduate student and the more advanced professional entering the ﬁelds of thermal and environmental sciences. The exponential expansion of scientiﬁc and engineering knowledge as well as its crossfertilization with allied emerging ﬁelds such as computer science, nanotechnology and bioengineering have created the need for a major reevaluation of the thermal science undergraduate and graduate engineering curricula. The relatively few professional and free electives academic slots available to students requires that traditional subject matter be combined into fewer classes whereby the associated loss in depth and rigor is compensated for by a better understanding of the connections among different topics within a given discipline as well as between traditional and newer ones. It is presumed that the reader has the necessary academic background (at the undergraduate level) of traditional topics such as physics, mathematics (linear algebra and calculus), ﬂuids, thermodynamics and heat transfer, as well as some exposure to experimental methods, probability, statistics and regression analysis (taught in lab courses at the freshman or sophomore level). Further, it is assumed that the reader has some basic familiarity with important energy and environmental issues facing society today. However, special effort has been made to provide pertinent review of such material so as to make this into a sufﬁciently selfcontained book. Most students and professionals are familiar with the uses and capabilities of the ubiquitous spreadsheet program. Though many of the problems can be solved with the existing (or addons) capabilities of such spreadsheet programs, it is urged that the instructor or reader select an appropriate statistical program to do the statistical computing work because of the added sophistication which it provides. This book does not delve into how to use these programs, rather, the focus of this book is educationbased intended to provide knowledge and skill sets necessary for value, judgment and conﬁdence on how to use them, as against trainingbased whose focus would be to teach facts and specialized software.
Acknowledgements Numerous talented and dedicated colleagues contributed in various ways over the several years of my professional career; some by direct association, others indirectly through their textbooks and papersboth of which were immensely edifying and stimulating to me personally. The list of acknowledgements of such meritorious individuals would be very long indeed, and so I have limited myself to those who have either provided direct valuable suggestions on the overview and scope of this book, or have generously given their time in reviewing certain chapters of this book. In the former category, I would like to gratefully mention Drs. David Claridge, Jeff Gordon, Gregor Henze John Mitchell and Robert Sonderegger, while in the latter, Drs. James Braun, Patrick Gurian, John House, Ari Rabl and Balaji Rajagopalan. I am also appreciative of interactions with several exceptional graduate students, and would like to especially thank the following whose work has been adopted in case study examples in this book: Klaus Andersen, Song Deng, Jason Fierko, Wei Jiang, Itzhak Maor, Steven Snyder and Jian Sun. Writing a book is a tedious and long process; the encouragement and understanding of my wife, Shobha, and our children, Agaja and Satyajit, were sources of strength and motivation. Tempe, AZ, USA December 2010
T. Agami Reddy
Contents
1
Mathematical Models and Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Forward and Inverse Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Preamble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 The Energy Problem and Importance of Buildings . . . . . . . . . . . 1.1.3 Forward or Simulation Approach . . . . . . . . . . . . . . . . . . . . . . . 1.1.4 Inverse or Data Analysis Approach . . . . . . . . . . . . . . . . . . . . . . 1.1.5 Discussion of Both Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 System Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 What Is a System Model? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Types of Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Types of Uncertainty in Data . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Mathematical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Basic Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Block Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.3 Mathematical Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.4 Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.5 SteadyState and Dynamic Models . . . . . . . . . . . . . . . . . . . . . . 1.5 Mathematical Modeling Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Broad Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.2 Simulation or Forward Modeling . . . . . . . . . . . . . . . . . . . . . . . . 1.5.3 Inverse Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.4 Calibrated Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Data Analytic Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1 Data Mining or Knowledge Discovery . . . . . . . . . . . . . . . . . . . . 1.6.2 Machine Learning or Algorithmic Models . . . . . . . . . . . . . . . . . 1.6.3 Introduction to Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.2 Basic Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.3 Example of a Data Collection and Analysis System . . . . . . . . . . 1.8 Topics Covered in Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1 1 1 2 2 3 3 3 3 4 4 6 7 7 7 9 10 13 14 14 14 17 19 19 20 20 21 22 22 22 23 25 27 30
2
Probability Concepts and Probability Distributions . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Classical Concept of Probability . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Bayesian Viewpoint of Probability . . . . . . . . . . . . . . . . . . . . . 2.1.3 Distinction Between Probability and Statistics . . . . . . . . . . . . .
31 31 31 32 32
. . . . .
xiii
xiv
Contents
2.2
3
Classical Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Basic Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Basic Set Theory Notation and Axioms of Probability . . . . . . . . 2.2.3 Axioms of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Joint, Marginal, and Conditional Probabilities . . . . . . . . . . . . . . 2.2.5 Permutations and Combinations . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Probability Distribution Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Expectations and Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Function of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Chebyshev’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Important Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Distributions for Discrete Variables . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Distributions for Continuous Variables . . . . . . . . . . . . . . . . . . . 2.5 Bayesian Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Application to Discrete Probability Variables . . . . . . . . . . . . . . . 2.5.3 Application to Continuous Probability Variables . . . . . . . . . . . . 2.6 Three Kinds of Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32 32 33 34 35 38 39 39 42 43 44 45 45 45 50 58 58 61 63 64 66 74
Data Collection and Preliminary Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Sensors and Their Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Data Collection Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Generalized Measurement System . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Types and Categories of Measurements . . . . . . . . . . . . . . . . . . . 3.2.3 Data Recording Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Raw Data Validation and Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Limit Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Consistency Checks Involving Conservation Balances . . . . . . . . 3.3.4 Outlier Rejection by Visual Means . . . . . . . . . . . . . . . . . . . . . . 3.3.5 Handling Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Statistical Measures of Sample Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Summary Descriptive Measures . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Covariance and Pearson Correlation Coefﬁcient . . . . . . . . . . . . . 3.5 Exploratory Data Analysis (EDA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 What Is EDA? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Purpose of Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.3 Static Univariate Graphical Plots . . . . . . . . . . . . . . . . . . . . . . . . 3.5.4 Static Bi and Multivariate Graphical Plots . . . . . . . . . . . . . . . . . 3.5.5 Interactive and Dynamic Graphics . . . . . . . . . . . . . . . . . . . . . . . 3.5.6 Basic Data Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Overall Measurement Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Need for Uncertainty Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Basic Uncertainty Concepts: Random and Bias Errors . . . . . . . . 3.6.3 Random Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.4 Bias Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.5 Overall Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.6 Chauvenet’s Statistical Criterion of Data Rejection . . . . . . . . . . .
75 75 77 77 80 80 81 81 81 82 83 83 87 87 88 89 89 91 92 95 100 101 102 102 102 103 105 105 106
Contents
xv
4
5
3.7
Propagation of Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.1 Taylor Series Method for CrossSectional Data . . . . . . . . . . . . . 3.7.2 Monte Carlo Method for Error Propagation Problems . . . . . . . . . 3.8 Planning a NonIntrusive Field Experiment . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
106 107 111 113 115 121
Making Statistical Inferences from Samples . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Basic Univariate Inferential Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Sampling Distribution and Conﬁdence Interval of the Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Hypothesis Test for Single Sample Mean . . . . . . . . . . . . . . . . . . 4.2.3 Two Independent Sample and Paired Difference Tests on Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Single and Two Sample Tests for Proportions . . . . . . . . . . . . . . 4.2.5 Single and Two Sample Tests of Variance . . . . . . . . . . . . . . . . . 4.2.6 Tests for Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.7 Test on the Pearson Correlation Coefﬁcient . . . . . . . . . . . . . . . . 4.3 ANOVA Test for MultiSamples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 SingleFactor ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Tukey’s Multiple Comparison Test . . . . . . . . . . . . . . . . . . . . . . 4.4 Tests of Signiﬁcance of Multivariate Data . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Introduction to Multivariate Methods . . . . . . . . . . . . . . . . . . . . . 4.4.2 Hotteling T2 Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 NonParametric Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Signed and Rank Tests for Medians . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Kruskal–Wallis Multiple Samples Test for Medians . . . . . . . . . . 4.5.3 Test on Spearman Rank Correlation Coefﬁcient . . . . . . . . . . . . . 4.6 Bayesian Inferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.2 Estimating Population Parameter from a Sample . . . . . . . . . . . . . 4.6.3 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Some Considerations About Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.1 Random and NonRandom Sampling Methods . . . . . . . . . . . . . . 4.7.2 Desirable Properties of Estimators . . . . . . . . . . . . . . . . . . . . . . . 4.7.3 Determining Sample Size During Random Surveys . . . . . . . . . . 4.7.4 Stratiﬁed Sampling for Variance Reduction . . . . . . . . . . . . . . . . 4.8 Resampling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.1 Basic Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.2 Application to Probability Problems . . . . . . . . . . . . . . . . . . . . . 4.8.3 Different Methods of Resampling . . . . . . . . . . . . . . . . . . . . . . . 4.8.4 Application of Bootstrap to Statistical Inference Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.5 Closing Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
123 123 124
Linear Regression Analysis Using Least Squares . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Objective of Regression Analysis . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Ordinary Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . .
124 127 129 133 134 136 137 138 139 141 142 142 143 146 147 149 150 152 152 152 153 154 154 155 156 158 159 159 159 159 160 162 163 167 169 169 170 170 170
xvi
Contents
5.3
Simple OLS Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Estimation of Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Statistical Criteria for Model Evaluation . . . . . . . . . . . . . . . . . . 5.3.3 Inferences on Regression Coefﬁcients and Model Signiﬁcance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 Model Prediction Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Multiple OLS Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Higher Order Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Matrix Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Point and Interval Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.4 Beta Coefﬁcients and Elasticity . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.5 Partial Correlation Coefﬁcients . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.6 Assuring Model Parsimony—Stepwise Regression . . . . . . . . . . . 5.5 Applicability of OLS Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2 Sources of Errors During Regression . . . . . . . . . . . . . . . . . . . . . 5.6 Model Residual Analysis and Regularization . . . . . . . . . . . . . . . . . . . . . . 5.6.1 Detection of IllConditioned Behavior . . . . . . . . . . . . . . . . . . . . 5.6.2 Leverage and Inﬂuence Data Points . . . . . . . . . . . . . . . . . . . . . . 5.6.3 Remedies for Nonuniform Residuals . . . . . . . . . . . . . . . . . . . . . 5.6.4 Serially Correlated Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.5 Dealing with Misspeciﬁed Models . . . . . . . . . . . . . . . . . . . . . . . 5.7 Other Useful OLS Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.1 ZeroIntercept Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.2 Indicator Variables for Local Piecewise Models— Linear Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.3 Indicator Variables for Categorical Regressor Models . . . . . . . . . 5.8 Resampling Methods Applied to Regression . . . . . . . . . . . . . . . . . . . . . . . 5.8.1 Basic Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8.2 Jackknife and kFold CrossValidation . . . . . . . . . . . . . . . . . . . 5.8.3 Bootstrap Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.9 Case Study Example: Effect of Refrigerant Additive on Chiller Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.10 Parting Comments on Regression Analysis and OLS . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Design of Physical and Simulation Experiments . . . . . . . . . . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Types of Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Purpose of DOE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.3 DOE Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Overview of Different Statistical Methods . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Different Types of ANOVA Tests . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Link Between ANOVA and Regression . . . . . . . . . . . . . . . . . . . 6.2.3 Recap of Basic Model Functional Forms . . . . . . . . . . . . . . . . . . 6.3 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Levels, Discretization, and Experimental Combinations . . . . . . . 6.3.2 Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.3 Unrestricted and Restricted Randomization . . . . . . . . . . . . . . . . 6.4 Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Full Factorial Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 2k Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
171 171 173 175 177 178 178 182 182 185 187 187 189 189 191 191 191 194 197 202 203 205 205 205 207 208 208 208 210 210 213 214 221 223 223 223 224 225 225 225 226 226 227 227 229 229 230 230 234
Contents
xvii
7
6.4.3 Concept of Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.4 Fractional Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Block Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Complete Block Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.2 Latin Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Response Surface Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.3 First and SecondOrder Models . . . . . . . . . . . . . . . . . . . . . . . . 6.6.4 Central Composite Design and the Concept of Rotation . . . . . . . 6.7 Simulation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7.2 Similarities and Differences Between Physical and Simulation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7.3 Monte Carlo and Allied Sampling Methods . . . . . . . . . . . . . . . . 6.7.4 Sensitivity Analysis for Screening . . . . . . . . . . . . . . . . . . . . . . . 6.7.5 Surrogate Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
238 240 241 241 243 245 245 246 246 247 250 250
Optimization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 What Is Optimization? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Terminology and Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Deﬁnition of Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Categorization of Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 Types of Objective Functions and Constraints . . . . . . . . . . . . . . 7.2.4 Sensitivity Analysis and PostOptimality Analysis . . . . . . . . . . . 7.3 Analytical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Unconstrained Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Direct Substitution Method for Equality Constrained Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Lagrange Multiplier Method for Equality Constrained Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.4 Problems with Inequality Constraints . . . . . . . . . . . . . . . . . . . . . 7.3.5 Penalty Function Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Numerical Unconstrained Search Methods . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Univariate Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 Multivariate Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Linear Programming (LP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 Standard Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.2 Example of a LP Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.3 Linear Network Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.4 Example of Maximizing Flow in a Transportation Network . . . . . 7.5.5 Mixed Integer Linear Programing (MILP) . . . . . . . . . . . . . . . . . 7.5.6 Example of Reliability Analysis of a Power Network . . . . . . . . . 7.6 Nonlinear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.1 Standard Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.2 Quadratic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.3 Popular Numerical Multivariate Search Algorithms . . . . . . . . . .
267 267 267 268 269 269 269 271 272 272 272
252 253 254 259 260 260 265
273 274 275 276 277 277 280 282 282 283 284 285 285 286 288 288 289 289
xviii
8
9
Contents
7.7 Illustrative Example: Integrated Energy System (IES) for a Campus . . . . . . 7.8 Examples of Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
290 294 299 306
Analysis of Time Series Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.3 Basic Behavior Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.4 Illustrative Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 General Model Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Smoothing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Arithmetic Moving Average (AMA) . . . . . . . . . . . . . . . . . . . . . 8.3.2 Exponentially Weighted Moving Average (EWA) . . . . . . . . . . . 8.3.3 Determining Structure by CrossValidation . . . . . . . . . . . . . . . . 8.4 OLS Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 Trend Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.2 Trend and Seasonal Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.3 Forecast Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.4 Fourier Series Models for Periodic Behavior . . . . . . . . . . . . . . . 8.4.5 Interrupted Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Stochastic Time Series Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.2 ACF, PACF, and Data Detrending . . . . . . . . . . . . . . . . . . . . . . . 8.5.3 ARIMA Class of Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.4 Recommendations on Model Identiﬁcation . . . . . . . . . . . . . . . . . 8.6 ARMAX or Transfer Function Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.1 Conceptual Approach and Beneﬁt . . . . . . . . . . . . . . . . . . . . . . . 8.6.2 Transfer Function Modeling of Linear Dynamic Systems . . . . . . 8.7 Quality Control and Process Monitoring Using Control Chart Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.1 Background and Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.2 Shewart Control Charts for Variables . . . . . . . . . . . . . . . . . . . . . 8.7.3 Shewart Control Charts for Attributes . . . . . . . . . . . . . . . . . . . . 8.7.4 Practical Implementation Issues of Control Charts . . . . . . . . . . . 8.7.5 TimeWeighted Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
309 309 309 311 312 313 314 314 315 316 318 320 320 321 322 324 327 328 328 329 332 337 339 339 339
Parametric and NonParametric Regression Methods . . . . . . . . . . . . . . . . . . . 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Important Concepts in Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Structural Identiﬁability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.2 IllConditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.3 Numerical Identiﬁability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Dealing with Collinear Regressors: Variable Selection and Shrinkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 Problematic Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.2 Principal Component Analysis and Regression . . . . . . . . . . . . . . 9.3.3 Ridge and Lasso Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.4 Chiller Case Study Involving Collinear Regressors . . . . . . . . . . . 9.3.5 Other Multivariate Methods . . . . . . . . . . . . . . . . . . . . . . . . . . .
355 355 357 357 359 360
341 341 342 344 346 347 349 350 353
361 361 362 367 368 372
Contents
xix
10
9.4
Going Beyond OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.2 Maximum Likelihood Estimation (MLE) . . . . . . . . . . . . . . . . . . 9.4.3 Generalized Linear Models (GLM) . . . . . . . . . . . . . . . . . . . . . . 9.4.4 BoxCox Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.5 Logistic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.6 Error in Variables (EIV) and Corrected Least Squares . . . . . . . . . 9.5 NonLinear Parametric Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.1 Detecting NonLinear Correlation . . . . . . . . . . . . . . . . . . . . . . . 9.5.2 Different NonLinear Search Methods . . . . . . . . . . . . . . . . . . . . 9.5.3 Overview of Various Parametric Regression Methods . . . . . . . . . 9.6 NonParametric Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6.2 Extensions to Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6.3 Basis Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6.4 Polynomial Regression and Smoothing Splines . . . . . . . . . . . . . 9.7 Local Regression: LOWESS Smoothing Method . . . . . . . . . . . . . . . . . . . 9.8 Neural Networks: MultiLayer Perceptron (MLP) . . . . . . . . . . . . . . . . . . . 9.9 Robust Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
373 373 375 377 378 381 384 386 386 388 390 390 390 391 393 393 394 394 399 401 406
Inverse Methods for Mechanistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 Fundamental Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.1 Applicability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.2 Approaches and Their Characteristics . . . . . . . . . . . . . . . . . . . . 10.1.3 Mechanistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.4 Scope of Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 GrayBox Static Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.1 Basic Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.2 Performance Models for Solar Photovoltaic Systems . . . . . . . . . 10.2.3 GrayBox and BlackBox Models for WaterCooled Chillers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.4 Sequential Stagewise Regression and Selection of Data Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.5 Case Study of NonIntrusive Sequential Parameter Estimation for Building Energy Flows . . . . . . . . . . . . . . . . . . . . 10.2.6 Application to Policy: DoseResponse . . . . . . . . . . . . . . . . . . . . 10.3 Certain Aspects of Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.1 Types of Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.2 Measures of Information Content . . . . . . . . . . . . . . . . . . . . . . . 10.3.3 Functional Testing and Data Fusion . . . . . . . . . . . . . . . . . . . . . . 10.4 GrayBox Models for Dynamic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.2 Sequential Estimation of Thermal Network Model Parameters from Controlled Tests . . . . . . . . . . . . . . . . . . . . . . . 10.4.3 NonIntrusive Identiﬁcation of Thermal Network Models and Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.4 State Space Representation and Compartmental Models . . . . . . . 10.4.5 Example of a Compartmental Model . . . . . . . . . . . . . . . . . . . . . 10.4.6 Practical Issues During Identiﬁcation . . . . . . . . . . . . . . . . . . . . . 10.5 Bayesian Regression and Parameter Estimation: Case Study . . . . . . . . . . . 10.6 Calibration of Detailed Simulation Programs . . . . . . . . . . . . . . . . . . . . . .
409 410 410 411 413 413 413 413 414 417 419 420 424 426 426 426 431 432 432 433 434 437 437 439 441 446
xx
Contents
10.6.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.2 The Basic Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.3 Detailed Simulation Models for Energy Use in Buildings . . . . . . 10.6.4 Uses of Calibrated Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.5 Causes of Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.6 Deﬁnition of Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.7 Raw Input Tuning (RIT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.8 SemiAnalytical Methods (SAM) . . . . . . . . . . . . . . . . . . . . . . . 10.6.9 Physical Parameter Estimation (PPE) . . . . . . . . . . . . . . . . . . . . . 10.6.10 Thoughts on Statistical Criteria for GoodnessofFit . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
446 446 447 449 450 451 451 453 456 456 459 464
11
Statistical Learning Through Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Distance as a Similarity Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Unsupervised Learning: Clustering Approaches . . . . . . . . . . . . . . . . . . . . 11.3.1 Types of Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.2 CentroidBased Partitional Clustering by KMeans . . . . . . . . . . . 11.3.3 DensityBased Partitional Clustering Using DBSCAN . . . . . . . . 11.3.4 Agglomerative Hierarchical Clustering Methods . . . . . . . . . . . . . 11.4 Supervised Learning: StatisticalBased Classiﬁcation Approaches . . . . . . . 11.4.1 Different Types of Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.2 DistanceBased Classiﬁcation: kNearest Neighbors . . . . . . . . . . 11.4.3 Naive Bayesian Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.4 Classical RegressionBased Classiﬁcation . . . . . . . . . . . . . . . . . 11.4.5 Discriminant Function Analysis . . . . . . . . . . . . . . . . . . . . . . . . 11.4.6 Neural Networks: Radial Basis Function (RBF) . . . . . . . . . . . . . 11.4.7 Support Vector Machines (SVM) . . . . . . . . . . . . . . . . . . . . . . . 11.5 Decision Tree–Based Classiﬁcation Methods . . . . . . . . . . . . . . . . . . . . . . 11.5.1 RuleBased Method and DecisionTree Representation . . . . . . . . 11.5.2 Criteria for Tree Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.3 Classiﬁcation and Regression Trees (CART) . . . . . . . . . . . . . . . 11.5.4 Ensemble Method: Random Forest . . . . . . . . . . . . . . . . . . . . . . 11.6 Anomaly Detection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.2 Graphical and Statistical Methods . . . . . . . . . . . . . . . . . . . . . . . 11.6.3 ModelBased Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.4 Data Mining Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.7 Applications to Reducing Energy Use in Buildings . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
467 467 469 470 470 471 475 478 482 482 482 486 486 490 492 493 495 495 496 497 499 504 504 505 505 505 506 510 512
12
DecisionMaking, Risk Analysis, and Sustainability Assessments . . . . . . . . . . 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.1 Types of DecisionMaking Problems and Applications . . . . . . . . 12.1.2 Purview of Reliability, Risk Analysis, and DecisionMaking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.3 Example of Discrete DecisionMaking . . . . . . . . . . . . . . . . . . . . 12.1.4 Example of Chiller FDD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Single Criterion DecisionMaking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.1 General Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.2 Representing Problem Structure: Inﬂuence Diagrams and Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
515 515 515 518 518 519 521 521 522
Contents
xxi
12.2.3 12.2.4 12.2.5 12.2.6 12.2.7 12.2.8
Single and MultiStage Decision Problems . . . . . . . . . . . . . . . . . Value of Perfect Information . . . . . . . . . . . . . . . . . . . . . . . . . . . Different Criteria for Outcome Evaluation . . . . . . . . . . . . . . . . . Discretizing Probability Distributions . . . . . . . . . . . . . . . . . . . . Utility Value Functions for Modeling Risk Attitudes . . . . . . . . . Monte Carlo Simulation for FirstOrder and Nested Uncertainties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3 Risk Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.1 The Three Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.2 The Empirical Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.3 Context of Environmental Risk to Humans . . . . . . . . . . . . . . . . 12.3.4 Other Areas of Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4 Case Study: Risk Assessment of an Existing Building . . . . . . . . . . . . . . . . 12.5 MultiCriteria DecisionMaking (MCDM) Methods . . . . . . . . . . . . . . . . . 12.5.1 Introduction and Description of Terms . . . . . . . . . . . . . . . . . . . . 12.5.2 Classiﬁcation of Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.5.3 Basic Mathematical Operations . . . . . . . . . . . . . . . . . . . . . . . . . 12.6 Single Discipline MCDM Methods: TechnoEconomic Analysis . . . . . . . . 12.6.1 Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.6.2 Consistent Attribute Scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.6.3 Inconsistent Attribute Scales: Dominance and Pareto Frontier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.6.4 Case Study of Conﬂicting Criteria: Supervisory Control of an Engineered System . . . . . . . . . . . . . . . . . . . . . . . 12.7 Sustainability Assessments: MCDM with MultiDiscipline Attribute Scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.7.1 Deﬁnitions and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.7.2 Indicators and Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.7.3 Sustainability Assessment Frameworks . . . . . . . . . . . . . . . . . . . 12.7.4 Examples of Non, Semi, and FullyAggregated Assessments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.7.5 Two Case Studies: StructureBased and PerformanceBased . . . . 12.7.6 Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
524 526 527 528 529 532 533 533 535 536 538 540 546 546 547 549 549 549 550 551 553 557 557 559 560 562 564 568 569 573
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607
1
Mathematical Models and Data Analysis
Abstract
This chapter starts with an introduction of forward and inverse models and provides a practical context of their distinctive usefulness and speciﬁc capabilities and scopes in terms of a major societal concern, namely the high energy use in buildings. This is followed by a description of the various types of models generally encountered, namely conceptual, physical, and mathematical, with the last type being the sole focus of this book. Next, different types of data collection schemes and the different types of uncertainty encountered are discussed. This is followed by introducing the elements of mathematical models and the different ways to classify them such as linear and nonlinear, lumped and distributed, dynamic and steadystate, etc. Subsequently, how algebraic and ﬁrstorder differential equations capture different characteristics related to the response of sensors is illustrated. Next, the distinction between simulation or forward (or welldeﬁned or wellspeciﬁed) problems, and inverse (or datadriven or illdeﬁned) problems is highlighted. This chapter introduces analysis approaches relevant to the latter which include calibrated forward models and statistical models identiﬁed primarily from data, which can be blackbox or graybox. The latter can again be separated into (i) partial graybox, i.e., inspired from only partial understanding of system functioning, and (ii) reducedorder mechanistic graybox models. More recently, a new class of analysis methods has evolved, namely the data analytic approaches, which include data mining or knowledge discovery, machine learning, and big data analysis. These methods, which have been largely driven by increasing computational power and sensing capabilities, are brieﬂy discussed. Next, the various steps involved in a statistical analysis study are discussed followed by an example of data collection and analysis of ﬁeldmonitored data from an engineering system. Finally, the various topics covered in each chapter of this book are outlined.
1.1
Forward and Inverse Approaches
1.1.1
Preamble
Applied data analysis and modeling of system performance is historically older than simulation modeling. The ancients, starting as far back as 12,000 years ago, observed the movements of the sun, moon, and stars in order to predict their behavior and initiate certain tasks such as planting crops or readying for winter. Theirs was a necessity compelled by survival; surprisingly, still relevant today. The threat of climate change and its dire consequences are being studied by scientists using in essence similar types of analysis tools— tools that involve measured data to reﬁne and calibrate their models, extrapolate, and evaluate the effect of different scenarios and mitigation measures. These tools fall under the general purview of inverse data analysis and modeling methods, and it would be expedient to illustrate their potential and relevance with a case study application that the reader can relate to more practically.
1.1.2
The Energy Problem and Importance of Buildings
One of the current major societal problems facing mankind is the issue of energy, not only due to the gradual depletion of fossil fuels but also due to the adverse climatic and health effects that their burning create. According to the U.S. Department of Energy (USDOE), total worldwide primary energy consumption in 2021 was about 580 Exajoules (= 580 × 1018 J). The average annual growth rate is about 2%, which suggests a doubling time of 35 years. The United States accounts for 17% of the worldwide energy use (with only 5% of the world’s population), while the building sector alone (residential plus commercial buildings) in the United States consumes about 40% of the total primary energy use,
# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 T. A. Reddy, G. P. Henze, Applied Data Analysis and Modeling for Energy Engineers and Scientists, https://doi.org/10.1007/9783031348693_1
1
2
1
over 76% of the electricity generated, and is responsible for 40% of the CO2 emitted. Improvement in energy efﬁciency in all sectors of the economy worldwide has been rightly identiﬁed as a major and pressing need, and aggressive programs and measures are being implemented worldwide. By 2030, USDOE estimates that building energy use in the United States could be cut by more than 20% using technologies known to be cost effective today and by more than 35% if research goals are met. Much higher savings are technically possible. Building efﬁciency must be considered as improving the performance of a complex system designed to provide occupants with a comfortable, safe, and attractive living and work environment. This requires superior architecture and engineering designs, quality construction practices, and intelligent operation and maintenance of the structures. Identifying energy conservation and efﬁciency opportunities, verifying by monitoring whether anticipated beneﬁts are in fact realized when such measures/systems are implemented, optimal operating of buildings, etc.; all these tasks require skills in data analysis and modeling.
1.1.3
Forward or Simulation Approach
Building energy simulation models (or forward models) are mechanistic (i.e., based on a mathematical formulation of the physical behavior) and deterministic (i.e., there is no randomness in the inputs or outputs).1 They require as inputs the hourly climatic data of the selected location, the layout, orientation and physical description of the building (such as wall material, thickness, glazing type and fraction, type of shading overhangs, etc.), the type of mechanical and electrical systems available inside the building in terms of air distribution secondary systems, performance speciﬁcations of primary equipment (chillers, boilers, etc.), and the hourly operating and occupant schedules of the building. The simulation predicts hourly or subhourly energy use during the entire year from which monthly total energy use and peak use along with utility rates provide an estimate of the operating costs of the building. The primary strength of such a forward simulation model is that it is based on sound engineering principles usually taught in colleges and universities, and consequently has gained widespread acceptance by the design and professional community. Major public domain simulation codes (e.g., Energy Plus 2009) have been developed with hundreds of manyears invested in their development by very competent professionals. This modeling approach is generally useful for design purposes where different design options are to be evaluated before the actual system is built.
1
These terms will be described more fully in Sect. 1.5.2.
1.1.4
Mathematical Models and Data Analysis
Inverse or Data Analysis Approach
Inverse modeling methods, on the other hand, are used when performance data of the system is available, and one uses this data for certain speciﬁc purposes, such as predicting or controlling the behavior of the system under different operating conditions, or for identifying energy conservation opportunities, or for verifying the effect of energy conservation measures and commissioning practices once implemented, or even to verify that the system is performing as intended (called condition monitoring). Consider the case of an existing building whose energy consumption is known (either utility bill data or monitored data). The following are some of the tasks to which knowledge of data analysis methods may be advantageous to a building energy specialist: (a) Commissioning tests: How can one evaluate whether a component or a system is installed and commissioned properly? (b) Comparison with design intent: How does the consumption compare with design predictions? In case of discrepancies, are they due to anomalous weather, to unintended building operation, to improper equipment operation, or to other causes? (c) Demand side management (DSM): How would the energy consumption decrease if certain operational changes are made, such as lowering thermostat settings, ventilation rates or indoor lighting levels? (d) Operation and maintenance (O&M): How much energy could be saved by retroﬁts to building shell, changes to air handler operation from constant air volume to variable air volume operation, or due to changes in the various control settings, or due to replacing the old chiller with a new and more energy efﬁcient one? (e) Monitoring and veriﬁcation (M&V): If the retroﬁts are implemented in the system, can one verify that the savings are due to the retroﬁt, and not to other confounding causes, e.g., the weather or changes in building occupancy? (f) Automated fault detection, diagnosis, and evaluation (AFDDE): How can one automatically detect faults in heating, ventilating, airconditioning, and refrigerating (HVAC&R) equipment, which reduce operating life and/or increase energy use? What are the ﬁnancial implications of this degradation? Should this fault be rectiﬁed immediately or at a later time? What speciﬁc measures need to be taken? (g) Optimal supervisory operation: How can one characterize HVAC&R equipment (such as chillers, boilers, fans, pumps, etc.) in their installed state and optimize the control and operation of the entire system? (h) Smartgrid interactions: How to best facilitate dynamic energy interactions between building energy systems
1.2 System Models
and the smartgrid with advanced communication and load control capability and high solar/wind energy penetration?
1.1.5
Discussion of Both Approaches
All the above questions are better addressed by data analysis methods. The forward approach could also be used, by, say, (i) going back to the blueprints of the building and of the HVAC system, and repeating the analysis performed at the design stage while using actual building schedules and operating modes, and (ii) performing a calibration or tuning of the simulation model (i.e., varying the inputs in some fashion) since actual performance is unlikely to match observed performance. This process is, however, tedious and much effort has been invested by the building professional community in this regard with only limited success (Reddy 2006). A critical limitation of the calibrated simulation approach is that the data being used to tune the forward simulation model must meet certain criteria, and even then, all the numerous inputs required by the forward simulation model cannot be mathematically identiﬁed (this is referred to as an “overparameterized problem”). Though awkward, labor intensive, and not entirely satisfactory in its current state of development, the calibrated building energy simulation model is still an attractive option and has its place in the toolkit of data analysis methods (discussed at length in Sect. 10.6). The fundamental difﬁculty is that there is no general and widely used model or software for dealing with datadriven applications as they apply to building energy; only specialized software programs have been developed, which allow certain types of narrow analysis to be performed. In fact, given the wide diversity in applications of datadriven models, it is unlikely that any one methodology or software program will ever sufﬁce. This leads to the basic premise of this book that there exists a crucial need for building energy professionals to be familiar and competent with a wide range of data analysis methods and tools so that they could select the one that best meets their purpose with the end result that buildings will be operated and managed in a much more energyefﬁcient manner than currently. Building design simulation tools have played a signiﬁcant role in lowering energy use in buildings. These are necessary tools and their importance should not be understated. Historically, most of the business revenue in architectural engineering and HVAC&R ﬁrms was generated from design/build contracts, which required extensive use of simulation and design software programs. Hence, the professional community is fairly knowledgeable in this area, and several universities teach classes geared toward the use of building energy modeling (BEM) simulation programs.
3
The last 40 years or so have seen a dramatic increase in building energy services as evidenced by the number of ﬁrms that offer services in this area. The acquisition of the required understanding, skills, and tools relevant to this aspect is different from those required for building design. There are other market forces that are also at play. The recent interest in “green” and “sustainable” has resulted in a plethora of products and practices aggressively marketed by numerous companies. Often, the claims that this product can save much more energy than another, and that that device is more environmentally friendly than others, are, unfortunately, unfounded under closer scrutiny. Such types of unbiased evaluations and independent veriﬁcation are imperative, otherwise the whole “green” movement may degrade into mere “greenwashing” rather than overcoming a dire societal challenge. A sound understanding of applied data analysis is imperative for this purpose and future science and engineering graduates have an important role to play. Thus, the raison d’etre of this book is to provide a general introduction and a broad foundation to the mathematical, statistical, and modeling aspects of data analysis methods.
1.2
System Models
1.2.1
What Is a System Model?
A system is the object under study, which could be as simple or as complex as one may wish to consider. It is any ordered, interrelated set of things, and their attributes. A model is a construct that allows one to represent the reallife system so that it can be used to predict the future behavior of the system under various “what–if” scenarios. The construct could be a scaled down physical version of the actual system (widely adopted historically in engineering) or a mental construct. The development of a model is not the ultimate objective; in other words, it is not an end by itself. It is a means to an end, the end being a credible means to make decisions that could involve systemspeciﬁc issues (such as gaining insights about inﬂuential drivers and system dynamics, or predicting system behavior, or determining optimal control conditions) as well as those involving a broader context (such as operation management, deciding on policy measures and planning, etc.).
1.2.2
Types of Models
One differentiates between different types of models (Fig. 1.1): (a) Abstract models can be: (i) Conceptual (or qualitative or descriptive models), where the system’s behavior is summarized in
4
1
Fig. 1.1 Different types of models
Mathematical Models and Data Analysis
Models
Physical
Abstract
Mathematical
Physicsbased
Conceptual
nonanalytical ways because only general qualitative trends of the system are known. They are mental/intuitive abstract constructs that capture subjective/qualitative behavior or expectations of how something works based on prior experience. Such models are primarily used as an aid to thought or communication. (ii) Mathematical models, which capture system response using mathematical equations; these are further discussed below. (b) Physical models can be either: (i) Scaled down (or up) physical constructs, whose characteristics resemble those of the physical system being studied. They often supported and guided the work of earlier scientists and engineers and are still extensively used for validating mathematical models (such as architectural daylight experiments in test rooms), or (ii) Analogue models, which are actual physical setups meant to reproduce the physics of the systems and involve secondary measurements of the system to be made (ﬂow, energy, etc.). (c) Mathematical models can be further subdivided into (see Fig. 1.1): (i) Empirical models, which are abstract models based on observed/monitored data with a loose mathematical structure. They capture general qualitative trends of the system based on data describing properties of the system summarized in a graph, a table, or a curve ﬁt to observation points. Such models presume some knowledge of the fundamental quantitative trends but lack accurate understanding. Examples: econometric, medical, sociological, anthropological behavior. (ii) Physicsbased mechanistic models (or structural models), which use metric or count data and are based on mathematical relationships derived from physical laws such as Newton’s laws, the laws of thermodynamics and heat transfer, etc. Such
Empirical
Analytical
Numerical Scale
Analogue
models can be used for prediction (during system design) or for proper system operation and control (involving data analysis). This group of models can be classiﬁed into: • Analytical, for which closed form mathematical solutions exist for the equation or set of equations. • Numerical, which require numerical procedures to solve the equation(s). Alternatively, mathematical models can be considered to be: • Exact structural models where the equation is thought to apply rigorously, i.e., the relationship between variables and parameters in the model is exact, or as close to exact as current state of scientiﬁc understanding permits. • Inexact structural models where the equation applies only approximately, either because the process is not fully known or because one chose to simplify the exact model so as to make it more usable. A typical example is the dose–response model, which characterizes the relation between the amount of toxic agent imbibed by an individual and the incidence of adverse health effect.
1.3
Types of Data
1.3.1
Classification
Data2 can be classiﬁed in different ways. One classiﬁcation scheme is to distinguish between experimental data gathered under controlled test conditions where the observer can perform tests in the manner or sequence he intends, and Several authors make a strict distinction between “data,” which is plural, and “datum,” which is singular and implies a single data point. No such distinction is made throughout this book, and the word “data” is used to designate either.
2
1.3 Types of Data
5
Fig. 1.2 The rolling of a dice is an example of discrete data where the data can only assume whole numbers. Even if the dice is fair, one would not expect, out of 60 throws, the numbers 1 through 6 to appear exactly 10 times but only approximately so
observational data collected while the system is under normal operation or when the system cannot be controlled (as in astronomy). Another classiﬁcation scheme is based on type of data: (a) Categorical/qualitative data, which involve nonnumerical descriptive measures or attributes, such as belonging to one of several categories. One can further distinguish between: (i) Nominal (or unordered), consisting of attribute data with no rank, such as male/female, yes/no, married/ unmarried, eye color, engineering major, etc. (ii) Ordinal data, i.e., data that has some order or rank, such as a building envelope that is leaky, medium, or tight, or a day that is hot, mild, or cold. Such data can be converted into an arbitrary rank order by a minmax scaling (e.g., hot/mild/cold days can be ranked as 1/2/3). Such rankordered data can be manipulated arithmetically to some extent. (b) Numerical/quantitative data, i.e., data obtained from measurements of such quantities as time, weight, and height. Further, there are two different kinds: (i) Count or discrete data, which can take on only a ﬁnite or countable number of values. An example is data series one would expect by rolling a dice 60 times (Fig. 1.2). (ii) Continuous or metric data involving measurements of time, weight, height, energy, or others. Such data may take on any value in an interval (most metric data is continuous, and hence is not countable); for example the daily average outdoor drybulb temperature in Philadelphia, PA, over a year (Fig. 1.3). Further, one can distinguish between: – Data measured on an interval scale, which has an arbitrary zero point (such as the Celsius scale) and so only differences between values are meaningful. – Data measured on a ratio scale, which has a zero point that cannot be arbitrarily changed (such as mass or volume or temperature in Kelvin); both differences and ratios are meaningful.
Fig. 1.3 Continuous data separated into a large number of bins (in this case, 300) resulted in the above histogram of the hourly outdoor drybulb temperature (in °F) in Philadelphia, PA, over a year. A smoother distribution would have resulted if a smaller number of bins had been selected
For data analysis purposes, it is often important to view data based on their dimensionality, i.e., the number of axes needed to graphically present the data. A univariate data set consists of observations based on a single variable, bivariate those based on two variables, and multivariate those based on more than two variables. A fourth type of distinction between data types is by the source or origin of the data: (a) Population is the collection or set of all individuals (or items, or characteristics) representing the same quantity with a connotation of completeness, i.e., the entire group of items being studied whether they be the freshmen student body of a university, instrument readings of a test quantity, or points on a curve. (b) Sample is a portion or limited number of items from a population from which information or readings are collected. There are again two types of samples: – Singlesample is a single reading or succession of readings taken at the same time or under different times but under identical conditions. – Multisample is a repeated measurement of a ﬁxed quantity using altered test conditions, such as different observers or different instruments or both. Many experiments may appear to be multisample data but are actually singlesample data. For example, if the same instrument is used for data collection during different times, the data should be regarded as singlesample not multisample. One can differentiate between different types of multisample data. Consider the case of solar thermal collector testing (as described in Pr. 5.7 of Chap. 5). In essence, the collector is subjected to different inlet ﬂuid temperature levels
6
1
Mathematical Models and Data Analysis
Fig. 1.4 Example of multisample data in the framework of a “roundrobin” experiment of testing the same solar thermal collector in six different test facilities (shown by different symbols) following the same testing methodology. The test data is used to determine and plot the collector efﬁciency versus the reduced temperature along with uncertainty bands (see Pr. 5.7 for nomenclature). (Streed et al. 1979)
under different values of incident solar radiation and ambient air temperatures using an experimental facility with instrumentation of prespeciﬁed accuracy levels. The test results are processed according to certain performance models and the data plotted against collector efﬁciency versus reduced temperature level. The test protocol would involve performing replicate tests under similar reduced temperature levels, and this is one type of multisample data. Another type of multisample data would be the case when the same collector is tested at different test facilities nationwide. The results of such a “roundrobin” test are shown in Fig. 1.4, where one detects variations around the trend line given by the performance model that can be attributed to differences in test facility and instrumentation, and in slight variations in how the test protocols were implemented in different facilities. (c) Twostage experiments are successive staged experiments where the chance results of the ﬁrst stage determine the conditions under which the next stage will be carried out. For example, when checking the quality of a lot of massproduced articles, it is frequently possible to decrease the average sample size by carrying out the inspection in two stages. One may ﬁrst take a small sample and accept the lot if all articles in the sample are satisfactory; otherwise a large second sample is inspected. Finally, one needs to distinguish between: (i) a duplicate, which is a separate specimen taken from the same source as the ﬁrst specimen, and tested at the same time and in the same manner, and (ii) replicate, which is the same specimen tested again at a different time. Thus, while duplication allows one to test samples till they are destroyed (such as tensile strength testing of an iron specimen), replicate testing stops short of doing permanent damage to the samples.
1.3.2
Types of Uncertainty in Data
If the same results are obtained when an experiment is repeated under the same conditions, one says that the experiment is deterministic. It is this deterministic nature of science that allows theories or models to be formulated and permits the use of scientiﬁc theory for prediction (Hodges and Lehman 1970). However, all observational or experimental data invariably have a certain amount of inherent noise or randomness, which introduces a certain degree of uncertainty in the results or conclusions. Instrument or measurement technique, or improper understanding of all inﬂuential factors, or the inability to measure some of the driving parameters, random and/or bias types of errors usually infect the deterministic data. However, there are also experiments whose results vary due to the very nature of the experiment; for example, gambling outcomes (throwing of dice, card games, etc.). These are called random experiments. Without uncertainty or randomness, there would have been little need for statistics. Probability theory and inferential statistics have been largely developed to deal with random experiments and the same approach has also been adapted to deterministic experimental data analysis. Both inferential statistics and stochastic model building have to deal with the random nature of observational or experimental data, and thus require knowledge of probability. There are several types of uncertainty in data, and all of them have to do with the inability to determine the true state of affairs of a system (Haimes 1998). A succinct classiﬁcation involves the following sources of uncertainties: (a) Purely stochastic variability (or aleatory uncertainty), where the ambiguity in outcome is inherent in the nature of the process, and no amount of additional measurements can reduce the inherent randomness.
1.4 Mathematical Models
Common examples involve coin tossing, or card games. These processes are inherently random (either on a temporal or spatial basis), and whose outcome, while uncertain, can be anticipated on a statistical basis. (b) Epistemic uncertainty or ignorance or lack of complete knowledge of the process, which result in certain inﬂuential variables not being considered (and, thus, not measured). (c) Inaccurate measurement of numerical data due to instrument or sampling errors. (d) Cognitive vagueness involving human linguistic description. For example, people use words like tall/ short or very important/not important, which cannot be quantiﬁed exactly. This type of uncertainty is generally associated with qualitative and ordinal data where subjective elements come into play. The traditional approach is to use probability theory along with statistical techniques to address (a), (b), and (c) types of uncertainties. The variability due to sources (b) and (c) can be diminished by taking additional measurements, by using more accurate instrumentation, by better experimental design, and by acquiring better insight into speciﬁc behavior with which to develop more accurate models. Several authors apply the term “uncertainty” to only these two sources. Finally, source (d) can be modeled using probability approaches though some authors argue that it would be more convenient and appropriate to use fuzzy logic to model such vagueness in human speech.
1.4
Mathematical Models
1.4.1
Basic Terminology
One can envision two different types of systems: open systems in which either energy and/or matter ﬂows into and out of the system, and closed systems in which matter is not exchanged to the environment but energy ﬂows can be present. A system model is a description of the system. Empirical and mechanistic models are made up of three components: (i) Input variables (also referred to as regressor, forcing, exciting, covariates, exogenous or independent variables in the engineering, statistical and econometric literature), which act on the system. Note that there are two types of such variables: controllable by the experimenter, and uncontrollable or extraneous variables, such as climatic variables, for example. (ii) System structure and parameters/properties, which provide the necessary mathematical description of the
7
systems in terms of physical and material constants; for example, thermal mass, overall heat transfer coefﬁcients, mechanical properties of elements. (iii) Output variables (also called response, state, endogenous, or dependent variables), which describe system response to the input variables. A structural model of a system is a mathematical relationship between one or several input variables and parameters and one or several output variables. Its primary purpose is to allow better physical understanding of the phenomenon or process or, alternatively, to allow accurate prediction of system response. This is useful for several purposes; for example, preventing adverse phenomena from occurring, for proper system design (or optimization), or to improve system performance by evaluating other modiﬁcations to the system. A satisfactory mathematical model is subject to two contradictory requirements (Edwards and Penney 1996): it must be sufﬁciently detailed to represent the phenomenon it is attempting to explain or capture, yet it must be sufﬁciently simple to make the mathematical analysis practical. This requires judgment and experience of the modeler backed by experimentation and validation.3
1.4.2
Block Diagrams
An information ﬂow or block diagram4 is a standard shorthand manner of schematically representing the inputs and output quantities of an element or a system as well as the computational sequence of variables. It is a concept widely used in the context of system modeling and simulation since a block implies that its output can be calculated provided the inputs are known. They are useful for setting up the set of model equations to solve in order to simulate or analyze systems or components. As illustrated in Fig. 1.5, a centrifugal pump could be represented as one of many possible block diagrams (as shown in Fig. 1.6), depending on which parameters are of interest. If the model equation is cast in a form such that the outlet pressure p2 is the response variable and the inlet pressure p1 and the ﬂuid ﬂow volumetric rate v are the forcing variables, then the associated block diagram is that shown in Fig. 1.6a. Another type of block diagram is shown in Fig. 1.6b, where ﬂow rate v is the response variable. The arrows indicate the direction of unilateral information or signal ﬂow, which in term can be viewed as cause–effect 3
Validation is deﬁned as the process of bringing the user’s conﬁdence about the model to an acceptable level either by comparing its performance to other more accepted models or by experimentation. 4 Block diagrams should not be confused with material ﬂow diagrams, which for a given system conﬁguration are unique. On the other hand, there can be numerous ways of assembling block diagrams depending on how the problem is framed.
8
1
relationship; this is why these models are termed causal. Thus, such diagrams depict the manner in which the simulation models of the various components of a system need to be formulated. In general, a system or process is subject to one or more inputs (or stimulus or excitation or forcing functions) to which it responds by producing one or more outputs (or system response). If the observer is unable to act on the system, i.e., change some or any of the inputs, so as to produce a desired output, the system is not amenable to
s
p2
v p1
Fig. 1.5 Schematic of a centrifugal pump rotating at speed s (say, in rpm), which pumps a water ﬂow rate v from lower pressure p1 to higher pressure p2
p1 v p1 p2 p1 p2
s
Pump
p2
(a)
Pump
v
(b)
v
Pump
(c)
+
Fig. 1.6 Different block diagrams for modeling a pump depending on how the problem is formulated
x(t)
Mathematical Models and Data Analysis
control. If, however, the inputs can be varied, then control is feasible. Thus, a control system is deﬁned as an arrangement of physical components connected or related in such a manner as to command, direct, or regulate itself or another system (Stubberud et al. 1994). One needs to distinguish between open and closed loops, and block diagrams provide a convenient way of doing so. (a) An open loop control system is one in which the control action is independent of the output (see Fig. 1.7a). Two important features are: (i) their ability to perform accurately is determined by their calibration, i.e., by how accurately one is able to establish the input–output relationship; and (ii) they are generally not unstable. A practical example is an automatic toaster, which is simply controlled by a timer. If the behavior of an open loop system is not completely understood or if unexpected disturbances act on it, then there may be considerable and unpredictable variations in the output. (b) A closed loop control system, also referred to as a feedback control system, is one in which the control action is somehow dependent on the output (Fig. 1.7b). If the value of the response y(t) is too low or too high, then the control action modiﬁes the manipulated variable (shown as u(t)) appropriately. Such systems are designed to cope with lack of exact knowledge of system behavior, inaccurate component models, and unexpected disturbances. Thus, increased accuracy is achieved by reducing the sensitivity of the ratio of output to input to variations in system characteristics (i.e., increased bandwidth deﬁned as the range of variation in the inputs over which the system will respond satisfactorily) or due to random perturbations of the system by the environment. They have a serious disadvantage though: they can
y (t)
System
(a) Open loop Disturbance
x (t)
+
Control Element
Manipulated variable System u (t)
Input Feedback element (b) Closed loop Fig. 1.7 Open and closed loop systems for a controlled output y(t). (a) Open loop. (b) Closed loop
y(t) Controlled output
1.4 Mathematical Models
9
inadvertently develop unstable oscillations. This issue is an important one by itself and is treated extensively in control textbooks. Using the same example of a centrifugal pump but going one step further would lead us to the control of the pump. For example, if the inlet pressure p1 is speciﬁed, and the pump needs to be operated or controlled (i.e., say by varying its rotational speed s) under variable outlet pressure p2 so as to maintain a constant ﬂuid ﬂow rate v, then some sort of control mechanism or feedback is often used (shown in Fig. 1.6c). The small circle at the intersection of the signal s and the feedback represents a summing point that denotes the algebraic operation being carried out. For example, if the feedback signal is summed with the signal s, a “+” sign is placed just outside the summing point. Such graphical representations are called signal ﬂow diagrams and are used in process or system control, which requires inverse modeling and parameter estimation.
1.4.3
Mathematical Representation
Let us start with explaining the difference between parameters and variables in a model. A deterministic model is a mathematical relationship, derived from physical considerations, between variables and parameters. The quantities in a model that can be measured independently during an experiment are the “variables,” which can be either input or output variables (as described earlier). To formulate the relationship among variables, one usually introduces “constants” that denote inherent properties of nature or of the engineering system called parameters. Consider the dynamic model of a component or system represented by the block diagram in Fig. 1.8. For simplicity, let us assume a linear model with no lagged terms in the forcing variables. Then, the model can be represented in matrix form as: Yt = AYt  1 þ BUt þ CWt
with
Y1 = d
ð1:1Þ
where the output or state variable at time t is Yt. The forcing or input variables are of two types: vector U denoting observable and controllable input variables, and vector W indicating uncontrollable input variables or disturbing inputs that may or may not be observable. The parameter vectors of the model are {A, B, C} while d represents the initial condition vector. Examples of Simple Models (a) Pressure drop Δp of a ﬂuid ﬂowing at velocity v through a pipe of hydraulic diameter Dh and length L: Δp = f
L v2 ρ Dh 2
where f is the friction factor, and ρ is the density of the ﬂuid. For a given system, v can be viewed as the independent or input variable, while the pressure drop is the state variable. The factors f, L, and Dh are the system or model parameters and ρ is a property of the ﬂuid. Note that the friction factor f is itself a function of the velocity, thus making the problem a bit more complex. Sometimes, the distinction between parameters and variables is ambiguous and depends on the context, i.e., the objective of the study and the manner in which the experiment is performed. For example, in Eq. 1.2, pipe length has been taken to be a ﬁxed system parameter since the intention was to study the pressure drop against ﬂuid velocity. However, if the objective is to determine the effect of pipe length on pressure drop for a ﬁxed velocity, the length would then be viewed as the independent variable. (b) Rate of heat transfer from a ﬂuid to a surrounding solid: •
Q = UA T f  T o
ð1:3Þ
where the parameter UA is the overall heat conductance, and Tf and To are the mean ﬂuid and solid temperatures (which are the input variables). (c) Rate of heat added to a ﬂowing ﬂuid: •
•
Q = mcp ðT out  T in Þ •
Fig. 1.8 Block diagram of a simple component with parameter vectors {A, B, C}. Vectors U and W are the controllable/observable and the uncontrollable/disturbing inputs, respectively, while Y is the state variable or system response
ð1:2Þ
ð1:4Þ
where m is the ﬂuid mass ﬂow rate, cp is its speciﬁc heat at constant pressure, and Tout and Tin are the exit and inlet ﬂuid temperatures. It is left to the reader to identify the input variables, state variables, and the model parameters. (d) Lumped model of the water temperature Ts in a storage tank with an immersed heating element and losing heat to the environment is given by the ﬁrstorder ordinary differential equation (ODE):
10
1
Mcp
dTs = P  UAðT s  T env Þ dt
ð1:5Þ
where Mcp is the thermal heat capacitance of the tank (water plus tank material), Tenv the environment temperature, and P is the auxiliary power (or heat rate) supplied to the tank. It is left to the reader to identify the input variables, state variables, and parameters.
1.4.4
Classification
Predicting the behavior of a system requires a mathematical representation of the system components. The process of deciding on the level of detail appropriate for the problem at hand is called abstraction (Cha et al. 2000). This process has to be undertaken with care; (i) oversimpliﬁcation may result in loss of important system behavior predictability, while (ii) an overly detailed model may result in undue data collection effort and computational resources as well as time spent in understanding the model assumptions and results generated. There are different ways by which mathematical models can be classiﬁed. Some of these are shown in Table 1.1 and described below. (i) Distributed Versus Lumped Parameter In a distributed parameter system, the elements of the system are continuously distributed along the system geometry so that the variables they inﬂuence must be treated as differing not only in time but also in space, i.e., from point to point. Partial differential or difference equations are usually needed. Recall that a partial differential equation (PDE) is a differential equation between partial derivatives of an unknown function against at least two independent variables. One distinguishes between two general cases:
Mathematical Models and Data Analysis
• The independent variables are space variables only. • The independent variables are both space and time variables. Though partial derivatives of multivariable functions are ordinary derivatives with respect to one variable (the other being kept constant), the study of PDEs is not an easy extension of the theory for ordinary differential equations (ODEs). The solution of PDEs requires fundamentally different approaches. Recall that ODEs are solved by ﬁrst ﬁnding general solutions and then using subsidiary conditions to determine arbitrary constants. However, such arbitrary constants in general solutions of ODEs are replaced by arbitrary functions in PDE, and determination of these arbitrary functions using subsidiary conditions is usually impossible. In other words, general solutions of ODEs are of limited use in solving PDEs. In general, the solution of the PDEs and subsidiary conditions (called initial or boundary conditions) needs to be determined simultaneously. Hence, it is wise to try to simplify the PDE model as far as possible when dealing with data analysis situations. In a lumped parameter system, the elements are small enough (or the objective of the analysis is such that simpliﬁcation is warranted) so that each such element can be treated as if it were concentrated (i.e., lumped) at one particular spatial point in the system. The position of the point can change with time but not in space. Such systems usually are adequately modeled by ODE or difference equations. A heated billet as it cools in air could be analyzed as either a distributed system or a lumped parameter system depending on whether the Biot number (Bi) is greater than or less than 0.1 (Fig. 1.9). Recall that the Biot number is proportional to the ratio of the internal to the external heat ﬂow resistances of the sphere. So, a small Biot number would imply that the resistance to heat ﬂow attributed to internal body temperature
Table 1.1 Ways of classifying mathematical models 1 2 3 4 5 6 7 8 9 10
Different classiﬁcation schemes Distributed vs lumped parameter Dynamic vs static or steadystate Deterministic vs stochastic Continuous vs discrete Linear vs nonlinear in the functional model Linear vs nonlinear in the model parameters Time invariant vs time variant Homogeneous vs nonhomogeneous Simulation vs performance models Physics based (whitebox) vs data based (blackbox) and mix of both (graybox)
Adapted from Eisen (1988)
Fig. 1.9 Cooling of a solid sphere in air can be modeled as a lumped model provided the Biot number Bi < 0.1. This number is proportional to the ratio of the heat conductive resistance (1/k) inside the sphere to the convective resistance (1/h) from the outer envelope of the sphere to the air
1.4 Mathematical Models
11
Fig. 1.10 Thermal networks to model heat ﬂow through a homogeneous plane wall of surface area A and wall thickness Δx. (a) Schematic of the wall with the indoor and outdoor temperatures and convective heat ﬂow coefﬁcients. (b) Lumped model with two resistances and one capacitance (2R1C model). (c) Higher nth order model with n layers of equal thickness (Δx/n). The numerical discretization assumes all capacitances to be equal, while only the (n  2) internal resistances (excluding the two end resistances) are taken to be equal. (From Reddy et al. 2016)
gradient is small enough that it can be neglected without biasing the analysis. Thus, a small body with high thermal conductivity and low convection coefﬁcient can be adequately modeled as a lumped system. Another example of lumped model representation is the 1D heat ﬂow through the wall of a building (Fig. 1.10a) using the analogy between heat ﬂow and electricity ﬂow. The internal and external convective ﬁlm heat transfer coefﬁcients are represented by hi and ho, respectively, while k, ρ, and cp are the thermal conductivity, density, and speciﬁc heat of the wall material, respectively. In the lower limit, the wall can be discretized into one lumped layer of capacitance C with two resistors as shown by the electric network of Fig. 1.10b (referred to as 2R1C network). In the upper limit, the network can be represented by “n” nodes (see Fig. 1.10c). The 2R1C simpliﬁcation does lead to some errors, which under certain circumstances is outweighed by the convenience it provides while yielding acceptable results. (ii) Dynamic Versus SteadyState Dynamic models are deﬁned as those that allow transient system or equipment behavior to be captured with explicit recognition of the time varying behavior of both output and input variables. The steadystate or static or zeroorder model is one that assumes no time variation in its input variables (and hence, no change in the output variable as well). One can also distinguish an intermediate type, referred to as quasistatic models. Cases arise when the input variables (such as
incident solar radiation on a solar hot water panel) are constantly changing at a short time scale (say, at the minute scale) and the thermal output needs to be predicted at hourly intervals only. The dynamic behavior is poorly predicted by the solar collector model at such highfrequency time scales, and so the input variables can be “timeaveraged” so as to make them constant during a speciﬁc hourly interval. This is akin to introducing a “low pass ﬁlter” for the inputs. Thus, the use of quasistatic models allows one to predict the system output(s) in discrete time variant steps or intervals during a given day with the system inputs averaged (or summed) over each of the time intervals fed into the model. These models could be either zeroorder or loworder ODE. Dynamic models are usually represented by PDEs, or by ODEs when spatially lumped with respect to time. One could solve them directly, and the simple cases are illustrated in Sect. 1.4.5. Since solving these equations gets harder as the order of the model increases, it is often more convenient to recast the differential equations in a timeseries formulation using response functions or transfer functions, which are timelagged values of the input variable(s) only, or of both the inputs and the response, respectively. This formulation is discussed in Chap. 8. The steadystate or static or zeroorder model is one that assumes no time variation in its inputs or outputs. Its time series formulation results in simple algebraic equations with no timelagged values of the input variable(s) appearing in the function.
12
(iii) Deterministic Versus Stochastic A deterministic system is one whose response to speciﬁed inputs under speciﬁed conditions is completely predictable (to within a certain accuracy of course) from physical laws. Thus, the response is precisely reproducible time and time again. A stochastic system is one where the speciﬁc output can be predicted to within an uncertainty range only, which could be due to two reasons: (i) that the inputs themselves are random and vary unpredictably within a speciﬁed range of values (such as the electric power output of a wind turbine subject to gusting winds), and/or (ii) because the models are not accurate (e.g., the dose–response of individuals when subject to asbestos inhalation). Concepts from probability theory are required to make predictions about the response. The majority of observed data has some stochasticity in them either due to measurement noise/errors or due to the random nature of the process itself. If the random element is so small that it is negligible as compared to the “noise” in the system, then the process or system can be treated in a purely deterministic framework. The orbits of the planets though well described by Kepler’s laws have small disturbances due to other secondary effects, but Newton was able to treat them as deterministic and verify his law of gravitation. On the other hand, Brownian molecular motion is purely random, and has to be treated by stochastic methods. (iv) Continuous Versus Discrete A continuous system is one in which all the essential variables are continuous in nature and the time that the system operates is some interval (or intervals) of the real numbers. Usually such systems need differential equations to describe them. A discrete system is one in which all essential variables are discrete and the time that the system operates is a ﬁnite subset of the real numbers. This system can be described by difference equations. In most applications in engineering, the system or process being studied is fundamentally continuous. However, the continuous output signal from a system is usually converted into a discrete signal by sampling. Alternatively, the continuous system can be replaced by its discrete analog that, of course, has a discrete signal. Hence, analysis of discrete data is usually more widespread in data analysis applications.
1
Mathematical Models and Data Analysis
x1
y1
x2
y2
c1 x1 c2 x2
c1 y1 + c2 y2
Fig. 1.11 Principle of superposition of a linear system
y2(t)] for all pairs of inputs x1(t) and x2(t) and all pairs of real number constants a1 and a2. This concept is illustrated in Fig. 1.11. An equivalent concept is the principle of superposition, which states that the response of a linear system due to several inputs acting simultaneously is equal to the sum of the responses of each input acting alone. This is an extremely important concept since it allows the response of a complex system to be determined more simply by decomposing the input driving function into simpler terms, solving the equation for each term separately, and then summing the individual responses to obtain the desired aggregated response. Such a strategy is common in detailed hourbyhour building energy simulation programs (Reddy et al. 2016). An important distinction needs to be made between a linear model and a model that is linear in its parameters. For example, • y = ax1 + bx2 is linear in both model and parameters a and b. • y = a sin x1 + bx2 is a nonlinear model but is linear in its parameters. • y = a exp (bx1) is nonlinear in both model and parameters.
(v) Linear Versus Nonlinear
In all ﬁelds, linear differential or difference equations are by far more widely used than nonlinear equations. Even if the models are nonlinear, every attempt is made, due to the subsequent convenience it provides, to make them linear either by suitable transformation (such as logarithmic transform) or by piecewise linearization, i.e., linear approximation over a smaller range of variation. The advantages of linear systems over nonlinear systems are many:
A system is said to be linear if, and only if, it has the following property: if an input x1(t) produces an output y1(t), and if an input x2(t) produces an output y2(t), then an input [c1 x1(t) + c2 x2(t)] produces an output [c1 y1(t) + c2
• Linear systems are simpler to analyze. • General theories are available to analyze them. • They do not have singular solutions (simpler engineering problems rarely have them anyway).
1.4 Mathematical Models
13
• Wellestablished methods are available, such as the state space approach (see Sect. 10.4.4), for analyzing even relatively complex set of equations. The practical advantage with this type of time domain transformation is that large systems of higherorder ODEs can be transformed into a ﬁrstorder system of simultaneous equations that, in turn, can be solved rather easily by numerical methods.
AyðnÞ þ Byðn  1Þ þ . . . þ My00 þ Ny0 þ Oy = 0
ð1:7Þ
A system is time invariant or stationary if neither the form of the equations characterizing the system, nor the model parameters vary with time under either constant or varying inputs; otherwise the system is timevariant or nonstationary. In some cases, when the model structure is poor and/or when the data are very noisy, time variant models are used requiring either online or offline updating depending on the frequency of the input forcing functions and how quickly the system responds. Examples of such instances abound in electrical engineering applications. Usually, one tends to encounter time invariant models in less complex thermal and environmental engineering applications.
yields the free response of the system. The homogeneous solution is a general solution whose arbitrary constants are then evaluated using the initial (or boundary) conditions, thus making it unique to the situation. (b) The nonhomogeneous form where P(x) ≠ 0 and Eq. 1.6 applies. The forced response of the system is associated with the case when all the initial conditions are identically zero, i.e., y(0), y′(0), . . . y(n  1) are all zero. Thus, the implication is that the forced response is only dependent on the external forcing function P(x). The total response of the linear timeinvariant ODE is the sum of the free response and the forced response (thanks to the superposition principle). When system control is being studied, slightly different terms are often used to specify total dynamic system response: (a) the steadystate response is that part of the total response that does not approach zero as time approaches inﬁnity, and (b) the transient response is that part of the total response that approaches zero as time approaches inﬁnity.
(vii) Homogeneous Versus NonHomogeneous
(viii) Simulation Versus PerformanceBased Models
If there are no external inputs and the system behavior is determined entirely by its initial conditions, then the system is called homogeneous or unforced or autonomous; otherwise it is called nonhomogeneous or forced. Consider the general form of a nth order timeinvariant or stationary linear ODE:
The distinguishing trait between simulation and performance models is the basis on which the model structure is framed (this categorization is quite important). Simulation models are used to predict system performance during the design phase when no actual system exists, and design alternatives are being evaluated. A performancebased model (also referred to as “stochastic data model”) relies on measured performance data of the actual system to provide insights into model structure and to estimate its parameters or to simply predict future performance (such models are referred to as “algorithmic models”). Both these model approaches are discussed in Sects. 1.5 and 1.6.
(vi) Time Invariant Versus Time Variant
AyðnÞ þ Byðn  1Þ þ . . . þ My00 þ Ny0 þ Oy = PðxÞ
ð1:6Þ
where y′, y″, and y(n) are the ﬁrst, second, and nth derivatives of y with respect to x, and A, B, . . . M, N, and O are constants. The function P(x) frequently corresponds to some external inﬂuence on the system and is a function of the independent variable. Often, the independent variable is the time variable t. This is intentional since time comes into play when the dynamic behavior of most physical systems is modeled. However, the variable x can be assigned any other physical quantity as appropriate. To completely specify the problem, i.e., to obtain a unique solution y(x), one needs to specify two additional factors: (i) the interval of x over which a solution is desired, and (ii) a set of n initial conditions. If these conditions are such that y(x) and its (n  1) derivatives are speciﬁed for x = 0, then the problem is called an initial value problem. Thus, one distinguishes between: (a) The homogeneous form where P(x) = 0, i.e., there is no external driving force. The solution of the differential equation:
1.4.5
SteadyState and Dynamic Models
Let us illustrate steadystate and dynamic system responses using the example of measurement sensors. Steadystate models (also called zeroorder models) apply when input variables (and hence, the output variables) are maintained constant. A zeroorder model for the dynamic performance of measuring systems is used (i) when the variation in the quantity to be measured is very slow as compared to how quickly the instrument responds, or (ii) as a standard of comparison for other more sophisticated models. For a zeroorder instrument, the output is directly proportional to the input (Doebelin 1995):
14
1
a0 qo = b0 qi
ð1:8aÞ
25
Or
Steadystate value
ð1:8bÞ
where a0 and b0 are the system parameters, assumed time invariant, qo and qi are the output and the input quantities, respectively, and K = b0/a0 is called the static sensitivity of the instrument. Hence, only K is required to completely specify the response of the instrument. Thus, the zeroorder instrument is an ideal instrument; no matter how rapidly the measured variable changes, the output signal faithfully and instantaneously reproduces the input. The next step in complexity is the ﬁrstorder model: a1
dq0 þ a0 q o = b 0 q i dt
ð1:9aÞ
Or τ
dq0 þ qo = Kqi dt
ð1:9bÞ
where τ is the time constant of the instrument (τ = a1/a0), and K is the static sensitivity of the instrument. Thus, two numerical parameters are used to completely specify a ﬁrstorder instrument. The solution to Eq. 1.9b for a step change in input is: qo ðt Þ = Kqis 1  e  t=τ
ð1:10Þ
where qis is the value of the input quantity after the step change. After a step change in the input, the steadystate value of the output will be K times the input qis (just as in the zeroorder instrument). This is shown as a dotted horizontal line in Fig. 1.12 with a numerical value of 20. The time constant characterizes the speed of response; the smaller its value the faster its response, and vice versa, to any kind of input. Figure 1.12 illustrates the dynamic response and the associated time constants for two instruments when subject to a step change in the input. Numerically, the time constant represents the time taken for the response to reach 63.2% of its ﬁnal change, or to reach a value within 36.8% of the ﬁnal value. This follows from Eq. 1.10 by setting t = τ, o ðt Þ i.e., qKq = ð1  e  1 Þ = 0:632. Another useful measure of is
response speed for any instrument is the 5% settling time, i.e., the time for the output signal to get to within 5% of the ﬁnal value. For any ﬁrstorder instrument, it is equal to about three times the time constant.
Instrument reading
20
qo = Kqi
Mathematical Models and Data Analysis
15 63.2% of change 10
Two different instruments
5 Small time constant
Large time constant
0 0
5
10
15
20
25
30
Time from step change in input (seconds)
Fig. 1.12 Stepresponses of two ﬁrstorder instruments with different response times on a plot with instrument reading (yaxis) versus time (xaxis). The response is characterized by the time constant, which is the time for the instrument reading to reach 63.2% of the steadystate value
1.5
Mathematical Modeling Approaches
1.5.1
Broad Categorization
As shown in Fig. 1.13, one can differentiate between two broad types of mathematical approaches meant for: (a) Simulation or forward (or welldeﬁned or wellspeciﬁed) problems. (b) Inverse problems, which include calibrated forward models, statistical models identiﬁed primarily from data that can be blackbox or graybox. The latter can again be designated as partial graybox, i.e., inspired from partial understanding, or mechanistic graybox or physicsbased reducedorder structural models.
1.5.2
Simulation or Forward Modeling
Simulation or forward or whitebox or detailed mechanistic models are based on the laws of physics and permit accurate and microscopic modeling of the various ﬂuid ﬂow, heat and mass transfer phenomenon, etc. that occur within engineered systems. Students are generally familiar with the algebraic and differential equations to represent the temporal (or transient or dynamic) and spatial performances of numerous equipment, subsystems, and systems in terms of algebraic, ordinary, and partial differential equations (ODE and PDE, respectively), and how to solve them analytically or numerically (sequentially or iteratively) over different time periods and duration (from minutes, to hours, days, and years). A high level of physical understanding is necessary to develop these models, complemented with some
1.5 Mathematical Modeling Approaches
15
Fig. 1.13 Overview of various traditional mathematical modeling approaches with applications
Mathematical Models
Inverse models (based on performance data)
Forward models (simulationbased)
Calibrated forward models
Whitebox models
Whitebox models
System design applications
Datadriven or Statistical models
Blackbox models (curve fitting)
Partial greybox models (mechanistic)
Performance prediction of existing systems
Physicsbased reduced order models
Greybox models (mechanistic)
Understanding, prediction, control
expertise in numerical analysis. Consequently, these have found their niche in design studies prior to building a system where the effect of different system conﬁgurations needs to be evaluated under different operating conditions and scenarios. Adopting the model speciﬁed by Eq. 1.1 and Fig. 1.8, such problems are framed as: Given fU, Wg and fB, Cg determine Y
ð1:11Þ
The objective is to predict the response or state variables of a speciﬁed model with known structure and known parameters when subject to speciﬁed input or forcing variables. This is also referred to as the “welldeﬁned problem” since it has a unique solution if formulated properly (in other words, the degree of freedom is zero). Such models are implicitly studied in classical mathematics and also in system simulation design courses. For example, consider a simple steadystate problem wherein the operating point of a pump and piping network are represented by blackbox models of the pressure drop (Δp) and volumetric ﬂow rate (V ), such as shown in Fig. 1.14: Δp = a1 þ b1 V þ c1 V 2
for the pump
Δp = a2 þ b2 V þ c2 V
for the pipe network
2
ð1:12Þ
Solving the two equations simultaneously yields the performance condition of the operating point, i.e., pressure drop and ﬂow rate (Δp0,V0). Note that the numerical values
Fig. 1.14 Example of a forward problem where solving two simultaneous equations, one representing the pump curve and the other the system curve, yields the operating point
of the model parameters {ai, bi, ci} are known, and that (Δp) and V are the two variables, while the two equations provide the two constraints. This simple example has obvious extensions to the solution of differential equations where the combined spatial and temporal response is sought. In order to ensure accuracy of prediction, the models have tended to become increasingly complex especially with the advent of powerful and inexpensive computing power. The divide and conquer mindset is prevalent in this approach, often with detailed mathematical equations based on scientiﬁc laws used to model microelements of the complete
16
1
Fig. 1.15 Schematic of the cooling plant for Example 1.5.1
tb
Mathematical Models and Data Analysis
qc
tc
P
Condenser Expansion Valve
Compressor
Evaporator
te qe
system. This approach presumes detailed knowledge of not only the various natural phenomena affecting system behavior but also of the magnitude of various interactions (e.g., heat and mass transfer coefﬁcients, friction coefﬁcients, etc.). The main advantage of this approach is that the system need not be physically built in order to predict its behavior. Thus, this approach is ideal in the preliminary design and analysis stage and is most often employed as such. Note that incorporating superﬂuous variables and needless modeling details does increase computing time and complexity in the numerical resolution. However, if done correctly, it does not usually compromise the accuracy of the solution obtained. Example 1.5.1 Simulation of a chiller. Consider an example of simulating a chilled water cooling plant consisting of the condenser, compressor, and evaporator, as shown in Fig. 1.15.5 Simple blackbox models are used for easier comprehension. The steadystate cooling capacity qe (in kWt6) and the compressor electric power draw P (in kWe) are function of the refrigerant evaporator temperature te and the refrigerant condenser temperature tc in °C, and are supplied by the equipment manufacturer: qe = 239:5 þ 10:073t e  0:109t 2e  3:41t c  0:00250t 2c  0:2030t e t c þ 0:00820t 2e t c þ0:0013t e t 2c  0:000080005t 2e t 2c
ð1:13Þ
ta
and P =  2:634  0:3081t e  0:00301t 2e þ 1:066t c  0:00528t 2c  0:0011t e t c  0:000306t 2e t c þ0:000567t e t 2c þ 0:0000031t 2e t 2c
ð1:14Þ
Another equation needs to be introduced for the heat rejected at the condenser qc (in kWt). This is simply given by a heat balance of the system (i.e., from the ﬁrst law of thermodynamics) as: qc = qe þ P
ð1:15Þ
The forward problem would entail determining the unknown values of Y = {te, tc, qe, P, qc}. Since there are ﬁve unknowns, ﬁve equations are needed. In addition to the three equations above, two additional ones are required. They are the heat transfer equations at the evaporator and the condenser between the refrigerant (assuming to be changing phase and so at a constant temperature) and the circulating water: qe = me cp ðt a  t e Þ 1  exp 
UAe me cp
ð1:16Þ
qc = mc cp ðt c  t b Þ 1  exp 
UAc mc cp
ð1:17Þ
and
where cp is the speciﬁc heat of water = 4.186 kJ/kg K. Further, values of parameters are speciﬁed: 5
From Stoecker (1989), by permission of McGrawHill. kWt denotes that the units correspond to thermal energy while kWe to energy in electric units.
6
• Water ﬂow rate through the evaporator, me = 6.8 kg/s, and through the condenser, mc = 7.6 kg/s
1.5 Mathematical Modeling Approaches
• Thermal conductance of the evaporator, UAe = 30.6 kW/ K, and that of the condenser, UAc = 26.5 kW/K • Inlet water temperature to the evaporator, ta = 10 °C, and that to the condenser, tb = 25 °C Solving Eqs. 1.13–1.17 results in: t e = 2:84 ° C, t c = 34:05 ° C, qe = 134:39 kW and P = 28:34 kW To summarize, the performance of the various equipment and their interaction have been represented by mathematical equations, which allow a single solution set to be determined. This is the case of the welldeﬁned forward problem adopted in system simulation and design studies. There are instances when the same system could be subject to the inverse model approach. Consider the case when a cooling plant similar to that assumed above exists, and the facility manager wishes to instrument the various components in order to: (i) verify that the system is performing adequately, and (ii) vary some of the operating variables so that the power consumed by the compressor is reduced. In such a case, the numerical model coefﬁcients given in Eqs. 1.13 and 1.14 will be unavailable, and so will be the UA values, since either he is unable to ﬁnd the manufacturerprovided models in his documents or the equipment has degraded somewhat such that the original models are no longer accurate. The model calibration will involve determining these values from experiment data gathered by appropriately submetering the evaporator, condenser, and compressor on both the refrigerant and the water coolant side. How best to make these measurements, how accurate should the instrumentation be, what should be the sampling frequency, for how long should one monitor, etc. are all issues that fall within the purview of design of ﬁeld monitoring. Uncertainty in the measurements as well as the fact that the assumed models are approximations of reality will introduce model prediction errors and so the veriﬁcation of the actual system against measured performance will have to consider such aspects realistically.
17
1.5.3
Inverse Modeling
It is rather difﬁcult to succinctly deﬁne inverse problems since they apply to different classes of problems with applications in diverse areas, each with their own terminology and viewpoints (it is no wonder that it suffers from the “blind men and the elephant” syndrome). Generally speaking, inverse problems are those that involve identiﬁcation of model structure (system identiﬁcation) and/or estimates of model parameters where the system under study already exists, and one uses measured or observed system behavior to aid in the model building and/or reﬁnement. Different model forms may capture the data trend; this is why some argue that inverse problems can be referred to as “illdeﬁned” or “illposed.” Following Eq. 1.1 and Fig. 1.8, inverse problems can be conceptually framed as either: – Parameter estimation problems: givenfY, U, W, dg, determinefA, B, Cg
ð1:18Þ
– Control models: givenfY00 g and fA, B, Cg, determinefU, W, dg ð1:19Þ where Y″ is meant to denote that only limited measurements may be available for the state variable. Typically, one (i) takes measurements of the various parameters (or regressor variables) affecting the output (or response variables) of a device or a phenomenon, (ii) identiﬁes a quantitative correlation between them by regression, and (iii) uses it to make predictions about system behavior under future operating conditions. As shown in Fig. 1.13, one can distinguish between blackbox and graybox approaches in terms of how the quantitative correlation is identiﬁed. Table 1.2 summarizes different characteristics of both these approaches and those of the simulation modeling approach. (i) Blackbox approach identiﬁes a simple mathematical function between response and regressor variables assuming that either (i) nothing is known about the
Table 1.2 Characteristics of different types of modeling approaches Approach Simulation
Time variation of system inputs/outputs Dynamic Quasistatic
Model type Whitebox Detailed mechanistic
Physical understanding High
Mechanistic inverse
Dynamic Quasistatic Steadystate Static or steadystate
Graybox Semiphysical Reducedorder Blackbox Curveﬁt
Medium
Empiricalinverse
ODE ordinary differential equation, PDE partial differential equation
Low
Types of equations PDE ODE Algebraic ODE Algebraic Algebraic
18
innards or the inside workings of the system, or (ii) there is very little/partial understanding of system behavior. The system functioning is opaque to the user. (ii) Graybox approach uses the physicsbased understanding of system structure and functioning to identify a mathematical model and deduce physically relevant system parameters from measured response and input variables. The resulting models are usually reducedorder, i.e., lumped models based on ﬁrstorder ODE or algebraic equations. Such system parameters (e.g., overall heat loss coefﬁcient, time constant, etc.) can serve to improve our mechanistic understanding of the phenomenon or system behavior. The mechanistic model structure can make more accurate predictions and provide better control capability. The identiﬁcation of these models that combine phenomenological plausibility with mathematical simplicity generally requires both good understanding of the physical phenomenon or of the systems/equipment being modeled, and a competence in statistical methods. These analysis methods were initially proposed several hundred years back and have seen major improvements over the years (especially during the last hundred years) resulting in a rich literature in this area with great diversity of techniques and level of sophistication. Traditionally, the distinction was made between parametric and nonparametric methods, and these are discussed in Chaps. 5 and 9. With the advent of computing power, resampling methods have been gaining increasing importance/popularity because they provide (i) additional ﬂexibility in model building and in predictive accuracy, (ii) more robustness in estimating the errors in both model coefﬁcients and in predictive ability, and (iii) are able to handle much larger sets of regressor variables than traditional parametric modeling. These are discussed in Chap. 5. The graybox approach requires contextspeciﬁc approximate numerical or analytical solutions for linear and nonlinear problems and often involves model selection and parameter estimation as well. The illconditioning, i.e., the solution is extremely sensitive to the data (see Sect. 9.2), is often due to the repetitive nature of the data collected while the system is under normal operation. There is a rich and diverse body of knowledge on inverse methods applied to physical systems, and numerous textbooks, monographs, and research papers are available on this subject. Chapter 10 addresses these problems at more length. In summary, different models and parameter estimation techniques need to be adopted depending on whether: (i) the intent is to subsequently predict system behavior within the temporal and/or spatial range of input variables—in such cases, simple and wellknown methods such as curve ﬁtting
1
Mathematical Models and Data Analysis
Fig. 1.16 Example of a parameter estimation problem where the model parameters of a presumed function of pressure drop versus volume ﬂow rate are identiﬁed from discrete experimental data points
may sufﬁce (Fig. 1.16); (ii) the intent is to subsequently understand/predict/control system behavior outside the temporal and/or spatial range of input variables—in such cases, physically based models are generally more appropriate. Example 1.5.2 Dose–response models. An example of how modeldriven methods differ from a straightforward curve ﬁt is given below (the same example is treated in more depth in Sect. 10.2.6). Consider the case of models of risk to humans when exposed to toxins (or biological poisons), which are extremely deadly even in small doses. Dose is the total mass of toxin that the human body ingests. Response is the measurable physiological change in the body produced by the toxin, which can have many manifestations; here, the focus is on human cells becoming cancerous. There are several aspects to this problem relevant to inverse modeling: (i) Can the observed data of dose versus response provide some insights into the process that induces cancer in biological cells? (ii) How valid are these results extrapolated down to low doses? (iii) Since laboratory tests are performed on animal subjects, how valid are these results when extrapolated to humans? The manner one chooses to extrapolate the dose–response curve downward is dependent on either the assumption one makes regarding the basic process itself or how one chooses to err (which has policymaking implications). For example, erring too conservatively in terms of risk would overstate the risk and prompt implementation of more precautionary measures, which some critics would fault as unjustiﬁed and improper use of limited resources. There are no simple answers to these queries (until the basic process itself is completely understood). There is yet another major issue. Since different humans (and test animals) react differently to the same dose, the response is often interpreted as a probability of cancer being induced, which can be framed as a risk. Probability is
1.6 Data Analytic Approaches
19
numerous model parameters so that model predictions match observed system behavior as closely as possible. Often, only a subset or limited number of measurements of system states and forcing function values are available, resulting in a highly overparameterized problem with more than one possible solution. Following Eq. 1.1, such inverse problems can be framed as: given
Fig. 1.17 Three different inverse models depending on toxin type for extrapolating dose–response observations at high doses to the response at low doses. (From Heinsohn and Cimbala 2003, by permission of CRC Press)
bound to play an important role to the nature of the process, and hence the adoption of various agencies (such as the U.S. Environmental Protection Agency) of probabilistic methods toward risk assessment and modeling. Figure 1.17 illustrates three methods of extrapolating dose–response curves down to low doses (Heinsohn and Cimbala 2003). The dots represent observed laboratory tests performed at high doses. Three types of models are ﬁt to the data. They all agree at high doses; however, they deviate substantially at low doses because the models are functionally different. While model I is a nonlinear model applicable to highly toxic agents, curve II is generally taken to apply to contaminants that are quite harmless at low doses (i.e., the body is able to metabolize the toxin at low doses). Curve III is an intermediate one between the other two curves. The above models are somewhat empirical (or blackbox) and provide little understanding of the basic process itself. Models based on simpliﬁed but phenomenological considerations of how biological cells become cancerous have also been developed and these are described in Sect. 10.2.6.
1.5.4
Calibrated Simulation
The calibrated simulation approach can be viewed as a hybrid of the forward and inverse methods (refer to Fig. 1.13). Here one uses a mechanistic model originally developed for the purpose of system simulation, and modiﬁes or “tunes” the
fY00 , U00 , W00 , d00 g, determine fA00 , B00 , C00 g ð1:20Þ
where the ″ notation is used to represent limited measurements or reduced parameter set. Example 1.5.1 is a simple simulation or forward model with explicit algebraic equations for each component with no feedback loops. Detailed simulation programs are much more complex (with hundreds of variables, complex interactions and boundary conditions, etc.) involving ODEs or PDEs; one example is computational ﬂuid dynamic (CFD) models for indoor air quality studies. Calibrating such models is extremely difﬁcult given the lack of proper instrumentation to measure detailed spatial and temporal ﬁelds, and the inability to conveniently compartmentalize the problem so that inputs and outputs of subblocks could be framed and calibrated individually as done in the cooling plant example above. Thus, in view of such limitations in the data, developing a simpler system model consistent with the data available while retaining the underlying mechanistic considerations as far as possible is a more appealing approach, albeit a challenging one. Such an approach is called the “graybox” approach, involving inverse models (see Fig. 1.13). An indepth discussion of calibrated simulation approaches is provided in Sect. 10.6.
1.6
Data Analytic Approaches
Several authors, for example (Sprent 1998), use terms such as (i) datadriven models to imply those that are suggested by the data at hand and commensurate with knowledge about system behavior—this is somewhat akin to our deﬁnition of blackbox models discussed above, and (ii) modeldriven approaches as those that assume a prespeciﬁed model and the data is used to determine the model parameters; this is synonymous with graybox models inverse methods. Datadriven or statistical methods have been traditionally separated into blackbox and graybox approaches. An alternate view (Fig. 1.18), reﬂective of current thinking, is to distinguish between traditional stochastic methods and data analytic methods (Breiman 2001 goes so far as to refer to these as different cultures). The traditional methods consisting of parametric, nonparametric, and resampling methods (which have been introduced earlier) fall under
20
1
Fig. 1.18 An overview of different approaches under the broad classiﬁcation of “traditional stochastic” methods and “data analytics” methods
Statistical Analysis Traditional Stochastic Methods Resampling
Classical Parametric NonParametric
“stochastic data modeling” and were meant primarily to understand and predict system behavior. In the last few decades there has been an explosion of other analysis methods, loosely called data analytic methods, which are less based on statistics and probability and more on data exploration and computerbased learning algorithms. The superiority of such algorithmic7 methods is manifest under situations when patterns in the data are too complex for humans to grasp or learn. The algorithmic approach is better suited for learning or knowledge discovery or gaining insight into important relationships/correlations (i.e., ﬁnding patterns in data) while providing superior predictive capability and control of nonlinear systems under complex situations. Several statisticians (e.g., Breiman 2001) argue the need to emphasize this approach given its diversity in application and its ability to make better use of large data sets in the real world. This book deals largely with the traditional stochastic modeling methods, with only Chap. 11 devoted to data analytic methods.
1.6.1
Data Mining or Knowledge Discovery
Data mining (DM) is deﬁned as the science of extracting useful information from large/enormous data sets; that is why it is also referred to as knowledge discovery. The associated suite of approaches were developed in ﬁelds outside statistics. Though DM is based on a range of techniques, from the very simple to the sophisticated (involving such methods as clustering classiﬁcation, anomaly detection, etc.), it has the distinguishing feature that it is concerned with shifting through large/enormous amounts of data with no clear aim in mind except to discern hidden information;
7
Mathematical Models and Data Analysis
The terminology follows Breiman (2001) who argues that algorithmic modeling rather than traditional stochastic approaches is much better suited to tackling modernday problems, which involve complex systems and decisionmaking behavior based on numerous factors, variables, and large sets of informational data.
Data Analytics Methods Data Mining
Machine Learning Big Data
discover patterns, associations, and trends; or summarize data behavior (Dunham 2003). Thus, not only does its distinctiveness lie in the data management problems associated with storing and retrieving large amounts of data from perhaps multiple data sets, but also in it being much more exploratory and less formalized in nature than is statistics and model building where one analyzes a relatively small data set with some speciﬁc objective in mind. Data mining has borrowed concepts from several ﬁelds such as multivariate statistics and Bayesian theory, as well as less formalized ones such as machine learning, artiﬁcial intelligence, pattern recognition, and data management so as to bound its own area of study and deﬁne the speciﬁc elements and tools involved. It is the result of the digital age where enormous digital databases abound from the mundane (supermarket transactions, credit card records, telephone calls, Internet postings, etc.) to the very scientiﬁc (astronomical data, medical images, etc.). Thus, the purview of data mining is to explore such databases in order to ﬁnd patterns or characteristics (called data discovery) or even in response to some very general research question not provided by any previous mechanistic understanding of the social or engineering system, so that some action can be taken resulting in a beneﬁt or value to the owner. Data mining techniques are brieﬂy discussed in Chap. 11.
1.6.2
Machine Learning or Algorithmic Models
Machine learning (ML) or predictive algorithmic modeling is the ﬁeld of study that develops algorithms that computers follow in order to identify and extract patterns from data with the primary purpose of developing prediction models with accuracy being the primary aim and understanding or explaining the data trends of secondary concern (Kelleher and Tierney 2018). It has also been deﬁned as a ﬁeld of study that gives computers the ability to learn without being explicitly programmed. Thus, ML models and algorithms learn to map input to output in an iterative manner with the internal
1.6 Data Analytic Approaches
21
Fig. 1.19 Examples of some infrastructurerelated technological applications of big data from the perspective of different stakeholders (SMI smart metering infrastructure)
Examples of Technological Applications of Big Data
Building Operations Learning energy consumption in residences (DM using ANN ensembles and adaptive)
Electric Utilities Datamine SMI (cluster customers for rate plan modification)
structure changing continuously. It is especially useful when the patterns in the data are too complex for traditional statistical analysis methods to handle. The learning of ML algorithms such as neural network models improves with additional data and is adaptive to changes in environmental conditions. ML is one of the important ﬁelds in computer science. Chapter 11 presents some of the important ML algorithms such as neural networks.
1.6.3
Introduction to Big Data
One of the major impacts to modern day society is the discipline called big data or data science. It involves harnessing information in novel ways and extracting a certain level of quantiﬁcation in terms of trends and behavior from large data sets. While the knowledge may not enhance fundamental understanding or insight or wisdom,8 it produces useful actionable insights on goods and services of signiﬁcant value to businesses and society. “It is meant to inform rather than explain” (MayerSchonberger and Cukier 2013, p. 4); and in that sense DM and ML algorithms are at the core of its suite of analysis tools. The basic trait that distinguishes it from statistical learning methods is that it is based on processing huge amounts of heterogeneous multisource data (sensors, videos, Internet searches, social media, survey, government, etc.), which is characterized by variety, large noise, huge volume, velocity (realtime streaming), data fusion from multiple disparate sources, and sensor fusion from multiple sensors. The thinking is that the size of the data set can compensate for the use of simple models and noisier noncurated data. However, there are some major concerns:
“Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information?” T.S. Eliot (1934).
8
Smart Grid Operations
City Level Routine City Operations
Extreme Events Mgmt
Development Planning for Aspirational Goals (carbon neutrality, sustainable cities)
• Can obscure longterm view/behavior of phenomena • Increases danger of false learning/beliefs and overconﬁdence • Major issue with privacy and ethics A whole new set of tools, procedures, and software has been developed for dataﬁcation, which is the process of transforming raw data such as Internet searches, Internet purchasing, texts, visual, social media, info from phones, cars, etc. into a quantiﬁable format so that it can be tabulated, stored, and analyzed. The value of the data shifts from its primary use to its potential future use. The new profession of the data scientist combines numerous traditional skills: the statistician (data mining), computer scientist (machine learning), software programmer, machine learning expert, informatics, planners, etc. Numerous textbooks have appeared recently on data science, for example, Kelleher and Tierney (2018). There is a debate on whether big data processing capabilities and the trends and behavior therefrom extracted diminish the value of the domain expert in applications that fell traditionally under the purview of statistical and engineering analysis. Nevertheless, big data offers great promise to reshape entire business sectors to address/satisfy several societal and global problems. Some examples from a technological perspective as applied to infrastructure applications are listed below and shown in Fig. 1.19: – Improving current sensing and data analysis capabilities in individual buildings. – Be able to analyze the behavior of large number of buildings/cities in terms of energy efﬁciency, and integrate higher renewable energy penetration with the smartgrid. – Be able to analyze the beneﬁts and limitations of engineered infrastructures and routine operational practices on society and different entities.
22
1
– Be able to identify gaps in current societal needs, use this knowledge to develop strategies to enhance operational efﬁciency and decarbonization, track the implementation of these strategies, and assess the beneﬁt once implemented. – Be able to provide and enhance routine services as well as emergency response measures.
1.7
Data Analysis
1.7.1
Introduction
Data analysis is not performed just for its own sake; its usefulness lies in the support it provides to such objectives as gaining insight about system behavior, characterizing current system performance against a baseline, deciding whether retroﬁts and suggested operational changes to the system are warranted or not, quantifying the uncertainty in predicting future behavior of the present system, suggesting robust/costeffective/risk averse ways to operate an existing system, avoiding catastrophic system failure, etc. Analysis is often a precursor to decisionmaking in the real world. There is another discipline with overlapping/ complementary aims to that of data analysis and modeling. Risk analysis and decisionmaking provide both an overall paradigm and a set of tools with which decision makers can construct and analyze a model of a decision situation (Clemen and Reilly 2001). Even though it does not give speciﬁc answers to problems faced by a person, decision analysis provides structure, guidance, and analytical tools on how to logically and systematically tackle a problem, model uncertainty in different ways, and hopefully arrive at rational decisions in tune with the personal preferences of the individual who has to live with the choice(s) made. While it is applicable to problems without uncertainty but with multiple outcomes, its strength lies in being able to analyze complex multiple outcome problems that are inherently uncertain or stochastic compounded with the utility functions or risk preferences of the decision maker. There are different sources of uncertainty in a decisionmaking process but the one pertinent to data modeling and analysis in the context of this book is that associated with fairly wellbehaved and wellunderstood engineering systems with relatively low uncertainty in their performance data. This is the reason why, historically, engineering students were not subjected to a class in decision analysis. However, many engineering systems are operated wherein the attitudes and behavior of people operating these systems assume importance; in such cases, there is a need to adapt
Mathematical Models and Data Analysis
many of the decision analysis tools and concepts with traditional data analysis and modeling techniques. This issue is addressed in Chap. 12.
1.7.2
Basic Stages
In view of the diversity of ﬁelds to which data analysis is applied, an allencompassing deﬁnition would have to be general. One good deﬁnition is: “an evaluation of collected observations so as to extract information useful for a speciﬁc purpose.” The evaluation relies on different mathematical and statistical tools depending on the intent of the investigation. In the area of science, the systematic organization of observational data, such as the orbital movement of the planets, provided a means for Newton to develop his laws of motion. Observational data from deep space allow scientists to develop/reﬁne/verify theories and hypotheses about the structure, relationships, origins, and presence of certain phenomena (such as black holes) in the cosmos. At the other end of the spectrum, data analysis can also be viewed as simply: “the process of systematically applying statistical and logical techniques to describe, summarize, and compare data.” From the perspective of an engineer/scientist, data analysis is a process that when applied to system performance data, collected either intrusively or nonintrusively, allows certain conclusions about the state of the system to be drawn, and, thereby, to initiate followup actions. Studying a problem through the use of statistical data analysis usually involves four basic steps (Arsham 2008): (a) Deﬁning the Problem The context of the problem and the exact deﬁnition of the problem being studied need to be framed. This allows one to design both the data collection system and the subsequent analysis procedures to be followed. (b) Collecting the Data In the past (say, 70 years back), collecting the data was the most difﬁcult part, and was often the bottleneck of data analysis. Nowadays, one is overwhelmed by the large amounts of data resulting from the great strides in sensor and data collection technology; and data cleaning, handling, and summarizing have become major issues. Paradoxically, the design of data collection systems has been marginalized by an apparent belief that extensive computation can make up for any deﬁciencies in the design of data collection. Gathering data without a clear deﬁnition of the problem often results in failure or limited success. Data can be
1.7 Data Analysis
collected from existing sources or obtained through observation and experimental studies designed to obtain new data. In an experimental study, the variable of interest is identiﬁed. Then, one or more factors in the study are controlled so that data can be obtained about how the factors inﬂuence the variables. In observational studies, no attempt is made to control or inﬂuence the variables of interest either intentionally or due to the inability to do so (two examples are surveys and astronomical data). (c) Analyzing the Data There are various statistical and analysis approaches and tools that one can bring to bear depending on the type and complexity of the problem and the type, quality, and completeness of the data available. Several categories of problems encountered in data analysis are shown in Figs. 1.13 and 1.18. Probability is an important aspect of data analysis since it provides a mechanism for measuring, expressing, and analyzing the uncertainties associated with collected data and mathematical models used. This, in turn, impacts the conﬁdence in our analysis results: uncertainty in future system performance predictions, conﬁdence level in our conﬁrmatory conclusions, uncertainty in the validity of the action proposed, etc. The majority of the topics addressed in this book pertain to this category. (d) Reporting the Results The ﬁnal step in any data analysis effort involves preparing a report. This is the written document that logically describes all the pertinent stages of the work, presents the data collected, discusses the analysis results, states the conclusions reached, and recommends further action speciﬁc to the issues of the problem identiﬁed at the onset. The ﬁnal report and any technical papers resulting from it are the only documents that survive over time and are invaluable to other professionals. Unfortunately, the task of reporting is often cursory and not given its due importance. The term “intelligent” data analysis has been used, which has a different connotation from traditional ones (Berthold and Hand 2003). This term is used not in the sense that it involves increasing knowledge/intelligence of the user or analyst in applying traditional tools, but that the statistical tools themselves have some measure of intelligence built into them. A simple example is when a regression model has to be identiﬁed from data. Software packages and programmable platforms are available, which facilitate hundreds of builtin functions to be evaluated, and a prioritized list of models identiﬁed. The recent evolution of computerintensive methods (such as bootstrapping and Monte Carlo methods) along with soft computing algorithms (such as artiﬁcial neural networks, genetic algorithms, etc.) enhances the
23
computational power and capability of traditional statistics, model estimation, and data analysis methods. Such added capabilities of modernday computers and the sophisticated manner in which the software programs are written allow “intelligent” data analysis to be performed.
1.7.3
Example of a Data Collection and Analysis System
Data can be separated into experimental or observational depending on whether the system operation can be modiﬁed by the observer or not. Consider a system where the initial phase of designing and installing the monitoring system is complete. Figure 1.20 is a ﬂowchart depicting various stages in the collection, analysis, and interpretation of data collected from an engineering thermal9 system while in operation. The various elements involved are: (a) A measurement system consisting of various sensors of prespeciﬁed types and accuracy. The proper location, commissioning, and maintenance of these sensors are important aspects of this element. (b) The data sampling element whereby the output of the various sensors is read at a predetermined frequency. The low cost of automated data collection has led to increasingly higher sampling rates. Typical frequencies for thermal systems are in the range of 1 s–1 min. (c) The cleaning of raw data for spikes, gross errors, misrecordings, and missing or dead channels; average (or sum) the data samples and, if necessary, store them in a dynamic fashion (i.e., online) in a central electronic database with an electronic time stamp. (d) The averaging of raw data and storing in a database; typical periods are in the range of 1–30 min. One can also include some ﬁner checks for data quality by ﬂagging data when they exceed physically stipulated ranges. This process need not be done online but could be initiated automatically and periodically, say, every day. It is this data set that is queried as necessary for subsequent analysis. (e) The above steps in the data collection process are performed on a routine basis. This data can be used to advantage provided one can frame the issues relevant to the client and determine which of these can be satisﬁed. Examples of such routine uses are assessing overall timeaveraged system efﬁciencies and preparing weekly performance reports, as well as for subtler action such as supervisory control and automated fault detection. 9
Electrical systems have different considerations since they mostly use very highfrequency sampling rates.
24
1
Mathematical Models and Data Analysis
Measurement System Design
System Monitoring
Data Sampling (1 sec–1 min)
Clean (and Store Raw Data)
Initial cleaning and flagging (missing misrecorded, dead channels) Gross error detection Removal of spikes
Average and Store Data (1–30 min)
Define issue to be Analyzed
Formulate intention of client as engineering problem Determine analysis approach Determine data needed
Extract Data Subset for Intended Analyses
Data transformation Data filtering Outlier detection Data validation
Perform Engineering Analysis
Statistical inference Identify patterns in data Regression analysis Parameter estimation System identification
Perform Decision Analysis
Data adequate for sound decision? Is prior presumption correct? How to improve operation and/or effy? Which riskaverse strategy to select? How to react to catastrophic risk?
Perform additional analyses
Redesign and take additional measurements
Present Decision to Client
End
Fig. 1.20 Flowchart depicting various stages in data analysis and decisionmaking as applied to continuous monitoring of thermal systems
1.8 Topics Covered in Book
25
Table 1.3 Analysis methods covered in this book Chapter 1 2 3 4 5 6 7 8 9 10 11 12
Topic Introduction to mathematical models and description of different classes of analysis approaches Probability and statistics, important probability distributions, Bayesian statistics Data collection, exploratory data analysis, measurement uncertainty, and propagation of errors Inferential statistics, nonparametric tests, Bayesian, sampling and resampling methods Linear ordinary least squares (OLS) regression, residual analysis, point and interval estimation, resampling methods Design of experiments (factorial and block, response surface, context of computer simulations) Traditional optimization methods; linear, nonlinear, and dynamic programming Time series analysis, trend and seasonal models, stochastic methods (ARIMA), quality control Advanced regression (parametric, nonparametric, collinearity, nonlinear, and neural networks) Parameter estimation of static and dynamic graybox models, calibration of simulation models Data analytics, unsupervised learning (clustering), supervised (classiﬁcation) Decisionmaking, risk analysis, and sustainability assessment methods
(f) Occasionally, the owner would like to evaluate major changes such as equipment change out or addition of new equipment, or would like to improve overall system performance or reliability not knowing exactly how to achieve this. Alternatively, one may wish to evaluate system performance under an exceptionally hot spell of several days. This is when specialized consultants are brought in to make recommendations to the owner. Historically, such analysis was done based on the professional expertise of the consultant with minimal or no measurements of the actual system. However, both ﬁnancial institutions who would lend the money for implementing these changes and the upper management of the company owning the system are insisting on a more transparent engineering analysis based on actual data. Hence, the preliminary steps involving relevant data extraction and a more careful data prooﬁng and validation are essential. (g) Extracted data are then subject to certain engineering analyses that can be collectively referred to as applied data modeling and analysis. These involve statistical inference, identifying patterns in the data, regression analysis, parameter estimation, performance extrapolation, classiﬁcation or clustering, deterministic modeling, etc. (h) Performing a decision analyses, in our context, involves using the results of the engineering analyses and adding an additional layer of analyses that includes modeling uncertainties (involving among other issues a sensitivity analysis), modeling stakeholder preferences, and structuring decisions. Several iterations may be necessary between this element and the ones involving engineering analysis and data extraction. (i) The presentation of the various choices suggested by the decision analysis to the owner or decision maker so that a ﬁnal course of action may be determined. Sometimes,
it may be necessary to perform additional analyses or even modify or enhance the capabilities of the measurement system in order to satisfy client needs.
1.8
Topics Covered in Book
The overall structure of the book is depicted in Table 1.3 along with a simple suggestion as to how this book could be used for two courses if necessary. This chapter has provided a general introduction of mathematical models and discussed the different types of problems and analysis tools available for datadriven modeling and analysis. Chapter 2 reviews basic probability concepts (both classical and Bayesian), and covers various important probability distributions with emphasis on their practical usefulness. Chapter 3 covers data collection and prooﬁng, along with exploratory data analysis and descriptive statistics. The latter entails performing “numerical detective work” on the data and developing methods for screening, organizing, summarizing, and detecting basic trends in the data (such as graphs, and tables), which would help in information gathering and knowledge generation. Historically, formal statisticians have shied away from exploratory data analysis considering it to be either too simple to warrant serious discussion or too ad hoc in nature to be able to expound logical steps (McNeil 1977). This area had to await the pioneering work by John Tukey and others to obtain a formal structure. A brief overview is provided in this book, and the interested reader can refer to Hoagin et al. (1983) or Tukey (1988) for an excellent perspective. The concepts of measurement uncertainty and propagation of errors are also addressed, and relevant equations provided. Chapter 4 covers statistical inference involving hypotheses testing of singlesample and multisample
26
parametric tests (such as analysis of variance [ANOVA]) involving univariate and multivariate samples. Inferential problems are those that involve making uncertainty inferences or calculating conﬁdence intervals of population estimates from selected samples. These methods are the backbone of classical statistics from which other approaches evolved (covered in Chaps. 5, 6, and 9). Nonparametric tests and Bayesian inference methods have also proven to be useful in certain cases, and these approaches are also covered. The advent of computers has led to the very popular sampling and resampling methods that reuse the available sample multiple times resulting in more intuitive, versatile, and robust point and interval estimation. Chapter 5 deals with inferential statistics applicable to linear regression situations, an application that is perhaps the most prevalent. In essence, the regression problem involves (i) taking measurements of the various parameters (or regressor variables) and of the output (or response variables) of a device/system or a phenomenon, (ii) identifying a causal quantitative correlation between them by regression, (iii) estimating the model coefﬁcients/ parameters, and (iv) using it to make predictions about system behavior under future operating conditions. When a regression model is identiﬁed from data, the data cannot be considered to include the entire “population” data, i.e., all the observations one could possibly conceive. Hence, model parameters and model predictions suffer from uncertainty, which falls under the purview of inferential statistics. There is a rich literature in this area called “model building” with great diversity of techniques and level of sophistication. Traditional regression methods using ordinary least squares (OLS) for univariate and multivariate linear problems along with advanced parametric and nonparametric methods are covered. Residual analysis, detection of leverage, and inﬂuential points are also discussed along with simple remedial measures one could take if the residual behavior is improper. Resampling methods applied in a regression context are also presented. Chapter 6 covers experimental design methods and discusses factorial and response surface methods that allow extending hypothesis testing to multiple variables as well as identifying sound performance models. “Design of experiments” (DOE) is the process of prescribing the exact manner in which samples for testing need to be selected, and the conditions and sequence under which the testing needs to be performed such that the relationship or model between a response variable and a set of regressor variables can be identiﬁed in a robust and accurate manner. The extension of traditional DOE approaches to computer simulationbased design of energy efﬁcient buildings involving numerous design variables is also discussed. The material from all these ﬁve chapters (Chaps. 2, 3, 4, 5 and 6) is generally covered in undergraduate statistics
1
Mathematical Models and Data Analysis
and probability course, and can be used for that purpose. It can also be used as review or refresher material (especially useful to the general practitioner) for a second course meant to cover more advanced concepts and statistical techniques and to better prepare graduate students and energy researchers. Chapter 7 reviews various traditional optimization methods separated into analytical (such as the Lagrange multiplier method) and univariate and multivariate numerical methods. Linear programming problems (network models and mixed integer problems) as well as nonlinear programming problems are covered. Optimization is at the heart of numerous data analysis situations (including regression model building) and is essential to current societal problems such as the design and future planning of energyefﬁcient and resilient infrastructure systems. Time series analysis methods are treated in Chap. 8, which are a set of tools that include traditional model building techniques as well as those that capture the sequential behavior of the data and its noise. They involve the analysis, interpretation, and manipulation of time series signals in either time domain or frequency domain. Several methods to smooth time series data in the time domain are presented. Forecasting models based on OLS modeling and the more sophisticated class of stochastic models (such as autoregressive and moving average methods) suitable for linear dynamic models are discussed. An overview is also provided of control chart techniques extensively used for process control and condition monitoring. Chapter 9 deals with subtler and more advanced topics related to parametric and nonparametric regression analysis. The dangers of collinearity among regressors during multivariate regression is addressed, and ways to minimize such effects (such as principal component analysis, ridge regression, and shrinkage methods) are discussed. An overarching class of models, namely Generalized Linear Models (GLM), are introduced that combine in a uniﬁed framework both the strictly linear models and nonlinear models that can be transformed into linear ones. Next, parameter estimation of intrinsically nonlinear models as well as nonparametric estimation methods involving smoothing and regression splines and the use of kernel functions are covered. Finally, the multilayer perceptron (MLP) neural network model is discussed. Inverse modeling is an approach to data analysis method, which combines the basic physics of the process with statistical methods so as to achieve a better understanding of the system dynamics, and thereby use it to predict system performance. Chapter 10 presents an overview of the types of problems that fall under inverse estimation methods applied to structural models: (a) static graybox models involving algebraic equations, (b) dynamic graybox models involving differential equations, and (c) the calibration of whitebox
Problems
models involving detailed simulation programs. The concept of information content of collected data and associated quantitative metrics are also introduced. Chapter 11 deals with data analytic methods, which include data mining and machine learning. They are directly concerned with practical applications (discern hidden information; discover patterns, associations, and trends; or summarize data behavior) through data exploration and computerbased learning algorithms. The problems can be broadly divided into two categories: unsupervised learning approaches (such as classiﬁcation methods) and supervised learning approaches (such as clustering methods). Classiﬁcation problems are those where one would like to develop a model to statistically distinguish or “discriminate” differences between two or more groups when one knows beforehand that such groupings exist in the data set provided, and to subsequently assign, allocate, or classify a future unclassiﬁed observation into a speciﬁc group with the smallest probability of error. During clustering, the number of clusters or groups is not known beforehand (thus, a more difﬁcult problem), and the intent is to allocate a set of observation sets into groups that are similar or “close” to one another. Several important subtypes of both these categories are presented and discussed. Decision theory is the study of methods for arriving at “rational” decisions under uncertainty. Chapter 12 covers this issue (including risk analysis), and further provides an introduction to sustainability assessment methods that have assumed central importance in recent years. An overview of quantitative decisionmaking methods is followed by the suggestion that this area be divided into single and multicriteria methods and further separated into single discipline and multidiscipline applications that usually involve nonconsistent attribute scales. How the decisionmaking process is actually applied to the selection of the most appropriate course of action for engineered systems is discussed and various relevant concepts introduced. A general introduction of sustainability and the importance of assessments in this area are discussed. The two primary sustainability assessment frameworks, namely the structurebased and the performancebased, are described along with their various subcategories and analysis procedures; illustrative case study examples are also provided.
Problems Pr. 1.1 Identify which of the following equations are linear functional models, which are linear in their parameters (a, b, c), and which are both: (a) y = a + bx + cx2 (b) y = a þ bx þ xc2
27
(c) (d) (e) (f) (g) (h) (i)
y = a + b(x  1) + c(x  1)2 y = a0 þ b0 x1 þ c0 x21 þ a1 þ b1 x1 þ c1 x21 x2 y = a + b. sin (c + x) y = a + b sin (cx) y = a + bxc y = a + bx1.5 y = a + b ex
Pr. 1.2 Consider the equation for pressure drop in a pipe given by Eq. 1.2 (a) Recast the equation such that it expresses the ﬂuid volume ﬂow rate (rather than velocity) in terms of pressure drop and other quantities. (b) Draw a block diagram to represent the case when a feedback control is used to control the ﬂow rate from measured pressure drop. Pr. 1.3 Consider Eq. 1.5, which is a lumped model of a fully mixed hot water storage tank. Assume initial temperature is Ts, initial = 60 °C while the ambient temperature is constant at 20 °C. (i) Deduce the expression for the time constant of the tank in terms of model parameters. (ii) Compute its numerical value when Mcp = 9.0 MJ/°C and UA = 0.833 kW/°C. (iii) What will be the storage tank temperature after 6 h under cooldown (with P = 0)? (iv) How long will the tank temperature take to drop to 40 ° C under cooldown? (v) Derive the solution for the transient response of the storage tank under electric power input P. (vi) If P = 50 kW, calculate and plot the response when the tank is initially at 30 °C (akin to Fig. 1.12). Pr. 1.4 Consider Fig. 1.9 where a heated sphere is being cooled. The analysis simpliﬁes considerably if the sphere can be modeled as a lumped one. This can be done if the Biot number Bi hLk e < 0:1. Assume that the external heat transfer coefﬁcient is 10 W/m2 °C and that the radius of the sphere is 15 cm. The equivalent length of the sphere is Volume Le = Surface area. Determine whether the lumped model assumption is appropriate for spheres made of the following materials: (a) Steel with thermal conductivity k = 34 W/m °C (b) Copper with thermal conductivity k = 340 W/m °C (c) Wood with thermal conductivity k = 0.15 W/m °C
28
1
Mathematical Models and Data Analysis
Fig. 1.21 Steadystate heat ﬂow through a composite wall of surface area A made up of three layers in series. (a) Sketch. (b) Electrical resistance analog (Pr. 1.6). (From Reddy et al. 2016)
Pr. 1.5 The thermal network representation of a homogeneous plane is illustrated in Fig. 1.10. Draw the 3R2C network representation and derive expressions for the three resistors and the two capacitors in terms of the two air ﬁlm coefﬁcients and the wall properties (Hint: follow the approach illustrated in Fig. 1.10 for the 2R1C network). Pr. 1.6 Consider the composite wall of surface area A as shown in Fig. 1.21 with four interfaces designed by (1, 2, 3, 4). It consists of three different materials (A, B, C) with thermal conductivity k and thickness Δx. The surface temperatures at points A and C are T1 and T4, respectively, while the steadystate heat ﬂow rate is Q. (a) Write the equation for steadystate heat ﬂow through this wall. This is the forward application. When would one use this equation? (b) This equation can also be used for inverse modeling applications. Give two practical instances when this applies. Clearly state what the intent is, how the equation has to be restructured, what parameters are to be determined, and what measurements are needed to do so. Pr. 1.7 A building with a ﬂoor area of A = 1000 m2 and inside wall height of h = 3 m has an air inﬁltration (i.e., leakage) rate of 0.4 air change per hour (Note: one air change is equal to the volume of the room). The outdoor temperature is To = 2 °C and the indoor temperature Ti = 22 °C: (a) Write the equation for the rate of heat input Q that must be provided by the building heating system to warm the cold outside air assuming the location to be at sea level.
(b) This equation can also be used for inverse modeling applications. Give a practical instance. Clearly state what the intent is, how the equation has to be restructured, what parameters are to be determined, and what measurements are needed to do so. Pr. 1.8 Consider a situation where a water pump is to deliver a quantity of water F = 500 L/s from a well of depth d = 60 m to the top of a building of height h = 30 m. The friction pressure drop through the pipe is Δp = 30 kPa. The pumpmotor efﬁciency ηp = 60%. (a) Write the equation for electric power consumption in terms of the various quantities speciﬁed. This is the forward situation. Under what situations would this equation be used? (b) This equation can also be used for inverse modeling applications. Give a practical instance. Clearly state what the intent is, how the equation has to be restructured, what parameters are to be determined, and what measurements are needed to do so. Pr. 1.9 Two pumps in parallel problem viewed from the forward and the inverse perspectives Consider Fig. 1.22, which will be analyzed in both the forward and datadriven approaches. (a) Forward problem10: Two pumps with parallel networks deliver F = 0.01 m3/s of water from a reservoir to the destination. The pressure drops in Pascals (Pa) of each network are given by: Δp1 = ð2:1Þ 1010 F 21 and Δp2 = ð3:6Þ 1010 F 22 where F1 and F2 are the ﬂow rates through each branch in m3/s. Assume that
10
From Stoecker (1989), by permission of McGrawHill.
Problems
29
ΔP1 = (2.1 x 1010) (F1)2 ΔP2 = (3.6 x 1010) (F2)2
F=0.01 m3/s
F1
F2
Fig. 1.22 Pumping system with two pumps in parallel (Pr. 1.9) Fig. 1.23 Perspective of the forward problem for the lake contamination situation (Pr. 1.10)
Contaminated outfall Incoming stream
Qs =5.0 m3/s Cs =10.0 mg/L
pumps and their motor assemblies have the same efﬁciency. Let P1 and P2 be the electric power in Watts (W) consumed by the two pumpmotor assemblies. (i) Sketch the block diagram for this system with total electric power as the output variable. (ii) Frame the total power P as the objective function that needs to be minimized against total delivered water F. (iii) Solve the problem for F1 and P. (b) Inverse problem: Now consider the same system in the inverse framework where one would instrument the existing system such that operational measurements of P for different F1 and F2 are available. (i) Frame the function appropriately using insights into the functional form provided by the forward model. (ii) The simplifying assumption of constant efﬁciency of the pumps is unrealistic. How would the above function need to be reformulated if efﬁciency can be taken to be a quadratic polynomial (or blackbox model) of ﬂow rate as shown below for the ﬁrst piping branch (with a similar expression applying for the second branch): η1 = a1 þ b1 F 1 þ c1 F 21 .
Q w =0.5 m3/s Cw=100.0 mg/L
V=10.0 x 106 m3 k=0.20/day C=?
Outgoing stream Qm= ? m3/s Cm= ? mg/L
Pr. 1.10 Lake contamination problem viewed from the forward and the inverse perspectives A lake of volume V is fed by an incoming stream with volumetric ﬂow rate Qs contaminated with concentration Cs11 (Fig. 1.23). The outfall of another source (say, the sewage from a factory) also discharges a ﬂow Qw of the same pollutant with concentration Cw. The wastes in the stream and sewage have a decay coefﬁcient k. (a) Let us consider the forward model approach. In order to simplify the problem, the lake will be considered to be a fully mixed compartment and evaporation and seepage losses to the lake bottom will be neglected. In such a case, the concentration of the outﬂow is equal to that in the lake, i.e., Cm = C. Then, the steadystate concentration in the lake can be determined quite simply: Input rate = Output rate + decay rate. where Input rate = QsCs + QwCw, Output rate = QmCm = (Qs + Qw)Cm, and decay rate = kCV.12 s þQw C w This results in: C = QQssCþQ . w þkV
11
From Masters and Ela (2008), by permission of Pearson Education. This term is the ﬁrstorder Taylor series approximation of exp(kCVt) where t = time.
12
30
Verify the abovederived expression, and also check that C = 3.5 mg/L when the numerical values for the various quantities given in Fig. 1.23 are used. (b) Now consider the inverse control problem when an actual situation can be generally represented by the model treated above. One can envision several scenarios; let us consider a simple one. Flora and fauna downstream of the lake have been found to be adversely affected, and an environmental agency would like to investigate this situation by installing appropriate instrumentation. The agency believes that the factory is polluting the lake, which the factory owner, on the other hand, disputes. Since it is rather difﬁcult to get a good reading of spatial averaged concentrations in the lake, the experimental procedure involves measuring the crosssectionally averaged concentrations and volumetric ﬂow rates of the incoming, outgoing, and outfall streams. (i) Using the above model, describe the agency’s thought process whereby they would conclude that indeed the factory is the major cause of the pollution. (ii) Identify arguments that the factory owner can raise to rebut the agency’s ﬁndings.
References Arsham, http://home.ubalt.edu/ntsbarsh/statdata/Topics.htm, downloaded August 2008 Berthold, M. and D.J. Hand (eds.) 2003. Intelligent Data Analysis, 2nd Edition, Springer, Berlin. Breiman, L. 2001. Statistical modeling: The two cultures, Statistical Science, vol. 16, no.3, pp. 199–231 Cha, P.D., J.J. Rosenberg and C.L. Dym, 2000. Fundamentals of Modeling and Analyzing Engineering Systems, 2nd Ed., Cambridge University Press, Cambridge, UK. Clemen, R.T. and T. Reilly, 2001. Making Hard Decisions with Decision Tools, Brooks Cole, Duxbury, Paciﬁc Grove, CA Edwards, C.H. and D.E. Penney, 1996. Differential Equations and Boundary Value Problems, Prentice Hall, Englewood Cliffs, NJ
1
Mathematical Models and Data Analysis
Eisen, M., 1988. Mathematical Methods and Models in the Biological Sciences, Prentice Hall, Englewood Cliffs, NJ. Energy Plus, 2009. Energy Plus Building Energy Simulation software, developed by the National Renewable Energy Laboratory (NREL) for the U.S. Department of Energy, under the Building Technologies program, Washington DC, USA. http://www.nrel.gov/buildings/ energy_analysis.html#energyplus. Doebelin, E.O., 1995. Engineering Experimentation: Planning, Execution and Reporting, McGrawHill, New York Dunham, M., 2003. Data Mining: Introductory and Advanced Topics, Pearson Education Inc. Haimes, Y.Y., 1998. Risk Modeling, Assessment and Management, John Wiley and Sons, New York. Heinsohn, R.J. and J.M. Cimbala, 2003. Indoor Air Quality Engineering, Marcel Dekker, New York, NY Hoagin, D.C., F. Moesteller and J.W. Tukey, 1983. Understanding Robust and Exploratory Analysis, John Wiley and Sons, New York. Hodges, J.L. and E.L. Lehman, 1970. Basic Concepts of Probability and Statistics, 2nd Edition Holden Day Kelleher J.D. and B. Tierney, 2018. Data Science, MIT Press, Cambridge, MA Masters, G.M. and W.P. Ela, 2008. Introduction to Environmental Engineering and Science, 3rd Ed. Prentice Hall, Englewood Cliffs, NJ MayerSchonberger V. and K. Cukier, 2013. Big Data, John Murray, London, UK McNeil, D.R. 1977. Interactive Data Analysis, John Wiley and Sons, New York. Reddy, T.A., 2006. Literature review on calibration of building energy simulation programs: Uses, problems, procedures, uncertainty and tools, ASHRAE Transactions, 112(1), January Reddy, T.A., J.F. Kreider, P. Curtiss, A. Rabl, 2016. Heating and Cooling of Buildings Principles and Practice of Energy Efﬁcient Design, 3rd Edition, CRC Press, Boca Raton, FL. Sprent, P., 1998. Data Driven Statistical Methods, Chapman and Hall, London. Stoecker, W.F., 1989. Design of Thermal Systems, 3rd Edition, McGrawHill, New York. Streed, E.R., J.E. Hill, W.C. Thomas, A.G. Dawson and B.D. Wood, 1979. Results and Analysis of a Round Robin Test Program for LiquidHeating FlatPlate Solar Collectors, Solar Energy, 22, p.235. Stubberud, A., I. Williams, and J. DiStefano, 1994. Outline of Feedback and Control Systems, Schaum Series, McGrawHill. Tukey, J.W., 1988. The Collected Works of John W. Tukey, W. Cleveland (Editor), Wadsworth and Brookes/Cole Advanced Books and Software, Paciﬁc Grove, CA
2
Probability Concepts and Probability Distributions
Abstract
This chapter reviews basic notions of probability (or stochastic variability), which is the formal study of the laws of chance, i.e., where the ambiguity in outcome is inherent in the nature of the process itself. Probability theory allows idealized behavior of different types of systems to be modeled and provides the mathematical underpinning of statistical inference. In that respect, it can be viewed as pertaining to the forward modeling domain. Both the primary views of probability, namely the frequentist (or classical) and the Bayesian, are covered, and a discussion provided of the difference between probability and statistics. The basic laws of probability are presented followed by an introductory treatment of set theory nomenclature and algebra. Relevant concepts of random variables, namely density functions, moment generation attributes, and transformation of variables, are also covered. Next, some of the important discrete and continuous probability distributions are presented along with a discussion of their genealogy, their mathematical form, and their application areas. Subsequently, the Bayes’ theorem is derived and how it provides a framework to include prior knowledge in multistage tests is illustrated using examples involving forward and reverse tree diagrams. Finally, the three different kinds of empirical probabilities, such as absolute, relative, and subjective, are described with illustrative examples.
2.1
Introduction
2.1.1
Classical Concept of Probability
Random data by its very nature is indeterminate. So how can a scientiﬁc theory attempt to deal with indeterminacy? Probability theory does just that and is based on the fact that
though the result of any particular trial or experiment or event cannot be predicted, a long sequence of performances taken together reveals a stability that can serve as the basis for fairly precise predictions. Consider the case when an experiment was carried out several times and the anticipated event E occurred in some of them. Relative frequency is the ratio denoting the fraction of event E occurring. It is usually estimated empirically after the event as: pðE Þ =
number of times E occured ð2:1Þ number of times the experiment was carried out
For certain simpler events, one can determine this proportion without actually carrying out the experiment; this is referred to as wise before the event. For example, the relative frequency of getting heads (selected as a “success” event) when tossing a fair coin is 0.5. In any case, this apriori proportion is interpreted as the longrun relative frequency and is referred to as probability of event E occurring. This is the classical or frequentist or traditionalist deﬁnition of probability on which probability theory is founded. This interpretation arises from the strong law of large numbers (a wellknown result in probability theory), which states that the average of a sequence of independent random variables having the same distribution will converge to the mean of that distribution. If a sixfaced dice is rolled, the probability of getting a preselected number between 1 and 6 (say, 4) will vary from event to event, but the longrun average will tend toward 1/6. The classical probability concepts are often described or explained in terms of dicetossing or coinﬂipping or cardplaying outcomes since they are intuitive and simple to comprehend, but their applicability is much wider and extends to all sorts of problems as will become evident in this chapter.
# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 T. A. Reddy, G. P. Henze, Applied Data Analysis and Modeling for Energy Engineers and Scientists, https://doi.org/10.1007/9783031348693_2
31
32
2.1.2
2 Probability Concepts and Probability Distributions
Bayesian Viewpoint of Probability
The classical or traditional or objective probability concepts are associated with the frequentist view of probability, i.e., interpreting probability as the longrun frequency. This has a nice intuitive interpretation, hence its appeal. However, people have argued that many processes are unique events and do not occur repeatedly, thereby questioning the validity of the frequentist or objective probability viewpoint. Further, even when one may have some basic preliminary idea of the probability associated with a certain event, the classical view excludes such subjective insights in the determination of probability. The Bayesian approach, however, recognizes such issues by allowing one to update assessments of probability that integrate prior knowledge with observed events, thereby allowing better conclusions to be reached. It can thus be viewed as an approach combining apriori probability (i.e., estimated ahead of the experiment) with postpriori knowledge gained after the experiment is over. Both the classical and the Bayesian approaches converge to the same results as increasingly more data (or information) is gathered. It is when the datasets are small that the additional insight offered by the Bayesian approach becomes advantageous. Thus, the Bayesian view is not an approach that is at odds with the frequentist approach, but rather adds (or allows the addition of) reﬁnement to it. This can be a great beneﬁt in many types of analysis, and therein lies its appeal. The Bayes’ theorem and its application to discrete and continuous probability variables are discussed in Sect. 2.5, while Sect. 4.6 presents its application to estimation and hypothesis testing problems.
2.1.3
(i) To try to understand the overall nature of the system from its measured performance, i.e., to explain what caused the system to behave in the manner it did. (ii) To try to make inferences about the general behavior of the system from a limited amount of data. Consequently, some authors have suggested that the probability approach be viewed as a “deductive” science where the conclusion is drawn without any uncertainty, while statistics is an “inductive” science where only an imperfect conclusion can be reached, with the added problem that this conclusion hinges on the types of assumptions one makes about the random nature of the process and its forcing functions! Here is a simple example to illustrate the difference. Consider the ﬂipping of a coin supposed to be fair. The probability of getting “heads” is ½. If, however, “heads” come up eight times out of the last ten trials, what is the probability the coin is not fair? Statistics allows an answer to this type of “inverse” enquiry, while probability is the approach for the “forward” type of questioning.
Distinction Between Probability and Statistics
The distinction between probability and statistics is often not clear cut, and sometimes the terminology adds to the confusion.1 In its simplest sense, probability theory generally allows one to predict the behavior of the system “before” the event under the stipulated assumptions, while statistics refers to a body of postpriori knowledge whose application allows one to make sense out of the data collected. Thus, probability concepts provide the theoretical underpinnings of those aspects of statistical analysis that involve random behavior or noise in the actual data being analyzed. Recall that in Sect. 1.3.2, a distinction had been made between four types of uncertainty or unexpected variability in the data. The ﬁrst was due to the stochastic or inherently random nature of the process itself, which no amount of experiment, even if For example, “statistical mechanics” in physics has nothing to do with statistics at all but is a type of problem studied under probability.
1
done perfectly, can overcome. The study of probability theory is mainly mathematical, and applies to this type, i.e., to situations/processes/systems whose random nature is known to be of a certain type or can be modeled such that its behavior (i.e., certain events being produced by the system) can be predicted in the form of probability distributions. Thus, probability deals with the idealized behavior of a system under a known type of randomness. Unfortunately, most natural or engineered systems do not ﬁt neatly into any one of these groups, and so when performance data is available of a system, the objective may be:
2.2
Classical Probability
2.2.1
Basic Terminology
A random (or stochastic) variable is one whose numerical value depends on the outcome of a random phenomenon or experiment or trial, i.e., one whose value depends on chance and thus not entirely predictable. For example, rolling a dice is a random experiment. There are two types of random variables: • Discrete random variables—those that can take on only a ﬁnite or countable number of values. • Continuous random variables—those that may take on any value in an interval. The following basic notions relevant to the study of probability apply primarily to discrete random variables:
2.2 Classical Probability
(i) Simple event or trial of a random experiment is one that has a single outcome. It cannot be decomposed into anything simpler. For example, getting a {6} when a dice is rolled. (ii) Sample space (some refer to it as “universe”) is the set of all possible outcomes of a single trial. For the rolling of a sixfaced dice, the sample space is S = {1, 2, 3, 4, 5, 6}. (iii) Compound or composite event is one involving grouping of several simple events. For example, getting a preselected number, say, A={6}, from all the possible outcomes of rolling two sixfaced dices together would constitute a composite event. (iv) Complement of an event is the set of outcomes in the sample not contained in A above. For example, A = f2, 3, 4, 5, 7, 8, 9, 10, 11, 12g is the complement of the event A.
2.2.2
33
S
A (a) S
intersection A ∩ B
(b) S
B
A
Basic Set Theory Notation and Axioms of Probability
A∩B=Ø
(c) A “set” in mathematics is a collection of welldeﬁned distinct objects that can be considered as an object in its own right. Set theory has its own deﬁnitions, axioms, notations, and algebra, which is closely related to the properties of random variables. Familiarity with algebraic manipulation of sets can enhance understanding and manipulation of probability concepts. The outcomes of simple or compound events can also be considered to be elements of a set. For example, the sample space S of outcomes from rolling a dice is a collection of items or a set, and is usually indicated by S = {1, 2, 3, 4, 5, 6}. If set B were to denote B{1,2,3}, it would be a subset of S or “belong to S,” and is expressed as B 2 S. The symbol 2 = represents “not belonging to.” Another mathematical representation that B is a subset of S is by B ⊂ S or S ⊃ B. Generally, if E is a set of numbers between 3 and 6 (inclusive), it can be stated as E={x 3≤ x ≤ 6}. Finally, a compound or joint event is one that arises from operations involving two or more events occurring at the same time. The concepts of complementary, union, intersection, etc. discussed below in the context of probability also apply to set manipulation. The Venn diagram is a pictorial representation wherein elements are shown as points in a plane and sets as closed regions within an enclosing rectangle denoting the universal set or sample space. It offers a convenient manner of illustrating set and subset interaction and allows intuitive understanding of compound events and the properties of their combined probabilities. Figure 2.1 illustrates the following concepts:
B
A
S
A B (d) Fig. 2.1 Venn diagrams for a few simple cases. (a) Event A is denoted as a region in space S. Probability of event A is represented by the area inside the circle to that inside the rectangle. (b) The intersection of events A and B is the common overlapping area (shown hatched). (c) Events A and B are mutually exclusive or are disjoint events. (d) Event B is a subset of event A
• The universe of outcomes or sample space S is a set denoted by the area enclosed within a rectangle, while the probability of a particular event (say, event A) is denoted by a region within (Fig. 2.1a). • Union of two events A and B (Fig. 2.1b) is represented by the set of outcomes in either A or B or both, and is denoted by A [ B (where the symbol [ is conveniently remembered as “u” of “union”). This is akin to an addition and the composite event is denoted mathematically as C = A [ B = B [ A. An example is the number of cards in a pack of 52 cards which are either hearts or spades (= 52*(1/4+1/4) = 26). • Intersection of two events A and B is represented by the set of outcomes that are in both A and B simultaneously. This is akin to a multiplication, and denoted by D = A \ B =
34
2 Probability Concepts and Probability Distributions
B \ A. It is represented by the hatched area in Fig. 2.1b. An example is drawing a card from a deck and ﬁnding it to be a red jack (probability = (1/2) × (1/13) = 1/26). The ﬁgure also shows the areas denoted by the intersection of A and B with their complements Ā and B. • Mutually exclusive events or disjoint events are those that have no outcomes in common (Fig. 2.1c). In other words, the two events cannot occur together during the same trial. If events A and B are disjoint, this can be expressed as A \ B = Ø, where Ø denote a null or empty set. An example is drawing a red spade (nil). • Event B is inclusive in event A when all outcomes of B are contained in those of A, i.e., B is a subset of A (Fig. 2.1d). This is expressed as B ⊂ A or A ⊃ B or B 2 A. An example is the number of cards less than six (event B), which are red cards (event A). The ﬁgure also shows the area (A – B) representing the difference between events A and B.
2.2.3
Axioms of Probability
Let us now apply the above concepts to random variables and denote the sample space S as consisting of two events A and B with probabilities p(A) and p(B), respectively. Then: (i) Probability of any event, say A, cannot be negative. This is expressed as: pðAÞ ≥ 0
ð2:2Þ
(ii) Probabilities of all events must be unity (i.e., normalized): pð SÞ pð A Þ þ pð B Þ = 1
ð2:3Þ
(iii) Probabilities of mutually exclusive events A and B add up: pðA [ BÞ = pðAÞ þ pðBÞ
ð2:4Þ
If a dice is rolled twice, the outcomes can be assumed to be mutually exclusive. If event A is the occurrence of 2 and event B that of 3, then p(A or B) = 1/6 + 1/6 = 1/3, i.e., the additive rule (Eq. 2.4) applies. The extension to more than two mutually exclusive events is straightforward. Some other inferred relations are: (iv) Probability of the complement of event A: p A = 1  pð A Þ which leads to pðAÞ [ p A = S
ð2:5Þ
(v) Probability for either A or B to occur (when they are not mutually exclusive) is: pðA [ BÞ = pðAÞ þ pðBÞ  pðA \ BÞ
ð2:6Þ
This is intuitively obvious from the Venn diagram (see Fig. 2.1b) since the hatched area (representing p(A \ B)) gets counted twice in the sum, and so needs to be deducted once. This equation can also be deduced from the axioms of probability. Note that if events A and B are mutually exclusive, then Eq. 2.6 reduces to Eq. 2.4. Example 2.2.1 (a) Set theory approach Consider two sets A and B deﬁned by integers from 1 to 10 by: A = {1,4,5,7,8,9} and B = {2,3,5,6,8,10}. Then A [ B = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} = sample space S. Also A \ B = {5, 8} and A – B = {1,4,7,9} and B – A = {2,3,6,10}. The reader is urged to draw the corresponding Venn diagram for better conceptual understanding. (b) Probability approach Let the two sets be redeﬁned based on the number of integers in each set. Then p(A) = 6/10 = 0.6 and p(B) = 0.6. Clearly the two sets are overlapping since they sum to greater than one. Union p(A [ B) = {1} = entire space S Rearranging Eq. 2.6 results in intersection p(A \ B) = p(A) + p(B)  1 = 0.2, which is consistent with (a) where the intersection consisted of two elements out of 10 in the set. Example 2.2.2 For three nonmutually exclusive events, Eq. 2.6 can be extended to pðA [ B [ CÞ = pðAÞ þ pðBÞ þ pðCÞ  pðA \ BÞ  pðA \ CÞ  pðB \ CÞ þ 3pðA \ B \ CÞ ð2:7Þ This is clear from the corresponding Venn diagram shown in Fig. 2.2. The last term (A \ B \ C) denotes the intersection area of all three events occurring simultaneously and is deducted three times as part of (A \ B), (A \ C), and (B \ C), and so needs to be added thrice.
2.2 Classical Probability
35
A
B
A∩B A∩B∩C A∩C
S
B∩C
C
Fig. 2.3 Two components connected in series and in parallel (Example 2.2.3)
Fig. 2.2 Venn diagram for the union of three nonmutually exclusive events (Example 2.2.2)
2.2.4
Joint, Marginal, and Conditional Probabilities
RðA and BÞ = RðAÞ:RðBÞ = ð0:9 × 0:75Þ = 0:675:
(a) Joint probability of two independent events represents the case when both events occur together at the same point in time. Such events and more complex probability problems are not appropriate for Venn diagram representation. Then, following the multiplication law: pðA and BÞ = pðAÞ pðBÞ if A and B are independent
(a) Series connection: For the system to function, both components should be functioning. Then the joint probability of system functioning (or the reliability):
ð2:8Þ
These are called product models. The notations p(A \ B) and p(A and B) can be used interchangeably. Consider a sixfaced dicetossing experiment. If outcome of event A is the occurrence of an even number, then p (A) = 1/2. If outcome of event B is that the number is less than or equal to 4, then p(B) = 2/3. The probability that both outcomes occur when a dice is rolled is p(A and B) = 1/2 × 2/3 = 1/3. This is consistent with our intuition since outcomes {2,4} would satisfy both the events. Example 2.2.3 Probability concepts directly apply to reliability problems associated with engineered systems. Consider two electronic components A and B but with different rates of failure expressed as probabilities, say p(A) = 0.1 and p(B) = 0.25. What is the failure probability of a system made up of the two components if connected (a) in series and (b) in parallel. Assume independence, i.e., the failure of one component is independent of the other. The two cases are shown in Fig. 2.3 with the two components A and B connected in series and in parallel. Reliability is the probability of functioning properly and is the complement of probability of failure, i.e., Reliability R(A) = 1 – p(A) = 1 – 0.1 = 0.9, and R(B) = 1 – p(B) = 1 – 0.25 = 0.75.
The failure probability of the system is p(system failure) = 1 – R(A and B) = 1 – 0.675 = 0.325. For the special case when both components have the same probability of failure p, it is left to the reader to verify that system reliability or probability of system functioning = (1 – p)2. (b) Parallel connection: For the system to fail, both components should fail. In this case, it is better to work with failure probabilities. Then, the joint probability of system failing: p(A). p(B) = 0.1 × 0.25 = 0.025, which is much lower than 0.325 found for the components in series scenario. This result is consistent with the intuitive fact that components in parallel increase the reliability of the system. The corresponding probability of functioning is R(A and B) = 1  0.025 = 0.975. For the special case when both components have the same probability of failure p, it is left to the reader to verify that system reliability or probability of system functioning = (1 – p2). (c) Marginal probability of an event A compared to another event B refers to its probability of occurrence irrespective of B. It is sometimes referred to as “unconditional probability” of A on B. Let the space contain only events A and B, i.e., events A and B are known to have occurred. Since S can be taken to be the sum of event space B and its complement B, the probability of A can be expressed in terms of the sum of the disjoint parts of B: pðAÞ = pðA \ BÞ þ p A \ B = p ðA and BÞ þ p A and B
ð2:9Þ
This expression (known as Bayes’ Rule) can be extended to the case of more than two joint events. This equation will be made use of in Sect. 2.5.
36
2 Probability Concepts and Probability Distributions
Example 2.2.4 The percentage data of annual income versus age has been gathered from a large population living in a certain region— see Table 2.1. Let X be the income and Y the age. The marginal probability of X for each class is simply the sum of the probabilities under each column and that of Y the sum of those for each row. Thus, p(X ≥ 40, 000) = 0.15 + 0.10 + 0.08 = 0.33, and so on. Also, verify that the sum of the marginal probabilities of X and Y sum to 1.00 (to satisfy the normalization condition). ■ (c) Conditional probability: There are several situations involving compound outcomes that are sequential or successive in nature. The chance result of the ﬁrst stage determines the conditions under which the next stage occurs. Such events, called twostage (or multistage) events, involve stepbystep outcomes that can be represented as a probability tree. This allows better visualization of how the probabilities progress from one stage to the next. If A and B are events, then the probability that event B occurs given that A has already occurred is given by the conditional probability of B given A (known as Bayes’ Rule): pðB=AÞ =
pðA \ BÞ pðAÞ
ð2:10Þ
An example of a conditional probability event is the drawing of a spade from a pack of cards from which a ﬁrst card was already drawn. If it is known that the ﬁrst card was not a spade, then the probability of drawing a spade the second time is 12/51 = 4/17. On the other hand, if the ﬁrst card drawn was a spade, then the probability of getting a spade on the second draw is 11/51. A special but important case is when p(B/A) = p(B). In this case, B is said to be independent of A because the fact that event A has occurred does not affect the probability of B occurring. In this case, one gets back Eq. 2.8.
Mutually exclusive events and independent events are not to be confused. While the former is a property of the events themselves that occur simultaneously, the latter is a property that arises from the event probabilities that are sequential or staged over time. The distinction is clearer if one keeps in mind that: – If A and B events are mutually exclusive, p(A \ B)=p(AB) = 0. – If A and B events are independent, then p(AB) = p(A). p(B). – If A and B events are mutually exclusive and independent, then p(AB) = 0 = p(A).p(B), and so one of the events cannot or should not occur. Example 2.2.5 A single fair dice is rolled. Let event A = {even outcome} and event B = {outcome is divisible by 3}. (a) List the various outcomes in the sample space: {1 2 3 4 5 6} (b) List the outcomes in A and ﬁnd p(A): {2 4 6}, p(A) = 1/2 (c) List the outcomes of B and ﬁnd p(B): {3 6}, p(B) = 1/3 (d) List the outcomes in A \ B and ﬁnd p(A \ B): {6}, p(A \ B) = 1/6 (e) Are the events A and B independent? Yes, since Eq. 2.8 holds ■ Example 2.2.6 Two defective bulbs have been mixed with ten good ones. Let event A = {ﬁrst bulb is good}, and event B = {second bulb is good}. The two events are independent. (a) If two bulbs are chosen at random with replacement, what is the probability that both are good? Given p(A) = 8/10 and p(B) = 8/10. Then from Eq. 2.8:
Table 2.1 Computing marginal probabilities from a probability table (Example 2.2.4) Income (X) Age (Y ) Under 25 Between 25 and 40 Above 40 Marginal probability of X
≤$40,000 0.15 0.10
40,000–90,000 0.09 0.16
≥90,000 0.05 0.12.
0.08 0.33
0.20 0.45
0.05 0.22
Marginal probability of Y 0.29 0.38 0.33 Should sum to 1.00 both ways
pðA \ BÞ =
8 8 64 : = = 0:64 10 10 100
(b) What is the probability that two bulbs drawn in sequence (i.e., not replaced) are good where the status of the bulb after the ﬁrst draw is known to be good? From Eq. 2.8, p(both bulbs drawn are good): pðA \ BÞ = pðAÞ pðB=AÞ =
8 7 28 = = 0:622 10 9 45 ■
2.2 Classical Probability
37
Example 2.2.7 Two events A and B have the following probabilities: p(A) = 0.3, p(B) = 0.4, and p A \ B = 0:28. (a) Determine whether the events A and B are independent or not? From Eq. 2.5, P A = 1  pðAÞ = 0:7. Next, one will verify whether Eq. 2.8 holds or not. In this case, one needs to verify whether: p A \ B = p A pðBÞ or whether 0.28 is equal to (0.7 × 0.4). Since this is correct, one can state that events A and B are independent. (b) Find p(A [ B) From Eqs. 2.6 and 2.8: pðA [ BÞ = pðAÞ þ pðBÞ  pðA \ BÞ = pðAÞ þ pðBÞ  pðAÞ pðBÞ = 0:3 þ 0:4  ð0:3Þð0:4Þ = 0:58
Example 2.2.8 Generating a probability tree for a residential airconditioning (AC) system. Assume that the AC is slightly undersized for the house it serves. There are two possible outcomes (S, satisfactory, and NS, not satisfactory) depending on whether the AC is able to maintain the desired indoor temperature. The outcomes depend on the outdoor temperature, and, for simplicity, its annual variability is grouped into three categories: very hot (VH), hot (H ), and not hot (NH). The probabilities for outcomes S and NS to occur in each of the three daytype categories are shown in the conditional probability tree diagram (Fig. 2.4) while the joint probabilities computed following Eq. 2.8 are assembled in Table 2.2. Note that the relative probabilities of the three branches in both the ﬁrst stage and in each of the two branches of each outcome add to unity (e.g., in the Very Hot, the S and NS outcomes add to 1.0, and so on). Further, note that the joint probabilities shown in the table also must sum to unity (it is advisable to perform such veriﬁcation checks). The probability of the indoor conditions being satisfactory is determined as: p(S)= 0.02 + 0.27 + 0.6 = 0.89 while p (NS) = 0.08 + 0.03 + 0 = 0.11. It is wise to verify that p (S) + p(NS) = 1.0. ■
Fig. 2.4 The conditional probability tree for the residential airconditioner when two outcomes are possible (S, satisfactory; and NS, not satisfactory) for each of three daytypes (VH, very hot; H, hot; and NH, not hot). (Example 2.2.8) Table 2.2 Joint probabilities of various outcomes (Example 2.2.8) p(VH \ S) = 0.1 × 0.2 = 0.02 p(VH \ NS) = 0.1 × 0.8 = 0.08 p(H \ S) = 0.3 × 0.9 = 0.27 p(H \ NS) = 0.3 × 0.1 = 0.03 p(NH \ S) = 0.6 × 1.0 = 0.6 p(NH \ NS) = 0.6 × 0 = 0
Table 2.3 Probabilities of various outcomes (Example 2.2.9) pðA \ RÞ =
pðB \ RÞ =
pðA \
pðB \
pðA \
1 1 1 2 × 2 = 4 1 1 W Þ = 2 × 2 = 14 GÞ = 12 × 0 = 0
pðB \
1 3 3 2 × 4 = 8 1 WÞ = 2 × 0 = 0 GÞ = 12 × 14 = 18
Example 2.2.9 Twostage experiment Consider a problem where there are two boxes with marbles as speciﬁed: Box A : ð1 red and 1 whiteÞ; and Box B : ð3 red and 1 greenÞ: A box is chosen at random and a marble drawn from it. What is the probability of getting a red marble? One is tempted to say that since there are 4 red marbles in total out of 6 marbles, the probability is 2/3. However, this is incorrect, and the proper analysis approach requires that one frames this problem as a twostage experiment. The ﬁrst stage is the selection of the box, and the second the drawing of the marble. Let event A (or event B) denote choosing Box A (or Box B). Let R, W, and G represent red, white, and green marbles. The resulting probabilities are shown in Table 2.3.
38
2 Probability Concepts and Probability Distributions
Fig. 2.5 The ﬁrst stage of the forward probability tree diagram involves selecting a box (either A or B) while the second stage involves drawing a marble that can be red (R), white (W ), or green (G) in color. The total probability of drawing a red marble is 5/8. (Example 2.2.9)
Thus, the probability of getting a red marble = 1/4 + 3/ 8 = 5/8. The above example is depicted in Fig. 2.5, where the reader can visually note how the probabilities propagate through the probability tree. This is called the “forward tree” to differentiate it from the “reverse tree” discussed in Sect. 2.5. The above example illustrates how a twostage experiment must be approached. First, one selects a box that by itself does not tell us whether the marble is red (since one has yet to pick a marble). Only after a box is selected, can one use the prior probabilities regarding the color of the marbles inside the box in question to determine the probability of picking a red marble. These prior probabilities can be viewed as conditional probabilities; i.e., for example, p(A \ R) = p (R/A) p(A) ■
2.2.5
Permutations and Combinations
The study of probability requires a sound knowledge of combinatorial mathematics, which is concerned with developing rules for situations involving permutations and combinations. (a) Permutation P(n, k) is the number of ways in which k objects can be selected from n objects with the order being important. It is given by: Pðn, kÞ =
n! ðn  kÞ!
ð2:11Þ
A special case is the number of permutations of n objects taken n at a time: Pðn, nÞ = n! = nðn  1Þðn  2Þ . . . ð2Þð1Þ
ð2:12Þ
(b) Combination C(n,k) is the number of ways in which k objects can be selected from n objects with the order not being important. It is given by: Cðn, kÞ =
n! ðn  k Þ!k!
n k
ð2:13Þ
Note that the same equation also deﬁnes the binomial coefﬁcients since the expansion of (a+b)n according to the Binomial theorem is: n
ð a þ bÞ n = k=0
n k
an  k b k
ð2:14Þ
Example 2.2.10 (a) Calculate the number of ways in which three people from a group of seven people can be seated in a row. This is a case of permutation since the order is important. The number of possible ways is given by Eq. 2.11: Pð7, 3Þ =
ð7Þ ð6Þ ð5Þ 7! = = 2110 1 ð7  3Þ!
(b) Calculate the number of combinations in which three people can be selected from a group of seven. Here the order is not important, and the combination formula can be used (Eq. 2.13). Thus: C ð7, 3Þ =
ð7Þ ð6Þ ð5Þ 7! = = 35 ð7  3Þ!3! ð 3Þ ð 2Þ ■
2.3 Probability Distribution Functions
39
Table 2.4 Number of combinations for equipment scheduling in a large physical plant of a campus
One of each Two of each: assumed identical Two of each: nonidentical except for boilers
Status (0, off; 1, on) Primemovers Boilers 0–1 0–1 0–0, 0–1, 0–0, 0–1, 1–1 1–1 0–0, 0–1, 0–0, 0–1, 1–0, 1–1 1–1
ChillersVapor compression 0–1 0–0, 0–1, 1–1
ChillersAbsorption 0–1 0–0, 0–1, 1–1
Number of combinations 24 = 16 34 = 81
0–0, 0–1, 1–1
0–0, 0–1, 1–0, 1–1
43 × 31 = 192
Another type of combinatorial problem is the factorial problem to be discussed in Chap. 6 while dealing with design of experiments. Consider a speciﬁc example involving equipment scheduling at a physical plant of a large campus that includes primemovers (diesel engines or turbines that produce electricity), boilers, and chillers (vapor compression and absorption machines). Such equipment need a certain amount of time to come online and so operators typically keep some of them “idling” so that they can start supplying electricity/heating/cooling at a moment’s notice. Their operating states can be designated by a binary variable; say “1” for onstatus and “0” for offstatus. Extensions of this concept include cases where, instead of two states, one could have m states. An example of three states is when say two identical boilers are to be scheduled. One could have three states altogether: (i) when both are off (0–0), (ii) when both are on (1–1), and (iii) when only one is on (1–0). Since the boilers are identical, state (iii) is identical to 0–1. In case the two boilers are of different size, there would be four possible states. The number of combinations possible for “n” such equipment where each one can assume “m” states is given by mn. Some simple cases for scheduling four different types of energy equipment in a physical plant are shown in Table 2.4.
2.3
Probability Distribution Functions
2.3.1
Density Functions
The notions of discrete and continuous random variables were introduced in Sect. 2.2.1. The concepts relevant to discrete outcomes of events or discrete random variables were addressed in Sect. 2.2. and these will now be extended to continuous random variables. The distribution of a random variable represents the probability of it taking its various possible values. For example, if the yaxis in Fig. 1.2 of the sixfaced dice rolling experiment were to be changed into a relative frequency (= 1/6), the resulting histogram would graphically represent the corresponding probability density function (PDF) (Fig. 2.6a). Thus, the probability of getting a 2 in the rolling of a dice is 1/6th. Since, this is a discrete random variable, the function takes on speciﬁc values at discrete points of the xaxis (which represents the outcomes).
The same type of yaxis normalization done to the data shown in Fig. 1.3 would result in the PDF for the case of continuous random data. This is shown in Fig. 2.7a for the random variable taken to be the hourly outdoor drybulb temperature over the year at Philadelphia, PA. Notice that this is the envelope of the histogram of Fig. 1.3. Since the variable is continuous, it is implausible to try to determine the probability of, say temperature outcome of 57.5 °F. One would be interested in the probability of outcomes within a range, say 55–60 °F. The probability can then be determined as the area under the PDF as shown in Fig. 2.7b. It is for such continuous random variables that the cumulative distribution function (CDF) is useful. It is simply the cumulative area under the curve starting from the lowest value of the random variable X to the current value (Fig. 2.8). The vertical scale directly gives the probability (or, in this case, the fractional time) that X is less than or equal to a certain value. Thus, the probability (X ≤ 60) is about 0.58. The concept of CDF also applies to discrete variables but as a discontinuous stepped curve as illustrated in Fig. 2.6b for the dice rolling example. To restate, depending on whether the random variable is discrete or continuous, one gets discrete or continuous PDFs. Though most experimentally gathered data is discrete, the underlying probability theory is based on the data being continuous. Replacing the integration sign by the summation sign in the equations that follow allows extending the deﬁnitions to discrete distributions. Let f(x) be the PDF associated with a random variable X. This is a function that provides the probability that a discrete random variable X takes on a particular value x among its various possible values. The axioms of probability (Eqs. 2.2 and 2.3) for the discrete case are expressed for the case of continuous random variables as: • PDF cannot be negative: f ðxÞ ≥ 0
1 10. In problems where the normal distribution is used, it is more convenient to standardize the random variable into a
2.4 Important Probability Distributions
51 0.4
0.4
Z=0.8
Z=1.2 0.3
0.3
0.2
0.2
p(0.8) =0.2119
p(1.2) =0.1151 0.1
0.1
0
Z=0.8
3
2
1
0
1
2
z
a
0
3
b
3
2
1
0
1
2
3
z
Fig. 2.16 Figures meant to illustrate the fact that the shaded ﬁgure areas are the physical representations of the tabulated standardized probability values in Table A3 (corresponds to Example 2.4.9ii). (a) Lower limit. (b) Upper limit
new random variable z x σ μ with mean zero and variance of unity. This results in the standard normal curve or zcurve: z2 1 N ðz; 0, 1Þ = p exp 2 2π
ð2:42bÞ
In actual problems, the standard normal distribution is used to determine the probability of the random variable having a value within a certain interval, say z between z1 and z2. Then Eq. 2.42b can be modiﬁed into: N ðz 1 ≤ z ≤ z 2 Þ =
z2 z1
z2 1 p exp dz 2 2π
ð2:42cÞ
The shaded area in Table A3 permits evaluating the above integral, i.e., determining the associated probability assuming z1 =  1. Note that for z = 0, the probability given by the shaded area is equal to 0.5. Since not all texts adopt the same format in which to present these tables, the user is urged to use caution in interpreting the values shown in such tables. Example 2.4.9 Graphical interpretation of probability using the standard normal table Resistors made by a certain manufacturer have a nominal value of 100 ohms, but their actual values are normally distributed with a mean of μ = 100.6 ohms and standard deviation σ = 3 ohms. Find the percentage of resistors that will have values: (i) Higher than the nominal rating. The standard normal variable z(X = 100) = (100 – 100.6)/3 =  0.2. From
Table A3, this corresponds to a probability of (1 – 0.4207) = 0.5793 or 57.93%. (ii) Within 3 ohms of the nominal rating (i.e., between 97 and 103 ohms). The lower limit z1 = (97  100.6)/ 3 =  1.2, and the tabulated probability from Table A3 is p(z = –1.2) = 0.1151 (as illustrated in Fig. 2.16a). The upper limit is: z2 = (103  100.6)/3 = 0.8. However, care should be taken in properly reading the corresponding value from Table A3, which only gives probability values of z < 0. One ﬁrst determines the probability about the negative value symmetric about 0, i.e., p(z = –0.8) = 0.2119 (shown in Fig. 2.16b). Since the total area under the curve is 1.0, p (z = 0.8) = 1.0 – 0.2119 = 0.7881. Finally, the required probability p(–1.2 < z < 0.8) = (0.7881 – 0.1151) = 0.6730 or 67.3%. ■ Inspection of Table A3 allows the following statements to be made, which are often adopted during statistical inferencing: • The interval μ ± σ contains approximately [1 – 2 (0.1587)] = 0.683 or 68.3% of the observations. • The interval μ ± 2σ contains approximately 95.4% of the observations. • The interval μ ± 3σ contains approximately 99.7% of the observations. Another manner of using the standard normal table is for the “backward” problem. In such cases, instead of being speciﬁed the z value and having to deduce the probability, now the probability is speciﬁed, and the z value is to be deduced.
52
2 Probability Concepts and Probability Distributions
If the mean and standard deviation of this distribution are μ and σ, and the civil engineer wishes to determine the “statistical minimum strength” x, speciﬁed as the strength below which only say 5% of the cubes are expected to fail, one searches Table A3 and determines the value of z for which the probability is 0.05, i.e., p(z =  1.645) = 0.05. This would correspond to x = μ  1.645σ.
0.4 Normal N(0,1)
d.f=10 0.3
PDF
Example 2.4.10 Reinforced and prestressed concrete structures are designed so that the compressive stresses are carried mostly by the concrete itself. For this and other reasons, the main criterion by which the quality of concrete is assessed is its compressive strength. Speciﬁcations for concrete used in civil engineering jobs may require specimens of speciﬁed size and shape (usually cubes) to be cast and tested on site. One can assume the normal distribution to apply.
0.2 d.f=5 0.1 0
3
2
1
0
1
2
3
x Fig. 2.17 Comparison of the normal (or Gaussian) z curve to two Student’s tcurves with different degrees of freedom (d.f.). As the d.f. decreases, the PDF for the Student’s tdistribution ﬂattens out and deviates increasingly from the normal distribution
(b) Student’s tDistribution One important application of the normal distribution is that it allows making statistical inferences about population means from random samples (see Sect. 4.2). In case the random samples are small (say, n < 30), then the Student’s tdistribution, rather than the normal distribution, should be used. If one assumes that the sampled population is approximately  μÞ normally distributed, then the random variable t = ðx p has s
nÞ
the Student’s tdistribution t(μ, s, ν) where μ is the sample mean, s is the sample standard deviation, and ν is the degrees of freedom = (n  1). Thus, the number of degrees of freedom (d.f.) equals the number of data points minus the number of constraints or restrictions placed on the data. Table A4 (which is set up differently from the standard normal table) provides numerical values of the tdistribution for different degrees of freedom at different conﬁdence levels. How to use these tables will be discussed in Sect. 4.2. Unlike the z curve, one has a family of tdistributions for different values of ν. Qualitatively, the tdistributions are similar to the standard normal distribution in that they are symmetric about a zero mean, while they are slightly wider than the corresponding normal distribution as indicated in Fig. 2.15. However, in terms of probability values represented by areas under the curves as in Example 2.4.11, the differences between the normal and the Student’s tdistributions are large enough to warrant retaining this distinction (Fig. 2.17).
Example 2.4.11 Differences between normal and Student’s tinferences Consider Example 2.4.10 where the distribution of the strength of concrete samples tested followed a normal distribution with a mean and standard deviation of μ and σ. The probability that the “minimum strength” x speciﬁed as the strength below which only say 5% of the cubes are expected to fail was determined from Table A3 to be p(z =  1.645) = 0.05. It was then inferred that that the statistical strength would correspond to x = μ  1.645σ. The Student’s tdistribution allows one to investigate how this interval changes when different numbers of samples are taken. The mean and standard deviation now correspond to those of the sample. The critical values are the multipliers of the standard deviation and the results are assembled in the table below for different number of samples tested (found from Table A4 for a singletailed distribution of 95%). For an inﬁnite number of samples, we get back the critical value found for the normal distribution. Number of samples 5 10 30 1
Degrees of freedom 4 9 29 1
Critical values 2.132 1.833 1.699 1.645
2.4 Important Probability Distributions
53
1
PDF
0.8 L(1,1)
0.6
L(2,2) 0.4 L(3,3) 0.2 0
0
2
4
6
8
10
x Fig. 2.18 Lognormal distributions for different mean and standard deviation values
Example 2.4.12 Using lognormal distributions for pollutant concentrations Concentration of pollutants produced by chemical plants is often modeled by lognormal distributions and is used to evaluate compliance with government regulations. The concentration of a certain pollutant, in parts per million (ppm), is assumed lognormal with parameters μ = 4.6 and σ = 1.5. What is the probability that the concentration exceeds 10 ppm? One can use Eq. 2.43, or, simpler still, use the z tables (Table A3) by suitable transformations of the random variable. lnð10Þ  4:6 1:5 = 1  N ð  1:531Þ = 1  0:0630 = 0:937
LðX > 10Þ = 1  N ½lnð10Þ, 4:6, 1:5 = 1  N
■
(c) Lognormal Distribution This distribution is appropriate for nonnegative skewed data for which the symmetrical normal distribution is no longer appropriate. If a variate X is such that ln(X) is normally distributed, then the distribution of X is said to be lognormal. With X ranging from  1 to + 1, ln(X) would range from 0 to + 1 . It is characterized by two parameters, the mean and standard deviation (μ, σ), as follows: Lðx; μ, σ Þ ¼
σ:x ¼0
1 p
2π
exp 
ðln x  μÞ2 2σ 2
(d) Gamma Distribution The gamma distribution (also called the Erlang distribution) is a good candidate for modeling random phenomena that can only be positive and are unimodal (akin to the lognormal distribution). The gamma distribution is derived from the gamma function for positive values of α, which, one may recall from mathematics, is deﬁned by the integral:
when x ≥ 0 elsewhere ð2:43Þ
Note that the two parameters pertain to the logarithmic mean and standard deviation, i.e., to ln(X) and not to the random variable X itself. The lognormal curves are a family of skewed curves as illustrated in Fig. 2.18. Lognormal failure laws apply when the degradation in lifetime of a system is proportional to the current state of degradation. Typical applications in civil engineering involve ﬂood frequency, in mechanical engineering with crack growth and mechanical wear, in electrical engineering to failure of electrical transformers or faults in electrical cables, and in environmental engineering with pollutants produced by chemical plants and threshold values for drug dosage. Lognormal distributions are also often used to characterize “fragility curves”, which represent the probability of damage due to extreme natural events (hurricanes, earthquakes, etc.), of the built environment such as buildings and other infrastructures. For example, Winkler et al. (2010) adopted topological and terrainspeciﬁc (μ, σ) parameters to model failure of electric power transmission lines and substations and supporting towers/poles during hurricanes.
1
xα  1 e  x dx
Γx ðαÞ =
α>0
ð2:44aÞ
0
Integration results in the following expression for nonnegative integers k: Γðk þ 1Þ = k!
ð2:44bÞ
The continuous random variable X has a gamma distribution with positive parameters α and λ if its density function is given by: Gðx; α, λÞ = λα e  λx
xα  1 ðα  1Þ!
=0
x>0
ð2:44cÞ
elsewhere
The mean and variance of the gamma distribution are: μ=
α and λ
σ2 =
α λ2
ð2:44dÞ
Variation of the parameter α (called the shape factor) and λ (called the scale parameter) allows a wide variety of shapes to be generated (see Fig. 2.19). From Fig. 2.11, one notes that
54
2 Probability Concepts and Probability Distributions 0.3
2.4
0.25
2 G(3,1)
0.15
G(3,0.33)
0.1
1.2 G(1,1)
0.8
G(3,0.2)
0.05 0
G(0.5,1)
1.6
PDF
PDF
0.2
G(3,1)
0.4 0
5
10
15
20
25
30
35
x
a
0
40
0
2
4
6
8
10
12
x
b
Fig. 2.19 Gamma distributions for different combinations of the shape parameter α and the scale parameter β = 1/λ
if x ≥ 0 otherwise
ð2:45aÞ
where λ is the mean value per unit time or distance. The mean and variance of the exponential distribution are: μ=
E(0.5)
1.2 0.8
E(1)
0.4
A special case of the gamma distribution for α = 1 is the exponential distribution. It is the continuous distribution analogue to the geometric distribution which is applicable to discrete random variables. Its PDF is given by
= 0
1.6
E(2)
(e) Exponential Distribution
Eðx; λÞ = λe  λx
2
PDF
the Gamma distribution is the parent distribution for many other distributions discussed. If α→1 and λ = 1, the gamma distribution approaches the normal (see Fig. 2.11). When α = 1, one gets the exponential distribution. When α = 2υ and λ = 12 , one gets the chisquare distribution (discussed below).
1 1 and σ 2 = 2 λ λ
0
0
1
2
3
4
5
x Fig. 2.20 Exponential distributions for three different values of the parameter λ
of faults in a long cable, if the number of faults per unit length is Poisson distributed, then the cable length between faults is exponentially distributed. Its CDF is given by: a
ð2:45bÞ
λ:e  λx dx = 1  e  λa
CDF ½Eða, λÞ =
ð2:45cÞ
0
The distribution is represented by a family of curves for different values of λ (see Fig. 2.20). Exponential failure laws apply to products whose current age does not have much effect on their remaining lifetimes. Hence, this distribution is said to be “memoryless.” It is used to model such processes as the interval between two occurrences, e.g., the distance between consecutive faults in a cable, or the time between chance failures of a component (such as a fuse) or a system, or the time between consecutive emissions of αparticles, or the time between successive arrivals at a service facility. The exponential and the Poisson distributions are closely related. While the latter represents the number of failures per unit time, the exponential represents the time between successive failures. In the context
Example 2.4.13 Temporary disruptions to the power grid can occur due to random events such as lightning, transformer failures, forest ﬁres, etc. The exponential distribution has been known to be a good function to model such failures. If these occur, on average, say, once every 2.5 years, then λ = 1/2.5 = 0.40 per year. (a) What is the probability that there will be no more than one disruption next year? From Eq. 2.45c CDF ½E ðX ≤ 1; λÞ = 1  e  0:4ð1Þ = 1 0:6703 = 0:3297
2.4 Important Probability Distributions
55
(b) What is the probability that there will be more than two disruptions next year? This is the complement of at least two disruptions.
β > 1, the curves become close to bellshaped and somewhat resemble the normal distribution. The expression for the CDF is given by: CDF ½W ðx; α, βÞ ¼ 1  exp½  ðx=βÞα for x ≥ 0
Probability = 1  CDF ½E ðX ≤ 2; λÞ
¼0
= 1  1  e  0:4ð2Þ = 0:4493 ■ (f) Weibull Distribution Another widely used distribution is the Weibull distribution, which has been found to be applicable to datasets from a wide variety of systems and natural phenomena. It has been used to model the time of failure or life of a component as well as engine emissions of various pollutants. Moreover, the Weibull distribution has been found to be very appropriate to model reliability of a system, i.e., the failure time of the weakest component of a system (bearing, pipe joint failure, etc.). The continuous random variable X has a Weibull distribution with parameters α and β (shape and scale factors, respectively) if its density function is given by: α α1 x exp½  ðx=βÞα for x ≥ 0 βα =0 elsewhere ð2:46aÞ
0.12 0.1
PDF
0.08 0.06 0.04
with mean
0.02
1 μ = βΓ 1 þ α
ð2:46bÞ
Figure 2.21 shows the versatility of this distribution for different sets of α and β values. Also shown is the special case of W(1,1) which is the exponential distribution. For
0
0
5
10
15
20
25
30
x Fig. 2.22 PDF of the Weibull distribution W(2, 7.9) (Example 2.4.14)
1
8
0.8
W(10,0.5)
6
W(1,1)
0.6
PDF
PDF
elsewhere
Example 2.4.14 Modeling wind distributions using the Weibull distribution The Weibull distribution is widely used to model the hourly variability of wind velocity. The mean wind speed and its distribution on an annual basis, which are affected by local climate conditions, terrain, and height of the tower, are important in order to determine annual power output from a wind turbine of a certain design whose efﬁciency changes with wind speed. It has been found that the shape factor α varies between 1 and 3 (when α = 2, the distribution is called the Rayleigh distribution). The probability distribution shown in Fig. 2.22 has a mean wind speed of 7 m/s. In this case:
W ðx; α, βÞ =
W(2,1) 0.4
W(10,1)
4
W(10,2)
W(2,5) 2
0.2 0
0 0
a
ð2:46cÞ
2
4
6
x
8
10
b
0
1
2
x
Fig. 2.21 Weibull distributions for different values of the two parameters α and β (the shape and scale factors, respectively)
3
4
56
2 Probability Concepts and Probability Distributions
(a) The numerical value of the parameter β assuming the shape factor α = 2 can be calculated from the gamma μ = 7:9 function Γ 1 þ 12 = 0:8862, from which β = 0:8862 (b) Using the PDF given by Eq. 2.46a, it is left to the reader to compute the probability of the wind speed being equal to 10 m/s (and verify the solution against Fig. 2.22, which indicates a value of 0.064). ■
(g) Chisquare Distribution A third special case of the gamma distribution is when α = 2v and λ = 12 where v is a positive integer and is called the degrees of freedom. This distribution, called the chisquare (χ 2) distribution, plays an important role in inferential statistics where it is used as a test of signiﬁcance for hypothesis testing and analysis of variance type of problems. It is based on the standard normal distribution with mean of 0 and standard deviation of 1. Just like the tstatistic, there is a family of distributions for different values of v (Fig. 2.23). The somewhat complicated PDF is given by Eq. 2.47a but its usefulness lies in the determination of a range of values from this distribution; more speciﬁcally, it provides the probability of observing a value of χ 2 from 0 to a speciﬁed value (Wolberg, 2006). Note that the distribution cannot assume negative values, and that it is positively skewed. Table A5 assembles critical values of the Chisquare distribution for different values of the degrees of freedom parameter v and for different signiﬁcance levels. The usefulness of these tables will be discussed in Sect. 4.2. The PDF of the chisquare distribution is: χ2 ðx; νÞ =
1 ν
22 Γ
=0
υ x 2
ν 21
while the mean and variance values are :
While the tdistribution allows comparison between two sample means, the Fdistribution allows comparison between two or more sample variances. It is deﬁned as the ratio of two independent chisquare random variables, each divided by its degrees of freedom. The Fdistribution is also represented by a family of plots (Fig. 2.24) where each plot is speciﬁc to a set of numbers representing the degrees of freedom of the two random variables (υ1, υ2). Table A6 assembles critical values of the Fdistributions for different combinations of these two parameters, and its use will be discussed in Sect. 4.2. (i) Uniform Distribution The uniform probability distribution is the simplest of all PDFs and applies to both continuous and discrete data whose outcomes are all equally likely, i.e., have equal probabilities. Flipping a coin for heads/tails or rolling a sixsided dice for getting numbers between 1 and 6 are examples that come readily to mind. The probability density function for the discrete case where X can assume values x1, x2,. . .xk is given by: U ðx; kÞ =
1 k
e
x>0
ð2:48aÞ
k
with mean μ =
i=1
variance σ 2 =
ð2:47aÞ
xi and
k k
 2x
ð2:47bÞ
(h) FDistribution
i=1
ðxi  μÞ
ð2:48bÞ
2
k
elsewhere 0.8
1.2 1
0.6
0.8
F(6,24)
c (1) 2
PDF
PDF
σ2 = 2v
μ = v and
0.6 c (4)
0.4
2
0.4
c 2(6)
0.2 0
0
2
4
6
8
10
F(6,5)
0.2
12
x Fig. 2.23 Chisquare distributions for different values of the variable υ denoting the degrees of freedom
0
0
1
2
3
4
5
x Fig. 2.24 Typical Fdistributions for two different combinations of the random variables (υ1 and υ2)
2.4 Important Probability Distributions
57
Fig. 2.25 The uniform distribution assumed continuous over the interval [c, d]
For random variables that are continuous over an interval (c,d) as shown in Fig. 2.25, the PDF is given by: 1 dc =0
U ð xÞ =
when c < x < d
ð2:48cÞ
otherwise
The mean and variance of the uniform distribution (using notation shown in Fig. 2.25) are given by: μ=
cþd 2
and
σ2 =
ð d  cÞ 2 12
ð2:48dÞ
The probability of random variable X being between say x1 and x2 is: U ðx1 ≤ X ≤ x2 Þ =
x2  x1 dc
ð2:48eÞ
Example 2.4.15 A random variable X has a uniform distribution with c = 5 and d = 10 (Fig. 2.25). (a) On an average, what proportion will have a negative value? (Answer: 1/3) (b) On an average, what proportion will fall between 2 and 2? (Answer: 4/15) ■ (j) Beta Distribution
Fig. 2.26 Various shapes assumed by the Beta distribution depending on the values of the two model parameters
Another versatile distribution is the Beta distribution which is appropriate for discrete random variables between 0 and 1 (such as representing proportions). It is a twoparameter model that is given by: Betaðx; p, qÞ =
ðp þ q þ 1Þ! p  1 x ð 1  xÞ q  1 ðp  1Þ!ðq  1Þ!
ð2:49aÞ
Depending on the values of p and q, one can model a wide variety of curves from ushaped ones to skewed distributions (Fig. 2.26). The distributions are symmetrical
when p and q are equal, with the curves becoming peakier as the numerical values of the two parameters increase. Skewed distributions are obtained when the parameters are unequal. The mean of the Beta distribution μ = variance σ 2 =
pq ð p þ qÞ ð p þ q þ 1 Þ 2
p and pþq
ð2:49bÞ
58
2 Probability Concepts and Probability Distributions
This distribution originates from the Binomial distribution, and one can detect the obvious similarity of a twooutcome affair with speciﬁed probabilities. The usefulness of this distribution will become apparent in Sect. 2.5.3, dealing with the Bayesian approach to problems involving continuous probability distributions.
2.5
Bayesian Probability
2.5.1
Bayes’ Theorem
It was stated in Sect. 2.1.2 that the Bayesian viewpoint can enhance the usefulness of the classical frequentist notion of probability.3 Its strength lies in the fact that it provides a framework to include prior information in a twostage (or multistage) experiment. If one substitutes the term p(A) in Eq. 2.10 by that given by Eq. 2.9 (known as Bayes’ Rule), one gets : pðB=AÞ =
pðA \ BÞ pðA \ BÞ þ p A \ B
pðA=BÞpðBÞ pðA=BÞpðBÞ þ p A=B p B
n
p A \ Bj =
ð2:53Þ
p A=Bj p Bj
j=1
j=1
Then
( / )=
ð2:51Þ
Bayes’ theorem, superﬁcially, appears to be simply a restatement of the conditional probability equation given by Eq. 2.10. The question is why is this reformulation so insightful or advantageous? First, the probability is now reexpressed in terms of its disjoint parts B, B , and second the probabilities have been “ﬂipped,” i.e., p(B/A) is now expressed in terms of p(A/B). Consider the two events A and B. If event A is observed while event B is not, this expression allows one to infer the “ﬂip” probability, i.e., probability of occurrence of B from that of the observed event A. In Bayesian terminology, Eq. 2.51 can be written as: Posterior probability of event B given event A ðLikelihood of A given BÞðPrior probability of BÞ = Prior probability of A ð2:52Þ
3
n
pðAÞ =
ð2:50Þ
Also, one can rearrange Eq. 2.10 into: p(A \ B) = p(A) p(B/A) =p(B) p(A/B). This allows expressing Eq. 2.50 into the following expression referred to as the law of total probability or Bayes’ theorem: pðB=AÞ =
Thus, the probability p(B) is called the prior probability (or unconditional probability) since it represents the opinion before any data was collected, while p(B/A) is said to be the posterior probability, which is reﬂective of the opinion revised in light of new data. The term “likelihood” is synonymous to “the conditional probability” of A given B, i.e., p(A/B) Equation 2.51 applies to the case when only one of two events is possible. It can be extended to the case of more than two events that partition the space S. Consider the case where one has n events, B1. . .Bn, which are disjoint and make up the entire sample space. Figure 2.27 shows a sample space of four events. Then, the law of total probability states that the probability of an event A is the sum of its disjoint parts:
There are several texts that deal only with Bayesian statistics; for example, Bolstad (2004).
( ∩
)
( )
=∑
( / ( /
) (
)
)
ð2:54Þ
likelihood prior
Posterior probability
This expression is known as Bayes’ theorem for multiple events. To restate, the marginal or prior probabilities p(Bi) for i = 1, . . ., n are assumed to be known in advance, and the intention is to update or revise our “belief” on the basis of the observed evidence of event A having occurred. This is captured by the probability p(Bi/A) for i = 1, . . ., n called the posterior probability. This is the weight one can attach to each event Bi after event A is known to have occurred.
S B1
B2
B3 ∩ A
A
B3
B4 Fig. 2.27 Bayes’ theorem for multiple events depicted on a Venn diagram. In this case, the sample space is assumed to be partitioned into four discrete events B1. . .B4. If an observable event A shown by the circle has already occurred, the conditional probability of Þ B3 is pðB3 =AÞ = pðpBð3A\A Þ . This is the ratio of the hatched area to the total area inside the ellipse
2.5 Bayesian Probability
59
Example 2.5.1 Consider the twostage experiment of Example 2.2.9 with six marbles of three colors in two boxes. Assume that the experiment has been performed and that a red marble has been obtained. One can use the information known beforehand, i.e., the prior probabilities R, W, and G to determine from which box the marble came from. Note that the probability of the red marble having come from box A represented by p(A/R) is now the conditional probability of the “ﬂip” problem. This is called the posterior probabilities of event A with event R having occurred. Thus, from the law of total probability, the posterior conditional probabilities (Eq. 2.51): – For the red marble to be from Box B: pðB=RÞ =
pðR=BÞpðBÞ = pðR=BÞpðBÞ þ p R=B p B
1 3 2 4 1 3 2 4
þ
1 1 2 2
=
3 5
– For the red marble to be from Box A:
p
A = R
1 1 2 2 1 1 2 2
þ
1 3 2 4
=
2 5
The reverse probability tree for this experiment is shown in Fig. 2.28. The reader is urged to compare this with the forward tree diagram of Example 2.2.9. The probabilities of 1.0 for both W and G outcomes imply that there is no uncertainty at all in predicting where the marble came from. This is obvious since only Box A contains W, and only Box B contains G. However, for the red marble, one cannot be sure of its origin, and this is where a probability measure must be determined. ■ Example 2.5.2 Forward and reverse probability trees for fault detection of equipment A large piece of equipment is being continuously monitored by an addon fault detection system developed by another vendor in order to detect faulty operation. The vendor of the fault detection system states that their product correctly identiﬁes faulty operation when indeed it is faulty (this is referred to as sensitivity) 90% of the time. This implies that there is a probability p = 0.10 of a “false negative” occurring
Fig. 2.28 The probabilities of the reverse tree diagram at each stage are indicated. If a red marble (R) is picked, the probabilities that it came from either Box A or Box B are 2/5 and 3/5, respectively (Example 2.5.1)
(i.e., a missed opportunity of signaling a fault). Also, the vendor quoted that the correct status prediction rate or speciﬁcity of the detection system (i.e., system identiﬁed as healthy when indeed it is so) is 0.95, implying that the “false positive” or false alarm rate is 0.05. Finally, historic data seem to indicate that the large piece of equipment tends to develop faults only 1% of the time. Figure 2.29 shows how this problem can be systematically represented by a forward tree diagram. State A is the faultfree state and state B is represented by the faulty state. Further, each of these states can have two outcomes as shown. While outcomes A1 and B1 represent, respectively, correctly identiﬁed faultfree and faulty operations, the other two outcomes are errors arising from an imperfect fault detection system. Outcome A2 is the false positive event (or false alarm or error type II, which will be discussed at length in Sect. 4.2), while outcome B2 is the false negative event (or missed opportunity or error type I). The ﬁgure clearly illustrates that the probabilities of A and B occurring along with the conditional probabilities p(A1/A) = 0.95 and p(B1/B) = 0.90, result in the probabilities of each of the four states as shown in the ﬁgure. The reverse tree situation, shown in Fig. 2.30, corresponds to the following situation. A fault has been signaled. What is the probability that this is a false alarm?
60
2 Probability Concepts and Probability Distributions
Fig. 2.29 The forward tree diagram showing the four events that may result when monitoring the performance of a piece of equipment (Example 2.5.2)
pðB=B2Þ=
Fig. 2.30 Reverse tree diagram depicting two possibilities. If an alarm sounds, it could be either an erroneous one (outcome A from A2) or a valid one (B from B1). Further, if no alarm sounds, there is still the possibility of missed opportunity (outcome B from B2). The probability that it is a false alarm is 0.846, which is too high to be acceptable in practice. How to decrease this is discussed in the text
Using Eq. 2.51: pðA2=AÞpðAÞ pðA2=AÞpðAÞ þ pðB1=BÞpðBÞ ð0:05Þð0:99Þ ¼ ð0:05Þð0:99Þ þ ð0:90Þð0:01Þ 0:0495 ¼ ¼ 0:846 0:0495 þ 0:009
pðA=A2Þ ¼
Working backward using the forward tree diagram (Fig. 2.29) allows one to visually understand the basis of the quantities appearing in the expression above. The value of 0.846 for the false alarm probability is very high for practical situations and could well result in the operator disabling the fault detection system altogether. One way of reducing this false alarm rate, and thereby enhance robustness, is to increase the sensitivity of the detection device from its current 90% to something higher by altering the detection threshold. This would result in a higher missed opportunity rate, which one must accept for the price of reduced false alarms. For example, the current missed opportunity rate is:
pðB2=BÞpðBÞ pðB2=BÞpðBÞ þ pðA1=AÞpðAÞ
=
ð0:10Þð0:01Þ ð0:10Þð0:01Þ þ ð0:95Þð0:99Þ
=
0:001 = 0:001 0:001 þ 0:9405
This is probably lower than what is needed, and so the above suggested remedy is one that can be considered. A practical way to reduce the false alarm rate is not to take action when a single alarm is sounded but to do so when several faults are ﬂagged. Such procedures are adopted by industrial process engineers using control chart techniques (see Sect. 8.7). Note that as the piece of machinery degrades, the percent of time when faults are likely to develop will increase from the current 1% to something higher. This will have the effect of lowering the false alarm rate (left to the reader to convince himself why). ■ Bayesian statistics provide the formal manner by which prior opinion expressed as probabilities can be revised in the light of new information (from additional data collected) to yield posterior probabilities. When combined with the relative consequences or costs of being right or wrong, it allows one to address decisionmaking problems as pointed out in the example above. It has had some success in engineering (as well as in social sciences) where subjective judgment, often referred to as intuition or experience gained in the ﬁeld, is relied upon heavily. The Bayes’ theorem is a consequence of the probability laws and is accepted by all statisticians. It is the interpretation of probability, which is controversial. Both approaches differ in how probability is deﬁned: • Classical viewpoint: longrun relative frequency of an event. • Bayesian viewpoint: degree of belief held by a person about some hypothesis, event, or uncertain quantity (Phillips 1973).
2.5 Bayesian Probability
61
Advocates of the classical approach argue that human judgment is fallible while dealing with complex situations, and this was the reason why formal statistical procedures were developed in the ﬁrst place. Introducing the vagueness of human judgment as done in Bayesian statistics would dilute the “purity” of the entire mathematical approach. Advocates of the Bayesian approach, on the other hand, argue that the “personalist” deﬁnition of probability should not be interpreted as the “subjective” view. Granted that the prior probability varies from one individual to the other based on their own experience, but with additional data collection all these views get progressively closer. Thus, with enough data, the initial divergent opinions would become indistinguishable. Hence, they argue, the Bayesian method brings consistency to informal thinking when complemented with collected data, and should, thus, be viewed as a mathematically valid approach.
2.5.2
The following examples illustrate how the Bayesian approach can be applied to discrete data. Example 2.5.34 Consider a machine whose prior PDF of the proportion X of defectives is given by Table 2.7. If a random sample of size 2 is selected, and one defective is found, the Bayes’ estimate of the proportion of defectives produced by the machine is determined as follows. Let y be the number of defectives in the sample. The probability that the random sample of size 2 yields one defective is given by the Binomial distribution since this is a twooutcome situation: 2 y
xy ð1  xÞ2  y ; y = 0, 1, 2
If x = 0.1, then f ð1=0:1Þ = Bð1; 2, 0:1Þ =
2 1
ð0:1Þ1 ð0:9Þ2  1
= 0:18 Similarly, for x = 0:2, f
1 0:2
= 0:32.
Table 2.7 Prior PDF of proportion of defectives (x) (Example 2.5.3) X f(x)
4
0.1 0.6
f ðy = 1Þ = ð0:18Þð0:6Þ þ ð0:32Þð0:40Þ = ð0:108Þ þ ð0:128Þ = 0:236 The posterior probability f(x/y = 1) is then given: • for x = 0.1: 0.108/0.236 = 0.458 • for x = 0.2: 0.128/0.236 = 0.542 Finally, the Bayes’ estimate of the proportion of defectives x is: x = ð0:1Þð0:458Þ þ ð0:2Þð0:542Þ = 0:1542 which is quite different from the value of 0.5 given by the classical method. ■
Application to Discrete Probability Variables
f ðy=xÞ = Bðy; n, xÞ =
Thus, the total probability of ﬁnding one defective in a sample size of 2 is:
0.2 0.4
From Walpole et al. (2007), by permission of Pearson Education.
Example 2.5.45 Using the Bayesian approach to enhance value of concrete piles testing Concrete piles driven in the ground are used to provide bearing strength to the foundation of a structure (building, bridge, etc.). Hundreds of such piles are used in large construction projects. These piles should not develop defects such as cracks or voids in the concrete, which would lower compressive strength. Tests are performed by engineers on piles selected at random during the concrete pour process in order to assess overall foundation strength. Let the random discrete variable be the proportion of defective piles out of the entire lot, which is taken to assume ﬁve discrete values as shown in the ﬁrst column of Table 2.8. Consider the case where the prior experience of an engineer as to the proportion of defective piles from similar sites is given in the second column of the table below. Before any testing is done, the expected value of the probability of ﬁnding one pile to be defective is: p = (0.20) (0.30) + (0.4)(0.40) + (0.6)(0.15) + (0.8)(0.10) + (1.0)(0.05) = 0.44 (as shown in the last row under the second column). This is the prior probability. Had he drawn a conclusion on just a single test that turns out to be defective, without using his prior judgment, he would have concluded that all the piles were defective; clearly, an overstatement. Suppose the ﬁrst pile tested is found to be defective. How should the engineer revise his prior probability of the proportion of piles likely to be defective? Bayes’ theorem (Eq. 2.51) can be used. For proportion x = 0.2, the posterior conditional probability is: 5
From Ang and Tang (2007), by permission of John Wiley and Sons.
62
2 Probability Concepts and Probability Distributions
pðx ¼ 0:2Þ
pðx ¼ 0:2Þ
ð0:2Þð0:3Þ ¼ ð0:2Þð0:3Þ þ ð0:4Þð0:4Þ þ ð0:6Þð0:15Þ þ ð0:8Þð0:10Þ þ ð1:0Þð0:05Þ 0:06 ¼ ¼ 0:136 0:44
ð0:2Þð0:136Þ ð0:2Þð0:136Þ þ ð0:4Þð0:364Þ þ ð0:6Þð0:204Þ þ ð0:8Þð0:182Þ þ ð1:0Þð0:114Þ ¼ 0:0272=0:55 ¼ 0:049
This is the value that appears in the ﬁrst row under the third column. Similarly, the posterior probabilities for different values of x can be determined, which add up to the expected value E (x = 1) = 0.55. Hence, a single inspection has led to the engineer revising his prior opinion upward from 0.44 to 0.55. The engineer would probably get a second pile tested, and if it also turns out to be defective, the associated probabilities are shown in the fourth column of Table 2.8. For example, for x = 0.2:
¼
The expected value in case the two piles tested turn out to be defective increases to 0.66. In the limit, if each successive pile tested turns out to be defective, one gets back the classical distribution, listed in the last column of the table. The progression of the PDF from the prior to the inﬁnite case is illustrated in Fig. 2.31. Note that as more piles tested turn out to be defective, the evidence from the data gradually overwhelms the prior judgment of the engineer. However, it is only when collecting data is so expensive or time
Table 2.8 Illustration of how a prior PDF is revised with new data (Example 2.5.4)
Proportion of defectives (x) 0.2 0.4 0.6 0.8 1.0 Expected probability of defective piles
Probability of being defective Prior PDF After one pile tested of defectives is found defective 0.30 0.136 0.40 0.364 0.15 0.204 0.10 0.182 0.05 0.114 0.44 0.55
After two piles tested are found defective 0.049 0.262 0.221 0.262 0.205 0.66
...
...
Limiting case of inﬁnite defectives 0.0 0.0 0.0 0.0 1.0 1.0
Fig. 2.31 Illustration of how the prior discrete PDF is affected by data collection following Bayes’ theorem (Example 2.5.4)
2.5 Bayesian Probability
63
consuming that decisions must be made from limited data that the power of the Bayesian approach becomes evident. Of course, if one engineer’s prior judgment is worse than that of another engineer, then his conclusion from the same data would be poorer than the other engineer. It is this type of subjective disparity that antagonists of the Bayesian approach are uncomfortable with. On the other hand, proponents of the Bayesian approach would argue that experience (even if intangible) gained in the ﬁeld is a critical asset in engineering applications and that discarding this type of heuristic knowledge entirely is naïve, and shortsighted. ■ There are instances when no previous knowledge or information is available about the behavior of the random variable; this is sometimes referred to as prior of pure ignorance. It can be shown that this assumption of the prior leads to results identical to those of the traditional probability approach (see Example 2.5.5).
2.5.3
Application to Continuous Probability Variables
The Bayes’ theorem can also be extended to the case of continuous random variables (Ang and Tang 2007). Let X be the random variable with a prior PDF denoted by p(x). Though any appropriate distribution can be chosen, the Beta distribution is particularly convenient,6 and is widely used to characterize prior PDF. Another commonly used prior is the uniform distribution called a diffuse prior. For consistency with convention, a slightly different nomenclature than that of Eq. 2.51 is adopted. Assume that the Beta distribution (Eq. 2.49a) can be rewritten to yield the prior: pðxÞ / xa ð1  xÞb
ð2:55Þ
Recall that higher the values of the exponents a and b, the peakier the distribution indicative of the prior distribution being relatively well deﬁned. Let L(x) represent the conditional probability or likelihood function of observing y “successes” out of n observations. Then, the posterior probability is given by: f ðx=yÞ / LðxÞ pðxÞ
ð2:56Þ
In the context of Fig. 2.27, the likelihood of the unobservable events B1. . .Bn is the conditional probability that A has occurred given Bi for i = 1, . . ., n, or by p(A/Bi). The likelihood function can be gleaned from probability considerations 6
Because of the corresponding mathematical simplicity that it provides as well as the ability to capture a wide variety of PDF shapes.
in many cases. Consider Example 2.5.4 involving testing the foundation piles of buildings. The Binomial distribution gives the probability of x failures in n independent Bernoulli trials, provided the trials are independent and the probability of failure in any one trial is p. This applies to the case when one holds p constant and studies the behavior of the PDF of defectives x. If instead, one holds x constant and lets p(x) vary over its possible values, one gets the likelihood function. Suppose n piles are tested and y piles are found to be defective or subpar. In this case, the likelihood function is written as follows for the Binomial PDF: LðxÞ =
n y
xy ð 1  xÞ n  y
0≤x≤1
ð2:57Þ
Notice that the Beta distribution is the same form as the likelihood function. Consequently, the posterior distribution given by Eq. 2.57 assumes the form: f ðx=yÞ = k xaþy ð1  xÞbþn  y
ð2:58Þ
where k is independent of x and is a normalization constant introduced to satisfy the probability law that the area under the PDF is unity. What is interesting is that the information contained in the prior has the net result of “artiﬁcially” augmenting the number of observations taken. While the classical approach would use the likelihood function with exponents y and (n  y) (see Eq. 2.57), these are inﬂated to (a + y) and (b + n  y) in Eq. 2.58 for the posterior distribution. This is akin to having taken more observations and supports the previous statement that the Bayesian approach is particularly advantageous when the number of observations is low. The examples below illustrate the use of Eq. 2.58. Example 2.5.5 Let us consider the same situation as that treated in Example 2.5.4 for the concrete pile testing situation. However, the proportion of defectives X is now a continuous random variable for which no prior distribution can be assigned. This implies that the engineer has no prior information, and, in such cases, a uniform distribution (or a diffuse prior) is assumed: pðxÞ = 1:0
for
0≤x≤1
The likelihood function for the case of the single tested pile turning out to be defective is x, i.e., L(x) = x. From Eq. 2.58, the posterior distribution is then: f ðx=yÞ = k xð1:0Þ The normalizing constant
64
2 Probability Concepts and Probability Distributions 1
1
k=
=2
xdx 0
Hence, the posterior probability distribution is: f ðx=yÞ = 2x
0≤x≤1
for
The Bayesian estimate of the proportion of defectives, when one pile is tested and it turns out to be defective, is then: 1
p = E ðx=yÞ =
Fig. 2.32 Probability distributions of the prior, likelihood function, and the posterior (Example 2.5.5). (From Ang and Tang 2007, by permission of John Wiley and Sons)
x 2xdx = 0:667 0
■ Example 2.5.67 Enhancing historical records of wind velocity using the Bayesian approach Buildings are designed to withstand a maximum wind speed, which depends on the location. The probability x that the wind speed will not exceed 120 km/h more than once in 5 years is to be determined. Past records of wind speeds of a nearby location indicated that the following beta distribution would be an acceptable prior for the probability distribution (Eq. 2.49a): pðxÞ = 20x3 ð1  xÞ for
0≤x≤1
Further, the likelihood that the annual maximum wind speed will exceed 120 km/h in 1 out of 5 years is given by the Binomial distribution as: 5
LðxÞ =
4
x4 ð1  xÞ = 5x4 ð1  xÞ
Hence, the posterior probability is deduced following Eq. 2.58: f ðx=yÞ = k 5x4 ð1  xÞ 20x3 ð1  xÞ = 100k x7 ð1  xÞ2 where the constant k can be found from the normalization criterion: 1
1
k=
2
100x ð1  xÞ dx 7
= 3:6
0
7
From Ang and Tang (2007), by permission of John Wiley and Sons.
Finally, the posterior PDF is given by f ðx=yÞ = 360x7 ð1  xÞ2 for 0 ≤ x ≤ 1 Plots of the prior, likelihood, and the posterior functions are shown in Fig. 2.32. Notice how the posterior distribution has become more peaked reﬂective of the fact that the single test data has provided the analyst with more information than that contained in either the prior or the likelihood function. ■
2.6
Three Kinds of Probabilities
The previous sections in this chapter presented basic notions of classical probability and how the Bayesian viewpoint is appropriate for certain types of problems. Both these viewpoints are still associated with the concept of probability as the relative frequency of an occurrence. At a broader context, one should distinguish between three kinds of probabilities: (i) Objective or absolute probability, which is the classical measure interpreted as the “longrun frequency” of the outcome of an event. It is an informed estimate of an event that in its simplest form is a constant; for example, historical records yield the probability of ﬂood occurring this year or of the infant mortality rate in the United States. It would be unchanged for all individuals since it is empirical having been deduced from historic records. Table 2.9 assembles probability estimates for the occurrence of natural disasters with 10 and 1000 fatalities per event (indicative of the severity level) during different time spans (1, 10, and 20 years). Note that ﬂoods and tornados have relatively small return times for small events while
2.6 Three Kinds of Probabilities
65
Table 2.9 Estimates of absolute probabilities for different natural disasters in the United States. (Adapted from Barton and Nishenko 2008) Exposure times Disaster Earthquakes Hurricanes Floods Tornadoes
10 fatalities per event 1 year 10 years 0.11 0.67 0.39 0.99 0.86 >0.99 0.96 >0.99
20 years 0.89 >0.99 >0.99 >0.99
Return time (years) 9 2 0.5 0.3
1000 fatalities per event 1 year 10 years 0.01 0.14 0.06 0.46 0.004 0.04 0.006 0.06
20 years 0.26 0.71 0.08 0.11
Return time (years) 67 16 250 167
Table 2.10 Leading causes of death in the United States, 1992. (Adapted from Kolluru et al. 1996) Cause Cardiovascular or heart disease Cancer (malignant neoplasms) Cerebrovascular diseases (strokes) Pulmonary disease (bronchitis, asthma, etc.) Pneumonia and inﬂuenza Diabetes mellitus Nonmotor vehicle accidents Motor vehicle accidents HIV/AIDS Suicides Homicides All other causes Total annual deaths (rounded)
earthquakes and hurricanes have relatively short times for large events. Such probability considerations can be determined at a ﬁner geographical scale, and these play a key role in the development of codes and standards for designing large infrastructures (such as dams) as well as small systems (such as residential buildings). Note that the probabilities do not add up to one since one cannot deﬁne the entire population of possible events. (ii) Relative probability, where the chance of occurrence of one event is stated in terms of another. This is a way of comparing the effect or outcomes of different types of adverse events happening on a system or on a population when the absolute probabilities are difﬁcult to quantify. For example, the relative risk for lung cancer is (approximately) 10 if a person has smoked before, compared to a nonsmoker. This means that he is 10 times more likely to get lung cancer than a nonsmoker. Table 2.10 shows leading causes of death in the United States in the year 1992. Here the observed values of the individual number of deaths due to various causes are used to determine a relative risk expressed as percent (%) in the last column. Thus, heart disease, which accounts for 33% of the total deaths, is 16 times riskier than motor vehicle deaths. However, as a note of caution, these are values aggregated across the whole population and during a speciﬁc time interval and need to be interpreted accordingly. State and
Annual deaths (× 1000) 720 521 144 91 76 50 48 42 34 30 27 394 2177
Percent (%) 33 24 7 4 3 2 2 2 1.6 1.4 1.2 18 100
government analysts separate such relative risks by location, age groups, gender, and race for public policymaking purposes. (iii) Subjective probability, which differs from one person to another, is an informed or best guess about an event that can change as our knowledge of the event increases. Subjective probabilities are those where the objective view of probability has been modiﬁed to treat two types of events: (i) when the occurrence is unique and is unlikely to repeat itself, or (ii) when an event has occurred but one is unsure of the ﬁnal outcome. In such cases, one has still to assign some measure of likelihood of the event occurring and use this in their analysis. Thus, a subjective interpretation is adopted with the probability representing a degree of belief of the outcome selected as having actually occurred (which could be based on a scientiﬁc analysis subject to different assumptions or even by “gutfeeling”). There are no “correct answers,” simply a measure reﬂective of one’s subjective judgment. A good example of such subjective probability is one involving forecasting the probability of whether the impacts on gross world product of a 3°C global climate change by 2090 would be large or not. A survey was conducted involving 20 leading researchers working on global warming issues but with different technical backgrounds, such as scientists,
66
2 Probability Concepts and Probability Distributions
Fig. 2.33 Example illustrating large differences in subjective probability. A group of prominent economists, ecologists, and natural scientists were polled so as to get their estimates of the loss of gross world product due to doubling of atmospheric carbon dioxide (which is likely to occur by the end of the twentyﬁrst century when mean global temperatures increase by 3°C). The two ecologists predicted the highest adverse impact while the lowest four individuals were economists. (From Nordhaus 1994)
engineers, economists, ecologists, and politicians, who were asked to assign a probability estimate (along with 10% and 90% conﬁdence intervals). Though this was not a scientiﬁc study as such since the whole area of expert opinion elicitation is still not fully mature, there was nevertheless a protocol in how the questioning was performed, which led to the results shown in Fig. 2.33. The median, and 10% and 90% conﬁdence intervals predicted by different respondents show great scatter, with the ecologists estimating impacts to be 20–30 times higher (the two rightmost bars in the ﬁgure), while the economists on average predicted the chance of large consequences to have only a 0.4% loss in gross world product. An engineer or a scientist may be uncomfortable with such subjective probabilities, but there are certain types of problems where this is the best one can do with current knowledge. Thus, formal analysis methods must accommodate such information, and it is here that Bayesian techniques can play a key role.
Problems Pr. 2.1 Three sets are deﬁned as integers from 1 to 12: A = {1,3,5,6,8,10}, B = {4,5,7,8,11} and C = {2,9,12}. (a) (b) (c) (d)
Represent these sets in a Venn diagram. What are A [ B, A \ B, A [ C, A \ C ? What are A [ B, A [ B, A [ C, A [ C ? What are AB, A – B, A + B?
Pr. 2.2 A county ofﬁce determined that of the 1000 homes in their area, 400 were older than 20 years (event A), that 500 were constructed of wood (event B), and that 400 had central airconditioning (AC) (event C). Further, it is found that events A and B occur in 300 homes, that events A or C occur in 625 homes, that all three events occur in 150 homes, and that no event occurs in 225 homes. If a single house is picked, determine the following probabilities (also draw the Venn diagrams): (a) That it is older than 20 years and has central AC. (b) That it is older than 20 years and does not have central AC. (c) That it is older than 20 years and is not made of wood. (d) That it has central AC and is made of wood. Pr. 2.3 A university researcher has submitted three research proposals to three different agencies. Let E1, E2, and E3 be the outcomes that the ﬁrst, second, and third bids are successful with probabilities: p(E1) = 0.15, p(E2) = 0.20, p (E3) = 0.10. Assuming independence, ﬁnd the following probabilities using group theory: (a) That all three bids are successful. (b) That at least two bids are successful. (c) That at least one bid is successful. Verify the above results using the probability tree approach.
Problems
67
Fig. 2.34 Components in parallel and in series (Problem 2.4)
Pr. 2.4 Example 2.2.3 illustrated how to compute the reliability of a system made up of two components A and B. As an extension, it would be insightful to determine whether reliability of the system will be better enhanced by (i) duplicating the whole system in parallel, or (ii) by duplicating the individual components in parallel. Consider a system made up of two components A and B. Figure 2.34 (a) represents case (i) while Figure 2.34(b) represents case (ii). (a) If p(A) = 0.1 and p(B) = 0.8 are the failure probabilities of the two components, what are the probabilities of the system functioning properly for both conﬁgurations. (b) Prove that functional probability of system (i) is greater than that of system (ii) without assuming any speciﬁc numerical values. Derive the algebraic expressions for proper system functioning for both conﬁgurations from which the above proof can be deduced. Pr. 2.5 Consider the two system schematics shown in Fig. 2.35. At least one pump must operate when one chiller is operational, and both pumps must operate when both chillers are on. Assume that both chillers have identical reliabilities of 0.90 and that both pumps have identical reliabilities of 0.95. (a) Without any computation, make an educated guess as to which system would be more reliable overall when (i) one chiller operates, and (ii) when both chillers operate. (b) Compute the overall system reliability for each of the conﬁgurations separately under cases (i) and (ii) deﬁned above. Pr. 2.68 An automatic sprinkler system for a highrise apartment has two different types of activation devices for each sprinkler
8
From McClave and Benson (1988) with permission of Pearson Education.
Fig. 2.35 Two possible system conﬁgurations (for Pr. 2.5)
head. Reliability of such devices is a measure of the probability of success, i.e., that the device will activate when called upon to do so. Type A and Type B devices have reliability values of 0.90 and 0.85, respectively. In case a ﬁre does start, calculate: (a) The probability that the sprinkler head will be activated (i.e., at least one of the devices works). (b) The probability that the sprinkler will not be activated at all. (c) The probability that both activation devices will work properly. (d) Verify the above results using the probability tree approach. Pr. 2.7 Consider the following probability distribution of a random variable X: f ðxÞ = ð1 þ bÞx:b =0
for 0 ≤ x ≤ 1 elsewhere
Use the method of moments: (a) To ﬁnd the estimate of the parameter b (b) To ﬁnd the expected value of X
68
2 Probability Concepts and Probability Distributions
Pr. 2.8 Consider the following cumulative distribution function (CDF): F ðxÞ = 1  expð  2xÞ for x > 0 =0 x≤0 (a) Construct and plot the cumulative distribution function. (b) What is the probability of X < 2? (c) What is the probability of 3 < X < 5? Pr. 2.9 The joint density for the random variables (X,Y) is given by: f ðx, yÞ = 10xy2 =0
0 absðr Þ → weak
2
A more statically sound procedure is described in Sect. 4.2.7, which allows one to ascertain whether observed correlation coefﬁcients are signiﬁcant or not depending on the number of data points.
It is very important to note that inferring nonassociation of two variables x and y from their correlation coefﬁcient is misleading since it only indicates linear relationship. Hence, a poor correlation does not mean that no relationship exists between them (e.g., a secondorder relation may exist between x and y; see Fig. 3.13f). Detection of nonlinear correlations is addressed in Sect. 9.5.1. Note also that correlation analysis does not indicate whether the relationship is causal, i.e., whether the variation in the yvariable is directly caused by that in the xvariable. Finally, keep in mind that the correlation analysis does not provide an equation for predicting the value of a variable—this is done under model building (see Chaps. 5 and 9). Example 3.4.2 The following observations are taken of the extension of a spring under different loads (Table 3.4). Using Eq. 3.8, the standard deviations of load and extension are 3.7417 and 18.2978, respectively, while the correlation coefﬁcient r = 0.9979 (following Eq. 3.12). This indicates a very strong linear positive correlation between the two variables as one would expect. ■
3.5
Exploratory Data Analysis (EDA)
3.5.1
What Is EDA?
EDA is an analysis process (some use the phrase “attitude toward data analysis”) that has been developed and championed by John Tukey (1970) and considerably expanded and popularized by subsequent statisticians (e.g., Hoaglin et al. 1983). Rather than directly proceeding to perform the thentraditional conﬁrmatory analysis (such as hypothesis testing) or stochastic model building (such as regression), Tukey suggested that the analyst should start by “looking at the data and see what it seems to say.” Such
90
3
Data Collection and Preliminary Analysis
Fig. 3.13 Illustration of various plots with different correlation strengths. (a) Moderate linear positive correlation (b) Perfect linear positive correlation (c) Moderate linear negative correlation (d) Perfect linear negative correlation (e) No correlation at all (f) Correlation exists but it is not linear. (From Wonnacutt and Wonnacutt 1985, by permission of John Wiley and Sons)
Table 3.4 Extension of a spring with applied load (Example 3.4.2)
Load (Newtons) Extension (mm)
2 10.4
an investigation of data visualization and exploration, it was argued, will be more effective since it would indicate otherwise hidden/unexpected behavior in the dataset, allow evaluating the assumptions in data behavior presumed by traditional statistical analyses, and provide better guidance into the selection and use of new/appropriate statistical techniques. The timely advances in computer technology and software sophistication allowed new graphical techniques to be developed along with ways to transform/ normalize variables to correct for unwanted shape/spread of their distributions. In short, EDA can be summarized as a process of data guiding the analysis! At ﬁrst glance, EDA techniques may seem to be ad hoc and not follow any unifying structure; however, the interested reader can refer
4 19.6
6 29.9
8 42.2
10 49.2
12 58.5
to Hoaglin et al. (1983) for a rationale for developing the EDA techniques and for an explanation and illustration of the connections between EDA and classical statistical theory. Heiberger and Holland (2015) provide an indepth coverage and illustrate how different types of graphical data displays can be used to aid in exploratory data analysis and enhance various types of statistical computation and analysis, such as for inference, hypothesis testing, regression, timeseries, and experimental design (topics that are also covered in different chapters in this book). The book also provides software code in the opensource R statistical environment for generating such visual displays (https://github.com/ henzeresearchgroup/adam/).
3.5 Exploratory Data Analysis (EDA)
3.5.2
Purpose of Data Visualization
Data visualization is done by graphs and serves two purposes. During exploration of the data, it provides a better means of assimilating broad qualitative trend behavior of the data than can be provided by tabular data. Second, it provides an excellent manner of communicating to the reader what the author wishes to state or illustrate (recall the adage “a picture is worth a thousand words”). Hence, data visualization can serve as a medium to communicate information, not just to explore data trends (an excellent reference is Tufte 2001). However, it is important to be clear as to the intended message or purpose of the graph, and also tailor it as to be suitable for the intended audience’s background and understanding. A pretty graph may be visually appealing but may obfuscate rather than clarify or highlight the necessary aspects being communicated. For example, unless one is experienced, it is difﬁcult to read numerical values off of 3D graphs. Thus, graphs should present data clearly and accurately without hiding or distorting the underlying intent. Table 3.5 provides a succinct summary of graph formats appropriate for different applications. Graphical methods are usually more insightful than numerical screening in identifying data errors in smaller datasets with few variables. For large datasets with several variables, they become onerous if basic software, such as the ubiquitous spreadsheet, is used. Additionally, the strength of a graphical analysis during EDA is to visually point out to the analyst relationships (linear or nonlinear) between two or more variables in instances when a sound physical understanding is lacking, thereby aiding in the selection of the appropriate regression model. Presentday graphical visualization tools allow much more than this simple objective,
91
some of which will become apparent below. There is a very large number of graphical ways of presenting data, and it is impossible to cover them all. Only a small set of representative and commonly used plots will be discussed below, while other types of plots will be presented in later chapters as relevant. Nowadays, several highend graphical software programs allow complex, and sometimes esoteric, plots to be generated. Graphical representations of data are the backbone of exploratory data analysis. They are usually limited to one, two, and threedimensional (1D, 2D, and 3D) data. In the last few decades, there has been a dramatic increase in the types of graphical displays largely due to the seminal contributions of Tukey (1970), Cleveland (1985), and Tufte (1990, 2001). A particular graph is selected based on its ability to emphasize certain characteristics or behavior of onedimensional data, or to indicate relations between twoand threedimensional data. A simple manner of separating these characteristics is to view them as being: (i) Crosssectional (i.e., data collected at one point in time or when time is not a factor) (ii) Timeseries data (iii) Hybrid or combined (iv) Relational (i.e., emphasizing the joint variation of two or more variables) An emphasis on visualizing data to be analyzed has resulted in statistical software programs becoming increasingly convenient to use and powerful. Any data analysis effort involving univariate and bivariate data should start by looking at basic plots (higherdimension data require more elaborate plots discussed later).
Table 3.5 Type and function of appropriate graph formats Type of message Component
Function Shows relative size of various parts of a whole
Relative amounts
Ranks items according to size, impact, degree, etc.
Time series
Shows variation over time
Frequency
Shows frequency of distribution among certain intervals
Correlation
Shows how changes in one set of data is related to another set of data
Typical format Pie chart (for one or two important components) Bar chart Dot chart Line chart Bar chart Line chart Dot chart Bar chart (for few intervals) Line chart Histogram Line chart BoxandWhisker Paired bar Line chart Scatter diagram
Downloaded from the Energy Information Agency (EIA) website in 2009, which was since removed (http://www.eia.doe.gov/neic/graphs/introduc. htm)
92
3.5.3
3
Static Univariate Graphical Plots
Commonly used graphics for crosssectional representation are mean and standard deviation plots, steamandleaf diagrams, dot plots, histograms, boxwhiskermean plots, distribution plots, bar charts, pie charts, area charts, and quantile plots. Mean and standard deviation plots summarize the data distribution using the two most basic measures; however, this manner is of limited use (and even misleading) when the distribution is skewed. (a) Histograms For univariate discrete or continuous data, plotting of histograms is very straightforward while providing a compact visual representation of the spread and shape (such as unimodal or bimodal) of the relative frequency distribution. There are no hard and fast rules regarding how to select the number of bins (Nbins) or classes in case of continuous data, probably because there is no proper theoretical basis. The shape of the underlying distribution is better captured with larger number of bins, but then each bin will contain fewer observations, and may exhibit jagged behavior (see Fig. 1.3 for the variation of outdoor air temperature in Philadelphia). Generally, the larger the number of observations n, the more Fig. 3.14 Box and whisker plot and its association with a standard normal distribution. The box represents the 50th percentile range while the whiskers extend 1.5 times the interquartile range (IQR) on either side. (From Wikipedia website)
Data Collection and Preliminary Analysis
classes can be used, though as a guide it should be between 5 and 20. Devore and Farnum (2005) suggest: Number of bins or classes = N bins = ðnÞ1=2
ð3:14aÞ
which would suggest that if n = 100, Nbins = 10. Doebelin (1995) proposes another equation: N bins = 1:87:ðn  1Þ0:4
ð3:14bÞ
which would suggest that if n = 100, Nbins = 12. (b) Boxandwhisker plots The boxandwhisker plots better summarize the distribution, but this is done using percentiles. Figure 3.14 depicts its shape for univariate continuous data, identiﬁes various important quantities, and illustrates how these can be associated with the Gaussian or standard normal distribution. The lower and upper box values Q1 and Q3 (called hinges) correspond to the 25th and 75th percentiles (recall the interquartile range [IQR] deﬁned in Sect. 3.4.1) while the whiskers commonly extend to 1.5 times the IQR on either side. From Fig. 3.14, it is obvious that the IQR would contain 50% of the data observations while the range bounded by the
3.5 Exploratory Data Analysis (EDA) Table 3.6 Values of time taken (in minutes) for 20 students to complete an exam
37.0 42.0 47.0
93 37.5 43.1 62.0
38.1 43.9 64.3
40.0 44.1 68.8
40.2 44.6 70.1
40.8 45.0 74.5
41.0 46.1
See Example 3.5.1
low and high whiskers for a normally distributed dataset would contain 99.3% of the total data (close to the 99.7% value corresponding to ± (3 × standard deviation)). Such a representation reveals the skewness in the data, and also indicates outlier points. Any observation between (1.5 × IQR) and (3.0 × IQR) from the closest quartile is considered to be a mild outlier, and shown as a closed or ﬁlled point, while one falling outside (3.0 × IQR) from the closest quartile is taken to be an extreme outlier and shown as an open circle point. However, if the outlier falls within the (1.5 × IQR) spread, then the whisker line should be terminated at this outlier point (Tukey 1970). (c) QQ plots Though plotting a boxandwhisker plot or a plot of the distribution itself can suggest the shape of the underlying distribution, a better visual manner of ascertaining whether a presumed or speciﬁed theoretical probability distribution is consistent with the dataset being analyzed is by means of quantile plots. Recall that quantiles are points dividing the range of a probability distribution (or any sample univariate data) into continuous intervals with equal probabilities; hence quartiles are a good way of segmenting a distribution. Further, percentiles are a subset of quantiles that divide the data into 100 equally sized groups. In essence, a quantile plot is one where the quantiles of the dataset are plotted as a cumulative distribution. A quantile–quantile or QQ plot is one where the quantiles of the data are plotted against the quantiles of a speciﬁed standardized theoretical distribution.3 How they align or deviate from the 45° reference line allows one to visually determine whether the two distributions are in agreement, and, if not, may indicate likely causes for this difference. The QQ plot can also be generated with quantiles from one sample data against those of another. A special case of the more general QQ plot is the normal probability plot where the comparison is done against the normal distribution plotted on the yaxis. There is some variation between different statistical software in the terminology adopted and how these plots are displayed (an issue to be kept in mind). QQ plots are better interpreted with larger datasets, but the following example is meant for illustrative purposes.
3
Another similar type of plot for comparing two distributions is the PP plot, which is based on the cumulative probability distributions. However, the QQ plot based on the cumulative quantile distribution is more widely used.
Fig. 3.15 QQ plot of the data in Table 3.6 against a standardized Gaussian distribution
Example 3.5.1 An instructor wishes to ascertain whether the time taken by her students to complete the ﬁnal exam follows a normal or Gaussian distribution. The values in minutes shown in Table 3.6 have been recorded. The normal quantile plot for this dataset against a Gaussian (or standard normal) is shown in Fig. 3.15. The pattern is obviously nonlinear, and so a Gaussian distribution is improper for this data. The apparent break appearing in the data on the right side of the graph indicates the presence of outliers (in this case, caused by ﬁve students taking much longer to complete the exam). ■ Example 3.5.2 Consider the same dataset as for Example 3.4.1. The following plots have been generated (shown in Fig. 3.16): (a) Boxandwhisker plot (note that the two whiskers are not equal indicating a slight skew). (b) Histogram of data (assuming 9 bins) shown as relative frequency; the sum of the yaxis of the individual bars should add to 100. (c) Normal probability plot allows evaluating whether the distribution is close to a normal distribution. The tails deviate from the straight line indicating departure from a normal distribution. Note that the normal quantile is now plotted on the yaxis and rescaled compared to the QQ plot of Fig. 3.15. Different software programs generate such plots differently. One also comes across QQ plots where both axes represent quantiles, one from the normal distribution and one from the sample distribution.
94
3
Data Collection and Preliminary Analysis
Fig. 3.16 Various exploratory plots for the dataset in Table 3.2
Fig. 3.17 Common components of the box plot and the violin plot for an arbitrary continuous variable (shown on the yaxis)
120
Outside Points
95
Upper Adjacent Value 70 Third Quartile Median 45
First Quartile
Lower Adjacent Value 20 Box Plot
(d) Run chart (or a timeseries plot) meant to retain the timeseries nature of the data while the other graphics do not. The manner in which the run chart has been generated is meaningless since the data has been entered into the spreadsheet in the wrong sequence, with data entered columnwise instead of rowwise. The run chart, had the data been entered correctly, would have resulted in a monotonically increasing curve and would have been more meaningful. ■
Violin Plot
(d) Violin plots Another visually appealing plot that conveys essentially the same information as the box plot (but more completely) is the violin plot (Fig. 3.17). It reveals the probability density of the data at different values (each half of the violin shows the same distribution; this convention has been adopted for symmetry and visual aesthetics). The plot is usually smoothed by a kernel density estimator. The dot in the middle of the box inside the violin is the median, and the ﬁrst and third quartiles are drawn (the box represents interquartile range).
3.5 Exploratory Data Analysis (EDA)
3.5.4
Static Bi and Multivariate Graphical Plots
There are numerous graphical representations that fall in this category and only an overview of the more common plots will be provided here. The box plot representation, discussed earlier, also allows a convenient visual comparison of the similarities and differences between the spread and shape of two or more datasets when multiple box plots are plotted side by side. (a) Pie Charts Multivariate stationary data of worldwide percentages of total primary energy sources can be represented by Fig. 3.18 Two different ways of plotting stationary data. Data corresponds to worldwide percentages of total primary energy supply in 2003. (a) Pie chart. (b) Bar chart. (From IEA, World Energy Outlook, IEA, Paris, France, 2004)
95
the widely used pie chart (Fig. 3.18a), which allows the relative aggregate amounts of the variables to be clearly visualized. The same information can also be plotted as a bar chart (Fig. 3.18b), which is not quite as revealing. (b) Elaborate Bar Charts More elaborate bar charts (such as those shown in Fig. 3.19) allow numerical values of more than one variable to be plotted such that their absolute and relative amounts are clearly highlighted. The plots depict differences between the electricity sales during each of the four different quarters of the year over 6 years. Such plots can be drawn as compounded plots to allow better visual intercomparisons (Fig. 3.19a). Column charts or stacked charts (Fig. 3.19b, c) show the same information
96
3
Data Collection and Preliminary Analysis
Fig. 3.19 Different types of bar and area plots to illustrate yearbyyear variation (over 6 years) in quarterly electricity sales (in GigaWatthours) for a certain city
as that in Fig. 3.19a but are stacked one above another instead of showing the numerical values side by side. One plot shows the stacked values normalized such that the sum adds to 100%, while another stacks them so as to retain their numerical values. Finally, the same information can be plotted as an area chart (Fig. 3.19d) wherein both the timeseries trend and the relative magnitudes are clearly highlighted. (c) Scatter and TimeSeries Plots Timeseries plots or relational plots or scatter plots (such as x–y plots) between two variables are the most widely used types of graphical displays. Scatter plots allow visual determination of the trend line between two variables and the extent to which the data scatter around the trend line (Fig. 3.20). An important issue is that the manner of selecting the range of the variables can be misleading to the eye. The same data is plotted in Fig. 3.21 on two different scales, but one would erroneously conclude that there is more data scatter around the trend line for (b) than for (a). This is
referred to as the lie factor deﬁned as the ratio of the apparent size of effect in the graph and the actual size of effect in the data (Tufte 2001). The data at hand and the intent of the analysis should dictate the scale of the two axes, but it is difﬁcult in practice to determine this heuristically.4 It is in such instances that statistical measures can be used to provide an indication of the magnitude of the graphical scales. (d) Bubble Plots Bubble plots allow observations with three attributes to be plotted. The 2D version of such plots is the wellknown x–y scatter plot. An additional variable is represented by enlarging the dot, i.e., the bubble size. Figure 3.22 is illustrative of such a representation for the commute patterns in major US cities in 2008.
4
Generally, it is wise, at least at the onset, to adopt scales starting from zero, view the resulting graphs and then make adjustments to the scales as appropriate.
3.5 Exploratory Data Analysis (EDA)
97
Fig. 3.20 Scatter plot (or x–y plot) of worldwide population growth over time showing past values and projected values with 2010 as the current year. In this case, a secondorder quadratic regression model has been selected to plot the trend line. The actual population in 2021 was 7.84 billion (quite close to the projected value)
Fig. 3.21 Figure to illustrate how the effect of resolution can mislead visually. The same data is plotted in the two plots, but one would erroneously conclude that there is more data scatter around the trend line for (b) than for (a)
Fig. 3.22 Bubble plot showing the commute patterns in major US cities in 2008. The size of the bubble represents the number of commuters. (From Wikipedia website, downloaded 2010)
98
3
Data Collection and Preliminary Analysis
Fig. 3.24 Scatter plot combined with boxwhiskermean (BWM) plot of the same data as shown in Fig. 3.10. (From Haberl and Abbas 1998, by permission of Haberl)
Fig. 3.23 Several types of combination charts are possible. The plots shown allow visual comparison of the standardized (subtracted by the mean and divided by the standard deviation) daily wholehouse electricity use in 29 similar residences against the overlaid standard normal distribution. (From Reddy 1990)
(e) Combination Charts Combination charts can take numerous forms, but, in essence, are those where two basic but different ways of representing data are combined together in one graph. One example is Fig. 3.23 where the histogram depicts actual data spread, the distribution of which can be visually evaluated against the standard normal line overlaid on the histogram. The dataset corresponds to the daily energy use of 29 residences with similar diurnal energy use (classiﬁed as Stratum 5) during the summer and winter days. Occupant vagaries can be likened to randomness/noise in the electricity use data. Possible causes for the closeness or deviation from normality for each season can provide physical insights to the analyst and also allow different electricity curtailment measures to be modeled in a probabilistic framework. For purposes of data checking, x–y plots are perhaps most appropriate as discussed in Sect. 3.3.4. The x–y scatter plot
(Fig. 3.10) of hourly cooling energy use of a large institutional building versus outdoor temperature allowed outliers to be detected. The same data could be summarized by combined boxandwhisker plots (ﬁrst suggested by Tukey 1988) as shown in Fig. 3.24. Here the xaxis range is subdivided into discrete bins (in this case, 5 °F bins), showing the median values (joined by a continuous line) along with the 25th percentiles on either side of the mean (shown boxed), the 10th and 90th percentiles indicated by the vertical whiskers from the box, and the values less than the 10th percentile and those greater than the 90th percentile are shown as individual points.5 Such a representation is clearly a useful tool for data quality checking, for detecting underlying patterns in data at different subranges of the independent variable, and also for ascertaining the shape of the data spread around this pattern. Note that the cooling energy use line seems to plateau at outside temperature values above 80 °F. What does this indicate about the installed capacity of the chiller? Probably that the chiller was undersized at the onset or has degraded over time or that the building loads have increased over time. (f) ComponentEffect Plots In case the functional relationship between the independent and dependent variables changes due to known causes, it is advisable to plot these in different frames. For example, hourly energy use in a commercial building is known to change with time of day, and, moreover, the functional relationship is quite different dependent on the season (time of year). Componenteffect plots are multiple plots between the variables for cold, mild, and hot 5
Note that the whisker end points are different from those described earlier in Sect. 3.5.3. Different textbooks and studies adopt slightly different selection criteria.
3.5 Exploratory Data Analysis (EDA)
Fig. 3.25 Example of a combined boxwhiskercomponent plot depicting how hourly energy use varies with hour of day during a year for different outdoor temperature bins for a large commercial building.
Fig. 3.26 Contour plot characterizing the sensitivity of total power consumption (condenser water pump power plus tower fan power) to condenser waterloop controls for a single chiller load, ambient wetbulb temperature, and chilled water supply temperature. (From Braun et al. 1989, # American Society of Heating, Refrigerating and AirConditioning Engineers, Inc., www.ashrae.org)
periods of the year combined with different boxandwhisker plots for different hours of the day. They provide more clarity in underlying trends and scatter as illustrated in Fig. 3.25, where the time of year is broken up into three temperature bins. (g) Contour Plots A contour plot depicts the relationship between system response (or dependent variable) and two independent variables plotted on the two axes. This relationship is
99
(From ASHRAE 2002, # American Society of Heating, Refrigerating and AirConditioning Engineers, Inc., www.ashrae.org)
captured by a series of lines drawn for different preselected values of the response variable. This is illustrated by Fig. 3.26 where the total power of a condenser loop of a cooling system is the sum of the pump power and the cooling tower fan, both of which are function of their operating speed (since they can be modulated). The two axes of the ﬁgure are the normalized fan and pump speeds relative to their rated values. The minimum total power is shown by a cross at the center of the innermost contour circle, while one notes that this minimum is quite broad and different combinations of the two control variables are admissible. Such plots clearly indicate the degree of latitude allowable in the operating control settings of the pump and the fan and reveal the nonlinear sensitivity of these control points stray from the optimal setting. Insights into how different combinations of the two independent variables impact total system power are useful to system operators. (h) Scatter Plot Matrix Figure 3.27, called scatter plot matrix, is another useful representation of visualizing multivariate data. Here the various permutations of the variables are shown as individual scatter plots. The idea, though not novel, has merit because of the way the graphs are organized and presented. The graphs are arranged in rows and columns such that each row or column has all the graphs relating a certain variable to all the others; thus, the variables have shared axes. Though there are twice as many graphs as needed minimally (since each graph has another one with the axis interchanged), the redundancy is sometimes useful to the analyst in better detecting underlying trends.
100 Fig. 3.27 Scatter plot matrix or carpet plots for multivariable graphical data visualization. The data corresponds to hourly climatic data for Phoenix, AZ, for January 1990. The bottom lefthand corner frame indicates how solar radiation in Btu/hrft2 (xaxis) varies with drybulb temperature (in °F) and is a ﬂipped and rotated image of that at the top righthand corner. The HR variable represents humidity ratio (in lbm/lba). Points that fall distinctively outside the general scatter can be ﬂagged as outliers
3
Data Collection and Preliminary Analysis
TDB
HR
Solar
improperly timestamped, such as overlooking daylight savings shift or misalignment of 24h holiday proﬁles (Fig. 3.29). One negative drawback associated with these graphs is the difﬁculty in viewing exact details such as the speciﬁc hour or speciﬁc day on which a misalignment occurs. Some analysts complain that 3D surface plots obscure data that is behind “hills” or in “valleys.” Clever use of color or dotted lines has been suggested to make it easier to interpret such graphs.
Fig. 3.28 Threedimensional (3D) surface chart of mean hourly wholehouse electricity during different hourly segments of the day across several residences. (From Reddy 1990)
(i) 3D Plots Threedimensional (or 3D) plots are being increasingly used from the past several decades. They allow plotting variation of a variable when it is inﬂuenced by two independent factors. They also allow trends to be gauged and are visually appealing, but the numerical values of the variables are difﬁcult to read. The data plotted in Fig. 3.28 corresponds to the mean hourly energy use of 29 residences with similar diurnal energy use (classiﬁed as Stratum 5). The day has been broken up into eight segments of 3 h each to reduce clutter in the graph. Such a graph can reveal probabilistic trends in how occupants consume electricity so that different demand curtailment measures can be evaluated by electric utilities in a probabilistic modeling framework. Another beneﬁt of such 3D plots is their ability to aid in the identiﬁcation of oversights. For example, energy use data collected from a large commercial building could be
(j) DomainSpeciﬁc Charts Different disciplines have developed different types of plots and graphical representations. Tornado diagrams are commonly used to illustrate variable sensitivity during risk analysis, and spider plots are common in multicriteria decisionmaking studies (Chap. 12). One example in HVAC studies is the wellknown psychrometric chart (Reddy et al. 2016), which allows one to determine (for a given location characterized by its elevation above sea level) the various properties of air–water mixtures such as drybulb temperature, absolute humidity, relative humidity, speciﬁc volume, enthalpy, wetbulb temperature. Solar scientists and architects have developed the sunpath diagram, which allows one to determine the position of the sun in the sky (deﬁned by the solar altitude and the solar azimuth angles) at different times of the day and the year for a location of latitude 40° N (Fig. 3.30). Such a representation has also been used to determine periods of the year when shading occurs from neighboring obstructions. Such considerations are important while siting solar systems or designing buildings.
3.5.5
Interactive and Dynamic Graphics
The above types of plots can be generated by most of the presentday data analysis software programs. More
3.5 Exploratory Data Analysis (EDA)
101
Fig. 3.29 Example of a threedimensional plot of measured hourly electricity use in a commercial building over 9 months. (From ASHRAE 2002, # American Society of Heating, Refrigerating and AirConditioning Engineers, Inc., www.ashrae.org) Fig. 3.30 Figure illustrating an overlay plot for shading calculations. The sunpath diagram is generated by computing the solar altitude and azimuth angles for a given latitude (for 40° N in this case) during different times of the day and times of the year. Trees and other objects can obstruct the observer. Such periods are conveniently determined by drawing the contours of these objects characterized by the angles φp and βp computed from basic geometry and overlaid on the sunpath diagram. (From Reddy et al. 2016, by permission of CRC Press)
specialized software programs allow interactive data visualization, which provide much greater insights and intuitive understanding into data trends, correlations, outliers, and local behavior, especially when large amounts of data are being analyzed. Animation has also been used to advantage in understanding timeseries system behavior from monitored data since effects such as diurnal and seasonal differences in building energy use can be conveniently investigated. Animated scatter plots of the x and y variables, use of ﬁltering, zooming, brushing, use of distortion lenses, etc. in conjunction with judicious use of color can provide even better visual insights to the professional and enhance classroom learning as well. The interested reader can refer to Keim and Ward (2003) for a more indepth classiﬁcation and treatment of advanced data visualization techniques most appropriate for large datasets. Different domains have developed specialized visualization tools. Glaser and Ubbelohde (2001) describe novel highperformance visualization techniques for viewing timedependent data common to building energy simulation program output. Some of these techniques include: (i) brushing
and linking where the user can investigate the behavior during a few days of the year, (ii) tessellating a 2D chart into multiple smaller 2D charts giving a fourdimensional (4D) view of the data such that a single value of a representative sensor can be evenly divided into smaller spatial plots arranged by time of day, (iii) magic lenses that can zoom into a certain portion of the room, and (iv) magic brushes. These techniques enable rapid inspection of trends and singularities that cannot be gleaned from conventional viewing methods.
3.5.6
Basic Data Transformations
Another important aspect of EDA is data transformation. Such transformations are meant to simplify the analysis by removing effects such as strong asymmetry, many outliers in one tail, batches of data with different spreads, promoting linear relationships between regressor and response variables—in short to yield more effective insights of the dataset being analyzed (Hoaglin et al. 1983). Typical examples include converting into appropriate units, taking
102
3
ratios, rescaling, and applying mathematical corrections to the data such as taking logarithms or taking square roots. Such transformations are discussed in various chapters of this book (e.g., Chap. 5 in the context of regression model building, Chap. 11 for data mining, and Chap. 12 for sustainability assessments). Only the very basic rescaling or normalization methods are described below: (a) Decimal scaling moves the decimal point but still preserves most of the original data. The speciﬁc observations of a given variable may be divided by 10x where x is the minimum value so that all the observations are scaled between 1 and 1. For example, say the largest value is 289 and the smallest value is 150. Then since x = 3, all observations are divided by 1000 so as to lie between [–0.150 and 0.289]. (b) Minmax scaling allows for better distribution of observations over the range of variation than does decimal scaling. It does this by redistributing the values to lie between [0 and 1]. Hence, each observation is normalized as: zi =
xi  xmin xmax  xmin
ð3:15Þ
where xmax and xmin are the maximum and minimum numerical values, respectively, of the x variable. Sometimes, xmin may be close or equal to zero, and then Eq. 3.15 simpliﬁes down to: zi =
xi xmax
xi  x sx
3.6
Overall Measurement Uncertainty
3.6.1
Need for Uncertainty Analysis
Any measurement exhibits some difference between the measured value and the true value and, therefore, has an associated uncertainty. A statement of measured value without an accompanying uncertainty statement has limited meaning. Uncertainty is the interval around the measured value within which the true value is expected to fall at some stated conﬁdence level (CL). “Good data” is not characterized by point values only; the data should be within an acceptable uncertainty interval or, in other words, should provide the acceptable degree of conﬁdence in the result. Measurements made in the ﬁeld are especially subject to errors. In contrast to measurements taken under the controlled conditions of a laboratory setting, ﬁeld measurements are typically made under less predictable circumstances and with less accurate and less expensive instrumentation. Furthermore, ﬁeld measurements are vulnerable to errors arising from: (a) Variable measurement conditions so that the method employed may not be the best choice for all operating conditions. (b) Limited instrument ﬁeld calibration, because it is typically more complex and expensive than laboratory calibration. (c) Limitations in the ability to adjust instruments in the ﬁeld.
ð3:16Þ
Note that though this transformation may look very appealing, the scaling relies largely on the minimum and maximum values, which are generally not very robust and may be error prone. The minmax scaling is a linear transformation. One could also scale some of the variables using nonlinear functions such as power or logarithmic functions (as discussed in Sect. 12.5.3). (c) Standard deviation scaling is widely used for distance measures (such as in multivariate statistical analysis) but transforms data into a form unrecognizable from the original data. Here, each observation is transformed as follows: zi =
Data Collection and Preliminary Analysis
ð3:17Þ
where x and sx are the mean and standard deviation, respectively, of the x variable. The dataset is said to be converted into one with zero mean and unit standard deviation (a transformation adopted in several statistical tests and during multivariate regression).
With appropriate care, many of these sources of error can be addressed: (i) through the optimization of the measurement system to provide maximum beneﬁt for the chosen budget, and (ii) through the systematic development of a procedure by which an uncertainty statement can be ascribed to the result. The results of a practitioner who does not consider sources of error are likely to be questioned by others, especially since the engineering community is increasingly becoming sophisticated and mature about the proper reporting of measured data and associated uncertainties.
3.6.2
Basic Uncertainty Concepts: Random and Bias Errors
There are several standard documents for evaluating and reporting uncertainties in measurements, parameters, and methods and for propagation of those uncertainties to the test result. The International Organization of Standardization (ISO) standards are the basis from which professional organizations have developed standards speciﬁc to their
3.6 Overall Measurement Uncertainty
Random errors are differences from one observation to the next due to both sensor noise and extraneous conditions affecting the sensor. The random error changes from one observation to the next, but its mean (average value) over a very large number of observations is taken to approach zero. Random error generally has a welldeﬁned probability distribution that can be used to bound its variability in statistical terms as described in the next two subsections when a ﬁnite number of observations is made of the same variable.
Population average
Frequency
True value Frequency
Parameter Measurement (a) Unbiased and precise
Parameter Measurement (b) Biased and precise
True value and Population average
Population average True value
Frequency
(a) Bias or systematic error (or precision or ﬁxed error) is analogous to sensor precision (see Sect. 3.1). In certain cases, the ﬁxed offset (bias) errors can be determined. For example, a bias is present if a temperature sensor always reads 1 °C higher than the true value from a certiﬁed calibration procedure, and this miscalibration error can be corrected. However, there are other causes such as improper placement of the sensor, degradation, or particular measurement technique that cause perturbations on the sensor reading (akin to Fig. 3.1b). These perturbations are also treated as a random variable but characterized by a ﬁxed and unchanging uncertainty value which does not reduce even with multisampling (this aspect is elaborated below in Sect. 3.6.4). Thus, for the speciﬁc situation, a simple bias correction to the measurements cannot be applied. (b) Random error (or inaccuracy error) is an error due to the unpredictable and unknown variations in the experiment that causes readings to take random values on either side of some mean value. Measurements may be accurate or nonaccurate, depending on how well an instrument can reproduce the subsequent readings of an unchanged input (Fig. 3.31). Only random errors can be treated by statistical methods. There are two types of random errors: (i) additive errors that are independent of the magnitude of the observations, and (ii) multiplicative errors that are dependent on the magnitude of the observations (Fig. 3.32). Usually instrument accuracy is stated in terms of percent of full scale, and in such cases uncertainty of a reading is taken to be additive, i.e., irrespective of the magnitude of the reading.
True value and Population average
Frequency
purpose (e.g., ASME PTC19.12018). The following material is largely drawn from Guideline 2 (ASHRAE G2 2005), which deals with engineering analysis of experimental data. Uncertainty sources may be classiﬁed as either systematic/ bias or random and these are treated as random variables, with, however, different multipliers applied to them. The end result of an uncertainty analysis is a numerical estimate of the test uncertainty with an appropriate CL.
103
Parameter Measurement (c) Unbiased and imprecise
Parameter Measurement (d) Biased and imprecise
Fig. 3.31 The four general manifestations of measurement bias and precision errors of population average estimated from sample measurements
Fig. 3.32 Conceptual ﬁgures illustrating how additive and multiplicative errors affect the uncertainty bands around the trend line
3.6.3
Random Uncertainty
Based on measurements of a random variable X, the true value of X can be speciﬁed to lie in the interval (Xbest ± Ux) where Xbest is usually the mean value of the measurements taken and Ux is the uncertainty in X that corresponds to the estimate of the effects of combining ﬁxed and random errors.
104
The uncertainty being reported is speciﬁc to a conﬁdence level (CL),6 which can be directly interpreted as a probability. The conﬁdence interval (CI) deﬁnes the range of values or the bounds/limits that can be expected to include the true value with a stated probability. For example, a statement that the CI at the 95% CL is 5.1–8.2 implies that the true value will be contained between the interval bounded by {5.1, 8.2} in 19 out of 20 predictions (95% probability), or that one is 95% conﬁdent that the true value lies between 5.1 and 8.2. This is a loose interpretation (but easier to understand by practitioners); the more accurate one is that the CI applies to the difference between the sample and the population means and not to the population mean itself. An uncertainty statement with a low CL is usually of little use. For example, in the previous example, if a CL of 40% is used instead of 95%, the interval becomes a tight {7.6, 7.7}. However, only 8 out of 20 predictions will likely lie between 7.6 and 7.7. Conversely, it is useless to seek a 100% CL since then the true value of some quantity would lie between plus and minus inﬁnity. Multisample data (repeated measurements of a ﬁxed quantity using altered test conditions, such as different observers or different instrumentation or both) provides greater reliability and accuracy than single sample data (measurements by one person using a single instrument). For the majority of engineering cases, it is impractical and too costly to perform a true multisample experiment. Strictly speaking, merely taking repeated readings with the same procedure and equipment does not provide multisample results; however, such a procedure is often accepted by the engineering community as a fair approximation of a multisample experiment. Depending upon the sample size of the data (greater or less than about 30 samples), different statistical considerations and equations apply. The issue of estimating CI is further discussed in Chap. 4, while operational equations are presented below. These levels or limits are directly based on the Gaussian and the Studentt distributions presented in Sect. 2.4.3. (a) Random uncertainty in large samples (n > about 30): The best estimate of a variable x is usually its sample mean value given by x: The limits of the CI are determined from the sample standard deviation sx. The typical procedure is then to assume that the individual data values are scattered about the mean following a certain probability distribution function, within (±z. sx) of the mean where z is a multiplier described below. Usually a normal probability curve (Gaussian distribution) is assumed to represent the dispersion in experimental 6
Several publications cite uncertainty intervals without specifying a corresponding CL; such practice should be avoided.
3
Data Collection and Preliminary Analysis
data, unless the process is known to follow one of the other standard distributions (discussed in Sect. 2.4). For a normal distribution, the standard deviation indicates the following degrees of dispersion of the values about the mean. From Table A3, for z = 1.96,7 the area shown shaded is 0.025, which translates to 0.05 for a twotailed distribution, implying that 95% of the data will be within (±1.96sx) of the mean. Thus, the zmultiplier has a direct relationship with the CL selected (assuming a known probability distribution). The CL for the mean of n number of multisample random data, with no ﬁxed error, is: z:s z:s xmin = x  p x and xmax = x þ p x n n
ð3:18aÞ
(b) Random uncertainty in small samples (n < about 30). In many circumstances, the analyst will not be able to collect a large number of data points and may be limited to a dataset of less than 30 values (n < 30). Under such conditions, the mean value and the standard deviation are computed as before. The zvalue applicable for the normal distribution cannot be used for small samples. The new values, called tvalues, are tabulated for different degrees of freedom d.f. (ν = n  1) and for the acceptable degree of conﬁdence (see Table A48). The CI for the mean value of x, when no ﬁxed (bias) errors are present in the measurements, is given by: t:s xmin = x  p x n
t:s and xmax = x þ p x n
ð3:18bÞ
For example, consider the case of d.f. = 10 and twotailed 95% CL. One ﬁnds from Table A4 that t = 2.228 for 95% CL. Note that this reduces to t = 2.086 for d.f. = 20 and reaches the zvalue of 1.96 for d.f. = 1. Example 3.6.1 Estimating conﬁdence intervals (CI) (a) The length of a ﬁeld is measured 50 times. The mean is 30 with a standard deviation of 3. Determine the 95% CI assuming no ﬁxed error. 7 Note that the value of 1.96 corresponds to very large samples (>120 or so). For a sample size of 30, the zvalue can be read off of Table A4, which shows a value of 2.045 for degrees of freedom = 30 – 1 = 29 for a twotailed 95% CL. However, it is common practice to simply assume the zvalues for samples greater than 30 even though this does introduce some error. 8 Table A4 assembles critical values for both the onetailed and twotailed distributions, while most of the discussion here applies to the latter. See Sect. 4.2.1 for the distinction between both.
3.6 Overall Measurement Uncertainty
105
This is a large sample case, for which the zmultiplier is 1.96. Hence, from Eq. 3.17, the 95% CI = 30 ±
ð1:96Þð3Þ ð50Þ1=2
= 30 ± 0:83 = f29:17, 30:83g.
(b) Only 21 measurements are taken and the same mean and standard deviation as in (a) are found. Determine the 95% CI assuming no ﬁxed error. This is a small sample case for which the tvalue = 2.086 for d.f. = 20. Then, from Eq. 3.18, the 95% CI will turn out to be wider: 30 ± 1:37 = f28:63, 31:37g
3.6.4
ð2:086Þð3Þ ð21Þ1=2
= 30 ± ■
Bias Uncertainty
Estimating the bias or ﬁxed error of a random variable at a speciﬁed conﬁdence level (commonly, 95% CL) is described below. The ﬁxed error BX for a given value x is assumed to be a single value drawn from some larger distribution of possible ﬁxed errors. The treatment is similar to that of random errors with the major difference that only one value is considered even though several observations may be taken. When further knowledge is lacking, a normal distribution is usually assumed. Hence, if a manufacturer speciﬁes that the ﬁxed uncertainty BX = ±1.0 °C with 95% CL (compared to some standard reference device), then one assumes that the ﬁxed error belongs to a larger distribution (taken to be Gaussian) with a standard deviation SB = 0.5 °C (since the corresponding zvalue ≃2.0).
3.6.5
Overall Uncertainty
The overall uncertainty of a measured variable x combines the random and bias uncertainty estimates. Though several forms of this expression appear in different texts, a convenient working formulation is as follows:
Ux =
s Bx 2 þ t px n
2
ð3:19Þ
where: Ux = overall uncertainty in the value x at a speciﬁed CL Bx = uncertainty in the bias or ﬁxed component at the speciﬁed CL sx = standard deviation estimates for the random component n = sample size t = tvalue at the speciﬁed CL for the appropriate degrees of freedom
Example 3.6.2 For a single measurement, the statistical concept of standard deviation does not apply. Nonetheless, one could estimate it from manufacturer’s speciﬁcations if available. One wishes to estimate the overall uncertainty at 95% CL in an individual measurement of water ﬂow rate in a pipe under the following conditions: (a) Fullscale meter reading 150 L/s (b) Actual ﬂow reading 125 L/s (c) Random error of instrument is ±6% of fullscale reading at 95% CL (d) Fixed (bias) error of instrument is ±4% of fullscale reading at 95% CL The solution is rather simple since all stated uncertainties are at 95% CL. It is implicitly assumed that the normal distribution applies. The random error = 150 × 0.06 = ±9 L/s. The ﬁxed error = 150 × 0.04 = ±6 L/s. The overall uncertainty can be estimated from Eq. 3.19 with n = 1: U x = 62 þ 92
1=2
= ± 10:82 L=s
The fractional overall uncertainty at 95% CL = = 0:087 = 8:7%
10:82 125
Ux x
= ■
Example 3.6.3 Consider Example 3.6.2. In an effort to reduce the overall uncertainty, 25 readings of the ﬂow are taken instead of only one reading. The resulting uncertainty in this case is determined as follows: • The bias error remains unchanged at ±6 L/s p • The random error decreases by a factor of n to 9=ð25Þ1=2 = ± 1:8 L=s • Then from Eq. 3.19, the overall uncertainty is: Ux = (62 + 1.82)1/2 = ±6.26 L/s • The fractional overall uncertainty at 95% CL = Uxx = 6:26 = 0:05 = 5:0% 125 Increasing the number of readings from 1 to 25 reduces the absolute uncertainty in the ﬂow measurement from ±10.82 L/s to ±6.26 L/s and the relative uncertainty from ±8.7% to ±5.0%. Because of the large, ﬁxed error, further increase in the number of readings would result in only a small reduction in the overall uncertainty. ■ Example 3.6.4 A ﬂow meter manufacturer stipulates a random error of 5% for his meter at 95.5% CL (i.e., at z = 2). Once installed, the engineer estimates that the bias error due to the placement of
106
3
Table 3.7 Table for Chauvenet’s criterion of rejecting outliers
Number of readings n 5 6 7 10 15 20 25 30 50 100 300 500 1000
Deviation ratio dmax/sx 1.65 1.73 1.80 1.96 2.13 2.24 2.33 2.51 2.57 2.81 3.14 3.29 3.48
the meter in the ﬂow circuit is 2% at 95.5% CL. The ﬂow meter takes a reading every minute, but only the mean value of 15 such measurements is recorded once every 15 min. Estimate the overall uncertainty of the mean of the recorded values at 99% CL. The bias uncertainty can be associated with the normal tables. From Table A3, z = 2.575 has an associated probability of 0.01, which corresponds to the 99% CL. Given that the bias error at 95.5% CL (z = 2) is 2%, the bias uncertainty at 99% CL (z = 2.575) would be 2.575%. Next, the random error at z = 1 is half of that at z = 2, i.e., 2.5%. However, the number of observations is less than 30, and so the studentt distribution has to be used for the random uncertainty component. From Table A4, the critical tvalue = 2.977 for d.f. = 15 – 1 = 14 and twotailed CL = 95%. Hence, from Eq. 3.19, the overall uncertainty of the recorded values at 99% CL
Ux =
2
½2:575 þ
ð2:977Þ:ð2:5Þ ð15Þ1=2
2
1=2
= 0:0322 = 3:22% ■
3.6.6
Data Collection and Preliminary Analysis
Chauvenet’s Statistical Criterion of Data Rejection
The statistical considerations described above can lead to analytical screening methods that can point out data errors not ﬂagged by graphical methods alone. Though several types of rejection criteria have been proposed, perhaps the best known is the Chauvenet’s criterion, which is said to provide an objective and quantitative method for data rejection. This criterion, which presumes that the errors are normally distributed and have constant variance, speciﬁes that any reading out of a
series of n readings shall be rejected if the magnitude of its deviation dmax from the mean value of the series (=abs (xi  x)) is such that the twosided probability of occurrence of such a deviation exceeds (1/2n). This criterion should not be applied to small datasets since the Gaussian distribution does not apply in such cases. The Chauvenet criteria is approximately given by the following regression model: d max = 0:8478 þ 0:5375 lnðnÞ  0:02309 ln n2 sx
ð3:20Þ
where sx is the standard deviation of the series and n is the number of data points.9 The deviation ratio for different number of readings is more accurately given in Table 3.7. For example, if one takes 15 observations, an observation shall be discarded if its deviation from the mean exceeds a value dmax = (2.13)sx. This data rejection should be done only once and more than one round of elimination using the Chauvenet’s criterion is not advised. Note that the Chauvenet’s criterion has inherent assumptions that may not be justiﬁed. For example, the underlying distribution may not be normal, but could have a longer tail. In such cases, one may be throwing out good data. A more scientiﬁc manner of dealing with outliers is not to reject data points but to use either weighted regression or robust regression, where observations farther away from the mean are given less weight than those from the center (see Sects. 5.6 and 9.9).
3.7
Propagation of Errors
In many cases, the variable used for data analysis is not directly measured, but values of several associated variables are measured, which are then combined using a functional 9 The regression ﬁt has an Rsquare of 99.6% and root mean square error (RMSE) = 0.0358; these goodnessofﬁt statistical criteria are explained in Sect. 5.3.2.
3.7 Propagation of Errors
107
relationship. The objective of this section is to present the methodology to estimate overall error/uncertainty10 of a functional value y knowing the uncertainties in the individual input variables xi. The random and ﬁxed components, which together constitute the overall error, must be estimated separately. The treatment that follows, though limited to random errors, could also apply to bias errors. It is recommended that the Taylor series method be applied when the errors are relatively small compared to the measurement values (say, 5% or so). When the errors are large (say, over 15%), the Monte Carlo (MC) method (Sect. 3.7.2) is preferable.
3.7.1
Taylor Series Method for CrossSectional Data
The general approach to estimate the error of a function y = y(x1, x2, . . ., xn), whose independently measured variables are all speciﬁed at the same conﬁdence level, is to use the ﬁrstorder Taylor series expansion (often referred to as the KlineMcClintock propagation of errors equation): n
εy = i=1
∂y εx,i ∂xi
2
ð3:21Þ
where: εy = error in function value εx,i = error in measured quantity xi (equivalent to Ux in the previous section) Neglecting terms higher than the ﬁrst order (implied by a ﬁrstorder Taylor Series expansion), the propagation of error expressions for the basic arithmetic operations are given below. Let x1 and x2 have errors ε1 and ε2. Then, for the basic arithmetic operations, Eq. 3.21 simpliﬁes to: Addition or subtraction: For y = x1 ± x2 εy = ε2x1 þ ε2x2
ð3:22Þ
1=2
10 This chapter uses the terms “uncertainty” and “error” interchangeably. There is, however, a distinction. The word “error” is usually used in the context of sensors and derived measurements when the bias and random inﬂuences are small compared to the magnitude of the observation (say, 5% or less). When these are large, then the term “uncertainty” is more appropriate. For example, when population statistics are derived from sample data, or when the error propagation analysis involves very large uncertainties of the individual variables warranting a stochastic approach.
Multiplication : For y = x1 =x2 εy = ð x1 x2 Þ
εx1 x1
2
ε þ x2 x2
2 1=2
ð3:23Þ
2 1=2
ð3:24Þ
Division : For y = x1 =x2 x εy = 1 x2
εx1 x1
2
ε þ x2 x2
For functions involving multiplication and division only, it is much simpler to use the following expression than it is to use the more general Eq. 3.21. If y = xx1 x3 2 , then the fractional standard deviation is given by: εy ε 2 ε 2 ε 2 = x12 þ x22 þ x33 y x1 x2 x3
1=2
ð3:25Þ
The error or uncertainty in the result depends on the squares of the uncertainties in the independent variables. This means that if the uncertainty in one variable is larger than the uncertainties in the other variables, then it is the largest uncertainty that dominates. To illustrate, suppose there are three variables with an uncertainty of magnitude 1 and one variable with an uncertainty of magnitude 5. The uncertainty in the result would be (52 + 12 + 12 + 12)0.5 = (28)0.5 = 5.29. Clearly, the effect of the uncertainty in the single largest variable dominates the others. An analysis involving relative magnitude of uncertainties plays an important role during the design of an experiment and the procurement of instrumentation. Very little is gained by trying to reduce the “small” uncertainties since it is the “large” ones that dominate. Any improvement in the overall experimental result must be achieved by improving the instrumentation or experimental technique resulting in these relatively large uncertainties. This concept is illustrated below. Example 3.7.111 Relative error in Reynolds number for ﬂow in a pipe Water is ﬂowing in a pipe at a certain measured rate. The temperature of the water is measured, and the viscosity and density are then found from tables of water properties. Determine the probable errors of the Reynolds numbers (Re) at the low and highﬂow conditions given the information assembled in Table 3.8: Recall that Re = ρVd μ . Since the function involves multiplication and division only, it is easier to work with Eq. 3.25.
11
Adapted from Schenck (1969), by permission of McGrawHill.
108
3
Data Collection and Preliminary Analysis
Table 3.8 Error table of the four quantities that define the Reynolds number (Example 3.7.1) Minimum ﬂow 1 0.2 1000 1.12 × 103
Quantity Velocity, m/s (V ) Pipe diameter, m (d ) Density, kg/m3 (ρ) Viscosity, kg/ms (μ) a
Maximum ﬂow 20 0.2 1000 1.12 × 103
Random error at full ﬂow (95% CL) 0.1 0 1 0.45 × 105
% Errora Minimum 10 0 0.1 0.4
Maximum 0.5 0 0.1 0.4
Note that the last two columns under “% Error” are computed from the previous three columns of data
o
Relative error in Re
o o ooo o o oo o o oo o oo o o o o oo o o o o o o o
o o o
o
Reynolds number (Re)
Fig. 3.33 Expected variation in experimental relative error (at 95% CL) with magnitude of Reynolds number (Example 3.7.1)
At minimum ﬂow condition, the relative error in Re (assuming no error in pipe diameter value) is: ε ðRe Þ = Re
0:1 1
2
þ
1 1000
2
þ
= 0:12 þ 0:0012 þ 0:0042
0:45 112 1=2
2 1=2
= 0:100 or 10%
On the other hand, at maximum ﬂow condition, the percentage error is: ε ðRe Þ = 0:0052 þ 0:0012 þ 0:0042 Re
1=2
= 0:0065 or 0:65%
The above example reveals that (i) at lowﬂow conditions the error is 10%, which reduces to 0.65% at highﬂow conditions, and (ii) at lowﬂow conditions the other sources of error are absolutely dwarfed by the 10% error due to ﬂow measurement uncertainty. Thus, the only way to improve the experiment is to improve ﬂow measurement accuracy. If the experiment is run without changes, one can conﬁdently expect the data at the lowﬂow end to show a broad scatter becoming smaller as the velocity is increased. This phenomenon is captured by the 95% CI shown as relative errors in Fig. 3.33. ■
Equation 3.21 applies when the measured variables are uncorrelated. If they are correlated, their interdependence can be quantiﬁed by the covariance (deﬁned by Eq. 3.11). If two variables x1 and x2 are correlated, then the error of their sum is given by: εy = εx1 2 þ εx2 2 þ 2x1 x2 covðx1 , x2 Þ
ð3:26Þ
Note that the covariance term can assume positive or negative values, and so the combined error can be higher or lower than that for uncorrelated independent variables. Example 3.7.2 Uncertainty in overall heat transfer coefﬁcient The equation of the overall heat transfer coefﬁcient U of a heat exchanger consisting of a ﬂuid ﬂowing inside and another ﬂuid ﬂowing outside a steel pipe of negligible thermal resistance is: U = ð1=h1 þ 1=h2 Þ  1 = ½h1 h2 =ðh1 þ h2 Þ
ð3:27Þ
where h1 and h2 are the individual coefﬁcients of the two ﬂuids on either side of the pipe. If h1 = 15 W/m2 °C with a fractional error of 5% at 95% CL and h2 = 20 W/m2 °C with a fractional error of 3%, also at 95% CL, what will be the fractional error in random uncertainty of the U coefﬁcient at 95% CL assuming bias error to be zero? In this case, because of the addition term, one has to use the fundamental equation given by Eq. 3.21. In order to use the propagation of error equation, the partial derivatives need to be computed. One could proceed to do so analytically using basic calculus. Then: ∂U δh1
=
h22 h2 ð h 1 þ h 2 Þ  h1 h 2 = ð h1 þ h2 Þ 2 ðh1 þ h2 Þ2
ð3:28Þ
=
h21 h1 ð h 1 þ h 2 Þ  h1 h 2 = ð h1 þ h2 Þ 2 ðh1 þ h2 Þ2
ð3:29Þ
h2
and ∂U δh2
h1
The absolute uncertainty εU in the overall heat transfer coefﬁcient U is given by Eq. 3.21:
3.7 Propagation of Errors
εU =
∂U εh1 ∂h1
109 2
þ
∂U εh2 ∂h2
2
ð3:30Þ
where εh1 and εh2 are the errors of the coefﬁcients h1 and h2, respectively, and are determined as:
Example 3.7.3 Uncertainty in exponential growth models Exponential growth models are used to describe several commonly encountered phenomena from population growth to consumption of resources. The amount of resource consumed over time Q(t) can be modeled as:
εh1 = 0:05 × 15 = 0:75 and εh2 = 0:03 × 20 = 0:80:
t
Plugging numerical values in the expression for U given by Eq. 3.27, one gets U = 8.571, while the partial derivatives given by Eqs. 3.28 and 3.29 are computed as: ∂U ∂U = 0:3265 and = 0:1837 ∂h2 ∂h1 Finally, from Eq. 3.30, the absolute error in the overall heat transfer coefﬁcient U: εU = ð0:3265 × 0:75Þ2 þ ð0:1837 × 0:80Þ2
1=2
= 0:288 at 95%CL The fractional error in U = (0.288/8.571) = 3.3%.
Q ðt Þ =
yðx1 þ Δx1 , x2 , . . .Þ  yðx1  Δx1 , x2 , . . .Þ ∂y = 2:Δx1 ∂x1 yðx1 , x2 þ Δx2 , . . .Þ  yðx1 , x2  Δx2 . . .Þ ∂y = etc . . . 2:Δx2 ∂x2 ð3:31Þ No strict rules for the size of the perturbation or step size Δx can be framed since they would depend on the underlying shape of the function. Perturbations in the range of 1–4% of the value are reasonable choices, and one should evaluate the stability of the partial derivative computed numerically by repeating the calculations for a few different step sizes. In cases involving complex experiments with extended debugging phases, one should update the uncertainty analysis whenever a change is made in the data reduction program. Commercial software programs are also available with inbuilt uncertainty propagation formulae. This procedure is illustrated in Example 3.7.3.
P0 rt ð e  1Þ r
ð3:32Þ
0
where P0 = initial consumption rate, and r = exponential rate of growth. The world coal consumption in 1986 was equal to 5.0 billion (short) tons and the estimated recoverable reserves of coal were estimated at 1000 billion tons. (a) If the growth rate is assumed to be 2.7% per year, how many years will it take for the total coal reserves to be depleted? Rearranging Eq. 3.32 results in t=
■
Another method of determining partial derivatives is to adopt a perturbation approach, which allows the local sensitivity or slope of the function to be evaluated numerically. A computer routine can be written to perform this task. One method is based on approximating partial derivatives by a central ﬁnitedifference approach. If y = y(x1, x2, . . . xn), then:
P0 ert dt =
Or t =
1 0:027 : ln
1 Q:r ln 1 þ r P0
ð3:33Þ
1 þ ð1000Þ5ð0:027Þ = 68:75 years
(b) Assume that the growth rate r and the recoverable reserves are subject to random uncertainty. If the uncertainties of both quantities are taken to be normal with standard deviation values of 0.2% (absolute) and 10% (relative), respectively, determine the lower and upper estimates of the years to depletion at the 95% CL. While the partial derivatives can be derived analytically in this case, the use of Eq. 3.21 following a numerical approach is adopted for illustration. The pertinent results using Eq. 3.31 with a perturbation multiplier of 1% to both the base values of r (= 0.027) and of Q (=1000) are assembled in Table 3.9. From here: ∂t = ∂r ∂t = ∂Q
ð68:37917  69:12924Þ =  1389, and ð0:02727  0:02673Þ ð69:06297  68:43795Þ = 0:03125 ð1010  990Þ
110
3
Table 3.9 Numerical computation of the partial derivatives of t with Q and r (Example 3.7.3) Multiplier 0.99 1.00 1.01
Assuming Q = 1000 r t (from Eq. 3.32) 0.02673 69.12924 0.027 68.75178 0.02727 68.37917
Assuming r = 0.027 Q t (from Eq. 3.32) 990 68.43795 1000 68.75178 1010 69.06297
While power E can be measured directly, the amount of cooling Qch has to be determined by individual measurements of the chilled water volumetric ﬂow rate and the difference between the supply and return chilled water temperatures along with water properties. Qch = ρVcΔT
ð3:35Þ
where:
Then:
εt =
Data Collection and Preliminary Analysis
∂t εr ∂r
2
þ
∂t εQ ∂Q
2 1=2
= ½ð 1389Þð0:002Þ2 þ ð0:03125Þð0:1Þð1000Þ2 = 2:7782 þ 3:1252
1=2
1=2
= 4:181
Thus, the lower and upper limits at the 95% CL (with the z = 1.96) is = 68:75 ± ð1:96Þ4:181 = f60:55, 76:94g years The analyst should repeat the above procedure with, say, a perturbation multiplier of 2% in order to evaluate the stability of the numerically derived partial derivatives. If these differ substantially, it is urged that the function be plotted and scrutinized for irregular behavior around the point of interest. ■ Example 3.7.4 Selecting instrumentation during the experimental design phase A general uncertainty analysis is recommended at the planning phase for the purposes of proper instrument selection. This analysis should intentionally be kept simple. It is meant to identify the primary sources of uncertainties, to evaluate the relative weights of different source, and to perform a sensitivity analysis. An experimental program is being considered involving continuous monitoring of a large chiller under ﬁeld conditions. The objective of the monitoring is to determine the chiller coefﬁcientofperformance (COP) on an hourly basis. The fractional uncertainty in the COP should not be greater than 5% at 95% CL. The rated full load is 450 tons of cooling (1 ton = 12,000 Btu/h). The chiller is operated under constant chilled water and condenser water ﬂow rates. Only random errors are to be considered. The COP of a chiller is deﬁned as the ratio of the amount of cooling at the evaporator (Qch) to the electric power (E) consumed: COP =
Qch E
ð3:34Þ
ρ = density of water V = chilled water volumetric ﬂow rate, assumed constant during operation (rated ﬂow rate = 1080 gpm) c = speciﬁc heat of water ΔT = temperature difference between the entering and leaving chilled water at the evaporator (a quantity that changes during operation) From Eq. 3.25, the fractional uncertainty in COP (neglecting the small effect of uncertainties in the density and speciﬁc heat terms) is: UC O P = COP
UV V
2
þ
U ΔT ΔT
2
þ
UE E
2
ð3:36Þ
Note that since this is a preliminary uncertainty analysis, only random errors are considered. The maximum ﬂow reading of the selected meter is 1500 gpm with 4% uncertainty at 95% CL. This leads to an absolute uncertainty of (1500 × 0.04) = 60 gpm. The ﬁrst term UVV is a constant and does not depend on the chiller load since the ﬂow through the evaporator is maintained constant at 1080 gpm, Thus, UV V
2
=
2
60 1080
= 0:0031 and
UV = ± 0:056: V
The random error at 95% CL for the type of commercial grade sensor to be used for temperature measurement is 0.2°F. Consequently, the error in the measurement of temperature difference ΔT = (0.22 + 0.22)1/2 = 0.28 °F. From manufacturer catalogs, the temperature difference between supply and return chilled water temperatures at full load can be assumed to be 10 °F. The fractional uncertainty at full load is then U ΔT ΔT
2
=
0:28 10
2
= 0:00078 and
U ΔT = ± 0:028: ΔT
The power instrument has a fullscale value of 400 kW with an error of 1% at 95% CL, i.e., an error of 4.0 kW. The chiller rated capacity is 450 tons of cooling, with an assumed realistic lower bound of 0.8 kW per ton of cooling. The
3.7 Propagation of Errors
111
anticipated electric draw at full load of the chiller = 0.8 × 450 = 360 kW. The fractional uncertainty at full load is then: UE E
2
=
4:0 360
2
= 0:00012 and
UE = ± 0:011 E
Thus, the fractional uncertainty in the power is about one ﬁfth of the ﬂow rate. Propagation of the above errors yields the fractional uncertainty at 95% CL at full chiller load of the measured COP (using Eq. 3.36): U COP = ð0:0031 þ 0:00078 þ 0:00012Þ1=2 = 0:063 = 6:3% COP It is clear that the fractional uncertainty of the proposed instrumentation is not satisfactory for the intended purpose. Looking at the fractional uncertainties of the individual contributions, the logical remedy is to select a more accurate ﬂow meter or one with a lower maximum ﬂow reading. ■
3.7.2
Monte Carlo Method for Error Propagation Problems
The previous method of ascertaining errors/uncertainty based on the ﬁrstorder Taylor series expansion is widely used; but it has limitations. If relative uncertainty is large, this method may be inaccurate for nonlinear functions since it assumes ﬁrstorder derivatives based on local functional behavior. Further, for the CI of the functional variable to have a statistical interpretation, the errors have to be normally distributed. Finally, deriving partial derivatives of complex and interrelinked analytical functions (as is common for models involving system simulation of various individual components) is a tedious and errorprone affair. A more general manner of dealing with uncertainty propagation is to use Monte Carlo (MC) methods, which are widely used for a number of complex applications (and treated at more length in Sects. 6.7.3, 10.6.7, and 12.2.8). These methods are numerical methods for solving problems involving random numbers and require considerations of probability. MC, in essence, is a numerical process of repeatedly calculating a mathematical function in which the input variables and/or the function parameters are random or contain uncertainty with prescribed probability distributions. Speciﬁcally, the individual inputs and/or parameters are sampled randomly from their prescribed probability distributions to form one repetition (or run or trial). The corresponding numerical solution is one possible outcome of the function. This process of generating runs is repeated a large number of times, resulting in a distribution of the functional values that
can then be represented as probability distributions, or as histograms, or by summary statistics, or by CI for any percentile threshold chosen. Such insights cannot be gained from using the traditional Taylor series error propagation approach. The MC process is computer intensive and requires thousands of runs to be performed. However, the entire process is simple and easily implemented even on spreadsheet programs (which have inbuilt functions for generating pseudorandom numbers of selected distributions). Specialized engineering software programs are also available. There is a certain amount of arbitrariness associated with the process because MC simulation is a numerical method. Several authors propose approximate formulae for determining the number of trials, but a simple method is as follows. Start with a large number of trials (say, 1000), and generate pseudorandom numbers with the assumed probability distribution. Since they are pseudorandom, the mean and the distribution (say, the standard deviation) may deviate somewhat from the desired ones (which depend on the accuracy of the algorithm used). Generate a few such sets and pick one whose mean and standard deviation values are closest to the desired quantities. Use this set to simulate the corresponding values of the function. This can be repeated a few times until the mean and standard deviations stabilize around some average values, which can be taken to be the answer. It is also urged that the analyst evaluate the effect of the results with different number of trials, say using 3000 trials, and ascertaining that the results of both the 1000 trial and 3000 trials are similar. If they are not, sets with increasingly large number of trials should be used till the results converge. The approach is best understood by means of a simple example. Example 3.7.5 Using Monte Carlo (MC) method to determine uncertainty in exponential growth models Let us solve the problem given in Example 3.7.3 by the MC method. The approach involves setting up a spreadsheet table as shown in Table 3.10. Since only two variables (namely Q and r) have uncertainty, one only needs to assign two columns to these and a third column to the desired quantity, i.e., time t over which the total coal reserves will be depleted. The ﬁrst row shows the calculation using the mean values and one sees that the value of t = 68.75 as found in part (a) of Example 3.7.3 is obtained (this is done for verifying the spreadsheet cell formula). The analyst then generates random numbers of Q and r with the corresponding mean and standard deviations as speciﬁed and shown in the ﬁrst row of the table. MC methods, being numerical methods, require that a large sample be generated in order to obtain reliable results.
112 Table 3.10 The first few and last few calculations used to determine uncertainty in variable t using the Monte Carlo (MC) method
3 Run # 1 2 3 4 5 6 7 8 9 10 ... ... 990 991 992 993 994 995 996 997 998 999 1000 Mean SD
Data Collection and Preliminary Analysis
Q (1000, 100) 1000.0000 1050.8152 1171.6544 1098.2454 1047.5003 1058.0283 946.8644 1075.5269 967.9137 1194.7164
r (0.027, 0.002) 0.0270 0.0287 0.0269 0.0284 0.0261 0.0247 0.0283 0.0277 0.0278 0.0262
t (years) 68.7518 72.2582 73.6445 73.2772 69.0848 67.7451 68.5256 71.8072 68.6323 73.3758
1133.6639 997.0123 896.6957 1056.2361 1033.8229 1078.6051 1137.8546 950.8749 1023.7800 950.2093 849.0252 1005.0 101.82
0.0278 0.0252 0.0257 0.0283 0.0298 0.0295 0.0276 0.0263 0.0264 0.0248 0.0247 0.0272 0.00199
73.6712 66.5173 63.8175 71.9108 72.8905 73.9569 73.4855 66.3670 68.7452 64.5692 61.0231 68.91 3.919
The last two rows indicate the mean and standard deviation (SD) of the individual columns (Example 3.7.5)
Fig. 3.34 Pertinent plots of the Monte Carlo (MC) analysis for the time variable t (years) (a) Histogram. (b) Normal probability plot with 95% limits
Often, 1000 normal distribution samples are selected but it is advisable to repeat the analysis a few times for more robust results. For example, instead of having (1000, 100) for the mean and standard deviation of Q, the 1000 samples have (1005.0, 101.82). The corresponding mean and standard deviation of t are found to be (68.91, 3.919) compared to the previously estimated values of (68.75, 4.181). Further, even though normal distributions were assumed for variables Q and r, the functional form for time t results in a nonnormal
distribution as can be seen by the histogram and the normal probability plots of Fig. 3.34. Hence it is more meaningful to look at the percentiles rather than simply the mean and standard deviation (shown in Table 3.10). These are easily deduced from the MC runs and shown in Table 3.11. Such additional insights into the distribution of the variable t provided by the MCgenerated data are certainly advantageous. ■
3.8 Planning a NonIntrusive Field Experiment
113
Table 3.11 Percentiles for the time variable t (years) determined from the 1000 MC runs of Table 3.10 % 1.0 5.0 10.0 25.0 50.0 75.0 90.0 95.0 99.0
3.8
Percentiles 58.509 62.331 63.8257 66.3155 68.968 71.6905 73.818 74.9445 77.051
Planning a NonIntrusive Field Experiment
One needs to differentiate between two conditions under which data can be collected. On the one hand, one can have a controlled setting where the various variables of interest can be altered by the experimenter. In such a case, referred to as intrusive testing, one can plan an “optimal” experiment where one can adjust the inputs and boundary or initial conditions as well as choose the number and location of the sensors so as to minimize the effect of errors on estimated values of the parameters (treated in Chap. 6). On the other hand, one may be in a situation where one is a mere “spectator,” i.e., the system or phenomenon cannot be controlled, and the data is collected under nonexperimental conditions (as is the case of astronomical observations). Such an experimental protocol, known as nonintrusive identiﬁcation, is usually not the best approach. In certain cases, the driving forces may be so weak or repetitive that even when a “long” dataset is used for identiﬁcation, a strong enough or varied output signal cannot be elicited for proper statistical treatment (see Sect. 10.3). An intrusive or controlled experimental protocol, wherein the system is artiﬁcially stressed to elicit a strong response, is more likely to yield robust and accurate models and their parameter estimates. However, in some cases, the type and operation of the system may not allow such intrusive experiments to be performed. Further, one should appreciate differences between measurements made in a laboratory setting and in the ﬁeld. The potential for errors, both bias and random, is usually much greater in the latter. Not only can measurements made on a piece of laboratory equipment be better designed and closely controlled, but they will be more accurate as well because more expensive sensors can be selected and placed correctly in the system. For example, proper ﬂow measurement requires that the ﬂow meter be placed 30 pipe diameters after a bend to ensure the ﬂow proﬁle is well established. A laboratory setup can be designed accordingly, while ﬁeld
conditions may not allow such conditions to be met satisfactorily. Further, systems being operated in the ﬁeld may not allow controlled tests to be performed, and one has to develop a model or make decisions based on what one can observe. Any experiment should be wellplanned involving several rational steps (e.g., ascertaining that the right sensors and equipment are chosen, that the right data collection protocol and scheme are followed, and that the appropriate data analysis procedures are selected). It is advisable to explicitly adhere to the following steps (ASHRAE 2005): (i) Identify experimental goals and acceptable accuracy that can be achieved within the time and budget available for the experiment. (ii) Identify entire list of measurable variables and relationships. If some are difﬁcult to measure, ﬁnd alternative variables. (iii) Establish measured variables and limits (theoretical limits and expected bounds) to match the selected instrument limits. Also, determine instrument limits— all sensor and measurement instruments have physical limits that restrict their ability to accurately measure quantities of interest. (iv) Preliminary instrumentation selection should be based on accuracy, repeatability, and features of the instrument increase, as well as cost. Regardless of the instrument chosen, it should have been calibrated within the last 12 months or within an interval required by the manufacturer, whichever is less. The required accuracy of the instrument will depend upon the acceptable level of uncertainty for the experiment. (v) Document uncertainty of each measured variable using information gathered from manufacturers or past experience with speciﬁc instrumentation. Document the uncertainty for each measured variable. This information will then be used in estimating the overall uncertainty of results using propagation of error methods. (vi) Perform preliminary uncertainty analysis of proposed measurement procedures and experimental methodology. This should be completed before the procedures and methodology are ﬁnalized in order to estimate the uncertainty in the ﬁnal results. The higher the accuracy required of measurements, the higher the accuracy of sensors needed to obtain the raw data. The uncertainty analysis is the basis for selection of a measurement system that provides acceptable uncertainty at least cost. How to perform such a preliminary uncertainty analysis was illustrated by Example 3.7.4. (vii) Final instrument selection and methods should be based on the results of the preliminary uncertainty
114
analysis and selection of instrumentation. Revise selection, if necessary, to achieve the acceptable uncertainty in the experiment results. (viii) Install instrumentation in accordance with manufacturer’s recommendations. Any deviation in the installation from the manufacturer’s recommendations should be documented and the effects of the deviation on instrument performance evaluated. A change in instrumentation or location may be required if in situ uncertainty exceeds acceptable limits determined by the preliminary uncertainty analysis. (ix) Perform initial data quality veriﬁcation to ensure that the measurements taken are not too uncertain and represent reality. Instrument calibration and independent checks of the data are recommended. Independent checks can include sensor validation, energy balances, and material balances (see Sect. 3.3). (x) Collect data. The challenge for data acquisition in any experiment is to collect the required amount of information while avoiding collection of superﬂuous information. Superﬂuous information can overwhelm simple measures taken to follow the progress of an experiment and can complicate data analysis and report generation. The relationship between the desired result—either static, periodic stationary, or transient— and time is the determining factor for how much information is required. A static, nonchanging result requires only the steadystate result and proof that all transients have died out. A periodic stationary result, the simplest dynamic result, requires information for one period and proof that the one selected is one of three consecutive periods with identical results within acceptable uncertainty. Transient or nonrepetitive results—whether a single pulse or a continuing, random result—require the most information. Regardless of the result, the dynamic characteristics of the measuring system and the full transient nature of the result must be documented for some relatively short interval of time. Identifying good models requires a certain amount of diversity in the data, i.e., should cover the spatial domain of variation of the independent variables. Some basic suggestions pertinent to controlled experiments are summarized below, which are also pertinent for nonintrusive data collection. (a) Range of variability: The most obvious way in which an experimental plan can be made compact and efﬁcient is to space the variables in a predetermined manner. If a functional relationship between an independent variable X and a dependent variable Y is sought, the most obvious way is to select end points or limits of the test,
3
Data Collection and Preliminary Analysis
Fig. 3.35 A possible XYZ envelope with X and Z as the independent variables. The dashed lines enclose the total family of points over the feasible domain space
thus covering the test envelope or domain that encloses the complete family of data. For a model of the type Z = f(X,Y ), a plane area or map is formed (see Fig. 3.35). Functions involving more variables are usually broken down to a series of maps. The above discussion relates to controllable regressor variables. Extraneous variables, by their very nature, cannot be varied at will. An example are phenomena driven by climatic variables. The energy use of a building is affected, among others, by outdoor drybulb temperature, humidity, and solar radiation. Since these cannot be varied at will, a proper experimental data collection plan would entail collecting data during different seasons of the year. (b) Grid spacing considerations: Once the domains or ranges of variation of the variables are deﬁned, the next step is to select the grid spacing. Being able to anticipate the system behavior from theory or from prior publications would lead to a better experimental design. For a relationship between X and Y, which is known to be nonlinear, the optimal grid is to space the points at the two extremities. However, if a linear relationship between X and Y is sought for a phenomenon that can be approximated as linear, then it would be best to space the x points evenly. For nonlinear or polynomial functions, an equally spaced test sequence in X is clearly not optimal. Consider the pressure drop through a new ﬁtting as a function of ﬂow. It is known that the relationship is quadratic. Choosing an experiment with equally spaced X values would result in a plot such as that shown in Fig. 3.36a. One would have more observations in the
Problems
115
Fig. 3.36 Two different experimental designs for proper identiﬁcation of the parameter (k) appearing in the model for pressure drop versus velocity of a ﬂuid ﬂowing through a pipe assuming ΔP = kV2. The grid spacing shown in (a) is the more common one based on equal
increments in the regressor variable, while that in (b) is likely to yield more robust estimation but would require guessestimating the range of variation for the pressure drop
lowpressure drop region and less in the higher range. One may argue that an optimal spacing would be to select the velocity values such that the pressure drop readings are more or less evenly spaced (see Fig. 3.36b). Which one of the two is better depends on the instrument precision. If the pressure drop instrument has constant uncertainty over the entire range of variation of the experiment, then test spacing as shown in Fig. 3.36b is better because it is uniform. But if the fractional uncertainty of the instrument decreases with increasing pressure drop values, then the points at the lower end have higher uncertainty. In that case it is better to take more readings at the lower end as for the spacing sequence shown in Fig. 3.36a. (xi) Accomplish data reduction and analysis, which involves the distillation of raw data into a form that is usable for further analysis. It may involve averaging multiple measurements, quantifying necessary conditions (e.g., steadystate), comparing with physical limits or expected ranges, and rejecting outlying measurements. (xii) Perform ﬁnal uncertainty analysis, which is done after the entire experiment has been completed and when the results of the experiments are to be documented or reported. This will take into account unknown ﬁeld effects and variances in instrument accuracy during the experiment. A ﬁnal uncertainty analysis involves the following steps: (i) Estimate ﬁxed (bias) error based upon instrumentation calibration results, and (ii) document the random error due to the instrumentation based upon instrumentation calibration results. The ﬁxed errors needed for the detailed uncertainty analysis are usually more difﬁcult to estimate with a high degree of certainty. Minimizing ﬁxed errors can
be accomplished by careful calibration with referenced standards. (xiii) Reporting results is the primary means of communication. Different audiences require different reports with various levels of detail and background information. The report should be structured to clearly explain the goals of the experiment and the evidence gathered to achieve the goals. It should describe the data reduction, data analysis, and uncertainty analysis performed. Graphical and mathematical representations are often used. On graphs, error bars placed vertically and horizontally on representative points are a very clear way to present expected uncertainty. A data analysis section and a conclusion are critical sections and should be prepared with great care while being succinct and clear.
Problems Pr. 3.1 Consider the data given in Table 3.2. Determine: (a) The 10% trimmed mean value. (b) Which observations can be considered to be “mild” outliers (>1.5 × IQR)? (c) Which observations can be considered to be “extreme” outliers (>3.0 × IQR)? (d) Identify outliers using Chauvenet’s criterion given by Eq. 3.20. Compare them with those obtained by using Table 3.7. (e) Compare the results from (b), (c), and (d). (f) Generate a few plots (such as Fig. 3.16) using your statistical software. Other types of plots can be generated but they should be relevant.
116
3
Pr. 3.2 Consider the data given in Table 3.6. Perform an exploratory data analysis involving pertinent statistical summary measures and generate at least three pertinent graphical plots along with a discussion of ﬁndings. Pr. 3.312 A nuclear power facility produces a vast amount of heat that is usually discharged into the aquatic system. This heat raises the temperature of the aquatic system resulting in a greater concentration of chlorophyll that in turn extends the growing season. To study this effect, water samples were collected monthly at three stations for one year. Station A is located closest to the hot water discharge, and Station C the farthest (Table 3.12). You are asked to perform the following tasks with the timeseries data and annotate with pertinent comments: (a) Flag any outlier points: (i) visually, (ii) using boxandwhisker plots, and (iii) following the Chauvenet’s criterion. (b) Compute pertinent statistical descriptive measures (after removing outliers).
Table 3.12 Data table for Problem 3.3
Month January February March April May June July August September October November December
Table 3.13 Parameters and uncertainties to be assumed (Pr. 3.4)
Parameter cpc mc Tc,i Tc,o cph mh Th,i Th,o
(c) Generate at least two pertinent graphical plots and discuss relevance of these plots in terms of insights they provide. (d) Compute the covariance and correlation coefﬁcients between the three stations and draw relevant conclusions (do this both without and with outlier rejection). Pr. 3.413 Consider a basic indirect heat exchanger where heat exchange rates associated with the cold and hot ﬂuid ﬂow sides are given by: Qactual = mc cpc ðT c,o  T c,i Þ ðcoldsideheatingÞ
Data available electronically on book website.
ð3:37Þ
Qactual = mh cph ðT h,i  T h,o Þ ðhot sidecoolingÞ where m, T, and cp are the mass ﬂow rate, temperature, and speciﬁc heat, respectively. The subscripts o and i stand for outlet and inlet, and c and h denote cold and hot streams, respectively. Assume the values and uncertainties of various parameters shown in Table 3.13:
Station A 9.867 14.035 10.700 13.853 7.067 11.670 7.357 3.358 4.210 3.630 2.953 2.640
Nominal value 1 Btu/lb°F 475,800 lb/h 34 °F 46 °F 0.9 Btu/h°F 450,000 lb/h 55 °F 40 °F
Station B 3.723 8.416 12.723 9.168 4.778 9.145 8.463 4.086 4.233 2.320 3.843 3.610
Station C 4.410 11.100 4.470 8.010 14.080 8.990 3.350 4.500 6.830 5.800 3.480 3.020
95% Uncertainty ±5% ±10% ±1 °F ±1 °F ±5% ±10% ±1 °F ±1 °F
From ASHRAEG2 (2005) # American Society of Heating, Refrigerating and AirConditioning Engineers, Inc., www.ashrae.org. 13
12
Data Collection and Preliminary Analysis
Problems
117
Table 3.14 Data table for Problem 3.6 Entering air enthalpy (hai) Leaving air enthalpy (hao) Entering water enthalpy (hci)
Units Btu/lb Btu/h Btu/h
When installed 38.7 27.2 23.2
(a) Compute the heat exchanger loads and the uncertainty ranges for the hot and cold sides assuming all variables to be uncorrelated. (b) What would you conclude regarding the uncertainty around the heat balance checks? (c) It has been found that the inlet hot ﬂuid temperature is correlated with the hot ﬂuid mass ﬂow rate with a correlation coefﬁcient of –0.6 (i.e., when the hot ﬂuid ﬂow rate increases, the corresponding inlet temperature decreases). How would your results for (a) and (b) change? Provide a short discussion.
Current 36.8 28.2 21.5
95% Uncertainty 5% 5% 2.5%
Pr. 3.6 Determining cooling coil degradation based on effectiveness The thermal performance of a cooling coil can also be characterized by the concept of effectiveness widely used for thermal modeling of traditional heat exchangers. In such coils, a stream of humid air ﬂows across a coil supplied by chilled water and is cooled and dehumidiﬁed as a result. In this case, the effectiveness can be determined as: ε=
ðh  hao Þ actual heat transfer rate = ai maximum possible heat transfer rate ðhai  hci Þ ð3:38Þ
Hint: The standard deviations can be taken to be half of the 95% uncertainty values listed in Table 3.13. Pr. 3.5 Consider Example 3.7.4 where the uncertainty analysis on chiller COP was done at fullload conditions. What about partload conditions, especially since there is no collected data? One could use data from chiller manufacturer catalogs for a similar type of chiller, or one could assume that partload operation will affect the inlet minus the outlet chilled water temperatures (ΔT) in a proportional manner, as stated below. (a) Compute the 95% CL uncertainty in the COP at 70% and 40% full load assuming the evaporator water ﬂow rate to be constant. At part load, the evaporator temperature difference is reduced proportionately to the chiller load, while the electric power drawn is assumed to increase from a fullload value of 0.8 kW/t to 1.0 kW/t at 70% full load and to 1.2 kW/t at 40% full load. (b) Would the instrumentation be adequate or would it be prudent to consider better instrumentation if the fractional COP uncertainty at 95% CL is to be less than 10%? (c) Note that ﬁxed (bias) errors have been omitted from the analysis, and some of the assumptions in predicting partload chiller performance can be questioned. A similar exercise with slight variations in some of the assumptions, called a sensitivity study, would be prudent at this stage. How would you conduct such an investigation?
where hai and hao are the enthalpies of the air stream at the inlet and outlet, respectively, and hci is the enthalpy of entering chilled water. The effectiveness is independent of the operating conditions provided the mass ﬂow rates of air and chilled water remain constant. An HVAC engineer would like to determine whether the coil has degraded after it has been in service for a few years. For this purpose, she measures the current coil performance at air and water ﬂow rates identical to those when originally installed as shown in Table 3.14. Note that the uncertainty in determining the air enthalpies are relatively large due to the uncertainty associated with measuring bulk air stream temperatures and humidities. However, the uncertainty in the enthalpy of the chilled water is only half of that of air. (a) Assess, at 95% CL, whether the cooling coil has degraded or not. Clearly state any assumptions you make during the evaluation. (b) What are the relative contributions of the uncertainties in the three enthalpy quantities to the uncertainty in the effectiveness value? Do these differ from the installed period to the time when current tests were performed? Pr. 3.7 Table 3.15 assembles values of the total electricity generated by ﬁve different types of primary energy sources and their associated total emissions (EIA 1999). Clearly, coal and oil generate a lot of emissions of pollutants, which are harmful not only to the environment but also to public health.
118
3
Data Collection and Preliminary Analysis
Table 3.15 Data table for Problem 3.7 US power generation mix and associated pollutant emissions Electricity Fuel kWh (1999) % Total Coal 1.77E + 12 55.7 Oil 8.69E + 10 2.7 Natural gas 2.96E + 11 9.3 Nuclear 7.25E + 11 22.8 Hydro/Wind 3.00E + 11 9.4 Totals 3.18E + 12 100.0
Short tons (=2000 lb/t) SO2 1.13E + 07 6.70E + 05 2.00E + 03 0.00E + 00 0.00E + 00 1.20E + 07
NOx 6.55E + 06 1.23E + 05 3.76E + 05 0.00E + 00 0.00E + 00 7.05E + 06
CO2 1.90E + 09 9.18E + 07 1.99E + 08 0.00E + 00 0.00E + 00 2.19E + 09
Data available electronically on book website Table 3.16 Data table for Problem 3.8 Symbol HP Hours ηold ηnew
Description Horse power of the enduse device Number of operating hours in the year Efﬁciency of the old motor Efﬁciency of the new motor
France, on the other hand, has a mix of 21% coal and 79% nuclear. (a) Calculate the total and percentage reductions in the three pollutants should the United States change its power generation mix to mimic that of France (Hint: First normalize the emissions per kWh for all three pollutants). (b) The relative uncertainties of the three pollutants SO2, NOx, and CO2 are 10%, 12%, and 5%, respectively. Assuming lognormal distributions for all quantities, compute the uncertainty in the total reductions of the three pollutants estimated in (a) above. Pr. 3.8 Uncertainty in savings from energy conservation retroﬁts There is great interest in implementing retroﬁt measures meant to conserve energy in individual devices as well as in buildings. These measures must be justiﬁed economically, and including uncertainty in the estimated energy savings is an important element of the analysis. Consider the rather simple problem involving replacing an existing electric motor with a more energyefﬁcient one. The annual energy savings Esave in kWh/year are given by: E save = ð0:746ÞðHPÞðHoursÞ
1 1 ηold ηnew
ð3:39Þ
with the symbols described in Table 3.16 along with their numerical values. (a) Determine the absolute and relative uncertainties in Esave under these conditions. (b) If this uncertainty had to be reduced, which variable will you target for further reﬁnement?
Value 40 6500 0.85 0.92
Outdoor air (OA)
95% Uncertainty 5% 10% 4% 2%
Mixed air (MA)
To building zones Airhandler unit
Return air (RA)
Fig. 3.37 Sketch of an allair HVAC system supplying conditioned air to indoor zones/rooms of a building (Problem 3.9)
(c) What is the minimum value of ηnew at which the lower bound of the 95% CL interval of Esave is greater than zero? Pr. 3.9 Uncertainty in estimating outdoor air fraction in HVAC systems Ducts in heating, ventilating, and airconditioning (HVAC) systems supply conditioned air (SA) to the various spaces in a building, and also exhaust the air from these spaces, called return air (RA). A sketch of an allair HVAC system is shown in Fig. 3.37. Occupant health requires that a certain amount of outdoor air (OA) be brought into the HVAC systems while an equal amount of return air be exhausted to the outdoors. The OA and the RA mix at a point just before the airhandler unit. Outdoor air ducts have dampers installed to control the OA since excess OA leads to unnecessary energy wastage. One of the causes for recent complaints from occupants has been identiﬁed as inadequate OA, and sensors installed inside the ducts could modulate the dampers accordingly. Flow measurement is always problematic on a continuous basis. Hence, OA ﬂow is inferred from measurements of the air temperature TR inside the RA stream, of TO inside the OA stream, and TM inside the mixed air (MA) stream. The supply air is deduced by measuring the fan speed with a tachometer,
Problems
119
using a differential pressure gauge to measure static pressure rise, and using manufacturer equation for the fan curve. The random error of the sensors is 0.2 °F at 95% CL with negligible bias error. (a) From a sensible heat balance (with changes in speciﬁc heat with temperature neglected), derive the following expression for the outdoor air fraction (ratio of outdoor air and mixed air) OAf = ðT R  T M Þ=ðT R  T O Þ. (b) Derive the expression for the uncertainty in OAf and calculate the 95% CL in the OAf if TR = 70 °F, TO = 90 °F, and TM = 75 °F. Pr. 3.10 Sensor placement in HVAC ducts with consideration of ﬂow nonuniformity Consider the same situation as in Pr. 3.9. Usually, the air ducts have large crosssections. The problem with inferring outdoor air ﬂow using temperature measurements is the large thermal nonuniformity usually present in these ducts due to both stream separation and turbulence effects. Moreover, temperature (and, hence density) differences between the OA and MA streams result in poor mixing. The following table gives the results of a traverse in the mixed air duct with nine measurements (using an equally spaced grid of 3 × 3 designated by numbers in bold in Table 3.17). The measurements were replicated four times under the same outdoor conditions. The random error of the sensors is 0.2 °F at 95% CL with negligible bias error. Determine: (a) The worst and best grid locations for placing a single sensor (to be determined based on analyzing the recordings at each of the nine grid locations and for all four time periods). (b) The maximum and minimum errors at 95% CL one could expect in the average temperature across the duct crosssection, if the best grid location for the single sensor was adopted. Pr. 3.11 Consider the uncertainty in the heat transfer coefﬁcient illustrated in Example 3.7.2. The example was solved
Table 3.17 Table showing the temperature readings (in °F) at the nine different sections (S#1–S#9) of the mixed air (MA) duct (Pr. 3.10) 55.6, 54.6, 55.8, 54.2 S#1 66.4, 67.8, 68.7, 67.6 S#4 63.5, 65.0, 63.6, 64.8 S#7
56.3, 58.5, 57.6, 63.8 S#2 58.0, 62.4, 62.3, 65.8 S#5 67.4, 67.4, 66.8, 65.7 S#8
Data available electronically on book website
53.7, 50.2, 59.0, 49.4 S#3 61.2, 56.3, 64.7, 58.8 S#6 63.9, 61.4, 62.4, 60.6 S#9
analytically using the Taylor’s series approach. You are asked to solve the same example using the Monte Carlo (MC) method: (a) Using 100 data points (b) Using 1000 data points Compare the results from this approach with those in the solved example. Also determine the 2.5th and the 97.5th percentile bounds and compare results. Pr. 3.12 You will repeat Example 3.7.3 involving uncertainty in exponential growth using the Monte Carlo (MC) method. (a) Instead of computing the standard deviation, plot the distribution of the time variable t to evaluate its shape for 100 trials. Generate probability plots (such as qq plots) against a couple of the promising distributions. (b) Determine the mean, median, 25th percentile, 75th percentile, and the 2.5% and 97.5% percentiles. (c) Compare the 2.5% and 97.5% values to the corresponding values in Example 3.7.3. (d) Generate the boxandwhisker plot and compare them with the results of (b). Pr. 3.13 In 2015, the United States had about 1000 GW of installed electricity generation capacity. (a) Assuming an exponential electric growth rate of 1% per year, what would be the needed electricity generation capacity in 2050? (b) If the penetration target set for renewables and energy conservation is 75% of the electricity capacity in 2050, what should be their needed annual growth rate (taken to be exponential)? Assume an initial value of 14% renewable capacity in 2015 (which includes hydropower). (c) If the electric growth rate assumed in (a) has an uncertainty of 15% (taken to be a normally distributed), calculate the uncertainty associated with the growth rate computed in (b). Pr. 3.14 Uncertainty in the estimation of biological dose over time for an individual Consider an occupant inside a building in which a biological agent has been accidentally released. The dose (D) is the cumulative amount of the agent to which the human body is subjected, while the response is the measurable physiological change produced by the agent. The widely accepted approach for quantifying dose is to assume functional forms based on ﬁrstorder kinetics. For biological and radiological agents
120
3
Data Collection and Preliminary Analysis
Solve this problem following both the numerical partial derivative method and the Monte Carlo (MC) method and compare results.
where the process of harm being done is cumulative, one can use Haber’s law (Heinsohn and Cimbala 2003): t2
D ðt Þ = k
C ðt Þdt
ð3:40Þ
Pr. 3.15 Propagation of optical and tracking errors in solar concentrators Solar concentrators are optical devices meant to increase the incident solar radiation ﬂux density (power per unit area) on a receiver. Separating the solar collection component (viz., the reﬂector) and the receiver can allow heat losses per collection area to be reduced. This would result in higher ﬂuid operating temperatures at the receiver. However, there are several sources of errors that lead to optical losses:
t1
where C(t) is the indoor concentration at a given time t, k is a constant that includes effects such as the occupant breathing rate, the absorption efﬁciency of the agent or species, etc., and t1 and t2 are the start and end times. This relationship is often used to determine healthrelated exposure guidelines for toxic substances. For a simple onezone building, the free response, i.e., the temporal decay given in terms of the initial concentration C(t1), is determined by: C ðt Þ = Cðt 1 Þ exp½  aðt  t 1 Þ
(i) Due to nonspecular or diffuse reﬂection from the reﬂector, which could be due to improper curvature of the reﬂector surface during manufacture (shown in Fig. 3.38), or to progressive dust accumulation over the surface over time as the system operates in the ﬁeld. (ii) Due to tracking errors arising from improper tracking mechanisms as a result of improper alignment sensors or nonuniformity in drive mechanisms (usually, the tracking is not continuous; a sensor activates a motor every few minutes, which realigns the reﬂector to the solar disk as it moves in the sky). The result is a spread in the reﬂected radiation as illustrated in Fig. 3.38b. (iii) Improper reﬂector and receiver alignment during the initial mounting of the structure or due to small ground/pedestal settling over time.
ð3:41Þ
where the model parameter a is a function of the volume of the space and the outdoor and supply air ﬂow rates. The above equation is easy to integrate during any time period from t1 to t2, thus providing a convenient means of computing total occupant inhaled dose when occupants enter or leave the contaminated zones at arbitrary times. Let a = 0.017186 with 11.7% uncertainty while C(t1) = 7000 cfu/m3 (cfu: colony forming units) with 15% uncertainty. Assume k = 1. (a) Determine the total dose to which the individual is exposed to at the end of 15 min. (b) Compute the uncertainty of the corresponding total dose in terms of absolute and relative terms.
The above errors are characterized by root mean square random errors (or RMSE) and their combined effect can be determined statistically following the basic propagation of errors formula. Bias errors such as that arising from structural
Incoming ray
Incident ray
Reflected rays
a Fig. 3.38 Different types of optical and tracking errors (Problem 3.15). (a) Microroughness in solar concentrator surface leads to a spread in the reﬂected radiation. The roughness is illustrated as a dotted line for the ideal reﬂector surface and as a solid line for the actual surface. (b) Tracking errors lead to a spread in incoming solar radiation shown as
b
Tracker reflector a normal distribution. Note that a tracker error of σ track results in a reﬂection error σ reﬂec = 2. σ track from Snell’s law. Factor of 2 also applies to other sources based on the error occurring as light both enters and leaves the optical device (see Eq. 3.42)
References
121
Table 3.18 Data table for Problem 3.15 Component
Source of error
Solar disk Reﬂector
Finite angular size Curvature manufacture Dust buildup Sensor misalignment Drive nonuniformity Misalignment
Tracker Receiver
RMSE error Fixed value 9.6 mrad 1.0 mrad – 2.0 mrad – 2.0 mrad
mismatch can be partially corrected by onetime or regular corrections and are not considered. Note that these errors need not be normally distributed, but such an assumption is often made in practice. Thus, RMSE values representing the standard deviations of these errors are often used for such types of analysis. (a) You will analyze the absolute and relative effects of this source of radiation spread at the receiver considering various other optical errors described above, using the numerical values shown in Table 3.18. σ totalpread = ðσ solardisk Þ2 þ ð2σ manuf Þ2 þ ð2σ dustbuild Þ2 þ ð2σ sensor Þ2 þ ð2σ drive Þ2 þ σ rec  mesalign
2 1=2
ð3:42Þ (b) Plot the variation of the total error as a function of the tracker drive nonuniformity error for three discrete values of dust building up (0, 1, and 2 mrad). Note that the ﬁnite angular size of the solar disk results in incident solar rays that are not parallel but subtend an angle of about 33 min or 9.6 mrad.
References Abbas, M., and J.S. Haberl, 1994. Development of indices for browsing large building energy databases, Proc. Ninth Symp. Improving Building Systems in Hot and Humid Climates, pp. 166–181, Dallas, TX, May. ASHRAE G14, 2002. Guideline14–2002: Measurement of Energy and Demand Savings, American Society of Heating, Refrigerating and AirConditioning Engineers, Atlanta. ASHRAE G2, 2005. Guideline 2–2005: Engineering Analysis of Experimental Data, American Society of Heating, Refrigerating and AirConditioning Engineers, Atlanta, GA. ASME PTC 19.1, 2018. Test Uncertainty, American Society of Mechanical Engineers, New York, NY. Ayyub, B.M. and R.H. McCuen, 1996. Numerical Methods for Engineers, PrenticeHall, Upper Saddle River, NJ Baltazar, J.C., D.E. Claridge, J. Ji, H. Masuda and S. Deng, 2012. Use of First Law Energy Balance as a Screening Tool for Building Energy Data, Part 2: Experiences on its Implementation As a Data Quality Control Tool, ASHRAE Trans., Vol. 118, Pt. 1, Conf. Paper CH12C021, pp. 167174, January.
Variation over time – – 0–2 mrad – 0–10 mrad –
Braun, J.E., S.A. Klein, J.W. Mitchell and W.A. Beckman, 1989. Methodologies for optimal control of chilled water systems without storage, ASHRAE Trans., 95(1), American Society of Heating, Refrigerating and AirConditioning Engineers, Atlanta, GA. Cleveland, W.S., 1985. The Elements of Graphing Data, Wadsworth and Brooks/Cole, Paciﬁc Grove, California. Devore J., and N. Farnum, 2005. Applied Statistics for Engineers and Scientists, 2nd Ed., Thomson Brooks/Cole, Australia. Doebelin, E.O., 1995. Measurement Systems: Application and Design, 4th Edition, McGrawHill, New York EIA, 1999. Electric Power Annual 1999, Vol.II, October 2000, DOE/ EIA0348(99)/2, Energy Information Administration, US DOE, Washington, D.C. 20585–065 http://www.eia.doe.gov/eneaf/electric ity/epav2/epav2.pdf. Glaser, D. and S. Ubbelohde, 2001. Visualization for time dependent building simulation, 7th IBPSA Conference, pp. 423–429, Rio de Janeiro, Brazil, Aug. 13–15. Haberl, J.S. and M. Abbas, 1998. Development of graphical indices for viewing building energy data: Part I and Part II, ASME J. Solar Energy Engg., vol. 120, pp. 156–167 Hawkins, D. 1980. Identiﬁcation of Outliers, Chapman and Hall, Kluwer Academic Publishers, Boston/Dordrecht/London Heiberger, R.M. and B. Holland, 2015. Statistical Analysis and Data Display, 2nd Ed., Springer, New York. Heinsohn, R.J. and J.M. Cimbala, 2003. Indoor Air Quality Engineering, Marcel Dekker, New York, NY Hoaglin, D.C., F. Mosteller and J.W. Tukey (Eds.), 1983. Understanding Robust and Exploratory Data Analysis, John Wiley and Sons, New York. Holman, J.P. and W.J. Gajda, 1984. Experimental Methods for Engineers, 5th Ed., McGrawHill, New York Keim, D. and M. Ward, 2003. Chap.11 Visualization, in Intelligent Data Analysis, M. Berthold and D.J. Hand (Editors), 2nd Ed., SpringerVerlag, Berlin, Germany. Reddy, T.A., J.K. Kreider, P.S. Curtiss and A. Rabl, 2016. Heating and Cooling of Buildings, 3rd Ed., CRC Press, Boca Raton, FL. Reddy, T.A., 1990. Statistical analyses of electricity use during the hottest and coolest days of summer for groups of residences with and without airconditioning. Energy, vol. 15(1): pp. 45–61. Schenck, H., 1969. Theories of Engineering Experimentation, 2nd Edition, McGrawHill, New York. Tufte, E.R., 1990. Envisioning Information, Graphic Press, Cheshire, CT. Tufte, E.R., 2001. The Visual Display of Quantitative Information, 2nd Edition, Graphic Press, Cheshire, CT Tukey, J.W., 1970. Exploratory Data Analysis, , Vol.1, Reading MA, AddisonWesley Tukey, J.W., 1988. The Collected Works of John W. Tukey,W. Cleveland (Editor), Wadsworth and Brookes/Cole Advanced Books and Software, Paciﬁc Grove, CA Wang, Z., T. Parkinson, P. Li, B. Lin and T. Hong, 2019. The squeaky wheel: Machine learning for anomaly detection in subjective thermal comfort votes, Building and Environment, 151 (2019), pp. 219227, Elsevier. Wonnacutt, R.J. and T.H. Wonnacutt, 1985. Introductory Statistics, 4th Ed., John Wiley & Sons, New York.
4
Making Statistical Inferences from Samples
Abstract
This chapter covers various concepts and statistical methods on inferring reliable parameter estimates about a population from sample data using knowledge of probability and probability distributions. More speciﬁcally, such statistical inferences involve point estimation, conﬁdence interval estimation, hypothesis testing of means and variances from two or more samples, analysis of variance methods, goodness of ﬁt tests, and correlation analysis. The basic principle of inferential statistics is that a random sample (or subset) drawn from a population tends to exhibit the same properties as those of the entire population. Traditional single and multiple parameter estimation techniques for sample means, variances, correlation coefﬁcients, and empirical distributions are presented. This chapter covers the popular singlefactor ANOVA technique which allows one to test whether the mean values of data taken from several different groups are essentially equal or not, i.e., whether the samples emanate from different populations or whether they are essentially from the same population. Also treated are nonparametric statistical procedures, best suited for ordinal data or for noisy data, that do not assume a speciﬁc probability distribution from which the sample(s) is taken. They are based on relatively simple heuristic ideas and are generally more robust; but, on the other hand, are generally less powerful and efﬁcient. How prior information on the population (i.e., the Bayesian approach) can be used to make sounder statistical inferences from samples and for hypothesis testing problems is also discussed. Further, various types of sampling methods are described which is followed by a discussion on estimators and their desirable properties. Finally, resampling methods, which reuse an available sample multiple times that was drawn from a population to make statistical inferences of parameter estimates, are treated which, though computationally expensive, are more intuitive, conceptually simple, versatile, and allow robust point and interval estimation. Not surprisingly, they
have become indispensable techniques in modern day statistical analyses.
4.1
Introduction
The primary reason for resorting to sampling as against measuring the whole population is to reduce expense, or to make quick decisions (say, in case of a realtime production process), or because often, it is simply impossible to do otherwise. Random sampling, the most common form of sampling, involves selecting samples (or subsets) from the population in a random and independent manner. If done correctly, it reduces or eliminates bias while enabling inferences to be made about the population from the sample. Such inferences are usually made on distributional characteristics or parameters such as the mean value or the standard deviation. Estimators are mathematical formulae or expressions applied to sample data to deduce the estimate of the true parameter (i.e., its numerical value). For example, Eqs. (3.4a) and (3.8) in Chap. 3 are the estimators for deducing the mean and standard deviation of a data set. Given the uncertainty involved, the point estimates must be speciﬁed along with statistical conﬁdence intervals (CI) albeit under the presumption that the samples are random. Methods of doing so are treated in Sect. 4.2 and 4.3 for single and multiple samples involving single parameters and in Sect. 4.4 involving multiple parameters. Unfortunately, certain unavoidable, or even undetected, biases may creep into the supposedly random sample, and this could lead to improper or biased inferences. This issue, as well as a more complete discussion of sampling and sampling design, is covered in Sect. 4.7. Resampling methods, which are techniques that involve reanalyzing an already drawn sample, are discussed in Sect. 4.8. Parameter tests on population estimates assume that the sample data are random and independently drawn, and thus, treated as random variables. The sampling fraction, in the case of ﬁnite populations, is very much smaller than the
# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 T. A. Reddy, G. P. Henze, Applied Data Analysis and Modeling for Energy Engineers and Scientists, https://doi.org/10.1007/9783031348693_4
123
124
4
population size (often about one to three orders of magnitude). Further, the data of the random variable is assumed to be close to being normally distributed. There is an entire ﬁeld of inferential statistics based on nonparametric or distributionfree tests which can be applied to population data with unknown probability distributions. Though nonparametric tests are encumbered by fewer restrictive assumptions, are easier to apply and understand, they are less efﬁcient than parametric tests (in that their uncertainty intervals are wider). These are brieﬂy discussed in Sect. 4.5, while Bayesian inferencing, whereby one uses prior information to enhance the inferencemaking process, is addressed in Sect. 4.6.
4.2
Basic Univariate Inferential Statistics
4.2.1
Sampling Distribution and Confidence Interval of the Mean
(a) Sampling distribution of the mean Consider a huge box ﬁlled with N round balls of unknown but similar diameter (the population). Let μ be the population mean and σ the population standard deviation of the ball diameter. If a sample of n balls is drawn and recognizing that the mean diameter X of the sample will vary from sample to sample, what can one say about the distribution of X ? It can be shown that the sample mean X would behave like a normally distributed random variable such that its expected value is: E X =μ
ð4:1Þ
and the standard error (SE) of X 1 is: SE X =
σ ðnÞ1=2
sx N n 1=2 N  1 ð nÞ
1=2
ð4:3Þ
where N is the population size, and n the sample size. Note that if N ≫ n, one effectively gets back Eq. (4.2).
1
The parameters (say, e.g., the mean) of a single sample are unlikely to be equal to those of the population. Since the sample mean values vary from sample to sample, this pattern of variation is referred to as the sampling distribution of the mean. Can the information contained in sampling distribution be reliably extended to provide estimates of the population mean? The answer is yes. However, since randomness is involved from one sample to another, the answer can only be reframed in terms of probabilities or conﬁdence levels. In summary, the conﬁdence interval (CI) of the population parameter estimates is based on sample size; the larger the better. They also depend on the variance in outcomes of different samples or trials. The larger the variance from sample to sample, the larger ought to be the sample size for the same degree of conﬁdence. The distribution of the sample data is presumed to be Gaussian or normal. Such an assumption is based on the Central Limit Theorem (one of the most important theorems in statistical theory), which states that if independent random samples of n observations are selected with replacement from a population with any arbitrary distribution, then the distribution of the sample means X will approximate a Gaussian distribution provided n is sufﬁciently large (n > 30). The larger the sample n, the closer does the sampling distribution approximate the Gaussian (Fig. 4.1).2 A consequence of the theorem is that it leads to a simple method of computing approximate probabilities of sums of independent random variables. It explains the remarkable fact that the empirical frequencies of so many natural “populations” exhibit bellshaped (i.e., normal or Gaussian) curves. Let X1, X2,. . .,Xn be a sequence of independent identically distributed random variables with population mean μ and variance σ 2. Then the distribution of the random variable Z (Sect. 2.4.3):
ð4:2Þ
Since commonly the population standard deviation σ is not known, the sample standard deviation sx (given by Eq. 3.8) can be used instead. In case the population sample is small, and sampling is done without replacement, then Eq. (4.2) is modiﬁed to: SE X =
Making Statistical Inferences from Samples
The standard deviation is a measure of the variability within a sample while the SE is a measure of the variability of the mean between samples.
Z=
X μ p σ= n
ð4:4Þ
approaches the standard normal as n tends toward inﬁnity. Note that this theorem is valid for any distribution of X; therein lies its power. Probabilities for random quantities can be found by determining areas under the standard normal curve as described in Sect. 2.4.3. Suppose one takes a random sample of size n from a population of mean μ and standard deviation σ. Then the random variable Z has (i) approximately the standard normal distribution if n > 30 regardless of the 2 That the sum of two Gaussian distributions from a population would be another Gaussian variable (a property called invariant under addition) is somewhat intuitive. Why the sum of two nonGaussian distributions should gradually converge to a Gaussian is less so, and hence the importance of this theorem.
4.2 Basic Univariate Inferential Statistics
125
Fig. 4.1 Illustration of the central limit theorem. The sampling distribution of X contrasted with the parent population distribution for three cases. The ﬁrst case (left column of ﬁgures) shows sampling from a normal population. As sample size n increases, the standard error of X
decreases. The next two cases show that even though the populations are not normal, the sampling distribution still becomes approximately normal as n increases. (From Wonnacutt and Wonnacutt 1985 by permission of John Wiley and Sons)
distribution of the population, and (ii) exactly the standard normal distribution if the population itself is normally distributed regardless of the sample size (Fig. 4.1).
Note that when sample sizes are small (n < 30) and the underlying distribution is unknown, the Student tdistribution which has wider uncertainty bands (Sect. 2.4.3) should be
126
4
Making Statistical Inferences from Samples
Fig. 4.2 Illustration of critical cutoff values between onetailed and twotailed tests for the standard normal distribution. The shaded areas representing the probability values ( p) corresponding to 95% CL or
signiﬁcance level α = 0.05 illustrate the difference in how these two types of tests differ in terms of the critical values (which are determined from Table A.3)
used with (n  1) degrees of freedom instead of the Gaussian (Fig. 2.15 and Table A.4). Unlike the zcurve, there are several tcurves depending on the degrees of freedom (d.f.). At the limit of inﬁnite d.f., the tcurve collapses into the zcurve.
select only one sample of the population comes at a price: one must necessarily accept some uncertainty in our estimates. Based on a sample taken from a population:
(b) Onetailed and twotailed tests An important concept needs to be clariﬁed, namely “when does one use onetailed as against twotailed tests?” In the twotailed test, one is testing whether the sample parameter is different (i.e., smaller or larger) than that of the stipulated population. In cases where one wishes to test whether the sample parameter is speciﬁcally larger (or speciﬁcally smaller) than that of the stipulated population, then the onetailed test is used. The tests are set up and addressed in like manner, the difference being in how the signiﬁcancelevel expressed as a probability ( p) value is ﬁnally determined. The smaller the pvalue, the stronger the statistical evidence that the observed data did not happen by chance. The shaded areas of the normal distributions shown in Fig. 4.2 illustrate the difference in how the onetailed and twotailed types of tests have to be performed. For a signiﬁcance level α of 0.05 or probability p = 0.05 that the observed value is lower than the mean value, the cutoff value is 1.645 for the onetailed test and is 1.96 for the twotailed test, as indicated in Fig. 4.2. The nomenclature adopted to denote the critical cutoff values for twotailed and onetailed are zα/2 = 1.96 and zα =  1.645 respectively for 95% CL (which can be determined from Table A.3). (c) Conﬁdence interval for the mean In the subsection above, the behavior of many samples, all taken from one population, was considered. Here, only one large random sample from a population is selected and analyzed to make an educated guess on parameter estimates of the population such as its mean and standard deviation. This process is called inductive reasoning or arguing backwards from a set of observations to reach a reasonable hypothesis. However, the beneﬁt provided by having to
(a) One can test whether the sample mean differs from a known population mean (this is covered in this subsection). (b) One can deduce interval bounds of the population mean at a speciﬁed probability or conﬁdence level which is expressed as a conﬁdence interval CI (covered in Sect. 4.2.2). The concept of CI was introduced in Sect. 3.6.3 in reference to instrument errors. This concept pertinent to random variables in general is equally applicable to sampling. A 95% CI is traditionally interpreted as implying that there is a 95% chance that the difference between the sample and population mean values is contained within this interval (and that this conclusion may be erroneous in 5% of the time).3 The range is obtained from the zcurve by ﬁnding the value at which the area under the curve (i.e., the probability) is equal to 0.95 (Fig. 4.2). From Table A.3, the corresponding twotailed cutoff value zα/2 is 1.96 (corresponding to a probability value of [(1  0.95)/2] = 0.025). This implies that the probability is: X μ p < 1:96 ≈ 0:95 sx = n s s or X  1:96 px < μ < X þ 1:96 px n n p  1:96
30). The halfwidth of the 95% CI about the mean value is (1:96 psxn ) and is called the bound of the error of estimation. For small samples, instead of random variable z, one uses the Studentt variable. Note that Eq. (4.5a) refers to the longrun bounds, i.e., in the long run, roughly 95% of the intervals will contain μ. If one is interested in predicting a single X value that has yet to be observed, one uses the following equation and the term prediction interval (PI) is used (Devore and Farnum 2005): PI ðX Þ = X ± t α=2 sx 1 þ
1 n
1=2
ð4:6Þ
where tα/2 is the twotailed cutoff value determined from the Student tdistribution at d.f. = (n  1) at the desired conﬁdence level, and sx is the sample standard deviation. It is apparent that the prediction intervals are much wider than the CI because the quantity “1” within the brackets of Eq. (4.6) will generally dominate (1/n). This means that there is a lot more uncertainty in predicting the value of a single observation X than there is in estimating the mean value μ. Example 4.2.1 Evaluating manufacturerquoted lifetime of light bulbs from sample data A manufacturer of xenon light bulbs for street lighting claims that the distribution of the lifetimes of his best model has a mean μ = 16 years and a standard deviation σ = 2.0 years when the bulbs are lit for 12 h every day. Suppose that a city ofﬁcial wants to check the claim by purchasing a sample of 36 of these bulbs and subjecting them to tests that determine their lifetimes. (i) Assuming the manufacturer’s claim to be true, describe the sampling distribution of the mean lifetime of a sample of 36 bulbs. Even though the shape of the distribution is unknown, the Central Limit Theorem suggests that the normal distribution can be used. Thus μ = x = 16 and SEðxÞ = p2:0 = 0:33 years.4 36 (ii) What is the probability that the sample purchased by the city ofﬁcial has a meanlifetime of 15 years or less? The normal distribution N(16, 0.33) is drawn and the darker shaded area to the left of x = 15 as shown in Fig. 4.3 provides the probability of the city ofﬁcial 4
The nomenclature of using X or x is somewhat confusing. The symbol X is usually used to denote the random variable while x is used to denote one of its values.
density
1 0.8 0.6 0.4 0.2 0 14
15
16
17
18
x Fig. 4.3 Sampling distribution of X for a normal distribution N(16, 0.33). Shaded area represents the probability of the mean life of the sample of bulbs being 1200 h, but the convention is to frame the alternative hypothesis as “different from” the null hypothesis.
4
Making Statistical Inferences from Samples
Assume a sample size of n = 100 of bulbs manufactured by the new process and set the signiﬁcance or error level of the test to be α = 0.05. This is clearly a onetailed test since the new bulb manufacturing process should have a longer life, not just different from that of the traditional process, and the critical value must be selected accordingly. The mean life x of the sample of 100 bulbs is assumed to be normally distributed with mean value of 1200 and standard error p p σ= n = 300= 100 = 30. From the standard normal table (Table A.3), the onetailed critical zvalue is: zα = 1.64. cpμ0 Recalling that the critical value is deﬁned as: zα = xσ= n
leads to xc = 1200 + 1.64 × 300/(100)1/2 = 1249 or about 1250. Suppose testing of the 100 tubes yields a value of x = 1260. As x > xc , one would reject the null hypothesis at the 0.05 signiﬁcance (or error) level. This is akin to jury trials where the null hypothesis is taken to be that the accused is innocent, and the burden of proof during hypothesis testing is on the alternate hypothesis, i.e., on the prosecutor to show overwhelming evidence of the culpability of the accused. If such overwhelming evidence is absent, the null hypothesis is preferentially favored. ■ There is another way of looking at this testing procedure (Devore and Farnum 2005):
(a) Null hypothesis H0 is true, but one has been exceedingly unlucky and got a very improbable sample with mean x. In other words, the observed difference turned out to be signiﬁcant when, in fact, there is no real difference. Thus, the null hypothesis has been rejected erroneously. The innocent man has been falsely convicted; (b) H0 is not true after all. Thus, it is no surprise that the observed x value was so high, or that the accused is indeed culpable. The second explanation is likely to be more plausible, but there is always some doubt because statistical decisions inherently contain probabilistic elements. In other words, statistical tests of hypothesis do not always yield conclusions with absolute certainty: they have inbuilt margins of error just like jury trials are known to hand down wrong verdicts. Speciﬁcally, two types of errors can be distinguished: (i) Concluding that the null hypothesis is false, when in fact it is true, is called a Type I error, and represents the probability α (i.e., the preselected signiﬁcance level) of erroneously rejecting the null hypothesis. This is also called the false negative or “false alarm” rate. The upper normal distribution shown in Fig. 4.4 has a mean value of 1200 (equal to the population or claimed mean value) with a standard error of 30. The area to the right of
4.2 Basic Univariate Inferential Statistics
Reject Ho
Accept Ho
(X 0.001) 15
N(1200,30) 12
density
Fig. 4.4 The two kinds of error that occur in a classical test. (a) False negative: If H0 is true, then signiﬁcance level α = probability of erring (rejecting the true hypothesis H0). (b) False positive: If Ha is true, then β = probability of erring (judging that the false hypothesis H0 is acceptable). The numerical values correspond to data from Example 4.2.2
129
9
From population
6 Area represents probability of falsely rejecting null hypothesis (Type I error)
3 0 1100
1150
1200
1250
1300
(X 0.001) 15
N(1260,30)
density
12
Area represents probability of falsely accepting the alternative hypothesis (Type II error)
From sample
9 6 3 0 1200
1250 Critical value
the critical value of 1250 represents the probability of Type I error occurring. (ii) The ﬂip side, i.e., concluding that the null hypothesis is true, when in fact it is false, is called a Type II error and represents the probability β of erroneously rejecting the alternate hypothesis, also called the false positive rate. The lower plot of the normal distribution shown in Fig. 4.4 now has a mean of 1260 (the mean value of the sample) with a standard error of 30, while the area to the left of the critical value xc indicates the probability β of being in error of Type II. The two types of error are inversely related as is clear from the vertical line in Fig. 4.4 drawn through both plots. A decrease in probability of one type of error is likely to result in an increase in the probability of the other. Unfortunately, one cannot simultaneously reduce both by selecting a smaller value of α. The analyst would select the signiﬁcance level depending on the tolerance, or seriousness of the consequences of either type of error speciﬁc to the circumstance. Recall that the probability of making a Type I error is called the signiﬁcance level of the test. This probability of correctly rejecting the null hypothesis is also referred to as the statistical power. The only way of reducing both types of
1300
1350
1400
x
errors is to increase the sample size with the expectation that the standard error would decrease, and the sample mean would get closer to the population mean. One ﬁnal issue relates to the selection of the test statistic. One needs to distinguish between the following two instances: (i) if the population variance σ is known and for sample sizes n > 30, then the zstatistic is selected for performing the test along with the standard normal tables (as done for Example 4.2.2 above); (ii) if the population variance is unknown or if the sample size n < 30, then the tstatistic is selected (using the sample standard deviation sx instead of σ) for performing the test using Studentt tables with the appropriate degree of freedom.
4.2.3
Two Independent Sample and Paired Difference Tests on Means
As opposed to hypothesis tests for a single population mean, there are hypothesis tests that allow one to compare two population mean values from samples taken from each
130
4
population. Two basic presumptions for the tests (described below) to be valid are that the standard deviations of the populations are reasonably close, and that the populations are approximately normally distributed. (a) Two independent samples test The test is based on information (namely, the mean and the standard deviation) obtained from taking two independent random samples from the two populations under consideration whose variances are unknown and unequal. Using the same notation as before for population and sample and using subscripts 1 and 2 to denote the two samples, the random variable z=
ð x 1  x2 Þ  ð μ 1  μ 2 Þ s21 n1
ð4:7Þ
s2 1=2
þ n22
is said to approximate the standard normal distribution for large samples (n1 > 30 and n2 > 30) where s1 and s2 are the standard deviations of the two samples. The denominator is called the standard error (SE) and is a measure of the total variability of both samples combined. Notice the similarity between the expression for SE and that for the combined error of two independent measurements given by the quadrature sum of their squared values (Eq. 3.22). The conﬁdence intervals CI of the difference in the population means can be determined as: μ1  μ2 = ðx1  x2 Þ ± zα SEðx1 , x2 Þ where SE ðx1 , x2 Þ =
s21 s22 þ n1 n2
1=2
ð4:8Þ
where zα is the critical value at the selected signiﬁcance level. Thus, the testing of the two samples involves a single random variable combining the properties of both. For smaller sample sizes, Eq. (4.8) still applies, but the zstandardized variable is replaced with the Studentt variable. The critical values are found from the Student ttables with degrees of freedom d.f. = n1 + n2  2. If the variances of the population are known, then these should be used instead of the sample variances. When the samples are small and only when the variances of both populations are close, some textbooks suggest that the two samples be combined, and the parameter be treated as a single random variable. Here, instead of using individual standard deviation values s1 and s2, a combined quantity called the pooled variance sp2 is used: 5
Such energy conservation programs result in a revenue loss in electricity sales but often this is more cost effective to utilities in terms of
s2p =
Making Statistical Inferences from Samples
ðn1  1Þs21 þ ðn2  1Þs22 with d:f : = n1 þ n2  2 ð4:9Þ n1 þ n2  2
Note that the pooled variance is simply the weighted average of the two sample variances. The use of the pooled variance approach is said to result in tighter conﬁdence intervals, and hence its appeal. The random variable approximates the tdistribution, and the conﬁdence interval, CI of the difference in the population means is: μ1  μ2 = ðx1  x2 Þ ± t α SE ðx1 , x2 Þ where SEðx1 , x2 Þ = s2p
1 1 þ n1 n2
1=2
ð4:10Þ
Note that the above equation is said to apply when the variances of both samples are close. However, Devore and Farnum (2005) strongly discourage the use of the pooled variance approach as a general rule, and so the better approach, when in doubt, is to use Eq. (4.8) so as to be conservative. Manly (2005), on the other hand, states that the independent random sample test is fairly robust to the assumptions of normality and equal population variance especially when the sample size exceeds 20 or so. The assumption of equal population variances is said not to be an issue if the ratio of the two variances is within 0.4–2.5. Example 4.2.3 Verifying savings from energy conservation measures in homes Certain electric utilities with limited generation capacities fund contractors to weather strip residences in an effort to reduce inﬁltration losses thereby reducing building loads, which in turn lower electricity needs.5 Suppose an electric utility wishes to determine the costeffectiveness of their weatherstripping program by comparing the annual electric energy use of 200 similar residences in a given community, half of which were weatherstripped, and the other half were not. Samples collected from both types of residences yield: Control sample: Weatherstripped sample:
x1 = 18,750; s1 = 3200 and n1 = 100. x2 = 15,150; s2 = 2700 and n2 = 100.
The mean difference ðx1  x2 Þ = 18,750  15,150 = 3600, i.e., the mean savings fraction or percentage in each weatherstripped residence is 19.2% (=3600/18,750) of the mean baseline or control home. However, there is an uncertainty associated with this mean value since only a sample has been analyzed. This uncertainty is characterized as a deferred generation capacity expansion costs. Another reason for implementing conservation programs is the mandatory requirement set by Public Utility Commissions (PUC) which regulates electric utilities in the United States.
4.2 Basic Univariate Inferential Statistics
131
bounded range for the mean difference. The onetailed critical value at the 95% CL corresponding to a signiﬁcance level α = 0.05 is zα = 1.645 from Table A.3. Then from Eq. (4.8): s2 s2 μ1  μ2 = ðx1  x2 Þ ± 1:645 1 þ 2 100 100
p t = d=SE where SE = sd = n
ð4:11aÞ
1=2
and the CI around d is:
To complete the calculation of the conﬁdence interval (CI), it is assumed, given that the sample sizes are large, that the sample variances are reasonably close to the population variances. Thus, the CI is approximately: ð18,750  15,150Þ ± 1:645
of two small paired samples (n < 30), and d their mean value. Then, the tstatistic is taken to be:
32002 27002 þ 100 100
1=2
= 3600 ± 689 = ð2911 and 4289Þ: These intervals represent the lower and upper values of saved energy at the 95% CL. To conclude, one can state that the savings are positive, i.e., one can be 95% conﬁdent that there is an energy beneﬁt in weatherstriping the homes. More speciﬁcally, the mean saving percentage is 19.2% = [(18,750–15,150)/18,750] of the baseline value with an uncertainty of 19.1% (= 689/3600) in the savings at the 95% CL. Thus, the uncertainty in the savings estimate is quite large. In practice, the energy savings fraction is usually (much) smaller than what was assumed above, and this could result in the uncertainty in the savings to be as large as the savings amount itself. Such a concern reﬂects realistic situations where the efﬁcacy of energy conservation programs in homes is often difﬁcult to verify accurately. One could try increasing the sample size or resorting to stratiﬁed sampling (see Sect. 4.7.4) but they may not necessarily be more conclusive. Another option is to adopt a less stringent conﬁdence level such as the 90% CL. ■ (b) Paired difference test The previous section dealt with independent samples from two populations with close to normal probability distributions. There are instances when the samples are somewhat correlated, and such interdependent samples are called paired samples. This interdependence can also arise when the samples are taken at the same time and are affected by a timevarying variable which is not explicitly considered in the analysis. Rather than the individual values, the difference is taken as the only random sample since it is likely to exhibit much less variability than those of the two samples. Thus, the conﬁdence intervals calculated from paired data will be narrower than those calculated from two independent samples. Let di be the difference between individual readings
p μd = d ± t α sd = n
ð4:11bÞ
Hypothesis testing of means for paired samples is done the same way as that for a single independent mean and is usually (but not always) superior to an independent sample test. Paired difference tests are used for comparing “before and after” or “with and without” type of experiments done on the same group in turn, say, to assess effect of an action performed. For example, the effect of an additive in gasoline meant to improve gas mileage can be evaluated statistically by considering a set of data representing the difference in the gas mileage of n cars which have each been subjected to tests involving “no additive” and “with additive.” Its usefulness is illustrated by the following example which is a more direct application of paired difference tests. Example 4.2.4 Comparing energy use of two similar buildings based on utility bills—the wrong way Buildings which are designed according to certain performance standards are eligible for recognition as energyefﬁcient buildings by federal and certiﬁcation agencies. A recently completed building (B2) was awarded such an honor. The federal inspector, however, denied the request of another owner of an identical building (B1) located close by who claimed that the differences in energy use between both buildings were within statistical error. An energy consultant was hired by the owner to prove that B1 is as energy efﬁcient as B2. He chose to compare the monthly mean utility bills over a year between the two commercial buildings using data recorded over the same 12 months and listed in Table 4.1 (the analysis would be more conclusive if bill data over several years were used, but such data may be hard to come by). This problem can be addressed using the twosample test method. The null hypothesis is that the mean monthly utility charges μ1 and μ2 for the two buildings are equal against the alternative hypothesis that Building B2 is more energy efﬁcient that B1 (thus, a onetailed test is appropriate). Since the sample sizes are less than 30, the tstatistic has to be used instead of the standard normal zstatistic. The pooled variance approach given by Eq. (4.9) is appropriate in this instance. It is computed as:
132
4
Making Statistical Inferences from Samples
Table 4.1 Monthly utility bills and the corresponding outdoor temperature for the two buildings being compared (Example 4.2.4) Month 1 2 3 4 5 6 7 8 9 10 11 12 Mean Std. Deviation
Building B1 utility cost ($) 693 759 1005 1074 1449 1932 2106 2073 1905 1338 981 873 1,349 530.07
Building B2 utility cost ($) 639 678 918 999 1302 1827 2049 1971 1782 1281 933 825 1,267 516.03
Fig. 4.5 Monthbymonth variation of the utility bills for the two buildings B1 and B2 (Example 4.2.5)
Difference in costs (B1B2) 54 81 87 75 147 105 57 102 123 57 48 48 82 32.00
Outdoor temperature (°C) 3.5 4.7 9.2 10.4 17.3 26 29.2 28.6 25.5 15.2 8.7 6.8
2500 B1 B2
Utility Bills ($/month)
2000
Difference 1500 1000 500 0 1
2
3
4
5
6
7
8
9
10
11
12
Month of Year
s2p =
ð12  1Þ ð530:07Þ2 þ ð12  1Þ ð516:03Þ2 12 þ 12  2
= 273,630:6 while the tstatistic can be deduced by rearranging Eq. (4.10): t=
ð1349  1267Þ  0 ð273, 630:6Þ
1 12
þ
1 12
1=2
=
82 = 0:38 213:54
The tvalue is very small and will not lead to the rejection of the null hypothesis even at signiﬁcance level α = 0.02 (from Table A.4, the onetailed critical value is 1.321 for CL = 90% and d.f. = 12 + 12 – 2 = 22). Thus, the consultant would report that insufﬁcient statistical evidence exists to state that the two buildings are different in their energy consumption. As explained next, this example demonstrates a faulty analysis.
Example 4.2.5 Comparing energy use of two similar buildings based on utility bills—the right way There is, however, a problem with the way the energy consultant performed the test of the previous example. Close observation of the data as plotted in Fig. 4.5 would lead one not only to suspect that this conclusion is erroneous, but also to observe that the utility bills of the two buildings tend to rise and fall together because of seasonal variations in the outdoor temperature. Hence the condition that the two samples are independent is violated. It is in such circumstances that a paired test is relevant. Here, the test is meant to determine whether the monthly mean of the differences in utility bills between buildings B1 and B2 ðxD Þ is zero (null hypothesis) or is positive. In this case: 82  0 xD  0 p = 8:88 p = sD = nD 32= 12 with d:f: = 12  1 = 11
t  statistic =
4.2 Basic Univariate Inferential Statistics
133
Fig. 4.6 Conceptual illustration of three characteristic cases that may arise during twosample testing of medians. The box and whisker plots provide some indication as to the variability in the results of the tests. Case (a) very clearly indicates that the samples are very much different,
case (b) also suggests the same but with a little less certitude. Finally, it is more difﬁcult to draw conclusions from case (c), and it is in such cases that statistical tests are useful
where the values of 82 and 32 are found from Table 4.1. For a signiﬁcance level of α = 0.05 and d.f. = 11, the onetailed test (Table A.4) suggests a critical value t0.05 = 1.796. Because 8.88 is much higher than this critical value, one can safely reject the null hypothesis. In fact, Bldg. 1 is less energy efﬁcient than Bldg. 2 even at a signiﬁcance level of 0.0005 (or CL = 99.95%), and the owner of B1 does not have a valid case at all! This illustrates how misleading results can be obtained if inferential tests are misused, or if the analyst ignores the underlying assumptions behind a particular test. It is important to keep in mind the premise that the random variables should follow a normal distribution, and so some preliminary exploratory data analysis using multiple years of utility bills is advisable (Sect. 3.5). Figure 4.6 illustrates, in a simple conceptual manner, the three characteristic cases which can arise when comparing the means of two populations based on sampled data. Recall that the box and whisker plot is a type of graphical display of the shape of the distribution where the solid line denotes the median, the upper and lower hinges of the box indicate the interquartile range values (25th and 75th percentiles) with the whiskers extending to 1.5 times this range. Case (a) corresponds to the case where the whisker of one box plot extends to the lower quartile of the second boxplot. One does not need to perform a statistical test to conclude that the twopopulation means are different. Case (b) also suggests difference between population means but with a little less certitude. Case (c) illustrates the case when the two whisker bands are quite close, and the value of statistical tests becomes apparent. As a rough rule of thumb, if the 25th percentile for one sample exceeds the median line of the other sample, one could conclude that the mean are likely to be different (Walpole et al. 2007).
4.2.4
Single and Two Sample Tests for Proportions
There are several cases where surveys are performed to determine fractions or proportions of populations who either have preferences of some sort or have purchased a certain type of equipment. For example, the gas company may wish to determine what fraction of their customer base has natural gas heating as against another source (e.g., electric heat pumps). The company performs a survey on a random sample from which it would like to extrapolate and ascertain conﬁdence limits on this fraction. It is in a case such as this which can be interpreted as either a “success” (the customer has gas heat) or a “failure”—in short, a binomial experiment (see Sect. 2.4.2b)—that the following test is useful. (a) Single sample test Let p be the population proportion one wishes to estimate from the sample proportion p which can be determined as: of successes in sample p = number = nx. Then, provided the sample is total number of trials large (n ≥ 30), proportion p is an unbiased estimator of p with approximately normal distribution. Dividing the expression for standard deviation of the Bernoulli trials (Eq. 2.37b) by n2, yields the standard error of the sampling distribution of p : SE ðpÞ = ½pð1  pÞ=n1=2
ð4:12Þ
Thus, the large sample CI for p for the two tailed case at a signiﬁcance level z is given by: CI = p ± zα=2 ½pð1  pÞ=n1=2
ð4:13Þ
134
4
Example 4.2.6 In a random sample of n = 100 new residences in Scottsdale, AZ, it was found that 63 had swimming pools. Find the 95% CI for the fraction of buildings which have pools. 63 In this case, n = 100, while p = 100 = 0:63: From Table A.3, the twotailed critical value z0.05/2 = 1.96, and hence from Eq. (4.13), the two tailed 95% CI for p is: 0:63  1:96 þ1:96
0:63ð1  0:63Þ 100
0:63ð1  0:63Þ 100
1=2
< p < 0:63
1=2
or 0:5354 < p < 0:7246: ■
Example 4.2.7 The same equations can also be used to determine a sample size in order for p not to exceed a certain range or error e. For instance, one would like to determine from Example 4.2.6 data, the sample size which will yield an estimate of p within 0.02 or less at 95% CL. Then, recasting Eq. (4.13) results in a sample size: n=
Example 4.2.8 Hypothesis testing of increased incidence of lung ailments due to radon in homes The Environmental Protection Agency (EPA) would like to determine whether the fraction of residents with health problems living in an area where the subsoil is known to have elevated radon concentrations is statistically higher than the fraction of homes where subsoil radon concentrations is negligible. Speciﬁcally, the agency wishes to test the hypothesis at the 95% CL that the fraction of residents p1 with lung ailments in radonprone areas is higher than the fraction p2 corresponding to low radon level locations. The following data are collected: High radon level area: Low radon area:
z=
= ■
(b) Two sample tests The intent here is to estimate whether statistically signiﬁcant differences exist between proportions of two populations based on one sample drawn from each population. Assume that the two samples are large and independent. Let p1 and p2 be the sampling proportions. Then, the sampling distribution of ðp1  p2 Þ is approximately normal with ðp1  p2 Þ being an unbiased estimator of ( p1  p2) and the standard error given by: SE ðp1  p2 Þ =
p 1 ð 1  p 1 Þ p2 ð 1  p2 Þ þ n1 n2
1=2
The following example illustrates the procedure.
ð4:14Þ
n1 = 100, p1 = 0:38 n2 = 225, p2 = 0:22 null hypothesis H 0 : ðp1  p2 Þ = 0 alternative hypothesis H 1 : ðp1  p2 Þ ≠ 0
One calculates the random variable using Eq. (4.14) to compute the SE:
z2 α=2 pð1  pÞ 1:962 ð0:63Þð1  0:63Þ = = 2239 2 e ð0:02Þ2
It must be pointed out that the above example is somewhat misleading since one does not know the value of p beforehand. One may have a preliminary idea, in which case, the sample size n would be an approximate estimate, and this may have to be revised once some data is collected.
Making Statistical Inferences from Samples
ð p 1  p2 Þ p1 ð1  p1 Þ n1
þ
p2 ð1  p2 Þ n2
1=2
ð0:38  0:22Þ ð0:38Þð0:62Þ 100
Þð0:78Þ þ ð0:22225
1=2
= 2:865
A onetailed test is appropriate, and from Table A.3 the critical value of z0.05 = 1.65 for the 95% CL. Since the calculated z value > zα, this would suggest that the null hypothesis can be rejected. Thus, one would conclude that those living in areas of high radon levels have statistically higher lung ailments than those who do not. Further inspection of Table A.3 reveals that zα = 2.865 corresponds to a probability value of 0.021 or close to 98% CL. Should the EPA require mandatory testing of all homes at some expense to all homeowners or should some other policy measure be adopted? These types of considerations fall under the purview of decisionmaking discussed in Chap. 12. ■
4.2.5
Single and Two Sample Tests of Variance
Recall that when a sample mean is used to provide an estimate of the population mean μ, it is more informative to give a conﬁdence interval CI for μ instead of simply stating the value x. A similar approach can be adopted for estimating the population variance from that of a sample.
4.2 Basic Univariate Inferential Statistics
135
(a) Single sample test
F=
The CI for a population variance σ based on sample variance s2 are to be determined. To construct such a CI, one will use the fact that if a random sample of size n is taken from a population that is normally distributed with variance σ 2, then the random variable 2
χ2 =
n1 2 s σ2
ð4:15Þ
has the Pearson chisquare distribution with ν = (n  1) degrees of freedom (described in Sect. 2.4.3). The advantage of using χ 2 instead of s2 is akin to standardizing a variable to a normal random variable. Such a transformation allows standard tables (such as Table A.5) to be used for determining probabilities irrespective of the magnitude of s2. The basis of these probability tables is again akin to ﬁnding the areas under the chisquare curves. Example 4.2.9 A company which makes boxes wishes to determine whether their automated production line requires major servicing or not. Their decision will be based on whether the weight from one box to another is signiﬁcantly different from a maximum permissible population variance value of σ 2 = 0.12 kg2. A sample of 10 boxes is selected, and their variance is found to be s2 = 0.24 kg2. Is this difference signiﬁcant at the 95% CL? From Eq. (4.15), the observed chisquare value is 1 ð0:24Þ = 18. Inspection of Table A.5 for ν = 9 χ 2 = 100:12 degrees of freedom reveals that for a signiﬁcance level α = 0.05, the critical chisquare value χ 2α = 16.92 and, for α = 0.025, χ 2α = 19.02. Thus, the result is signiﬁcant at α = 0.05 or 95% CL but not at the 97.5% CL. Whether to service the automated production line based on these statistical tests would involve weighing the cost of service with associated beneﬁts in product quality. ■ (b) Two sample tests This instance applies to the case when two independent random samples are taken from two populations that are normally distributed, and one needs to determine whether the variances of the two populations are different or not. Such tests ﬁnd application prior to conducting ttests on two means which presumes equal variances. Let σ 1 and σ 2 be the standard deviations of both the populations, and s1 and s2 be the sample standard deviations. If σ 1 = σ 2, then the random variable
s21 s22
ð4:16Þ
has the Fdistribution (described in Sect. 2.4.3) with degrees of freedom (d.f.) = (ν1, ν2) where ν1 = (n1  1) and ν2 = (n2  1). Note that the distributions are different for different combinations of ν1 and ν2. The probabilities for F can be determined using areas under the F curves or from tabulated values as in Table A.6. Note that the Ftest applies to independent samples, and, unfortunately, is known to be rather sensitive to the assumption of normality. Hence, some argue against its use altogether for two sample testing (for example, Manly 2005). Example 4.2.10 Comparing variability in daily productivity of two workers It is generally acknowledged that worker productivity increases if the environment is properly conditioned to meet the stipulated human comfort environment. One is interested in comparing the mean productivity of two ofﬁce workers under the same conditions. However, before undertaking that evaluation, one is unsure about the assumption of equal variances in productivity of the workers (i.e., how consistent are the workers from one day to another). This test can be used to check the validity of this assumption. Suppose the following data have been collected for two workers under the same environment and performing similar tasks. An initial analysis of the data suggests that the normality condition is met for both workers: Worker A: n1 = 13 days, mean x1 = 26.3 production units, standard deviation s1 = 8.2 production units. Worker B: n2 = 18 days, mean x2 = 19.7 production units, standard deviation s2 = 6.0 production units. The intent here is to compare not the means but the standard deviations. The Fstatistic is determined by always choosing the larger variance as the numerator. Then F = (8.2/6.0)2 = 1.87. From Table A.6, the critical F value is Fα = 2.38 for (13  1) = 12 and (18  1) = 17 degrees of freedom at a signiﬁcance level α = 0.05. Thus, as illustrated in Fig. 4.7, one is forced to accept the null hypothesis since the calculated Fvalue < Fα, and conclude that the data provide not enough evidence to indicate that the population variances of the two workers are statistically different at α = 0.05. Hence, one can now proceed to use the twosample ttest with some conﬁdence to determine whether the difference in the means between both workers is statistically signiﬁcant or not.
136
4
Table 4.2 Expected number of homes for different number of noncode compliance values if the process is assumed to be a Poisson distribution with sample mean of 0.5 (Example 4.2.11)
F distribution with d.f. (17,12) 1 Critical value =2.38 for a =0.05
density
0.8
X = number of noncode compliance values 0 1 2 3 4 5 or more Total
Rejection region
0.6 0.4
Calculated Fvalue=1.87
0.2 0 0
1
2
3
4
Fig. 4.7 Since the calculated F value is lower than the critical value, one is forced to accept the null hypothesis (Example 4.2.10)
Tests for Distributions
Recall from Sect. 2.4.3 that the chisquare (χ 2) statistic applies to discrete data. It is used to statistically test the hypothesis that a set of empirical or sample data does not differ signiﬁcantly from that expected from a speciﬁed theoretical distribution. In other words, it is a goodnessofﬁt test to ascertain whether the distribution of proportions of one group differs from another or not. The chisquare statistic is computed as: χ2 = k
f obs  f exp f exp
P(x) n (0.6065) 380 (0.3033) 380 (0.0758) 380 (0.0126) 380 (0.0016) 380 (0.0002) 380 (1.000) 380
Expected number 230.470 115.254 28.804 4.788 0.608 0.076 380
5
x
4.2.6
Making Statistical Inferences from Samples
2
ð4:17Þ
where fobs is the observed frequency of each class or interval, fexp is the expected frequency for each class predicted by the theoretical distribution, and k is the number of classes or intervals. If χ 2 = 0, then the observed and theoretical frequencies agree exactly. If not, the larger the value of χ 2, the greater the discrepancy. Tabulated values of χ 2 are used to determine signiﬁcance for different values of degrees of freedom υ = k  1 (Table A.5). Certain restrictions apply for proper use of this test. The sample size should be greater than 30, and none of the expected frequencies should be less than 5 (Walpole et al. 2007). In other words, a long tail of the probability curve at the lower end is not appropriate, and in that sense, some power is lost when the test is adopted. The following two examples serve to illustrate the process of applying the chisquare test. Example 4.2.11 Ascertaining whether noncode compliance infringements in residences is random or not
A county ofﬁcial was asked to analyze the frequency of cases when home inspectors found new homes built by one speciﬁc builder to be noncode compliant and determine whether the violations were random or not. The following data for 380 homes were collected: No. of code infringements Number of homes
0 242
1 94
2 38
3 4
4 2
The underlying random process can be characterized by the Poisson distribution (see Sect. 2.4.2): PðxÞ =
λx expð λÞ : x!
The null hypothesis, namely that the sample is drawn from a population that is Poisson distributed, is to be tested at the 0.05 signiﬁcance level. ð38Þþ3ð4Þþ4ð2Þ = 0:5 The sample mean λ = 0ð242Þþ1ð94Þþ2 380 infringements per home. For a Poisson distribution with λ = 0.5, the underlying or expected values are found for different values of x as shown in Table 4.2. The last two categories have expected frequencies that are less than 5, which do not meet one of the requirements for using the test (as stated above). Hence, these will be combined into a new category called “3 or more cases” which will have an expected frequency of (4.7888 + 0.608 + 0.076) = 5.472. The following statistic is calculated ﬁrst:
χ2 =
ð242  230:470Þ2 ð94  115:254Þ2 ð38  28:804Þ2 þ þ 230:470 115:254 28:804 2 ð6  5:472Þ þ = 7:483 5:472
Since there are only 4 groups, the degrees of freedom υ = 4 – 1 = 3, and from Table A.5, the twotailed critical value at 0.05 signiﬁcance level is χ 2α = 7.815. This would suggest that the null hypothesis cannot be rejected at the 0.05 signiﬁcance level; however, the two frequencies are very close, and some further analysis may be warranted. ■
4.2 Basic Univariate Inferential Statistics
137
Table 4.3 Observed and computed (assuming gender independence) number of accidents in different circumstances (Example 4.2.12) Male Observed 40 49 18 107
Circumstance At work At home Other Total
Female Observed 5 58 13 76
Expected 26.3 62.6 18.1
Example 4.2.126 Evaluating whether injuries in males and females are independent of circumstance Chisquare tests are also widely used as tests of independence using contingency tables. In 1975, more than 59 million Americans suffered injuries. More males (=33.6 million) were injured than females (=25.6 million). These statistics do not distinguish whether males and females tend to be injured in similar circumstances. A safety survey of n = 183 accident reports was selected at random to study this issue in a large city, and the results are summarized in Table 4.3. The null hypothesis is that the circumstance of an accident (whether at work or at home) is independent of the gender of the victim. This hypothesis is to be veriﬁed at a signiﬁcance level of α = 0.01. The degrees of freedom d.f. = (r  1) (c  1) where r is the number of rows and c the number of categories. Hence, d.f. = (3  1) (2  1) = 2. From Table A.5, the twotailed critical value is χ 2α = 9.21 at α = 0.01 for d.f. = 2. The expected values for different joint occurrences (male/ work, male/home, male/other, female/work, female/home, female/other) are shown in italics in Table 4.3 and correspond to the case when the occurrences are independent. Recall from basic probability (Eq. 2.10) that if events A and B are independent, then p(A \ B) = p(A). p(B) where p indicates the probability. In our case, if being “male” and “being involved in an accident at work” were truly independent, then p(work \ male) = p(work). p(male). Consider the cell corresponding to male/at work. Its expected value = n pðwork \ maleÞ = n pðworkÞ pðmaleÞ = 183 ð45Þð107Þ 183
45 183
107 183 =
= 26:3 (as shown Table 4.3). Expected values for other joint occurrences shown in the table have been computed in like manner. Thus, the chisquare statistics is χ 2 = ð5  18:7Þ2 18:7
ð40  26:3Þ2 26:3
ð13  12:9Þ2 12:9
þ
þ ... þ = 24.3. Since, χ 2α = 9.21 which is much lower than 24.3, the null hypothesis can be safely rejected at a signiﬁcance level of 0.01. Hence, one would conclude that gender has a bearing on the circumstance in which the accidents occur. ■
6
From Weiss (1987) by permission of Pearson Education.
Expected 18.7 44.4 12.9
Total Observed 45 107 31 183 = n
Another widely used test to determine whether the distribution of a set of empirical or sample data comes from a speciﬁed theoretical distribution is the KolmogorovSmirnov (KS) test (Shannon, 1975). It is analogous to the chisquare test and is especially useful for small sample sizes since it does not require that adjacent categories with expected frequencies less than 5 be combined. The chisquare test is very powerful for large samples (greater than sample sizes of about n > 100, with some authors even suggesting n > 30). The KS test is based on binning the cumulative probability distribution of the empirical data into a speciﬁc number of classes or intervals and is best used for 10 < n < 100. In general, the greater the number of intervals the more discriminating the test. While the chisquare test requires a minimum of 5 data points in each class, the KS test can assume even a single observation in a class.
4.2.7
Test on the Pearson Correlation Coefficient
Recall that the Pearson correlation coefﬁcient was presented in Sect. 3.4.2 as a means of quantifying the linear relationship between samples of two variables. One can also deﬁne a population correlation coefﬁcient ρ for two variables. Section 4.2.1 presented methods by which the uncertainty around the population mean could be ascertained from the sample mean by determining conﬁdence limits. Similarly, one can make inferences about the population correlation coefﬁcient ρ from knowledge of the sample correlation coefﬁcient r. Provided both the variables are normally distributed (called a bivariate normal population), then Fig. 4.8 provides a convenient way of ascertaining the 95% CL of the population correlation coefﬁcient for different sample sizes. Say, r = 0.6 for a sample n = 10 pairs of observations, then the 95% CL interval limits for the population correlation coefﬁcient are (0.05 < ρ < 0.87), which are very wide. Notice how increasing the sample size shrinks these bounds. For example, when n = 100, the interval limits are (0.47 < ρ < 0.71). The accurate manner of determining whether the sample correlation coefﬁcient r found from analyzing a data set with two variables is signiﬁcant or not is to conduct a standard hypothesis test involving null and alternative hypothesis.
138
4
Making Statistical Inferences from Samples
Fig. 4.8 Plot depicting 95% CI for population correlation in a bivariate normal population for various sample sizes n. The bold vertical line deﬁnes the lower and upper limits of ρ when r = 0.6 from a data set of 10 pairs of observations. (From Wonnacutt and Wonnacutt 1985 by permission of John Wiley and Sons)
Table A.7 lists the critical values of the sample correlation coefﬁcient r for testing the null hypothesis that the population correlation coefﬁcient is statistically signiﬁcant (i.e., ρ ≠ 0) at the 0.05 and 0.01 signiﬁcance levels for one and two tailed tests. The interpretation of these values is of some importance in many cases, especially when dealing with small data sets. Say, analysis of the 12 monthly bills of a residence revealed a linear correlation of r = 0.6 with degreedays at the location (see Pr. 2.28 for a physical explanation of the degreeday concept). Assume that a onetailed test applies. The sample correlation suggests the presence of a correlation at a signiﬁcance level α = 0.05 (the critical value from Table A.7 is ρc = 0.497) while there is none at α = 0.01 (for which ρc = 0.658). Note that certain simpliﬁed suggestions on interpreting values of r in terms of whether they are strong, moderate, or weak for general engineering analysis were given by Eq. (3.13); these are to be used with caution and were meant as rules of thumb only. Instances do arise when the correlation coefﬁcients determined on the same two variables, but from two different samples, are to be statistically compared. Fisher’s r to z transformation method can be adopted whereby the sampling distribution of the Pearson correlation coefﬁcient r is converted into a normally distributed random variable z. Convert the two data sets (x1, y1) and (x2, y2) into two sets of normalized (zx1, zy1) and (zx2, zy2) scores and calculate
the correlation coefﬁcients zr1 and zr2 of both data sets separately following Eq. (3.12). Then calculate the random variable: zobs = ðzx1 –zx2 Þ
1 1 þ n1  3 n 2  3
1=2
ð4:18Þ
Standard statistical signiﬁcance tests can then be performed on the random variable zobs. For example, say the observed value was zobs = 2.15. If the level of signiﬁcance is set at 0.05, the critical value for a twotailed test would be zcritical = 1.96. Since zobs > zcritical, the null hypothesis that the two correlations are not signiﬁcantly different can be rejected.
4.3
ANOVA Test for MultiSamples
The statistical methods known as ANOVA (analysis of variance) are a broad set of widely used and powerful techniques meant to identify and measure sources of variation within a data set. This is done by partitioning the total variation in the data into its component parts. Speciﬁcally, ANOVA uses variance information from several samples to make inferences about the means of the populations from which these samples were drawn (and, hence, the appellation).
4.3 ANOVA Test for MultiSamples Fig. 4.9 Conceptual explanation of the basis of a singlefactor ANOVA test. (From Devore and Farnum 2005 with permission from Thomson Brooks/Cole)
139
Variation within samples Variation between sample means
.. .. .
... ..
. .. ..
... . .
SMALL
.. .. .
4.3.1
SingleFactor ANOVA
The ANOVA procedure uses just one test for comparing k sample means, just like that followed by the twosample test. The following example allows a conceptual understanding of the approach. Say, four random samples of the same physical quantity have been selected, each from a different source. Whether the sample means differ enough to suggest different parent populations for the sources can be ascertained from the withinsample variation to the variation between the four samples. The more the sample means differ, the larger will be the betweensamples variation, as shown in Fig. 4.9b, and the less likely is the probability that the samples arise from the same population. The reverse is true if the ratio of betweensamples variation to that of the withinsamples is small (Fig. 4.9a). Singlefactor ANOVA methods test the null hypothesis of the form:
LARGE
Variation between samples Variation within samples
When H0 is true
Recall that ztests and ttests described previously are used to test for differences in one parameter treated as a random variable (such as mean values) between two independent groups depending on whether the sample sizes are greater than or less than about 30 respectively. The two groups differ in some respect; they could be, say, two samples of 10 marbles each selected from a production line during different “time periods.” The “time period” is the experimental variable which is referred to as a factor in designed experiments and hypothesis testing. It is obvious that the cases treated in Sect. 4.2 are singlefactor hypothesis tests (on mean and variance) involving single and two groups or samples; the extension to multiple groups is the singlefactor (or oneway) ANOVA method. This section deals with singlefactor ANOVA method which is a logical leadin to multivariate techniques (discussed in Sect. 4.4) and to experimental design methods involving multiple factors or experimental variables (discussed in Chap. 6).
.. . ..
.. . ..
. .. ..
When H0 is false
H 0 : μ1 = μ2 = . . . = μk H a : at least two of the μi 0 s are different
ð4:19Þ
Adopting the following notation: Sample sizes: n1 , n2 . . . , nk Sample means: x1 , x2 . . . xk Sample standard deviations: s1 , s2 . . . sk Total sample size: n = n1 þ n2 . . . þ nk Grand average: hxi = weighted average of all n responses Then, one deﬁnes betweensample variation called “treatment sum of squares7” (SSTr) as: k
ni ðxi  hxiÞ2
SSTr =
with
d:f : = k  1
ð4:20Þ
i=1
and withinsamples variation or “error sum of squares” (SSE) as: k
SSE = i=1
ðni  1Þs2i with d:f : = n  k
ð4:21Þ
Together these two sources of variation comprise the “total sum of squares” (SST): k
n
SST = SSTr þ SSE =
2
xij  hxi with d:f : = n  1 i=1 j=1
ð4:22Þ
The term “treatment” was originally coined for historic reasons where one was interested in evaluating the effect of treatments or speciﬁc changes in material mix and processing of a product during its development.
7
140
4
Table 4.4 Amount of vibration (values in microns) for five brands of bearings tested on six motor samples (Example 4.3.1)a
Sample 1 2 3 4 5 6 Mean Std. dev. a
Brand 1 13.1 15.0 14.0 14.4 14.0 11.6 13.68 1.194
Brand 2 16.3 15.7 17.2 14.9 14.4 17.2 15.95 1.167
Making Statistical Inferences from Samples
Brand 3 13.7 13.9 12.4 13.8 14.9 13.3 13.67 0.816
Brand 4 15.7 13.7 14.4 16.0 13.9 14.7 14.73 0.940
Brand 5 13.5 13.4 13.2 12.7 13.4 12.3 13.08 0.479
Data available electronically on book website
Table 4.5 ANOVA table for Example 4.3.1 Source Factor Error Total
d.f. 51=4 30  5 = 25 30  1 = 29
Sum of squares SSTr = 30.855 SSE = 22.838 SST = 53.694
SST is simply the sample variance of the combined set of n data points = (ni  1)s2 where s is the standard deviation of all the n data points. The statistic deﬁned below as the ratio of two variances follows the Fdistribution: F=
MSTr MSE
ð4:23Þ
ð4:24Þ
and the mean withinsample square error is MSE = SSE=ðn  k Þ
ð4:25Þ
Recall that the pvalue is the area of the F curve for (k  1, n  k) degrees of freedom to the right of Fvalue. If pvalue ≤α (the selected signiﬁcance level), then the null hypothesis can be rejected. Note that the Ftest is meant to be used for normal populations and equal population variances. Example 4.3.18 Comparing mean life of ﬁve motor bearings A motor manufacturer wishes to evaluate ﬁve different motor bearings for motor vibration (which adversely results in reduced life). Each type of bearing is installed on different random samples of six motors. The amount of vibration (in microns) is recorded when each of the 30 motors are running. The data obtained is assembled in Table 4.4.
From Devore and Farnum (2005) by # permission of Cengage Learning.
8
Fvalue 8.44
Determine from a Ftest whether the bearing brands inﬂuence motor vibration at the α = 0.05 signiﬁcance level. In this example, Grand average : hxi = 14:22, k = 5, and n = 30. The oneway ANOVA table is ﬁrst generated as shown in Table 4.5 using Eqs. (4.20)–(4.25). For example, SSTr = 6× ð13:68–14:22Þ2 þ ð15:95–14:22Þ2 þ ð13:67–14:22Þ2
þð14:73–14:22Þ2 þ ð13:08–4:22Þ2 = 30:855
where the mean betweensample variation is MSTr = SSTr=ðk  1Þ
Mean square MSTr = 7.714 MSE = 0.9135
SSE = 5 × 1:1942 þ 1:1672 þ 0:8162 þ 0:9402 þ 0:4792 = 22:838: From the Ftables (Table A.6) and for α = 0.05, the critical Fvalue for d.f. = (4, 25) is Fc = 2.76, which is less than F = 8.44 computed from the data. Hence, one is compelled to reject the null hypothesis that all ﬁve means are equal, and conclude that type of bearing motor does have a signiﬁcant effect on motor vibration. In fact, this conclusion can be reached even at the more stringent signiﬁcance level of α = 0.001. The results of the ANOVA analysis can be conveniently illustrated by generating an effects plot, as shown in Fig. 4.10a. This clearly illustrates the relationship between the mean values of the response variable, i.e., vibration level for the ﬁve different motor bearing brands. Brand 5 gives the lowest average vibration, while Brand 2 has the highest. Note that such plots, though providing useful insights, are not generally a substitute for an ANOVA analysis. Another way of plotting the data is a means plot (Fig. 4.10b) which includes 95% CL intervals as well as the information provided in Fig. 4.10a. Thus, a sense of the variation within samples can be gleaned. ■
4.3 ANOVA Test for MultiSamples
141
Fig. 4.10 (a) Effect plot. (b) Means plot showing the 95% CL intervals around the mean values of the 5 brands (Example 4.3.1)
Table 4.6 Pairwise analysis of the five samples following Tukey’s HSD procedure (Example 4.3.2)
Samples 1,2 1,3 1,4 1,5 2,3 2,4 2,5 3,4 3,5 4,5 a
4.3.2
Distance j13.68  15.95 j j13.68  13.67 j j13.68  14.73 j j13.68  13.08 j j15.95  13.67 j j15.95  14.73 j j15.95  13.08 j j13.67  14.73 j j13.67  13.08 j j14.73  13.08 j
= 2.27 = 0.01 = 1.05 = 0.60 = 2.28 = 1.22 = 2.87 = 1.06 = 0.59 = 1.65
μi ≠ μj μi ≠ μj
μi ≠ μj
Indicated only if distance > critical value of 1.62
Tukey’s Multiple Comparison Test
A limitation with the ANOVA test is that, in case the null hypothesis is rejected, one is unable to determine the exact cause. For example, one poor motor bearing brand could have been the cause of this rejection in the example above even though there is no signiﬁcant difference between the other four other brands. Thus, one needs to be able to pinpoint the culprit. One could, of course, perform paired comparisons of two brands one at a time. In the case of 5 sets, one would then make 10 such tests. Apart from the tediousness of such a procedure, making independent paired comparisons leads to a decrease in sensitivity, i.e., type I errors are magniﬁed.9 Rigorous classical procedures that allow multiple comparisons to be made simultaneously have been proposed for this purpose (see Manly 2005). One such method is discussed in Sect. 4.4.2. In this section, the Tukey’s HSD (honestly signiﬁcant difference) procedure is described which is used to test the differences among multiple sample means by pairwise testing. It is limited to cases of equal sample sizes. This procedure allows the simultaneous formation of prespeciﬁed conﬁdence intervals for all paired comparisons using the Student tdistribution. Separate tests are conducted to determine whether μi = μj for each pair (i,j) of means in an ANOVA study of k population means. Tukey’s procedure is based on comparing the distance (or absolute value)
between any two sample means j xi  xj j to a threshold value T that depends on signiﬁcance level α as well as on the mean square error (MSE) from the ANOVA test. The Tvalue is calculated as: T = qα
This can be shown mathematically but is beyond the scope of this text.
MSE ni
1=2
ð4:26aÞ
where ni is the size of the sample drawn from each brand population, qα values are called the Studentt range distribution values and are given in Table A.8 for α = 0.05 for d. f. = (k, n  k). If j xi  xj j > T, then one concludes that μi ≠ μj at the corresponding signiﬁcance level. Otherwise, one concludes that there is no difference between the two means. Tukey also suggested a convenient visual representation to keep track of the results of all these pairwise tests. The Tukey’s HSD procedure and this representation are illustrated in the following example. Example 4.3.210 Using the same data as that in Example 4.3.1 conducts a multiple comparison procedure to distinguish which of the motor bearing brands are superior to the rest. Following Tukey’s HSD procedure given by Eq. (4.26a), the critical distance between sample means at α = 0.05 is: From Devore and Farnum (2005) by # permission of Cengage Learning. 10
9
Conclusiona μi ≠ μj
142
4
Making Statistical Inferences from Samples
Brand 1 13.68 Brand 5 13.08
Brand 3 13.67
Brand 4 14.73
Brand 2 15.95
Fig. 4.11 Graphical depiction summarizing the ten pairwise comparisons following Tukey’s HSD procedure. Brand 2 is signiﬁcantly different from Brands 1, 3, and 5, and so is Brand 4 from Brand 5 (Example 4.3.2)
T = qα
MSE ni
1=2
= 4:15
0:913 6
1=2
= 1:62
where qα is found by interpolation from Table A.8 based on d.f. = (k, n  k) = (5, 25). The pairwise distances between the ﬁvesample means shown in Table 4.6 can be determined, and appropriate inferences made. Note that the number of pairwise comparison is [k(k  1)/2] which in this case of k = 5 is equal to 10 comparisons as shown in the table. Thus, the distance T between the following pairs is less than 1.62: {1,3;1,4;1,5}, {2,4}, {3,4;3,5}. This information is visually summarized in Fig. 4.11 by arranging the ﬁve sample means in ascending order and then drawing rows of bars connecting the pairs whose distances do not exceed T = 1.62. It is now clear that though brand 5 has the lowest mean value, it is not signiﬁcantly different from brands 1 and 3. Hence, the ﬁnal selection of which motor bearing to pick that has low vibration can be made from these three brands only (brands 1, 3, 5). ■ The Tukey method is said to be too conservative when the intent is to compare the means of (k  1) samples against a single sample taken to be the “control” or reference. An alternative multicomparison test suitable in such instances is the Dunnett’s method (see for example, Devore and Farnum 2005). Instead of performing pairwise comparison among all the samples, this method computes the critical Tvalue following: T = t α MSE
1 1 þ ni n c
1=2
ð4:26bÞ
where ni and nc are the sizes of the sample drawn from the individual groups and the control group respectively and tα is the critical value found from Table A.9 based on d. f. = (k  1, n  k) degrees of freedom. Thus, the number of pairwise comparisons reduce to (k  1) tests as against [k(k  1)/2] for the Tukey method. For the example above, the number of tests is only 4 instead of 10 for the Tukey approach.
4.4
Tests of Significance of Multivariate Data
4.4.1
Introduction to Multivariate Methods
Multivariate statistical analysis (also called multifactor analysis) deals with statistical inference as applied to multiple parameters (each considered to behave like a random variable) deduced from one or several samples taken from one or several populations. Multivariate methods can be used to make inferences about parameters such as sample means and variances. Rather than treating each parameter or variable separately as done in ttests and singlefactor ANOVA, multivariate inferential methods allow the analyses of multiple variables simultaneously as a system of measurements. This generally results in sounder inferences to be made, a point elaborated below. The univariate probability distributions presented in Sect. 2.4 can also be extended to bivariate and multivariate distributions. Let x1 and x2 be two variables of the same type, say both discrete (for continuous variables, the summations in the equations below need to be replaced with integrals). Their joint distribution is given by: f ðx1 , x2 Þ ≥ 0 and
f ð x1 , x2 Þ = 1
ð4:27Þ
allðx1 , x2 Þ
Consider two sets of multivariate data each consisting of p variables. However, the sets could be different in size, i.e., the number of observations in each set may be different, say n1 and n2. Let X 1 and X 2 be the sample mean vectors of dimension p. For example, X 1 = x11 , x12 , . . . x1i , . . . x1p
ð4:28Þ
where x1i is the sample average over n1 observations of parameter i for the ﬁrst set. Further, let C1 and C2 be the sample covariance matrices of size (p × p) for the two sets respectively (the basic concepts of covariance and correlation were presented in Sect. 3.4.2). Then, the sample matrix of variances and covariances for the ﬁrst data set is given by:
4.4 Tests of Significance of Multivariate Data
143
Fig. 4.12 Two bivariate normal distributions and associated 50% and 90% contours assuming equal standard deviations for both variables. However, the lefthand side plot (a) presumes the two variables to be uncorrelated, while for the right (plot b) the two variables have a correlation coefﬁcient of 0.75 which results in elliptical contours. (From Johnson and Wichern, 1988 by # permission of Pearson Education)
C1 =
c11
c12
::
c1p
c21 ::
c22 ::
:: ::
c2p ::
cp1
cp2
::
cpp
ð4:29Þ
where cii is the variance for parameter i and cik the covariance for parameters i and k. Similarly, the sample correlation matrix where the diagonal elements are equal to unity and other terms scaled appropriately is given by 1
r 12
::
r 1p
r 21 R1 = ::
1 ::
:: ::
r 2p ::
r p1
r p2
::
1
ð4:30Þ
Both matrices contain the correlations between each pair of variables, and they are symmetric about the diagonal since, say, r12 = r21, and so on. This redundancy is simply meant to allow easier reading. These matrices provide a convenient visual representation of the extent to which the different sets of variables are correlated with each other, thereby allowing strongly correlated sets to be easily identiﬁed. Note that correlations are not affected by shifting and scaling the data. Thus, standardizing the variables obtained by subtracting each observation by the mean and dividing by the standard deviation will still retain the correlation structure
of the original data set while providing certain convenient interpretations of the results. Underlying assumptions for multivariate tests of signiﬁcance are that the two samples have close to multivariate normal distributions with equal population covariance matrices. The multivariate normal distribution is a generalization of the univariate normal distribution when p ≥ 2 where p is the number of dimensions or parameters. Figure 4.12 illustrates how the bivariate normal distribution is distorted in the presence of correlated variables. The contour lines are circles for uncorrelated variables and ellipses for correlated ones.
4.4.2
Hotteling T2 Test
The simplest extension of univariate statistical tests is the situation when two or more samples are evaluated to determine whether they originate from populations with: (i) different means, and (ii) different variances/covariances. One can distinguish between the following types of multivariate inference tests involving more than one parameter (Manly 2005): (a) comparison of several mean values of factors from two samples is best done using the Hotteling T2test; (b) comparison of variance for two samples (several procedures have been proposed; the best known are
144
4
the Box’s Mtest, the Levene’s test based on T2test, and the Van Valen test); (c) comparison of mean values for several samples (several tests are available; the best known are the Wilks’ lambda statistic test, Roy’s largest root test, and Pillai’s trace statistic test); (d) comparison of variance for several samples (using the Box’s Mtest). Only case (a) will be described here, while the others are treated in texts such as Manly (2005). Consider two samples with sample sizes n1 and n2. One wishes to compare differences in p random variables among the two samples. Let X 1 and X2 be the mean vectors of the two samples. A pooled estimate of covariance matrix is: C = fðn1  1ÞC1 þ ðn2  1ÞC2 g=ðn1 þ n2  2Þ
Making Statistical Inferences from Samples
157:381 241:000 X 1 = 31:433
and
18:500 20:810
C1 =
11:048
9:100
1:557
0:870
1:286
9:100
17:500
1:910
1:310
0:880
1:557
1:910
0:531
0:189
0:240
0:870
1:310
0:189
0:176
0:133
1:286
0:880
0:240
0:133
0:575
ð4:31Þ 158:429
where C1 and C2 are the covariance vectors given by Eq. (4.29). Then, the Hotteling T2statistic is deﬁned as:
241:571 X2 = 31:479
0
T2 =
n1 n2 X1  X2 C  1 X1  X2 ð n1 þ n2 Þ
18:446
ð4:32Þ
20:839
A large numerical value of this statistic suggests that the two population mean vectors are different. The null hypothesis test uses the transformed statistic: ðn þ n2  p  1ÞT 2 F= 1 ðn1 þ n2  2Þp
C2 = ð4:33Þ
which follows the Fdistribution with the number of p and (n1 + n2  p  1) degrees of freedom. Since, the T2 statistic is quadratic, it can also be written in double sum notation as: T2 =
n1 n 2 ð n1 þ n2 Þ
p
15:069
17:190
2:243
1:746
2:931
17:190
32:550
3:398
2:950
4:066
2:243
3:398
0:728
0:470
0:559
1:746
2:950
0:470
0:434
0:506
2:931
4:066
0:559
0:506
1:321
If one performed paired ttests with each parameter taken one at a time (as described in Sect. 4.2.3), one would compute the pooled variance for the ﬁrst parameter as:
p
ðx1i  x2i Þcik ðx1k  x2k Þ ð4:34Þ i=1 k=1
The following solved example serves to illustrate the use of the above equations. Example 4.4.111 Comparing mean values of two samples by pairwise and by Hotteling T2 procedures Consider two samples of 5 parameters ( p = 5) with paired samples. Sample 1 has 21 observations and sample 2 has 28. The mean and covariance matrices of both these samples have been calculated and shown below: 11
and
From Manly (2005) by permission of CRC Press.
s21 = ½ð21  1Þð11:048Þ þ ð28  1Þð15:069Þ=ð21 þ 28  2Þ = 13:36 And the tstatistic as: t=
ð157:381  158:429Þ =  0:99 1 1 13:36 21 þ 28
with (21 + 28  2 =) 47 degrees of freedom. This is not signiﬁcantly different from zero as one can note from the pvalue indicated in Table A.4. Table 4.7 assembles similar results for all other parameters. One would conclude that none of the ﬁve parameters in both data sets are statistically different.
4.4 Tests of Significance of Multivariate Data
145
Table 4.7 Paired ttests for each of the five parameters taken one at a time (Example 4.4.1) Parameter
First data set Mean 157.38 241.00 31.43 18.50 20.81
1 2 3 4 5
Second data set Mean 158.43 241.57 31.48 18.45 20.84
Variance 11.05 17.50 0.53 0.18 0.58
To perform the multivariate test, one ﬁrst calculates the pooled sample covariance matrix (Eq. 4.31): 20C 1 þ 27C 2 47
C=
13:358
13:748
1:951
1:373 2:231
13:748
26:146
2:765
2:252 2:710
= 1:951
2:765
0:645
0:350 0:423
1:373
2:252
0:350
0:324 0:347
2:231
2:710
0:423
0:347 1:004
where, for example, the ﬁrst entry is: (20 × 11.048 + 27 × 15.069)/47 = 13.358. The inverse of the matrix C yields  0:0694  0:2395 0:0785
0:2061
 0:0694 0:1234 C
1
 0:0376  0:5517 0:0277
=  0:2395  0:0376 4:2219
 3:2624 0:0181
 0:5517  3:2624 11:4610
0:0785
 0:1969 0:0277
0:1969
1:2720
 0:0181  1:2720 1:8068
Substituting the elements of the above matrix in Eq. (4.34) leads to: ð21Þð28Þ ½ð157:381  158:429Þð0:2061Þ ð21 þ 28Þ
T2 =
ð157:381  158:429Þ  ð241:000  241:571Þ ð0:0694Þð241:000  241:571Þ þ ⋯ þ ð20:810  20:839Þ ð1:8068Þð20:810  20:839Þ = 2:824 which
from
Eq. (4.33) results in a Fstatistic = = 0:517 with d.f. = (5, 43).
ð21þ28  5  1Þð2:824Þ ð21þ28  2Þð5Þ
Variance 15.07 32.55 0.73 0.43 1.32
tvalue (d.f. = 47)
pvalue
0.99 0.39 0.20 0.33 0.10
0.327 0.698 0.842 0.743 0.921
This is clearly not signiﬁcant since Fcritical = 2.4 (from Table A.6), and so there is no evidence to support that the population means of the two groups are statistically different when all ﬁve parameters are simultaneously considered. In this case one could have drawn such a conclusion directly from Table 4.7 by looking at the pairwise pvalues, but this may not happen always. ■ Other than the elegance provided, there are two distinct advantages of performing a single multivariate test as against a series of univariate tests. The probability of ﬁnding a typeI or false negative result purely by accident increases as the number of variables increase, and the multivariate test takes proper account of the correlation between variables. The above example illustrated the case where no signiﬁcant differences in population means could be discerned either from univariate tests performed individually or from an overall multivariate test. However, there are instances, when the latter test turns out to be signiﬁcant as a result of the cumulative effects of all parameters while any one parameter is not signiﬁcantly different. The converse may also hold, the evidence provided from one signiﬁcantly different parameter may be swamped by lack of differences between the other parameters. Hence, it is advisable to perform tests as illustrated in the example above. Sections 4.2, 4.3, and 4.4 treated several cases of hypothesis testing. An overview of these cases is provided in Fig. 4.13 for greater clarity. The speciﬁc subsection of each of the cases is also indicated. The ANOVA case corresponds to the lower right box, namely testing for differences in the means of a single factor or variable which is sampled from several populations, while the Hotteling T2test corresponds to the case when the mean of several variables from two samples are evaluated. As noted above, formal use of statistical methods can become very demanding mathematically and computationally when multivariate and multiple samples are considered, and hence the advantage of using numerical based resampling methods (discussed in Sect. 4.8).
146
4
Making Statistical Inferences from Samples
Hypothesis Tests
Two samples
One sample
One variable
Mean/ Proportion 4.2.2/ 4.2.4(a)
Two variables
Variance
Probability distribution
Correlation coefficient
4.2.5(a)
4.2.6
4.2.7
Nonparametric
4.5.3
One variable
Multi samples
Multivariate
One variable
Mean/ Proportion Variance
Mean
Mean
4.2.3(a) 4.2.3(b)/ 4.2.4(b)
4.4.2
4.3
4.5.1
4.2.5(b)
Hotteling T^2
ANOVA
4.5.2
Fig. 4.13 Overview of various types of parametric hypothesis tests treated in this chapter along with section numbers. Nonparametric tests are also covered brieﬂy as indicated. The term “variable” is used instead of “parameters” since the latter are treated as random variables
4.5
NonParametric Tests
The parametric tests described above have implicit builtin assumptions regarding the distributions from which the samples are taken. Comparison of samples from populations using the ttest and Ftest can yield misleading results when the random variables being measured are not normally distributed and do not have equal variances. It is obvious that fewer the assumptions, the broader would be the potential applications of the test. One would like that the signiﬁcance tests used lead to sound conclusions, or that the risk of coming to misleading/wrong conclusions be minimized. Two concepts relate to the latter aspect. The concept of robustness of a test is inversely proportional to the sensitivity of the test and to violations of the underlying assumptions. The power of a test, on the other hand, is a measure of the extent to which cost of experimentation is reduced. Nonparametric tests can be applied to continuous numerical measurements but tend to be more widely used to categorical or ordinal data, i.e., those that can only be ranked in order of magnitude. This occurs frequently in management, social sciences, and psychology. For example, a consumer survey respondent may rate one product as better than another but is unable to assign quantitative values to each product. Data involving such “preferences” cannot also be
subject to the t and Ftests. It is under such cases that one must resort to nonparametric statistics. Rather than use actual numbers, nonparametric tests usually use relative ranks by sorting the data by rank (or score) and discarding their speciﬁc numerical values. Because nonparametric tests do not use all the information contained in the data, they are generally less powerful than parametric ones (i.e., tend to lead to wider conﬁdence intervals). On the other hand, they are more intuitive, simpler to perform, more robust, and less sensitive to outlier points when the data is noisy or when the underlying empirical distribution of the parameters is nonGaussian. Much of the material discussed below can be found in statistical textbooks; for example, McClave and Benson (1988), Devore and Farnum (2005), Walpole et al. (2007), and Heiberger and Holland (2015). The nonparametric tests described in this section are meant to evaluate whether the medians of drawn samples/ groups are close enough to conclude that they have emanated from the same population. The tests are predicated on the homogeneity of variation assumption, i.e., the population variances of the dependent variable are equal within all groups with the same Gaussian or nonGaussian distribution shape. Whether this is credible or not could be evaluated by visually inspecting sample distributions. A better approach is to use the Levene’s variance test. It is based on the fact that the absolute differences between all scores and their (group)
4.5 NonParametric Tests
147
means should be roughly equal over groups. A larger variance implies that, on average, the data values are “further away” from their mean, and vice versa. Thus, this test is very similar to a onefactor ANOVA test based, not on the actual data, but on the absolute difference scores.
Similarly, B(0;7,0.5) = .57 = 0.0078, and B(1;7,0.5) = 0.0547. Finally, the probability that two or less successes occurredpurelybychance = 0.0078 + 0.0547 + 0.164= 0.227. Since this value of 0.227 < 0.286, the null hypothesis that the median value of the response times is less than 6 h should be rejected. ■
4.5.1
When the number of observations is large, the calculations may become tedious. In such cases, it is convenient to use the binomial distribution tables (Appendix A1); alternatively, the normal distribution is a good assumption for ease in computation.
Signed and Rank Tests for Medians
Some of the common nonparametric tests for the medians of one sample and two sample data are described below. No restriction is placed on the empirical distributions other than they be continuous and symmetric (so that the mean and medians are approximately equal).
(b) Mann–Whitney or Wilcoxon rank sum test for medians of two samples
(a) The Sign Test for medians The median is a value where one expects 50% of the observations to be above and 50% below this value. The magnitude of the observations is converted to either a + sign or a  sign depending on whether the individual observation value is above or below the median value. One would expect half of the signs to be positive (+) and half to be negative (). A statistical test is performed on the actual number of + and  values to draw an inference. The procedure is best illustrated by means of an example. Example 4.5.1 A building maintenance technician claims that he can repair most nonmajor mechanical breakdowns of the HVAC system within a median time of 6 h. The maintenance supervisor records the following response times in the last seven events (assumed independent): Response time (hour) Difference from median Sign of difference
5.2 0.8
6.7 +0.7
5.5 0.5
5.8 0.2
6.3 +0.3
5.6 0.4

+


+

In the observed sample of 7 events, there are 2 instances when the response time is greater than the median value of 6 h, i.e., a probability of (2/7 = 0.286). The probability that two “successes” occurred purely by chance can be ascertained by using the binomial distribution (Eq. 2.37a) with p = 0.5 since only one of two possible outcomes is possible (the response time is either greater or less than the stated median): Bð2; 7, 0:5Þ = =
7 2
1 2
2
1
7×6 1 1 × × 2 4 4
5
1 2
72
= 0:164, i:e:, 16:4%
The sign test can be said to discard too much information in the data. A stronger test is the Mann–Whitney test (also referred to as Wilcoxon rank sum) used to evaluate whether the medians of two separate samples have been drawn from the same population or not. It is thus the nonparametric version of the twosample ttest for sample means (Sect. 4.2.3). The sampled data must be continuous, and the sampled population should be close to symmetric and can be either Gaussian or nonGaussian (if Gaussian, parametric tests are preferable). Strictly speaking, the Mann–Whitney test involves ranking the individual observations of both samples combined, and then summing the ranks of both groups separately. A test is performed on the two sums to deduce whether the two samples come from the same population or not. While simple and intuitive, the test is nonetheless grounded on statistical theory. The following example illustrates the approach. Example 4.5.2 Ascertaining whether oil company researchers and academics differ in their predictions of future atmospheric carbon dioxide levels The intent is to compare the predictions in the change of atmospheric carbon dioxide levels between researchers who are employed by oil companies and those who are in academia. The gathered data shown in Table 4.8 in percentage increase in carbon dioxide from the current level over the next 10 years as predicted by six oil company researchers and seven academics. Perform a statistical test at the 0.05 signiﬁcance level to evaluate the following hypotheses: (i) Predictions made by oil company researchers differ from those made by academics. (ii) Predictions made by oil company researchers tend to be lower than those made by academics.
148
4
Table 4.8 Wilcoxon rank test calculation for paired independent samples (Example 4.5.2)
1 2 3 4 5 6 7 Sum
Oil company researchers Prediction (%) 3.5 5.2 2.5 5.6 2.0 3.0 –
(i) First, both groups are combined into a single group and ranks are assigned on the combined group starting from rank = 1 for the value with the lowest algebraic value and the rest in ascending order. Next, the ranks of each group are separately tabulated and summed. Since there are 13 predictions, the ranks run from 1 through 13 as shown in the table. The test statistic is based on the sum totals of each group separately (and hence its name). If they are close, the implication is that there is no evidence that the probability distributions of both groups are different; and vice versa. Let TA and TB be the rank sums of either group. Then, the sum of all the individual ranks is: TA þ TB =
nðn þ 1Þ 13ð13 þ 1Þ = = 91 2 2
ð4:35Þ
where n = n1 + n2 with n1 = 6 and n2 = 7. Note that n1 should be selected as the one with fewer observations. Since (TA + TB) is ﬁxed, a small value of TA implies a large value of TB, and vice versa. Hence, the greater the difference between both the rank sums, greater the evidence that the samples come from different populations. Since one is testing whether the predictions by both groups are different or not, the twotailed signiﬁcance test is appropriate. Table A.11 provides the lower and upper cutoff values for different values of n1 and n2 for both the onetailed and the twotailed tests. Note that the lower and higher cutoff values are (28, 56) at 0.05 signiﬁcance level for the twotailed test. The computed statistics of TA = 25 and TB = 66 are outside the range, the null hypothesis is rejected, and one would conclude that the predictions from the two groups are different. (ii) Here one wishes to test the hypothesis that the predictions by oil company researchers is lower than those made by academics. Then, one uses a onetailed test whose cutoff values are given in part (b) of Table A.11. These cutoff values at 0.05 signiﬁcance
Rank 4 7 2 8 1 3 – 25
Making Statistical Inferences from Samples Academics Prediction (%) 4.7 5.8 3.6 6.2 6.1 6.3 6.5
Rank 6 9 5 11 10 12 13 66
level are (30, 54) but only the lower value of 30 is used for the problem speciﬁed. The null hypothesis will be rejected only if TA < 30. Since this is so, the above data suggests that the null hypothesis can be rejected at a signiﬁcance level of 0.05. ■ (c) The Wilcoxon signed rank sum test for medians of two paired samples This test is meant for paired data where samples taken are not independent. This is analogous to the twosample paired difference test treated in Sect. 4.2.3b. where the paired differences in data are converted to a single random variable. Again, the sampled data must be continuous, and the sampled population should be close to symmetric. The Wilcoxon signed rank sum test involves calculating the difference between the paired data, ranking them and summing the positive values and negative values separately. A statistical test is ﬁnally applied to reach a conclusion on whether the two distributions are signiﬁcantly different or not. This is illustrated by the following example. Example 4.5.3 Evaluating predictive accuracy of two climate change models from expert elicitation A policy maker wishes to evaluate the predictive accuracy of two different climate change models for predicting shortterm (e.g., 30 years) carbon dioxide changes in the atmosphere. He consults 10 experts and asks them to grade these models on a scale from 1 to 10, with 10 being extremely accurate. Clearly, these data are not independent since the same expert is asked to make two value judgments about the models being evaluated. The data shown in Table 4.9 are obtained (note that these are not ranked values, except for the last column, but are grades from 1 to 10 assigned by the experts). The tests are on the medians which are approximately close to the means for symmetric distributions.
4.5 NonParametric Tests
149
Null hypothesis H 0 : μ1  μ2 = 0 ðthere is no significant difference in the distributionsÞ Alternative hypothesis H a : μ1  μ2 ≠ 0 ðthere is significant difference in the distributionsÞ The paired differences are ﬁrst computed (as shown in Table 4.9) from which the ranks are generated based on the absolute differences, and ﬁnally the sums of the positive and negative ranks are computed. Note how the ranking has been assigned since there are repeats in the absolute difference values. There are three “1” in the absolute difference column. Hence a mean value of rank “2” has been assigned for all 3. Similarly, for the three absolute differences of “2,” the rank is given as “5,” and so on. For the highest absolute difference of “6,” the rank is assigned as “10.” The values shown in last two rows of the table are also simple to deduce. The values of the difference (A  B) column are either positive or negative. One simply adds up all the rank values corresponding to the cases when (A  B) is positive; and when they are negative. These are found to be 46 and 9 respectively. The test statistic for the null hypothesis is T = min (T, T+). In our case, T = 9. The smaller the value of T, the stronger the evidence that the difference between both distributions is important. The rejection region for T is determined from Table A.12. The twotailed critical value for n = 10 at 0.05 signiﬁcance level is 8. Since the computed value for T is higher, one cannot reject the null hypothesis, and so one would conclude that there is not enough evidence to suggest that one of the models is more accurate than the other at the 0.05 signiﬁcance level. Note that if a signiﬁcance level of 0.10 were selected, the null hypothesis would have been rejected. Looking at the ratings shown in Table 4.9, one notices that these seem to be generally higher for model A than model B. In case one wishes to test the hypothesis, at a signiﬁcance level of 0.05, that researchers deem model B to be less
accurate than model A, one would have used T as the test statistic and compared it to the critical value of a onetailed column values of Table A.12. Since the critical value is 11 for n = 10, which is greater than 9, one would reject the null hypothesis. This example illustrates the fact that it is important to frame the problem correctly in terms of whether a onetailed or a twotailed test is more appropriate. ■
4.5.2
Kruskal–Wallis Multiple Samples Test for Medians
Recall that the singlefactor ANOVA test was described in Sect. 4.3.1 for inferring whether mean values from several samples emanate from the same population or not, with the necessary assumption of normal distributions. The Kruskal– Wallis H test (Kruskal and Wallis, 1952) is the nonparametric or distributionfree equivalent of the Ftest used in onefactor ANOVA but applies to the medians. It can also be taken to be the extension or generalization of the ranksum test to more than two groups. Hence, the test applies to the case when one wishes to compare more than two groups which should be symmetrical but may not be normally distributed. Again, the evaluation is based on the rank sums where the ranking is made based on samples of all k groups combined. The test is framed as follows: Null hypothesis H 0 : All populations have identical probability distributions Alternative hypothesis H a : Probability distributions of at least two populations are different Let R1, R2, R3 denote the rank sums of, say, three samples. The Htest statistic measures the extent to which the three samples differ with respect to their relative ranks, and is given by:
Table 4.9 Wilcoxon signed rank sum test calculation for paired nonindependent samples (Example 4.5.3) Expert 1 2 3 4 5 6 7 8 9 10
Model A 6 8 4 9 4 7 6 5 6 8
Model B 4 5 5 8 1 9 2 3 7 2
Difference (A  B) 2 3 1 1 3 2 4 2 1 6
Absolute difference 2 3 1 1 3 2 4 2 1 6 Sum of positive ranks T+ Sum of negative ranks T
Rank 5 7.5 2 2 7.5 5 9 5 2 10 =46 =9
150
4
Making Statistical Inferences from Samples
Table 4.10 Data table for Example 4.5.4a Agriculture # Employees 10 350 4 26 15 106 18 23 62 8
1 2 3 4 5 6 7 8 9 10 a
Rank 5 27 2 13 8 21 11 12 17 4 R1 = 120
Manufacturing # Employees 244 93 3532 17 526 133 14 192 443 69
Rank 25 19 30 9.5 29 22 7 23 28 18 R2 = 210.5
Service # Employees 17 249 38 5 101 1 12 233 31 39
Rank 9.5 26 15 3 20 1 6 24 14 16 R3 = 134.5
Data available electronically on book website
H=
12 nð n þ 1 Þ
k j=1
R2j nj
 3 ð n þ 1Þ
12 1202 210:52 134:52 þ þ  3ð31Þ 10 10 30ð31Þ 10 = 99:097  93 = 6:097
H=
ð4:36Þ
where k is the number of groups, nj is the number of observations in the jth sample and n is the total sample size (n = n1 + n2 + . . . + nk). The number 12 occurs naturally from the expression for the sample variance of the ranks of the outcomes (for the mathematical derivation see Kruskal and Wallis, 1952). Thus, if the Hstatistic is close to zero, one would conclude that all groups have the same mean rank, and vice versa. The distribution of the Hstatistic is approximated by the chisquare distribution, which is used to make statistical inferences. The following example illustrates the approach. Example 4.5.412 Evaluating probability distributions of number of employees in three different occupations using a nonparametric test One wishes to compare, at a signiﬁcance level of 0.05, the number of employees in companies representing each of three different business classiﬁcations, namely agriculture, manufacturing, and service. Samples from ten companies each were gathered which are shown in Table 4.10. Since the distributions are unlikely to be normal (for example, one detects some large numbers in the ﬁrst and third columns), a nonparametric test is appropriate. First, the individual ranks for all samples from the three classes combined are generated as shown tabulated under the 2nd, 4th, and 6th columns. The values of the sums Rj are also computed and shown in the last row. Note that n = 30, while nj = 10. The test statistic H is computed next:
The degrees of freedom are the number of groups minus one, or d.f. = 3  1 = 2. From the chisquare tables (Table A.5), the twotailed critical value at α = 0.05 is 5.991. Since the computed H value exceeds this threshold, one would reject the null hypothesis at 95% CL, and conclude that at least two of the three probability distributions describing the number of employees in the sectors are different. However, the verdict is marginal since the computed H statistic is close to the critical value. It would be wise to consider the practical implications of the statistical inference test and perform a decision analysis study. ■
4.5.3
Test on Spearman Rank Correlation Coefficient
The Pearson correlation coefﬁcient (Sect. 3.4.2) was a parametric measure meant to quantify the correlation between two quantiﬁable variables. The Spearman rank correlation coefﬁcient rs is similar in deﬁnition to the Pearson correlation coefﬁcient but uses relative ranks of the data instead of the numerical values itself. The same equation as Eq. (3.12) can be used to compute this measure, with its magnitude and sign interpreted in the same fashion. However, a simpler formula is often used to calculate the Spearman rank correlation coefﬁcient (McClave and Benson 1988): r Sp = 1 
From McClave and Benson (1988) by # permission Pearson Education. 12
6 d 2i nð n 2  1Þ
ð4:37Þ
4.5 NonParametric Tests
151
Table 4.11 Data table for Example 4.5.5 showing how to conduct the nonparametric correlation test Faculty 1 2 3 4 5 6 7 8 9 10
research grants ($) 1,480,000 890,000 3,360,000 2,210,000 1,820,000 1,370,000 3,180,000 930,000 1,270,000 1,610,000
Teaching evaluation (out of 10) 7.05 7.87 3.90 5.41 9.02 6.07 3.20 5.25 9.50 4.45
Research rank (ui) 5 1 10 8 7 4 9 2 3 6
where n is the number of paired measurements, and the difference between the ranks for the ith measurement for ranked variables u and v is di = ui  vi. Example 4.5.5 Nonparametric testing of correlation between the sizes of faculty research grants and teaching evaluations The provost of a major university wants to determine whether a statistically signiﬁcant correlation exists between the research grants and teaching evaluation rating of its senior faculty. Data over 3 years have been collected as assembled in Table 4.11, which also shows the manner in which ranks have been generated and the quantities di = (ui  vi) computed. Using Eq. (4.37) with n = 10, the Spearman rank correlation coefﬁcient is: r Sp = 1 
6ð260Þ =  0:576 10ð100  1Þ
Thus, one notes that there exists a negative correlation between the sample data. However, whether this is signiﬁcant at the population level requires that a statistical test be performed for the correlation coefﬁcient rSp: Null hypothesis H 0 : r Sp = 0 ðthere is no significant population correlationÞ Alternative hypothesis H a : r Sp ≠ 0 ðthere is significant population correlationÞ Table A.10 in Appendix A gives the absolute cutoff values for different signiﬁcance levels of the Spearman rank correlation. For n = 10, the onetailed absolute critical value for α = 0.05 is rSp,α = 0.564. This implies that there is a negative
Teaching rank (vi) 7 8 2 5 9 6 1 4 10 3
Difference (di) 2 7 8 3 2 2 8 2 7 3 Total
Diff squared (di2) 4 49 64 9 4 4 64 4 49 9 260
correlation between research grants and teaching evaluations which differs statistically from 0 at a signiﬁcance level of 0.05 (albeit barely). It is interesting to point out that had a parametric analysis been undertaken, the corresponding Pearson correlation coefﬁcient (Sect. 3.4.2) would have been r = 0.620 and deemed signiﬁcant at α = 0.05 (the critical value from Table A.7 is 0.549). The correlation coefﬁcients by both methods are quite close (0.576 and 0.620) with the parametric method indicating stronger correlation. However, nonparametric tests are distributionfree and, in that sense, are more robust. It is advisable, as far as possible, to perform both types of tests and then draw conclusions. ■ The aspect related to how the conﬁdence intervals widen as n decreases has been previously discussed in Sect 4.2.7 for the Pearson correlation coefﬁcient. The number of data points n also has a large effect on whether the Spearman correlation coefﬁcient rSp determined from a data set is signiﬁcant or not (Wolberg, 2006). For values of n greater than about 10, the random variable z deﬁned below (assuming Gaussian distribution): z = r sp =sqrtðn  1Þ
ð4:38Þ
From Table A.3 for a onetailed distribution, the critical value zα = 1.645 for a 5% signiﬁcance level. From Eq. (4.38), for a sample n = 101, the critical value of rSp,α = 1.645/sqrt (101  1) = 0.1645. However, for a sample size of n = 10, the critical rSps = 1.645/sqrt (10  1) = 0.548 which is 3.3 times greater than the previous estimate! This simple example serves to illustrate the importance of the number of data points on the signiﬁcance test of a sample correlation coefﬁcient.
152
4
4.6
Bayesian Inferences
4.6.1
Background
Bayes’ theorem and how it can be used for probability related problems has been treated in Sect. 2.5. Its strength lies in the fact that it provides a framework for including prior information in a twostage (or multistage) experiment whereby one could draw stronger conclusions than one could with observational data alone. It is especially advantageous for small data sets, and it was shown that its predictions converge with those of the classical method for two cases: (i) as the data set of observations gets larger; and (ii) if the prior distribution is modeled as a uniform distribution. It was pointed out that advocates of the Bayesian approach view probability as a degree of belief held by a person about an uncertainty issue as compared to the objective view of long run relative frequency held by traditionalists. This section will discuss how the Bayesian approach can be used to make statistical inferences from samples about an uncertain population parameter, and for addressing hypothesis testing problems.
4.6.2
Estimating Population Parameter from a Sample
Consider the case when the population mean μ is to be estimated (point and interval estimates) from the sample mean x with the population distribution assumed to be Gaussian with a known standard deviation σ. This case is given by the sampling distribution of the mean x treated in Sect. 4.2.1. The probability P of a twotailed distribution at signiﬁcance level α can be expressed as: P x  zα=2
σ σ < μ < x þ zα=2 1=2 = 1  α n1=2 n
ð4:39Þ
where n is the sample size and z is the value from the standard normal tables. The traditional or frequentist interpretation is that one can be (1  α) conﬁdent that the above interval contains the true population mean (see Sect. 4.2.1c). However, the interval itself should not be interpreted as a probability interval for the parameter. The Bayesian approach uses the same formula, but the mean is modiﬁed since the posterior distribution is now used which includes the sample data as well as the prior belief. The conﬁdence interval is referred to as the credible interval (also, referred to as the Bayesian conﬁdence interval). The Bayesian interpretation is that the value of the mean is ﬁxed but has been chosen from some known (or assumed) prior probability distribution. The data collected allows one to recalculate the probability of different values of the mean (i.e., the posterior probability) from which the (1  α)
Making Statistical Inferences from Samples
credible interval can be surmised. Thus, the traditional approach leads to a probability statement about the interval, while the Bayesian approach about the population mean parameter (Phillips 1973). The credible interval is usually narrower than the traditional conﬁdence interval. The relevant procedure to calculate the credible intervals for the case of a Gaussian population and a Gaussian prior is presented without proof below (Wonnacutt and Wonnacutt 1985). Let the prior distribution, assumed normal, be characterized by a mean μ0 and variance σ 20 , while the sample mean and standard deviation values are x and sx. Selecting a prior distribution is equivalent to having a quasisample of size n0 whose size is given by: n0 =
s2x σ 20
ð4:40Þ
The posterior mean and standard deviation μ and σ are then given by: μ =
n0 μ0 þ nx n0 þ n
and
σ =
sx ðn0 þ nÞ1=2
ð4:41Þ
Note that the expression for the posterior mean is simply the weighted average of the sample and the prior mean and is likely to be less biased than the sample mean alone. Similarly, the standard deviation is divided by the total normal sample size and will result in increased precision. However, had a different prior rather than the normal distribution been assumed above, a slightly different interval would have resulted which is another reason why traditional statisticians (socalled frequentists) are uneasy about fully endorsing the Bayesian approach. Example 4.6.1 Comparison of classical and Bayesian conﬁdence intervals A certain solar PV module is rated at 60 W with a standard deviation of 2 W. Since the rating varies somewhat from one shipment to the next, a sample of 12 modules has been selected from a shipment and tested to yield a mean of 65 W and a standard deviation of 2.8 W. Assuming a Gaussian distribution, determine the twotailed 95% CI by both the traditional and the Bayesian approaches. (a) Traditional approach μ = x ± 1:96
sx 2:8 = 65 ± 1:96 1=2 = 65 ± 1:58 n1=2 12
Note that a value of 1.96 is used from the z tables even though the sample is small since the distribution is assumed to be Gaussian.
4.6 Bayesian Inferences
153
(b) Bayesian approach. Using Eq. (4.40) to calculate the quasisample size inherent in the prior: n0 =
2:82 = 1:96 ≃ 2:0 22
i.e., the prior is equivalent to information from testing an additional 2 modules. Next, Eq. (4.41) is used to determine the posterior mean and standard deviation: μ =
2ð60Þ þ 12ð65Þ 2:8 = 64:29 and σ = = 0:748 2 þ 12 ð2 þ 12Þ1=2
The Bayesian 95% CI is then:
μ = μ ± 1:96 σ = 64:29 ± 1:96ð0:748Þ = 64:29 ± 1:47 Since prior information has been used, the Bayesian interval is likely to be better centered and more precise (with a narrower interval) than the traditional or classical interval.
4.6.3
Hypothesis Testing
Section 4.2 dealt with the traditional approach to hypothesis testing where one frames the problem in terms of two competing claims. The application areas discussed involved testing for single sample mean, testing for two sample means and paired differences, testing for single and two sample variances, testing for distributions, and testing on the Pearson correlation coefﬁcient. In all these cases, one proceeds by deﬁning two hypotheses: • The null hypothesis (H0), which represents the status quo, i.e., that the hypothesis will be accepted unless the data provides convincing evidence of the contrary. • The research or alternative hypothesis (Ha), which is the premise that the variation observed in the data sample cannot be ascribed to random variability or chance alone, and that there must be some inherent structural or fundamental cause. Thus, the traditional or frequentist approach is to divide the sample space into an acceptance region and a rejection region and posit that the null hypothesis can be rejected only if the probability of the test statistic lying in the rejection region can be ascribed to chance or randomness at the preselected signiﬁcance level α. Advocates of the Bayesian
approach have several objections to this line of thinking (Phillips 1973): (i) The null hypothesis is rarely of much interest. The precise speciﬁcation of, say, the population mean is of limited value; rather, ascertaining a range would be more useful. (ii) The null hypothesis is only one of many possible values of the uncertain variable, and undue importance being placed on this value is unjustiﬁed. (iii) As additional data are collected, the inherent randomness in the collection process would lead to the null hypothesis being rejected in most cases. (iv) Erroneous inferences from a sample may result if prior knowledge is not considered. The Bayesian approach to hypothesis testing is not to base the conclusions on a traditional signiﬁcance level like α < 0.05. Instead it makes use of the posterior credible interval for the population mean μ of the sample collected against a prior mean value μ0. The procedure is described in texts such as Bolstad (2004) and illustrated in the following example. Example 4.6.2 Traditional and Bayesian approaches to determining conﬁdence intervals The life of a certain type of smoke detector battery is speciﬁed as having a mean of 32 months and a standard deviation of 0.5 months. A building owner decides to test this claim at a signiﬁcance level of 0.05. He tests a sample of 9 batteries and ﬁnds a mean of 31 and a sample standard deviation of 1 month. Note that this is a oneside hypothesis test case. (a) The traditional approach would entail testing H0: μ = 32 versus Ha: μ > 32. The Studentt value: =  3:0. From Table A.4, the critical value t = 311=p32 9 for d.f. = 8 is t0.05 =  1.86. Thus, he can reject the null hypothesis, and state that the claim of the manufacturer is incorrect. (b) The Bayesian approach, on the other hand, would require calculating the posterior probability of the null hypothesis. The prior distribution has a mean μ0 = 32 and variance σ 20 = 0.52 = 0.25. 2
1 First, use Eq. (4.40) and determine n0 = 0:5 2 = 4, i.e., the prior information is “equivalent” to increasing the sample size by 4. Next, use Eq. (4.41) to determine the posterior mean and standard deviation:
154
μ =
4
4ð32Þ þ 9ð31Þ 1:0 = 31:3 and σ = = 0:277: 4þ9 ð4 þ 9Þ1=2
 32:0 From here: t = 31:30:277 =  2:53: This tvalue is outside the critical value t0.05 = 1.86. The building owner can reject the null hypothesis. In this case, both approaches gave the same result, but this is not always true ■
4.7
Some Considerations About Sampling
4.7.1
Random and NonRandom Sampling Methods
A sample is a limited portion, or a ﬁnite number of items/ elements/members drawn from a larger entity called population of which information and characteristic traits are sought. Point and interval estimation as well as notions of inferential statistics covered in the previous sections involved the use of samples drawn from some underlying population. The premise was that ﬁnite samples would reduce the expense associated with the estimation, while the associated uncertainty which would consequently creep into the estimation process could be estimated and managed. It is quite clear that the sample drawn must be representative of the population, and that the sample size should be such that the uncertainty is within certain preset bounds. However, there are different ways by which one could draw samples; this aspect falls under the purview of sampling design. Since these methods have different implications, they are discussed in this section. There are three general rules of sampling design: (i) The more representative the sample of the population, the better the results. (ii) All else being equal, larger samples yield better results, i.e., more precise estimates with narrower uncertainty bands. (iii) Larger samples cannot compensate for a poor sampling design plan or a poorly executed plan. Some of the common sampling methods are described below: (a) Random sampling (also called simple random sampling) is the simplest conceptually and is most widely used. It involves selecting the sample of n elements in such a way that all possible samples of n elements have the same chance of being selected. Two important strategies of random sampling involve: (i) Sampling with replacement, in which the object selected is put back into the population pool and
Making Statistical Inferences from Samples
has the possibility to be selected again in subsequent picks, and (ii) Sampling without replacement, where the object picked is not put back into the population pool prior to picking the next item. Random sampling without replacement of n objects from a population N could be practically implemented in one of several ways. The most common is to order the objects of the population (e.g., 1, 2, 3, . . ., N ), use a random number generator to generate n numbers from 1 to N without replication, and pick only the objects whose numbers have been generated. This approach is illustrated by means of the following example. A consumer group wishes to select a sample of 5 cars from a lot of 500 cars for crash testing. It assigns integers from 1 to 500 to every car on the lot, uses a random number generator to select a set of 5 integers, and then select the 5 cars corresponding to the 5 integers picked randomly. Dealing with random samples has several advantages: (i) any random subsample of a random sample or its complement is also a random sample; (ii) after a random sample has been selected, any random sample from its complement can be added to it to form a larger random sample. (b) Nonrandom sampling occurs when the selection of members from the population is done according to some method or preset process which is not random. Often it occurs unintentionally or unwittingly with the experimenter thinking that he is dealing with random samples while he is not. In such cases, bias or skewness is introduced, and one obtains misleading conﬁdence limits which may lead to erroneous inferences depending on the degree of nonrandomness in the data set. However, in some cases, the experimenter intentionally selects the samples in a nonrandom manner and analyzes the data accordingly. This can result in the required conclusions being reached with reduced sample sizes, thereby saving resources. There are different types of nonrandom sampling (ASTM E 1402 1996), and some of the important ones are listed below: (i) Stratiﬁed sampling in which the target population is such that it is amenable to partitioning into disjoint subsets or strata based on some criterion. Samples are selected independently from each stratum, possibly of different sizes. This improves efﬁciency of the sampling process in some instances and is discussed at more length in Sect. 4.7.4. (ii) Cluster sampling in which natural occurring strata or clusters are ﬁrst selected, then random sampling is done to identify a subset of clusters, and ﬁnally all the elements in the picked clusters are selected for analysis. For example, a state can be divided into
(iv)
(v)
(vi)
4.7.2
Desirable Properties of Estimators
The parameters from a sample are random variables since different sets of samples will result in different values of the parameters. Recall the deﬁnition of two seemingly analogous, but distinct, terms: estimators are mathematical expressions to be applied to sample data which yield random variables while an estimate is a speciﬁc number or value of this random variable. Commonly encountered estimators are the mean, median, standard deviation, etc. Since the search for estimators is the crux of the parameter estimation process, certain basic notions and desirable properties of estimators need to be explicitly recognized (a good discussion is provided by Pindyck and Rubinfeld 1981). Many of these concepts are logical extensions of the concepts applicable to
Unbiased Biased
Actual value
Fig. 4.14 Concept of biased and unbiased estimators
Efficient estimator Probability distributions
(iii)
districts and then into municipalities for ﬁnal sample selection. This approach is used often in marketing research. Sequential sampling is a quality control procedure where a decision on the acceptability of a batch of products is made from tests done on a sample of the batch. Tests are done on a preliminary sample, and depending on the results, either the batch is accepted, or further sampling tests are performed. This procedure usually requires, on an average, fewer samples to be tested to meet a prestipulated accuracy. Composite sampling where elements from different samples drawn over a designated time period are combined together. An example is mixing water samples drawn hourly to form a composite sample over a day. Multistage or nested sampling which involves selecting a sample in stages. A larger sample is ﬁrst selected, and then subsequently smaller ones. For example, for testing indoor air quality in a population of ofﬁce buildings, the design could involve selecting individual buildings during the ﬁrst stage of sampling, choosing speciﬁc ﬂoors of the selected buildings in the second stage of sampling, and ﬁnally, selecting speciﬁc rooms in the ﬂoors chosen to be tested during the third and ﬁnal stage. Convenience sampling, also called opportunity sampling, is a method of choosing samples arbitrarily following the manner in which they are acquired. If the situation is such that a planned experimental design cannot be followed, the analyst must make do with the samples collected in this manner. Though impossible to treat rigorously, it is commonly encountered in many practical situations.
155
Probability distributions
4.7 Some Considerations About Sampling
Inefficient estimator
Actual value
Fig. 4.15 Concept of efﬁciency of estimators
errors and also apply to regression models treated in Chap. 5. For example, consider the case where inferences about the population mean parameter μ are to be made from the sample mean estimator x. (a) Lack of bias: A very desirable property is for the distribution of the estimator to have the parameter as its mean value (see Fig. 4.14). Then, if the experiment were repeated many times, one would at least be assured that one would be right on an average. In such a case, the bias in Eðx  μÞ = 0, where E represents the expected value. An example of bias in the estimator is when (n) is used rather than (n  1) while calculating the standard deviation following Eq. (3.8). (b) Efﬁciency: Lack of bias provides no indication regarding the variability. Efﬁciency is a measure of how small the dispersion can possibly get. The value of the mean x is said to be an efﬁcient unbiased estimator if, for a given sample size, the variance of x is smaller than the variance of any unbiased estimator (see Fig. 4.15) and is the smallest limiting variance that can be achieved. More often a relative order of merit, called the relative efﬁciency, is used which is deﬁned as the ratio of both variances. Efﬁciency is desirable since the greater the efﬁciency associated with an estimation process, the stronger the statistical or inferential statements one can make about the estimated parameters. Consider the following example (Wonnacutt and Wonnacutt 1985). If a population being sampled is
156
4 Minimum mean square error
Making Statistical Inferences from Samples
4 n=200
Probability distributions
3
n=50
2 Unbiased
1
n=10 n=5
Actual value
Fig. 4.16 Concept of mean square error which includes bias and efﬁciency of estimators
symmetric, its center can be estimated without bias by either the sample mean x or its median ~x. For some populations x is more efﬁcient; for others ~x is more efﬁcient. In case of a normal parent distribution, the standard error of p p ~x = SE(~x) = 1.25σ= n. Since SE(x) = σ= n, efﬁciency of x relative to ~x Efficiency
var ~x = 1:252 = 1:56: var x
ð4:42Þ
(c) Mean square error: There are many circumstances in which one is forced to trade off bias and variance of estimators. When the goal of a model is to maximize the precision of predictions, for example, an estimator with very low variance and some bias may be more desirable than an unbiased estimator with high variance (see Fig. 4.16). One criterion which is useful in this regard is the goal of minimizing mean square error (MSE), deﬁned as: MSEðxÞ = Eðx  μÞ2 = ½BiasðxÞ2 þ varðxÞ
ð4:43Þ
where E(x) is the expected value of x. Thus, when x is unbiased, the mean square error and variance of the estimator x are equal. MSE may be regarded as a generalization of the variance concept. This leads to the generalized deﬁnition of the relative efﬁciency of two estimators, whether biased or unbiased: “efﬁciency is the ratio of both MSE values.” (d) Consistency: Consider the properties of estimators as the sample size increases. In such cases, one would like the estimator x to converge to the true value, or stated differently, the probability limit of x (plim x) should equal μ as sample size n approaches inﬁnity (see Fig. 4.17). This leads to the criterion of consistency: x is a consistent
0 True value
Fig. 4.17 A consistent estimator is one whose distribution becomes gradually peaked as the sample size n is increased
estimator of μ if plim (x ) = μ. In other words, as the sample size grows larger, a consistent estimation would tend to approximate the true parameters, i.e., the mean square error of the estimator approaches zero. Thus, one of the conditions that make an estimator consistent is that both its bias and variance approach zero in the limit. However, it does not necessarily follow that an unbiased estimator is a consistent estimator. Although consistency is an abstract concept, it often provides a useful preliminary criterion for sorting out estimators. Generally, one tends to be more concerned with consistency than with lack of bias as an estimation criterion. A biased yet consistent estimator may not equal the true parameter on average but will approximate the true parameter as the sample information grows larger. This is more reassuring practically than the alternative of ﬁnding a parameter estimate which is unbiased initially yet will continue to deviate substantially from the true parameter as the sample size gets larger. However, to ﬁnally settle on the best estimator, the efﬁciency is a more powerful criterion. As discussed earlier, the sample mean is preferable to the median for estimating the center of a normal population because the former is more efﬁcient though both estimators are clearly consistent and unbiased.
4.7.3
Determining Sample Size During Random Surveys
Population census, market surveys, pharmaceutical ﬁeld trials, etc. are examples of survey sampling. These can be done in one of two ways which are discussed in this section and in the next. The discussion and equations presented in the previous subsections pertain to random sampling. Survey sampling frames the problem using certain terms slightly
4.7 Some Considerations About Sampling
157
different from those presented above. Here, a major issue is to determine the sample size which can meet a certain prestipulated precision at predeﬁned conﬁdence levels. The estimates from the sample should be close enough to the population characteristic to be useful for drawing conclusions and taking subsequent decisions. One generally assumes the underlying probability distribution to be normal (this may not be correct since lognormal distributions are also encountered often). Let RE be the relative error (often referred to as the margin of error) of the population mean μ at a conﬁdence level (1  α) where α is the signiﬁcance level. For a twotailed distribution, it is deﬁned as: RE 1  α = zα=2
SEðxÞ μ
ð4:44Þ
where SE ðxÞ is the standard error of the sample mean given by Eq. (4.3). A measure of variability in the population needs to be introduced, and this is done through the coefﬁcient of variation (CV) deﬁned as:13 CV =
std:dev: s = x μ true mean
ð4:45Þ
where sx is the sample standard deviation (if the population standard deviation is known, it is better to use that value). From a practical point of view, one should ascertain that sx is lower than the maximum value sx,max at a conﬁdence level of signiﬁcance (1  α) given by: sx, max,1  α = zα=2 :CV 1  α :μ
ð4:46Þ
Let N be the population size. One could deduce the required sample size n from the above equation to reach the target RE1  α. Replacing (N  1) by N in Eq. (4.3), which is the expression for the standard error of the mean for small samples without replacement, results in: SEðxÞ2 =
s2 s2 s2x N  n = x  x n N n N
ð4:47Þ
The required sample size n is found by readjusting terms and using Eqs. (4.44) and (4.46): n=
1 SE ðxÞ2 s2x
þ
1 N
=
1 RE1  α zα=2 CV 1  α
2
ð4:48Þ þ N1
This is the functional form normally used in survey sampling to determine sample size provided some prior estimate of the 13
Note that this deﬁnition is slightly different from that of CV deﬁned by Eq. (3.9a) since population mean rather than sample mean is used.
population mean and standard deviation are known. In summary, sample sizes relative to the population are determined from three considerations: the margin of error, the conﬁdence level and the expected variability. Example 4.7.1 Determination of random sample size needed to verify peak reduction in residences at preset conﬁdence levels An electric utility has provided ﬁnancial incentives to many customers to replace their existing airconditioners with high efﬁciency ones. This rebate program was initiated to reduce the aggregated electric peak during hot summer afternoons which is dangerously close to the peak generation capacity of the utility. The utility analyst would like to determine the sample size necessary to assess whether the program has reduced the peak as projected such that the relative error RE ≤ 10% at 95% CL. The following information is given: N = 20,000
The total number of customers: Estimate of the mean peak saving Estimate of the standard deviation
μ = 2 kW (from engineering calculations) sx = 1 kW (from engineering calculations)
This is a onetailed distribution problem with 95% CL. Then, from Table A.4, z0.05 = 1.645 for large sample sizes. Inserting values of RE = 0.1 and CV = sμx = 12 = 0:5 in Eq. (4.48), the required sample size is: n=
1 0:1 ð1:645Þð0:5Þ
2
1 þ 20,000
= 67:4 ≈ 70
It would be advisable to perform some sensitivity runs given that many of the assumed quantities are based on engineering calculations. It is simple to use the above approach to generate ﬁgures such as Fig. 4.18 for assessing tradeoff between reducing the margin of error versus increasing the cost of veriﬁcation (instrumentation, installation, and monitoring) as sample size is increased. Note that accepting additional error reduces sample size in a hyperbolic manner. For example, lowering the requirement that RE ≤10% to ≤15% decreases n from 70 to about 30; while decreasing RE requirement to ≤5% would require a sample size of about 270 (outside the range of the ﬁgure). On the other hand, there is not much one could do about varying CV since this represents an inherent variability in the population if random sampling is adopted. However, nonrandom stratiﬁed sampling, described next, could be one approach to reduce sample sizes. ■
158
4
Fig. 4.18 Size of random sample needed to satisfy different relative errors of the population mean for two different values of population variability (CV of 25% and 50%). Data correspond to Example 4.7.1
Making Statistical Inferences from Samples
Population size = 20,000 One tailed 95% CL
CV(%)
Sample size n
25
50
Relative Error (%)
4.7.4
Stratified Sampling for Variance Reduction
Variance reduction techniques are a special type of sample estimating procedures which rely on the principle that prior knowledge about the structure of the model and the properties of the input can be used to increase the precision of estimates for a ﬁxed sample size, or, conversely, to decrease the sample size required to obtain a ﬁxed degree of precision. These techniques distort the original problem so that special techniques can be used to obtain the desired estimates at a lower cost. Variance can be decreased by considering a larger sample size which involves more work. So, the effort with which a parameter is estimated can be evaluated as (Shannon, 1975): efficiency = ðvariance × workÞ  1 :
ð4:49Þ
This implies that a reduction in variance is not worthwhile if the work needed to achieve it is excessive. A common recourse among social scientists to increase efﬁciency per unit cost in statistical surveys is to use stratiﬁed sampling, which counts as a variance reduction technique. In stratiﬁed sampling, the distribution function to be sampled is broken up into several pieces, each piece is then sampled separately, and the results are later combined into a single estimate. The speciﬁcation of the strata to be used is based on prior knowledge about the characteristics of the population to be sampled. Often an order of magnitude variance reduction is achieved by stratiﬁed sampling as compared to the standard random sampling approach.
Example 4.7.214 Example of stratiﬁed sampling for variance reduction Suppose a home improvement center wishes to estimate the mean annual expenditure of its local residents in the hardware section and the drapery section. It is known that the expenditures by women differ more widely than those by men. Men visit the store more frequently and spend annually approximately $50; expenditures of as much as $100 or as little as $25 per year are found occasionally. Annual expenditures by women can vary from nothing to over $ 500. The variance for expenditures by women is therefore much greater, and the mean expenditure more difﬁcult to estimate. Assume that 80% of the customers are men and that a sample size of n = 15 is to be taken. If simple random sampling were employed, one would expect the sample to consist of approximately 12 men (original male fraction of the population f1 = 12/15 = 0.8) and 3 women (original female fraction f2 = 0.2). However, assume that a sample that included n1 = 5 men and n2 = 10 women was selected instead (more women have been preferentially selected because their expenditures are more variable). Suppose the annual expenditures of the members of the sample turned out to be: Men: Women:
45, 50, 55, 40, 90 80, 50, 120, 80, 200, 180, 90, 500, 320, 75
It is intuitively clear that such data will lead to a more accurate estimate of the overall average than would the expenditures of 12 men and 3 women. 14
From Shannon (1975).
4.8 Resampling Methods
159
where 0.80 and 0.20 are the original weights in the population, and 0.33 and 0.67 the sample weights respectively. This value is likely to be a more realistic estimate than if the sampling had been done based purely on the percentage of the gender of the customers. The above example is a simple case of stratiﬁed sampling where the customer base was ﬁrst stratiﬁed into the two genders, and then these were sampled disproportionately. There are statistical formulae which suggest nearoptimal size of selecting stratiﬁed samples, for which the interested reader can refer to Devore and Farnum (2005) and other texts.
numerical methods have, in large part, replaced closed forms solution techniques of differential equations in almost all ﬁelds of engineering mathematics. Thus, versatile numerical techniques allow one to overcome such problems as the lack of knowledge of the probability distribution of the errors of the variables, and even determine sampling distributions of such quantities as the median or of the interquartile range or even the 5th and 95th percentiles for which no traditional tests exist. The methods are conceptually simple, requiring low levels of mathematics, can be used to determine any estimate whatsoever of any parameter (not just the mean or variance) and even allows the empirical distribution of the parameter to be obtained. Thus, they have clear advantages when assumptions of traditional parametric tests (such as normal distributions) are not met. Note that the estimation must be done directly and uniquely from the samples, and none of the parametric equations related to standard error, etc., discussed in Sect. 4.2 (such as Eq. 4.2 or Eq. 4.5), should be used. Since the needed additional computing power is easily provided by presentday personal computers, resampling methods have become increasingly popular and have even supplanted classical/traditional parametric tests.
4.8
Resampling Methods
4.8.2
4.8.1
Basic Concept
The appropriate sample weights must be applied to the original sample data if one wishes to deduce the overall mean. Thus, if Mi and Wi are used to designate the ith sample of men and women, respectively, X= =
1 f1 n ðn1 =nÞ
5
Mi þ i=1
f2 ðn2 =nÞ
10
Wi i=1
ð4:50Þ
1 0:80 0:20 280 þ 1695 ≈ 79 15 0:33 0:67
Resampling methods reuse a single available sample to make statistical inferences about the population. The precision of a populationrelated estimate can be improved by drawing multiple samples from the population and inferring the conﬁdence limits from these samples rather than determining them from classical analytical estimation formulae based on a single sample only. However, this is infeasible in most cases because of the associated cost and time of assembling multiple samples. The basic rationale behind resampling methods is to draw one single sample, treat this original sample as a surrogate for the population, and generate numerous subsamples by simply resampling the sample itself. Thus, resampling refers to the use of given data, or a data generating mechanism, to produce new samples from which the required estimates can be deduced numerically. It is obvious that the sample must be unbiased and be reﬂective of the population (which it will be if the sample is drawn randomly), otherwise the precision of the method is severely compromised. Efron and Tibshirani (1982) have argued that given the available power of computing, one should move away from the constraints of traditional parametric theory with its overreliance on a small set of standard models for which theoretical solutions are available and substitute computational power for theoretical analysis. This parallels the way
Application to Probability Problems
How resampling methods can be used for solving probability type of problems are illustrated below (Simon 1992). Consider a simple example, where one has six balls labeled 1–6. What is the probability that three balls will be picked such that they have 1, 2, 3 in that order if this is done with replacement? The traditional probability equation would yield (1/6)3. The same result can be determined by simulating the 3ball selection a large number of times. This approach, though tedious, is more intuitive since this is exactly what the traditional probability of the event is meant to represent; namely, the long run frequency. One could repeat this 3ball selection say a million times, and count the number of times one gets 1, 2, 3 in sequence, and from there infer the needed probability. The procedure rules or the sequence of operations of drawing samples must be written in computer code, after which the computer does the rest. Much more difﬁcult problems can be simulated in this manner, and its advantages lie in its versatility, its low level of mathematics required, and most importantly, its direct bearing with the intuitive interpretation of probability as the longrun frequency.
4.8.3
Different Methods of Resampling
The creation of multiple subsamples from the original sample can be done in several ways and distinguishes one method
160
against the other (an important distinction is whether the sampling is done with or without replacement). The three most common resampling methods are: (a) Permutation method (or randomization method without replacement) is one where all possible subsets of r items (which is the subsample size) out of the total n items (the sample size) are generated and used to deduce the population estimate and its conﬁdence levels or its percentiles. This may require some effort in many cases, and so, an equivalent and less intensive deviant of this method is to use only a sample of all possible subsets. The size of the sample is selected based on the accuracy needed, and about 1000 samples are usually adequate. The use of the permutation method when making inferences about the medians of two populations is illustrated below. The null hypothesis is that the there is no difference between the two populations. First, we sample both populations to create two independent random samples. The difference in the medians between both samples is computed. Next, two subsamples without replacement (say 10–20 cases per subgroup) are created from the two samples, and the difference in the medians between both resampled subgroups recalculated. This is done a large number of times, say 1000 times. The resulting distribution contains the necessary information regarding the statistical conﬁdence in the null hypothesis of the parameter being evaluated. For example, if the difference in the median between the two original samples was lower in 50 of 1000 possible subgroups, then one concludes that the onetailed probability of the original event was only 0.05. It is clear that such a sampling distribution can be done for any statistic of interest, not just the median. However, the number of randomizations become quickly very large, and so one must select the number of randomizations with some care. (b) The jackknife method creates subsamples without replacement. The jackknife method, introduced by Quenouille in 1949 and later extended by Tukey in 1958, is a technique of universal applicability and great ﬂexibility that allows conﬁdence intervals to be determined of an estimate calculated from subgroups while reducing bias of the estimator. There are several numerical schemes for implementing the jackknife scheme. The original “leave one out” method of implementation is to simply create n subsamples with (n  1) data points wherein a single different observation is omitted in each subgroup or subsample. There is no randomness in the results since the same parameter estimates and conﬁdence intervals will be obtained if repeated several times. However, if n is large, this
4
Making Statistical Inferences from Samples
process may be time consuming. A more recent and popular version is the “kfold crossvalidation” method where (i) one divides the random sample of n observations into k groups of equal size (ii) omits one group at a time and determines what are called pseudoestimates from the (k  1) groups, (iii) estimates the actual conﬁdence intervals of the parameters.15 (c) The bootstrap method (popularized by Efron in 1979) is similar but differs in that no groups are formed but the different sets of data sequences are generated by repeated sampling with replacement from the observational data set (Davison and Hinkley 1997). Individual estimators deduced from such samples directly permit estimates and conﬁdence intervals to be determined. The analyst must select the number of randomizations while the sample size is selected to be equal to that of the original sample. The method would appear to be circular; i.e., how can one acquire more insight by resampling the same sample? The simple explanation is that “the population is to the sample as the sample is to the bootstrap sample.” Though the jackknife is a viable method, it has been supplanted by the bootstrap method, which has emerged as the most efﬁcient of the resampling methods in that better estimates of parameters such as the mean, median, variance, percentiles, empirical distributions and conﬁdence limits are obtained. Several improvements to the naïve bootstrap have been proposed (such as the bootstrapt method) especially for longtailed distributions or for time series data. There is a possibility of confusion between the bootstrap method and the Monte Carlo approach (presented in Sect. 3.7.2). The tie between them is obvious: both are based on repetitive sampling and then direct examination of the results. A key difference between the methods, however, is that bootstrapping uses the original or initial sample as the population from which to resample, whereas Monte Carlo simulation is based on setting up a sample data generation process for the inputs of the simulation or computational model.
4.8.4
Application of Bootstrap to Statistical Inference Problems
The use of the bootstrap method to two types of instances is illustrated in this section: determining conﬁdence intervals and for correlation analysis. At its simplest, the algorithm of the bootstrap method consists of the following steps (Devore and Farnum 2005): 15
The kfold crossvalidation is also used in regression modeling (Sect. 5.8) and in treebased classiﬁcation problems (Sect. 11.5).
4.8 Resampling Methods
161
1. Obtain a random sample of size n from the population. 2. Generate a random sample of size n with replacement from the original sample in step 1. 3. Calculate the statistic of interest for the sample in step 2. 4. Repeat steps 2 and 3 many times to form an approximate sampling distribution of the statistic. It is important to note that bootstrapping requires that sampling be done with replacement, and about 1000 samples are often required. It is advised that the analyst performs a few evaluations with different number of samples in order to be more conﬁdent about his results. The following example illustrates the implementation of the bootstrap method. Example 4.8.116 Using the bootstrap method for deducing conﬁdence intervals The data in Table 4.12 correspond to the breakdown voltage (in kV) of an insulating liquid, which is indicative of its dielectric strength. Determine the 95% CI. First, use the large sample conﬁdence interval formula to estimate the twotailed 95% CI of the mean. Summary quantities are: sample size n = 48, ∑xi = 2646 and x2i = 144,950 from which x = 54:7 and standard deviation s = 5.23. The 95% CI following the traditional parametric approach is then:
Further, the bootstrap resampling approach can also provide 95% CI for other quantities not shown in the ﬁgure, such as for the standard deviation (4.075, 6.238) and for the median (53.0, 56.0). ■ The following example illustrates the versatility of the bootstrap method for determining correlation between two variables, a problem which is recast as comparing two sample means. Example 4.8.217 Using the bootstrap method with a nonparametric test to ascertain correlation of two variables One wishes to determine whether there exists a correlation between athletic ability and intelligence level of teenage students. A sample of 10 high school athletes was obtained involving their athletic and IQ scores. The data are listed in terms of descending order of athletic scores in the ﬁrst two columns of Table 4.13. A nonparametric approach is adopted to solve this problem, the parametric version would be the test on the Pearson correlation coefﬁcient (Sect. 4.2.7). The athletic scores and the IQ scores are rank ordered from 1 to 10 as shown in the
5:23 54:7 ± 1:96 p = 54:7 ± 1:5 = ð53:2,56:2Þ 48 The conﬁdence intervals using the bootstrap method are now recalculated to evaluate differences. A histogram of 1000 samples of n = 48 each, drawn with replacement, is shown in Fig. 4.19. The 95% CI correspond to the twotailed 0.05 signiﬁcance level. Thus, one selects 1000(0.05/2) = 25 units from each end of the distribution, i.e., the value of the 25th and that of the 975th largest values which yield (53.33, 56.27) which are very close to the parametric range determined earlier. This example illustrates the fact that bootstrap intervals usually agree with traditional parametric ones when all the assumptions underlying the latter are met. It is when they do not that the power of the bootstrap stands out.
Fig. 4.19 Histogram of bootstrap sample means with 1000 samples (Example 4.8.1) Table 4.13 Data table for Example 4.8.2 along with ranksa Athletic score 97 94 93 90 87 86 86 85 81 76
Table 4.12 Data table for Example 4.8.1 62 59 54 46 57 53
50 64 55 55 48 52
53 50 57 53 63 50
57 53 50 54 57 55
41 64 55 52 57 60
53 62 50 47 55 50
55 50 56 47 53 56
61 68 55 55 59 58 a
From Devore and Farnum (2005) by # permission of Cengage Learning.
IQ score 114 120 107 113 118 101 109 110 100 99
Athletic rank 1 2 3 4 5 6 7 8 9 10
Data available electronically on book website
16
17
From Simon (1992) by # permission of Duxbury Press.
IQ rank 3 1 7 4 2 8 6 5 9 10
162
4
Making Statistical Inferences from Samples
Fig. 4.20 Histogram based on 100 trials of the sum of 5 random IQ ranks from the sample of 10. Note that in only 2% of the trials was the sum equal to 17 or lower (Example 4.8.2)
last two columns of the table. The two observations (athletic rank, IQ rank) are treated together since one would like to determine their joint behavior. The table is split into two groups of ﬁve “high” and ﬁve “low.” An even split of the group is advocated since it uses the available information better and usually leads to greater “efﬁciency.” The sum of the observed IQ ranks of the ﬁve top athletes = (3 + 1 + 7 + 4 + 2) = 17. The resampling scheme will involve numerous trials where a subset of 5 numbers is drawn randomly from the set {1. . .10}. One then adds these ﬁve IQ score numbers for each individual trial. If the observed sum across trials is consistently higher than 17, this will indicate that the best athletes have not earned the observed IQ scores purely by chance. The probability can be directly estimated from the proportion of trials whose sum exceeded 17. Figure 4.20 depicts the histogram of the IQ score sum of 5 random observations using 100 trials (a rather low number of trials meant for illustration purposes). Note that in only 2% of the trials was the sum 17 or lower. Hence, one can state to within 98% CL, that there does exist a correlation between athletic ability and IQ score. It is instructive to compare this conclusion against one from a parametric method. The Pearson correlation coefﬁcient (Sects. 3.4.2 and 4.2.7) between the raw athletic scores and the IQ scores has been determined to be r = 0.7093 for this sample with a pvalue of 2.2% which is almost identical to the approximate pvalue of 2.0% determined by the bootstrap method.
4.8.5
Closing Remarks
Resampling methods can be applied to diverse problems (Good 1999): (i) for determining probability in complex
situations, (ii) to estimate conﬁdence intervals (CI) of an estimate during univariate sampling of a population, (iii) hypothesis testing to compare estimates of two samples, (iv) to estimate conﬁdence bounds during regression, and (v) for classiﬁcation. These problems can all be addressed by classical methods provided one makes certain assumptions regarding probability distributions of the random variables. The analytic solutions can be daunting to those who use these statistical analytic methods rarely, and one can even select the wrong formula by error. Resampling is much more intuitive and provides a way of simulating the physical process without having to deal with the, sometimes obfuscating, statistical constraints of the analytic methods. A big virtue of resampling methods is that they extend classical statistical evaluation to cases which cannot be dealt with mathematically. The downside to the use of these methods is that they require larger computing resources (two or three orders of magnitude). This issue is no longer a constraint because of the computing power of modernday personal computers. Resampling methods are also referred to as computerintensive methods, although other techniques discussed in Sect. 10.6 are more often associated with this general appellation. It has been suggested that one should use a parametric test when the samples are large, say number of observations is greater than 40, or when they are small ( 350 at the 0.05 signiﬁcance level. Pr. 4.15 Comparison of human comfort correlations between Caucasian and Chinese subjects Human indoor comfort can be characterized by to the occupants’ feeling of wellbeing in the indoor environment. It depends on several interrelated and complex phenomena involving subjective as well as objective criteria. Research initiated over 50 years back and subsequent chamber studies have helped deﬁne acceptable thermal comfort ranges for indoor occupants. Perhaps the most widely used standard is ASHRAE Standard 552004 (ASHRAE 2004) which is described in several textbooks (e.g., Reddy et al., 2016). The basis of the standard is the thermal sensation scale determined by the votes of the occupants following the scale in Table 4.23. The individual votes of all the occupants are then averaged to yield the predicted mean vote (PMV). This is one of the two indices relevant to deﬁne acceptability of a large population of people exposed to a certain indoor environment. PMV = 0 is deﬁned as the neutral state (neither cool nor warm), while positive values indicate that occupants feel warm, and vice versa. The mean scores from the chamber studies are then regressed against the inﬂuential environmental parameters to yield an empirical correlation which can be used as a means of prediction: PMV = a T db þ b Pv þ c
ð4:51Þ
where Tdb is the indoor drybulb temperature (degrees C), Pv is the partial pressure of water vapor (kPa), and the numerical values of the coefﬁcients a*, b*, and c* are dependent on such factors as sex, age, hours of exposure, clothing levels, type of activity, . . . . The values relevant to healthy adults in an ofﬁce setting for a 3 h exposure period are given in Table 4.24.
Fig. 4.21 Predicted percentage of dissatisﬁed (PPD) people as function of predicted mean vote (PMV) following Eq. (4.52)
In general, the distribution of votes will always show considerable scatter. The second index is the percentage of people dissatisﬁed (PPD), deﬁned as people voting outside the range of 1 to +1 for a given value of PMV. When the PPD is plotted against the mean vote of a large group characterized by the PMV, one typically ﬁnds a distribution such as that shown in Fig. 4.21. This graph shows that even under optimal conditions (i.e., a mean vote of zero), at least 5% are dissatisﬁed with the thermal comfort. Hence, because of individual differences, it is impossible to specify a thermal environment that will satisfy everyone. A curve ﬁt expression between PPD and PMV has also been suggested: PPD=10095 exp 0:03353:PM V 4 þ 0:2179:PM V 2 Þ ð4:52Þ Note that the overall approach is consistent with the statistical approach of approximating distributions by the two primary measures, the mean and the standard deviation. However, in this instance, the standard deviation (characterized by PPD) has been empirically found to be related to the mean value, namely PMV (Eq. 4.51). A research study was conducted in China by Jiang (2001) in order to evaluate whether the above types of correlations, developed using American and European subjects, are applicable to Chinese subjects as well. The environmental chamber test protocol was generally consistent with previous Western studies. The total number of Chinese subjects in
References
167
better visualized if plotted on a psychrometric chart shown in Fig. 4.22. Based on this data, one would like to determine whether the psychological responses of Chinese people are different from those of American/European people. (a) Formulate the various types of statistical tests one would perform stating the intent of each test. (b) Perform some or all these tests and draw relevant conclusions. (c) Prepare a short report describing your entire analysis. Hint One of the data points is suspected. Also use Eqs. (4.51) and (4.52) to generate the values pertinent to Western subjects prior to making comparative evaluations.
References
Fig. 4.22 Chamber test conditions plotted on a psychrometric chart for Chinese subjects (Problem 4.15)
the pool was about 200, and several tests were done with smaller batches (about 10–12 subjects per batch evenly split between males and females). Each batch of subjects ﬁrst spent some time in a preconditioning chamber after which they were moved to the main chamber. The environmental conditions (drybulb temperature Tdb, relative humidity RH and air velocity) of the main chamber were controlled such that: Tdb(±0.3 ° C), RH(±5%) and air velocity CV, this would indicate that the model deviates more at the lower range, and vice versa. There is a thirdway RMSE that can be normalized, though it is not used as much in the statistical literature. It is simply to divide the RMSE by the range of variation in the response variable y: CV00 = RMSE=ðymax  ymin Þ
ð5:10cÞ
This measure has a nice intuitive appeal, and its range is bounded between [0,1]. 6
Parsimony in the context of regression model building is a termmeant to denote the most succinct model, i.e., one without any statistically superﬂuous regressors.
5.3 Simple OLS Regression
175
SFð%Þ = ½ðRMSEmodel A  RMSEmodel B Þ=RMSEmodel A
(e) Mean Bias Error The mean bias error (MBE) is deﬁned as the mean difference between the actual data values and model predicted values: MBE =
n i = 1 ð yi
 yi Þ np
ð5:11aÞ
Note that when a model is identiﬁed by OLS using the original set of data, MBE should be zero (to within roundoff errors of the computer). Only when, say, the model identiﬁed from a ﬁrst set of observations is used to predict the value of the response variable under a second set of conditions will MBE be different than zero (see Sect. 5.8 for further discussion). Under the latter circumstances, the MBE is also called the mean simulation or prediction error. A normalized MBE (or NMBE) is often used and is deﬁned as the MBE given by Eq. 5.9a divided by the mean value of the response variable y : NMBE =
MBE y
ð5:11bÞ
(f) Mean Absolute Deviation The mean absolute deviation (MAD) is deﬁned as the mean absolute difference between the actual data values and model predicted values: MAD =
n i = 1 jyi
 yi j ð n  pÞ
ð5:12Þ
This metric is also called mean absolute error (MAE) and is a measure of the systematic bias in the model. Example 5.3.2 Using the data from Example 5.3.1 repeat the exercise using a spreadsheet program. Calculate R2, RMSE, and CV values. From Eq. 5.2, SSE = 323.3 and SSR = 3390.5. From this SST = SSE + SSR = 3713.9. Then from Eq. 5.7a, R2 = 91.3%, while from Eq. 5.8, RMSE = 3.2295, from which CV = 0.095 = 9.5%. ■
× 100
 1 < SF ≤ 100 ð5:13Þ
Hence, SF < 0 % would indicate that model B is poorer than model A. Conversely, SF> 0 would mean that model B is an improvement. If, say, SF = 1.35, this could be interpreted as the predictive accuracy of model B is 35% better than that of model A. The limit of SF = 100% is indicative of the upper limit of perfect predictions, that is, RMSE (model B) = 0. The SF metric is a conceptually appealing measure that allows several potential models to be directly evaluated and ranked compared to a baseline or reference model. The Adj. Rsquare, the RMSE (or CV), MBE (or NMBE), and MAD are perhaps the most widely used metrics to evaluate competing regression model ﬁts to data. Under certain circumstances, one model may be preferable to another in terms of one index but not the other. The analyst is then perplexed as to which index to pick as the primary one. In such cases, the speciﬁc intent of how the model is going to be subsequently applied should be considered which may suggest the model selection criterion.
5.3.3
Inferences on Regression Coefficients and Model Significance
Once an overall regression model is identiﬁed, is the model statistically signiﬁcant? If it is not, the entire identiﬁcation process loses its value. The Fstatistic, which tests for the signiﬁcance of the overall regression model (not that of a particular regressor), is deﬁned as: variance explained by the regression variance not explained by the regression MSR SSR n  p = = MSE p  1 SSE
F=
ð5:14aÞ
Note that the degrees of freedom for SSR = ( p  1) while that for SSE = (n  p). Thus, the smaller the value of F, the poorer the regression model. It will be noted that the Fstatistic is directly related to R2 as follows:
(g) Skill Factor Another measure is sometimes used to compare the relative improvement of one model over another when applied to the same data set. The relative score or skill factor (SF) allows one to quantify the improvement in predictive accuracy when, say model B would bring compared to another, say model A. It is common to use the RMSE of both models as the basis:
F=
ð n  pÞ R2 2 ð p  1Þ 1R
ð5:14bÞ
Hence, the Fstatistic can alternatively be viewed as being a measure to test the R2 signiﬁcance itself. During univariate regression, the Ftest is really the same as a Student ttest for
176
5
the signiﬁcance of the slope coefﬁcient. In the general case, the Ftest allows one to test the joint hypothesis of whether all coefﬁcients of the regressor variables are equal to zero or not. Example 5.3.3 Calculate the Fstatistic for the model identiﬁed in Example 5.3.1. What can you conclude about the signiﬁcance of the ﬁtted model? From Eq. 5.14a, F=
3390:5 33  2 = 325 323:3 21
which clearly indicates that the overall regression ﬁt is signiﬁcant. The reader can verify that Eq. 5.14b also yields an identical value of F. ■ Note that the values of coefﬁcients a and b based on the given sample of n observations are only estimates of the true model parameters α and β. If the experiment is repeated, the estimates of a and b are likely to vary from one set of experimental observations to another. OLS estimation assumes that the model residual ε is a random variable with zero mean. Further, the residuals εi at speciﬁc values of x are taken to be randomly distributed, which is akin to saying that the distributions shown in Fig. 5.3 at speciﬁc values of x are normal and have equal variance. After getting an overall picture of the regression model, it is useful to study the signiﬁcance of each individual regressor on the overall statistical ﬁt in the presence of all other regressors. The Student tstatistic is widely used for this purpose and is applied to each regression parameter. For the slope parameter b in Eq. 5.1: Student tvalue t=
b0 sb
ð5:15aÞ
The estimated standard deviation (also referred to as “the standard error of the sampling distribution”) of the slope parameter b is given by sb = RMSE=
Sxx :
ð5:15bÞ
with Sxx being the sum of squares given by Eq. 5.6b and RMSE by Eq. 5.8. For the intercept parameter a in Eq. 5.1, Student tvalue a0 t= sa
ð5:16aÞ
where the estimated standard deviation of the intercept parameter a is
Linear Regression Analysis Using Least Squares n
sa = RMSE
1=2
xi 2
i
ð5:16bÞ
n:Sxx
Basically, the ttest as applied to regression model building is a formal statistical test to determine how signiﬁcantly different an individual coefﬁcient is from zero in the presence of the remaining coefﬁcients. Stated simply, it enables an answer to the following question: would the ﬁt become poorer if the regressor variable in question is not used in the model at all? Recall that the conﬁdence intervals CI refer to the limits for the mean response at speciﬁed values of the predictor variables for a speciﬁed conﬁdence level (CL). Let β and α denote the hypothesized true values of the slope and intercept coefﬁcients. The CI for the model parameters are determined as follows. For the slope term: b
t α=2 RMSE t α=2 RMSE p p Þ < 21:9025 þ ð2:04Þð0:87793Þ
20
30 40 Solids Reduction
50
60
(ii) Singleequation or multiequation depending on whether only one or several interconnected response variables are being considered. (iii) Linear or nonlinear, depending on whether the model is linear or nonlinear in its parameters (and not its functional form). Thus, a regression equation such as y = a + b x + c x2 is said to be linear in its parameters {a, b, c} though it is nonlinear in its functional structure (see Sect. 1.4.4 for a discussion on the classiﬁcation of mathematical models).
or,
Certain simple univariate equation models are shown in Fig. 5.7. Frame (a) depicts simple linear models (one with a 20:112 < ð < y20 > Þ < 23:693 at 95% CL: ■ positive slope and another with a negative slope), while frames (b) and (c) are higherorder polynomial models which, though nonlinear in the function, are models linear Example 5.3.9 Calculate the 95% PI for predicting the individual response in their parameters. The other ﬁgures depict nonlinear for x = 20 using the linear model identiﬁed in Example 5.3.1. models. Analysts often approximate a linear model (especially over a limited range) even if the relationship of the Using Eq. 5.21, data is not strictly linear. If a function such as that shown in 2 1=2 frame (d) is globally nonlinear, and if the domain of the 1 ð20  33:4545Þ = 3:3467 experiment is limited say to the right knee of the curve Varðy0 Þ = ð3:2295Þ 1 þ þ 4152:18 33 (bounded by points c and d), then a linear function in this region could be postulated. Models tend to be preferentially Further, t0.05/2 = 2.04. Using an analogous expression as framed as linear ones largely due to the simplicity in Eq. 5.20 yields the PI for the mean response subsequent model building and the prevalence of solution methods based on matrix algebra. 21:9025  ð2:04Þð3:3467Þ < y20 < 21:9025 þ ð2:04Þð3:3467Þ or
15:075 < y20 < 28:730:at 95%CL ■
5.4
Multiple OLS Regression
Regression models can be classiﬁed as: (i) Univariate or multivariate, depending on whether only one or several regressor variables are being considered.
5.4.1
Higher Order Linear Models
When more than one regressor variable is known to inﬂuence the response variable, a multivariate model will explain more of the variation and provide better predictions than a univariate model. The parameters of such a model can be identiﬁed using multiple regression techniques. This section will discuss certain important issues regarding multivariate, singleequation models linear in the parameters using the OLS
5.4 Multiple OLS Regression
179
Fig. 5.7 General shape of regression curves (From Shannon 1975 by # permission of Pearson Education)
approach. For now, the treatment is limited to regressors, which are uncorrelated or independent. Consider a data set of n readings that include k regressor variables. The number of model parameters p will then be (k+1) because one of the parameters is the intercept or constant term. The corresponding form, called the additive multivariate linear regression (MLR) model, is: y = β 0 þ β 1 x1 þ β 2 x2 þ ⋯ þ β k xk þ ε
ð5:22aÞ
where ε is the error or unexplained variation in y. Due to the lack of any interaction terms, the model is referred to as “additive.” The simple interpretation of the numerical value of the model parameters is that βi represents the unit inﬂuence dy ). Note that this is strictly valid only of xi on y (i.e., slope dx i when the variables are independent or uncorrelated, which, often, is not true.
The same model formulation is equally valid for a kth degree polynomial regression model which is a special case of Eq. 5.22a with x1 = x, x2 = x2, etc. y = β 0 þ β 1 x þ β 2 x2 þ ⋯ þ β k x k þ ε
ð5:23Þ
Polynomial models are commonly used to represent empirical behavior and can capture a variety of shapes. Usually, they are limited to secondorder (quadratic) or thirdorder (cubic) functional forms. Let xij denote the ith observation of parameter j. Then Eq. 5.22a can be rewritten as yi = β0 þ β1 xi1 þ β2 xi2 þ ⋯ þ βk xik þ ε
ð5:22bÞ
Often, it is most convenient to transform the regressor variables and express them as a difference from the mean
180
5
(this approach is used in Sect. 5.4.4 and also in Sect. 6.4 while dealing with experimental design methods). This transformation is also useful to reduce the illconditioning effects of multicollinearity, which introduces errors and large uncertainties in the model parameter estimates (discussed in Sect. 9.3). Speciﬁcally, Eq. 5.22a can be transformed into: y = β 0 0 þ β 1 ð x1  x1 Þ þ β 2 ð x2  x2 Þ þ ⋯ þ β k ð xk  xk Þ þ ε ð5:24Þ An important special case is the secondorder or quadratic regression model when ( p = 3) in Fig. 5.7b. The straight line is now replaced by parabolic curves depending on the value of β (i.e., either positive or negative). Multivariate model development utilizes some of the same techniques as discussed in the univariate case. The ﬁrst step is to identify all variables that can inﬂuence the response as predictor variables. It is the analyst’s responsibility to identify these potential predictor variables based on his or her knowledge of the physical system. It is then possible to plot the response against all possible predictor variables to identify any obvious trends. The greatest single disadvantage to this approach is the sheer labor involved when the number of possible regressor variables is high. A situation that arises in multivariate regression is the concept of variable synergy, or commonly called interaction between variables (this is a consideration in other problems; for example, when dealing with the design of experiments). This occurs when two or more variables interact and impact system response to a degree greater than when the variables operate independently. In such a case, the ﬁrstorder linear model with two interacting regressor variables takes the form: y = β 0 þ β 1 x1 þ β 2 x2 þ β 3 x1 x 2 þ ε
ð5:25Þ
The term (β3x1 x2) is called the interaction term. How the interaction parameter affects the shape of the family of Fig. 5.8 Plots illustrating the effect of interaction among two regressor variables due to the presence of crossproduct terms. (a) Noninteracting. (b) Interacting (From Neter et al. 1983)
Linear Regression Analysis Using Least Squares
curves is illustrated in Fig. 5.8. The origin of this model function is easy to derive. The lines for different values of regressor x1 are essentially parallel, and so the slope terms for both models are equal. Let the model with the ﬁrst regressor be: y = a′ + bx1, while the intercept be given by: a′ = f(x2) = a + cx2. Combining both equations results in: y = a + bx1 + cx2. This corresponds to Fig. 5.8a. For the interaction case, both the slope and the intercept terms are functions of x2. Hence, representing a′ = a + bx1 and b′ = c + dx1, then: y = a þ bx1 þ ðc þ dx1 Þx2 = a þ bx1 þ cx2 þ dx1 x2 which is identical in structure to Eq. 5.25. Simple linear functions have been assumed above. It is straightforward to derive expressions for higherorder models by analogy. For example, the secondorder (or quadratic) model without interacting variables is: y = β0 þ β1 x1 þ β2 x2 þ β3 x21 þ β4 x22 þ ε
ð5:26Þ
For a secondorder model with interacting terms, the corresponding expression can be easily derived. Consider the linear polynomial model with one regressor (with the error term dropped): y = b0 þ b1 x1 þ b2 x1 2
ð5:27Þ
If the parameters {b0, b1, b2} can themselves be expressed as secondorder polynomials of another regressor x2, the full model will have nine regression parameters: y = b00 þ b10 x1 þ b01 x2 þ b11 x1 x2 þb20 x21 þ b02 x22 þ b21 x21 x2 þb12 x1 x22
þ
ð5:28Þ
b22 x21 x22
The functional dependence of a response variable (mortality ratio) on two independent variables (age and percent of
5.4 Multiple OLS Regression
181
Fig. 5.9 Mortality ratio of men as a function of age and percent of normal weight (Ezekiel and Fox 1959)
Fig. 5.10 Response contour diagrams (Neter et al. 1983). (a) Noninteracting independent variables: y = 20 + 0.95x1 0.50x2. (b) Interacting independent variables: y = 5x1 +7x2 + 3x1x2
normal weight) is perhaps better illustrated using 3D plots as shown in Fig. 5.9. Clearly, any model used to ﬁt the shape of this curved surface would require higherorder functional models with interacting terms. For example, men who are either underweight or overweight at age 22 seem to have higher mortality rates than normal (yaxis=100 is normal), but this is not so for the 52year age group where overweight is the only highrisk factor. This example also illustrates the fact that such polynomial models can ﬁt simple ridges, peaks, valleys, and saddles. It is important to emphasize that the analyst should strive to identify the simplest model possible with the model order as low as possible in the case of multivariate regression. 3D plots, such as in Fig. 5.9, are sometimes hard to read, and response contour plots for two independent variable situations are often more telling. Instead of plotting the
dependent or response variable on the zaxis, the two independent variables are shown on the x and yaxis, and discrete values of the response variables are shown as contour lines. Fig. 5.10 illustrates this type of graphical presentation for two cases: without and with interaction between the two independent variables. The synergistic behavior of independent variables can result in two or more variables working together to “overpower or usurp” another variable’s prediction capability. As a result, it is necessary to always check the importance of each individual predictor variable while performing multivariate regression. Those variables with low absolute values of the tstatistic should be omitted from the model and the remaining predictors used to reestimate the model parameters. The stepwise regression method described in Sect. 5.4.6 is based on this concept.
182
5
5.4.2
∂L =  2XT Y þ 2XT Xβ = 0 ∂β
Matrix Formulation
When dealing with multiple regression, it is advantageous to resort to matrix algebra because of the compactness and ease of manipulation it offers without loss in clarity. Though the solution is conveniently provided by a computer, a basic understanding of matrix formulation is nonetheless useful. In matrix notation (with YT denoting the transpose of Y), the linear model given by Eq. 5.22(b) can be expressed as follows for the n data points (with the matrix dimension shown in subscripted brackets for better understanding): Yðn,1Þ = Xðn,pÞ βðp,1Þ þ εðn,1Þ
ð5:29Þ
βT = ½β0 β1 . . . βk ,
ε T = ½ ε1 ε2 . . . εn
ð5:30Þ
ð5:33Þ
which leads to the system of normal equations XT X b = XT Y
ð5:34Þ
with n
n i=1 n
n
XT X =
i=1 n
where p is the number of parameters in the model (=k + 1 for a linear model) k is the order of the model, and n is the number of data points which consists of one response and p regressors observations. The individual terms can be expressed as: Y T = ½y1 y2 . . . yn ,
Linear Regression Analysis Using Least Squares
xi1
i=1
::
i=1
n
xik
i=1
xi1
::
x2i1
::
::
::
xik :xi1
::
n i=1 n i=1
xik
xi1 :xik
n
::
i=1
ð5:35Þ
x2ik
The above matrix is a symmetric matrix with the main diagonal elements being the sum of squares of the elements in the columns of X and the offdiagonal elements being the sum of the crossproducts. From here, the regression model coefﬁcient vector b is the least square estimator vector of β given by: b = XT X
1
XT Y = C XT Y
ð5:36Þ
and
X=
1 1
x11 x21
x1k
1
xn1
xnk
ð5:31Þ
The ﬁrst column of 1 is meant for the constant term; it is strictly not needed but is convenient for matrix manipulation. The interpretation of the matrix elements is simple. For example, x21 refers to the second observation of the ﬁrst regressor x1, and so on. The descriptive measures applicable for a single variable can be extended to multiple variable models of order k and written in compact matrix notation.
5.4.3
The approach involving the minimization of SSE for the univariate case (Sect. 5.3.1) can be generalized to multivariate linear regression. Here, the parameter set β is to be identiﬁed such that the sum of squares function L is minimized:
or
n
ε2 i=1 i
= εT ε = ðY  XβÞT ðY  XβÞ
VarðbÞ = σ 2 XT X
1
= σ2 C
ð5:37Þ
where σ 2 is the mean square error of the model error terms = ðsum of square errorsÞ=ðn  pÞ
ð5:38Þ
An unbiased estimator of σ 2 is the sample s2 or residual mean square
Point and Interval Estimation
L=
provided matrix (XTX) is not singular. Note that the matrix C = (XTX)1 called the variancecovariance matrix of the estimated regression coefﬁcients is also a symmetric matrix with the main diagonal elements being the variances of the model coefﬁcient estimators and the offdiagonal elements being the sum of the covariances. Under OLS regression, the variance of the model parameters is given by:
s2 =
εT ε SSE = np np
ð5:39Þ
For predictions within the range of variation of the original data, the mean and individual response values are normally distributed with the variance given by the following:
ð5:32Þ (a) For the mean response at a speciﬁc set of x0 values, or the conﬁdence interval CI, under OLS
5.4 Multiple OLS Regression
183
varðy0 Þ = s2 X0 XT X
1
(b) The variance of an individual prediction, or the prediction level, is varðy0 Þ = s2 1 þ X0 XT X
1
1
0:2
0:04
1 1
0:3 0:4
0:09 0:16
1
0:5
0:25
X= 1 1
0:6 0:7
0:36 0:49
1 1
0:8 0:9
0:64 0:81
1
1
1
ð5:40Þ
XT0
XT0
ð5:41Þ
where 1 is a column vector of unity. Twotailed conﬁdence intervals CI at a signiﬁcance level α are: y0 ± t ðn  k, α=2Þ var1=2 ðy0 Þ
ð5:42Þ
Example 5.4.1 Part load performance of fans (and pumps) Partload performance curves do not follow the idealized fan laws due to various irreversible losses. For example, decreasing the ﬂow rate by half of the rated ﬂow does not result in a (1/8)th decrease in its rated power consumption as predicted by the fan laws. Hence, actual tests are performed for such equipment under different levels of loading. The performance tests of the ﬂow rate and the power consumed are then normalized by the rated or 100% load conditions called part load ratio (PLR) and fractional fullload power (FFLP) respectively. Polynomial models can then be ﬁt between these two quantities with PLR as the regressor and FFLP as the response variable. Data assembled in Table 5.2 were obtained from laboratory tests on a variable speed drive (VSD) control, which is a very energy efﬁcient device and increasingly installed. (a) What is the matrix X in this case if a secondorder polynomial model is to be identiﬁed of the form y = β0 þ β1 x1 þ β2 x21 ? (b) Using the data given in the table, identify the model and report relevant statistics on both parameters and overall model ﬁt. (c) Compute the conﬁdence interval and the prediction interval at 0.05 signiﬁcance level for the response at values of PLR = 0.2 and 1.00 (i.e., the extreme points). Solution (a) The independent Eq. 5.26b is:
variable
matrix
X
given
0.7 0.51
0.8 0.68
0.9 0.84
by
Table 5.2 Data table for Example 5.4.1 PLR FFLP
0.2 0.05
0.3 0.11
0.4 0.19
0.5 0.28
0.6 0.39
1.0 1.00
(b) The results of the regression are shown below: Parameter Constant PLR PLR^2
Estimate  0.0204762 0.179221 0.850649
Standard error  0.0173104 0.0643413 0.0526868
tstatistic  1.18288 2.78547 16.1454
pvalue 0.2816 0.0318 0.0000
Analysis of Variance Source Model Residual Total (Corr.)
Sum of squares 0.886287 0.000512987 0.8868
Df 2 6 8
Mean square 0.443144 0.0000854978
Fratio 5183.10
pvalue 0.0000
Goodnessofﬁt R2 = 99.9 % , Adj ‐ R2 = 99.9 %, RMSE = 0.009246, and mean absolute error (MAD) = 0.00584. The equation of the ﬁtted model is (with appropriate rounding) FFLP =  0:0205 þ 0:1792 PLR þ 0:8506 PLR2 ð5:43Þ Since the model pvalue in the ANOVA table is less than 0.05, there is a statistically signiﬁcant relationship between FFLP and PLR at 95% CL. However, the pvalue of the constant term is large (>0.05), and a model without an intercept term is more appropriate as physical considerations suggest since power consumed by the pump is zero if there is no ﬂow. The values shown are those provided by the software package. Note that the standard errors and the Student tvalues for the model coefﬁcients shown in the table cannot be computed from Eq. 5.14 to 5.15, which apply for the simple linear model. The equations for polynomial regression are rather complicated to solve by hand, and the interested reader can refer to texts such as Neter et al. (1983) for more details. The 95% CI and PI are shown in Fig. 5.11. Because the ﬁt is excellent, these are very narrow and close to each other. The predicted values as well as the 95% CI and PI for the two data points are given in the table below. Note that the uncertainty range is relatively larger at the lower value than at the higher range.
184
5
Linear Regression Analysis Using Least Squares
Fig. 5.11 Plot of ﬁtted model along with 95% CI and 95% PI
x 0.2 1.0
Predicted y 0.0493939 1.00939
95% prediction limits Lower Upper 0.0202378 0.0785501 0.980238 1.03855
95% conﬁdence limits Lower Upper 0.0310045 0.0677834 0.991005 1.02778
Example 5.4.2 Table 5.3 gives the solubility of oxygen in water in (mg/L) at 1 atm pressure for different temperatures and different chloride concentrations in (mg/L). (a) Plot the data and formulate two different potential models for oxygen solubility (the response variable) against the two regressors. (b) Evaluate both models and identify the better one. Give justiﬁcation for your choice Report pertinent statistics for model parameters as well as for the overall model ﬁt. (a) The above data set (28 data points in all) is plotted in Fig. 5.12a. One notes that the series of plots are slightly nonlinear but parallel, suggesting a higherorder model without interaction terms. Hence, the secondorder polynomial models without interaction are probably more logical but let us investigate both the ﬁrstorder and secondorder linear models. (b1) Analysis results of the ﬁrstorder model (n = 28 and p = 3) Goodnessofﬁt indicators: R2 = 96.83%, AdjR = 96.57%, RMSE = 0.41318. All three model parameters are statistically signiﬁcant as indicated by the pvalues ( p, this would indicate a biased model because of underﬁtting (Walpole et al. 1998). Measures other than AdjR2 and Mallows Cp statistic have been proposed for subset selection. Two of the most widely adopted ones are the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) (see, e.g., James et al. 2013). Another model selection approach, which can be automated to evaluate models with a large number of possible parameters, is the iterative approach which comes in three variants (in many cases, all three may yield slightly different results). (b1) Backward Elimination Method One begins with selecting an initial model that includes the full set of possible predictor variables from the candidate pool, and then successively dropping one variable at a time Actually, there is no “best” model since random variables are involved. A better term would be “most plausible” and should include mechanistic considerations, if appropriate.
8
Linear Regression Analysis Using Least Squares
based on their contribution to the reduction of SSE (or dropping the variable which results in the smallest decrease in R2) until a sudden large drop in R2 is noticed. The OLS method is used to estimate all model parameters along with tvalues for each model parameter. If all model parameters are statistically signiﬁcant, the model building process stops. If some model parameters are not signiﬁcant, the model parameter of least signiﬁcance (lowest tvalue) is omitted from the regression equation, and the reduced model is reﬁt. This process continues until all parameters that remain in the model are statistically signiﬁcant. (b2) Forward Selection Method One begins with an equation containing no regressors (i.e., a constant model). The model is then augmented by including the regressor variable with the highest simple correlation with the response variable (or one which will increase R2 by the highest amount). If this regression coefﬁcient is signiﬁcantly different from zero, it is retained, and the search for a second variable is made. This process of adding regressors one by one is terminated when the last variable entering the equation is not statistically signiﬁcant or when all the variables are included in the model. Clearly, this approach involves ﬁtting many more models than in the backward elimination method. (b3) Stepwise Regression Method Prior to the advent of resampling methods (Sect. 5.8), stepwise regression was the preferred modelbuilding approach. It combines both the procedures discussed above. Stepwise regression begins by computing correlation coefﬁcients between the response and each predictor variable (like the forward selection method). The variable most highly correlated with the response is then allowed to “enter the regression equation.” The parameter for the singlevariable regression equation is then estimated along with a measure of the goodness of ﬁt. The next most highly correlated predictor variable is identiﬁed, given the current variable already in the regression equation. This variable is then allowed to enter the equation and the parameters reestimated along with the goodness of ﬁt. Following each parameter estimation, tvalues for each parameter are calculated and compared to the tcritical value to determine whether all parameters are still statistically signiﬁcant. Any parameter that is not statistically signiﬁcant is removed from the regression equation. This process continues until no more variables “enter” or “leave” the regression equation. In general, it is best to select the model that yields a reasonably high “goodness of ﬁt” for the fewest parameters in the model (referred to as model parsimony). The ﬁnal decision on model selection requires the judgment of the model builder based on mechanistic insights into the problem. Again, one must guard against
5.5 Applicability of OLS Parameter Estimation
189
the danger of overﬁtting by performing a crossvalidation check (Sect. 5.8). When a blackbox model is used containing several regressors, stepwise regression would improve the robustness of the model identiﬁed by reducing the number of regressors and, thus, hopefully reduce the adverse effects of multicollinearity between the remaining regressors. Many packages use the Ftest indicative of the overall model instead of the ttest on individual parameters to perform the stepwise regression. It is suggested that stepwise regression not be used in case the regressors is highly correlated since it may result in nonrobust models. However, the backward procedure is said to better handle such situations than the forward selection procedure. A note of caution is warranted in using stepwise regression for engineering models based on mechanistic considerations. In certain cases, stepwise regression may omit a regressor which ought to be inﬂuential when using a particular data set, while the regressor is picked up when another data set is used. This may be a dilemma when the model is to be used for subsequent predictions. In such cases, discretion based on physical considerations should trump purely statistical model building. Resampling methods, such as the crossvalidation method (Sect. 5.8.2), can be used as the ﬁnal judge to settle on the most appropriate model among a relatively small number of variable subsets found by automatic model selection. A sounder estimate of the model prediction error is also directly provided. Example 5.7.39 Proper model identiﬁcation with multivariate regression models An example of multivariate regression is the development of model equations to characterize the performance of refrigeration compressors. It is possible to regress the compressor manufacturer’s tabular data of compressor performance using the following simple biquadratic formulation (see Fig. 5.14 for nomenclature): y = b0 þ b1 T cho þ b2 T cdi þ b3 T 2cho þ
b4 T 2cdi
þ b5 T cho T cdi
ð5:48aÞ
where y represents either the compressor power (Pcomp) or the cooling capacity (Qch). OLS is then used to develop estimates of the six model parameters, b0 to b5, based on the compressor manufacturer’s data. The biquadratic model was used to estimate the parameters for compressor cooling capacity (in refrigeration From ASHRAE (2005) # American Society of Heating, Refrigerating and AirConditioning Engineers, Inc., www.ashrae.org
Table 5.5 Results of the first and second stage model building (Example 5.7.3) With all parameters Coefﬁcient Value 152.50 b0 3.71 b1  0.335 b2 b3 0.0279 b4  0.000940  0.00683 b5
tvalue 6.27 36.14  0.62 52.35  0.32  6.13
With signiﬁcant parameters only Value tvalue 114.80 73.91 3.91 11.17 – – 0.027 14.82 – –  0.00892  2.34
tons)10 for a screw compressor. The model and its corresponding parameter estimates are given below. Although the overall curve ﬁt for the data was excellent (R2 = 99.96%), the tvalues of b2 and b4 were clearly insigniﬁcant indicating that the ﬁrstorder term of Tcdi and the secondorder term of Tcdi2 should be dropped. As an aside, it has been suggested by some authors that one should “maintain hierarchy,” that is, include in the model lower order terms of a speciﬁc variable even if found to be statistically insigniﬁcant by stepwise analysis or analysis akin to the one done above. This practice is not universally accepted though and is better left to the discretion of the analyst. A second stage regression is done by omitting these regressors, resulting in the following model and coefﬁcient tvalues shown in Table 5.5. y = b0 þ b1 T cho þ b3 T 2cho þ b5 T cho T cdi
ð5:48bÞ
All the parameters in the simpliﬁed model are signiﬁcant and the overall model ﬁt remains excellent: R2 = 99.5%. ■
5.5
Applicability of OLS Parameter Estimation
5.5.1
Assumptions
The term “least squares” regression is generally applied to linear models, although the concept can be extended to nonlinear functions as well. The ordinary least squares (OLS) regression method is a special and important subclass whose parameter estimates are best only when a number of conditions regarding the functional form and the model residuals or errors are met (discussed below). It enables simple (univariate) or multivariate linear regression models to be identiﬁed from data, which can then be used for future prediction of the response variable along with its uncertainty intervals. It also allows statistical statements to be made
9
10
1 Ton of refrigeration = 12,000 Btu/h.
190
about the estimated model parameters a process known as “inference”. No statistical assumptions are used to obtain the OLS estimators for the model coefﬁcients. When nothing is known regarding measurement errors, OLS is often the best choice for estimating the parameters. However, to make statistical statements about these estimators and the model predictions, it is necessary to acquire information regarding the measurement errors. Ideally, one would like the error terms (or residuals) εi to be normally distributed, without serial correlation, with mean zero and constant variance. The implications of each of these four assumptions, as well as a few additional ones, will be brieﬂy addressed below since some of these violations may lead to biased coefﬁcient estimates and to distorted estimates of the standard errors, of the, conﬁdence intervals, and to improper conclusions from statistical tests. (a) Errors should have zero mean: If this is not true, the OLS estimator of the intercept will be biased. The impact of this assumption not being correct is generally viewed as the least critical among the various assumptions. Mathematically, this implies that expected error values E(εi) = 0. (b) Errors should be normally distributed: If this is not true, statistical tests and conﬁdence intervals are incorrect for small samples though the OLS coefﬁcient estimates are unbiased. Fig. 5.3 which illustrates this behavior has already been discussed. This problem can be avoided by having larger samples and verifying that the model is properly speciﬁed. (c) Errors should have constant variance: var (εi) = σ 2. This violation of the basic OLS assumption results in increasing the standard errors of the estimates and widening the model prediction conﬁdence intervals (though the OLS estimates themselves are unbiased). In this sense, there is a loss in statistical power. This condition in which the variance of the residuals or error terms is not constant is called heteroscedasticity and is discussed further in Sect. 5.6.3. (d) Errors should not be serially correlated: This violation is equivalent to having fewer independent data and also results in a loss of statistical power with the same consequences as (c) above. Serial correlations may occur due to the manner in which the experiment is carried out. Extraneous factors, that is, factors beyond our control (such as the weather) may leave little or no choice as to how the experiments are executed. An example of a reversible experiment is the classic pipefriction experiment where the ﬂow through a pipe is varied to cover both laminar and turbulent ﬂows, and the associated friction drops are observed. Gradually increasing the ﬂow one way (or decreasing it the other
5
Linear Regression Analysis Using Least Squares
way) may introduce biases in the data, which will subsequently also bias the model parameter estimates. In other circumstances, certain experiments are irreversible. For example, the loading on a steel sample to produce a stress–strain plot must be performed by gradually increasing the loading till the sample breaks, one cannot proceed in the other direction. Usually, the biases brought about by the test sequence are small, and this may not be crucial. In mathematical terms, this condition, for a ﬁrstorder case, can be written as the expected value of the product of two consecutive errors E(εi. εi + 1) = 0. This assumption, which is said to be hardest to verify, is further discussed in Sect. 5.6.4. (e) Errors should be uncorrelated with the regressors: The consequences of this violation result in OLS coefﬁcient estimates being biased and the predicted OLS conﬁdence intervals understated, that is, narrower. This violation is a very important one and is often due to “misspeciﬁcation error” or underﬁtting. Omission of inﬂuential regressor variables and improper model formulation (assuming a linear relationship when it is not) are likely causes. This issue is discussed at more length in Sect. 5.6.5. (f) Regressors should not have any measurement error: Violation of this assumption in some (or all) regressors will result in biased OLS coefﬁcient estimates for those (or all) regressors. The model can be used for prediction, but the conﬁdence intervals will be understated. Strictly speaking, this assumption is hardly ever satisﬁed since there is always some measurement error. However, in most engineering studies, measurement errors in the regressors are not large compared to the random errors in the response, and so this violation may not have important consequences. As a rough rule of thumb, this violation becomes important when the errors in x reach about a ﬁfth of the random errors in y, and when multicollinearity is present. If the errors in x are known, there are procedures that allow unbiased coefﬁcient estimates to be determined (see Sect. 9.4.6). Mathematically, this condition is expressed as Var (xi) = 0. (g) Regressor variables should be independent of each other: This violation applies to models identiﬁed by multiple regression when the regressor variables are correlated with each other (called multicollinearity). This is true even if the model provides an excellent ﬁt to the data. Estimated regression coefﬁcients, though unbiased, will tend to be unstable (their values tend to change greatly when a data point is dropped or added), and the OLS standard errors and the prediction intervals will be understated. Multicollinearity is likely to be a problem only when one (or more) of the correlation coefﬁcients among the regressors exceeds 0.8–0.85 or so. Sect. 9.3 deals with this issue at more length.
5.6 Model Residual Analysis and Regularization
5.5.2
191
Sources of Errors During Regression
Perhaps the most crucial issue during parameter identiﬁcation is the type of measurement inaccuracy present. This has a direct inﬂuence on the estimation method to be used. Though statistical theory has neatly classiﬁed this behavior into a ﬁnite number of groups, the data analyst is often stymied by data which does not ﬁt into any one category. Remedial action advocated does not seem to entirely remove the adverse data conditioning. A certain amount of experience is required to surmount this type of adversity, which, further, is circumstance speciﬁc. As discussed earlier, there can be two types of errors: (a) Measurement error. The following subcases can be identiﬁed depending on whether the error occurs: (i) In the dependent variable, in which case the model form is: yi þ δi = β0 þ β1 xi
ð5:49aÞ
or in the regressor variable, in which case the model form is: yi = β 0 þ β 1 ð xi þ γ i Þ
ð5:49bÞ
or, in both dependent and regressor variables: y i þ δ i = β 0 þ β 1 ð xi þ γ i Þ
ð5:49cÞ
Further, the errors δ and γ (which will be jointly represented by ε) can have an additive error, in which case, εi ≠ f(yi, xi), or a multiplicative error: εi = f(yi, xi), or worst still, a combination of both. Section 9.4.1 discusses this issue further. (b) Model misspeciﬁcation error: How this would affect the model residuals εi is difﬁcult to predict and is extremely circumstance speciﬁc. Misspeciﬁcation could be due to several factors, for example, (i) One or more important variables have been left out of the model (ii) The functional form of the model is incorrect Even if the physics of the phenomenon or of the system is well understood and can be cast in mathematical terms, identiﬁability constraints may require that a simpliﬁed or macroscopic model be used for parameter identiﬁcation rather than the detailed model (see Sect. 9.2). This is likely to introduce both bias and random noise in the parameter estimation process except when model R2 is very high (R2 > 0.9). This issue is further discussed in Sect. 5.6. Formal statistical procedures do not explicitly treat this case but
limit themselves to type (a) errors and, more speciﬁcally, to case (i) assuming purely additive or multiplicative errors. The implicit assumptions in OLS and their implications, if violated, are described below.
5.6
Model Residual Analysis and Regularization11
5.6.1
Detection of IllConditioned Behavior
The availability of statistical software has resulted in routine and easy application of OLS to multiple linear models. However, there are several underlying assumptions that affect the individual parameter estimates of the model as well as the overall model itself. Once a model has been identiﬁed, the general tendency of the analyst is to hasten and use the model for whatever purpose intended. However, it is extremely important (and this phase is often overlooked) that an assessment of the model be done to determine whether the OLS assumptions are met, otherwise the model is likely to be deﬁcient or misspeciﬁed and yield misleading results. In the last 50 years or so, there has been much progress made on how to screen model residual behavior to gain diagnostics insight into model deﬁciency or misspeciﬁcation, take remedial action, or adopt more advanced regression techniques.12 Some of the simple methods to screen and correct for illbehaved model residuals are presented in this chapter, while more advanced statistical concepts and regression methods are addressed in Chap. 9. A few idealized plots illustrate some basic patterns of improper residual behavior, which are addressed in more detail in the later sections of this chapter. Fig. 5.15 illustrates the effect of omitting an important dependence, which suggests that an additional variable is to be introduced in the model which distinguishes between the two groups. The presence of outliers and the need for more robust regression schemes which are immune to such outliers are illustrated in Fig. 5.16. The presence of nonconstant variance (or heteroscedasticity) in the residuals is a very common violation and one of several possible manifestations is shown in Fig. 5.17. This particular residual behavior is likely to be remedied by using a log transform of the response variable instead of the variable itself. Another approach is to use weighted least squares estimation procedures described later in this section. Though nonconstant variance is easy to detect visually, its cause is difﬁcult to identify. Fig. 5.18 illustrates a typical behavior that arises when a Herschel: “. . . almost all of the greatest discoveries in astronomy have resulted from the consideration of what . . . (was) termed residual phenomena.” 12 Unfortunately, many of these techniques are not widely used by those involved in energyrelated data analysis. 11
++ + +++ + +
.. .. .... .
.
Model Residuals
5
Fig. 5.15 The residuals can be separated into two distinct groups (shown as crosses and dots) which suggest that the response variable is related to another regressor not considered in the regression model. This improper residual pattern can be rectiﬁed by reformulating the model to include this additional variable. One example of such a timebased event system change is shown in Fig. 8.17
Model Residuals
Linear Regression Analysis Using Least Squares
.. ......... . ... . . . . . .
192
Fig. 5.18 Bowshaped residuals can often be rectiﬁed by evaluating higherorder linear models
+ Outliers
... ......... +
.
Model Residuals
+
Model Residuals
+
Fig. 5.19 Serial correlation is indicated by a pattern in the residuals when plotted in the sequence the data was collected, that is, when plotted against time even though time may not be a regressor in the model Fig. 5.16 Outliers indicated by crosses suggest that data should be checked and/or robust regression used instead of OLS
.
Model Residuals
. . . . . . . . . . . . . .. . .
Fig. 5.17 Residuals with bow shape and increased variability (i.e., the error increases as the response variable y increases) can be often rectiﬁed by a log transformation of y
linear function is used to model a quadratic variation. The proper corrective action will increase the predictive accuracy of the model (RMSE will be lower), result in the estimated parameters being more efﬁcient (i.e., lower standard errors), and most importantly, allow more sound and realistic interpretation of the model prediction uncertainty bounds. Figure 5.19 illustrates the occurrence of serial correlations in time series data which arises when the error terms are not independent. Such patterned residuals occur commonly during model development and provide useful insights into
model deﬁciency. Serial correlation (or autocorrelation) has special pertinence to time series data (or data ordered in time) collected from insitu performance of mechanical and thermal systems and equipment. Autocorrelation is present if adjacent residuals show a trend or a pattern of clusters above or below the zero value that can be discerned visually. Such correlations can either suggest that additional variables have been left out of the model (modelmisspeciﬁcation error) or could be due to the nature of the process itself (dynamic nature of the process—further treated in Chap. 8 on time series analysis). The latter is due to the fact that equipment loading over a day would follow an overall cyclic curve (as against random jumps from say full load to half load) consistent with the diurnal cycle and the way the system is operated. In such cases, positive residuals would tend to be followed by positive residuals, and vice versa. Problems associated with model underﬁtting and overﬁtting are usually the result of a failure to identify the nonrandom pattern in time series data. Underﬁtting does not capture enough of the variation in the response variable which the corresponding set of regressor variables can possibly explain. For example, all four models ﬁt to their respective sets of data as shown in Fig. 5.20 have identical R2 values and tstatistics but are distinctly different in how they capture the data variation. Only plot (a) can be described by a linear model. The data in (b) need to be ﬁtted by a higherorder model, while one data point in (c) and (d) distorts the entire
5.6 Model Residual Analysis and Regularization
193
Fig. 5.20 Plot of the data (x, y) with the ﬁtted lines for four data sets. The models have identical R2 and tstatistics but only the ﬁrst model is a realistic model (From Chatterjee and Price 1991, with permission from John Wiley and Sons)
model. Blind model ﬁtting (i.e., relying only on model statistics) is, thus, inadvisable. This aspect is further discussed in Sect. 5.6.5. Overﬁtting implies capturing randomness in the model, that is, attempting to ﬁt the noise in the data. A rather extreme example is when one attempts to ﬁt a model with six parameters to six data points, which have some inherent experimental error. The model has zero degrees of freedom and the set of six equations can be solved without error (i.e., RMSE = 0). This is clearly unphysical because the model parameters have also “explained” the random noise in the observations in a deterministic manner. Both underﬁtting and overﬁtting can be detected by performing certain statistical tests on the residuals. The most used test for white noise (i.e., uncorrelated residuals) involving model residuals is the DurbinWatson (DW) statistic deﬁned by: DW =
n 2 i = 2 ð εi  εi  1 Þ n 2 i = 1 εi
ð5:50aÞ
where εi is the residual at time interval i, deﬁned as εi = y i  yi : An approximate relationship can also be used (Chatterjee and Price (1991) DW ≈ 2 ð1  r Þ
ð5:50bÞ
where r is the correlation coefﬁcient (Eq. 3.12) between timelagged residuals. If there is no serial or autocorrelation present, the expected value of DW = 2 (the limiting range being 0–4). The closer DW is to 2, the stronger the evidence that there is no autocorrelation in the data. If the model underﬁts DW < 2; while DW > 2 indicates an overﬁtted model. Tables are available for approximate signiﬁcance tests with different numbers of regressor variables and a number of data points. Table A.13 assembles lower and upper critical values of DW statistics to test autocorrelation. These apply to positive DW values; if, however, a test is to be conducted with negative DW values, the quantity (4—DW) should be used. For example, if n = 20, and the model has three variables ( p = 3), the null hypothesis that the correlation coefﬁcient is equal to zero can be rejected at the 0.05 signiﬁcance level if its value is either below 1.00 or above 1.68. Note that the critical values in the table are onesided, that is, apply to a onetailed distribution. It is important to note that the DW statistic is only sensitive to correlated errors in adjacent observations, that is, when only ﬁrstorder autocorrelation is present. For example, if the time series has seasonal patterns, then higher autocorrelations may be present which the DW statistic will be unable to detect. More advanced concepts and modeling are discussed in Sect. 8.5.3 while treating stochastic time series data.
194
5.6.2
5
Leverage and Influence Data Points
Most of the aspects discussed above relate to identifying general patterns in the residuals of the entire data set. Another issue is the ability to identify subsets of data that have an unusual or disproportionate inﬂuence on the estimated model in terms of parameter estimation. Being able to ﬂag such inﬂuential subsets of individual points allows one to investigate their validity, or to glean insights for better experimental design since they may contain the most interesting system behavioral information. Note that such points are not necessarily “bad” data points which should be omitted, while are to be viewed as “distinctive” observations in the overall data set. It is useful to provide a geometrical understanding of outlier points and their potential impact on the model parameter estimates. No matter how carefully an experiment is designed and performed, there always exists the possibility of serious errors. These errors could be due to momentary instrument malfunction (say, dirt sticking onto a paddle wheel of a ﬂow meter), power surges (which may cause data logging errors), or the engineering system deviating from its intended operation due to random disturbances. Usually, it is difﬁcult to pinpoint the cause of the anomalies. The experimenter is often not fully sure whether the outlier is anomalous, or whether it is a valid or legitimate data point which does not conform to what the experimenter “thinks” it should. In such cases, throwing out a data point may amount to data “tampering” or fudging of results. Usually, data that exhibit such anomalous tendencies are a minority. Even then, if the data analyst retains these questionable observations, they can bias the results of the entire analysis since they exert an undue inﬂuence and can dominate a computed relationship between two variables.
Fig. 5.21 Illustrating different types of outliers. Point A is very probably a doubtful point; point B might be bad but could potentially be a very important point in terms of revealing unexpected behavior; point C is close enough to the general trend and should be retained until more data is collected
Linear Regression Analysis Using Least Squares
Consider the case of outliers during regression for the univariate case. Data points are said to be outliers when their model residuals are large relative to the other points. A visual investigation can help one distinguish between endpoints and center points (this is the intent of exploratory data analysis Sect. 3.5). For example, point A of Fig. 5.21 is quite obviously an outlier, and if the rejection criterion orders its removal, one should proceed to do so. On the other hand, point B, which is near the end of the data domain, may not be a bad point at all, but merely the beginning of a new portion of the curve (say, the onset of turbulence in an experiment involving laminar ﬂow). Similarly, even point C may be valid and important. Hence, the only way to remove this ambiguity is to take more observations at the lower end. Thus, a simple heuristic is to reject points only when they are center points. Several advanced books present formal statistical treatment of outliers in a regression context. One can diagnose whether the data set is illconditioned or not, as well as identify and reject, if needed, the necessary outliers that cause illconditioning during the modelbuilding process (e.g., Belsley et al. 1980). Consider Fig. 5.22a. The outlier point will have little or no inﬂuence on the regression parameters identiﬁed, and in fact retaining it would be beneﬁcial since it would lead to a reduction in model parameter variance. The behavior shown in Fig. 5.22b is more troublesome because the estimated slope is almost wholly determined by the extreme point. In fact, one may view this situation as a data set with only two data points, or one may view the single point as a spurious point and remove it from the analysis. Gathering more data at that range would be advisable but may not be feasible; this is where the judgment of the analyst or prior information about the underlying trend line is useful. How and the extent to which each of the data points will affect the outcome of the regression line will
5.6 Model Residual Analysis and Regularization
195
Fig. 5.22 Two other examples of outlier points. While the outlier point in (a) is most probably a valid point, it is not clear for the outlier point in (b). Either more data must be collected, failing which it is advisable to delete this data from any subsequent analysis (From Belsley et al. (1980) by permission of John Wiley and Sons)
determine whether that particular point is an inﬂuence point or not. Scatter plots often reveal such outliers easily for single regressor situations but are inappropriate for multivariate cases. Hence, several statistical measures have been proposed to deal with multivariate situations, the inﬂuence and leverage indices being widely used (Belsley et al. 1980; Cook and Weisberg 1982; Chatterjee and Price 1991). The leverage of a datum point quantiﬁes the extent to which that point is “isolated” in the xspace, that is, its distinctiveness in terms of the regressor variables. Consider the following symmetric matrix (called the hat matrix): H = X XT X
1
XT = pij
ð5:51Þ
where X is a data matrix with n rows (n is the number of observations) and p columns (given by Eq. 5.31). The order of the H matrix would be (n x n), that is, equal to the number of observations. The diagonal element pii is deﬁned as the leverage of the ith data point. Since the diagonal elements can be related to the distance between Xi and x, with values between 0 and 1, their average value is equal to (p/n). Points with pii > [3 (p/n)] are regarded as points with high leverage (sometimes the threshold is taken as [2 (p/n)]). Large residuals are traditionally used to highlight suspect data points or data points unduly affecting the regression model. Instead of looking at residuals εi, it is more meaningful to study a normalized or scaled value, namely the Rstudent residuals, where R  student =
εi 1
RMSE:½1  pii 2
ð5:52Þ
Thus, studentized residuals measure how many standard deviations each observed value deviates from a model ﬁtted using all of the data except that observation. Points with  Rstudent > 3 can be said to be inﬂuence points that
correspond to a signiﬁcance level of 0.01. Sometimes a less conservative value of 2 is used corresponding to the 0.05 signiﬁcance level, with the underlying assumption that residuals or errors are Gaussian. A data point is said to be inﬂuential if its deletion, singly or in combination with a relatively few others, causes statistically signiﬁcant changes in the ﬁtted model coefﬁcients. There are several measures used to describe inﬂuence, a common one is DFITS: 1
DFITSi =
ei ðpii Þ2 1
si ð1  pii Þ2
ð5:53Þ
where εi is the residual error of observation i, and si is the standard deviation of the residuals without considering the ith residual. Points with DFITS ≥ 2 [p/(n  p)]1/2 are ﬂagged as inﬂuential points. It is advisable to identify points with high leverage, and then examine them in terms of Rstudent statistics and the DFITS index for ﬁnal determination. Inﬂuential observations can impact the ﬁnal regression model in different ways (Hair et al. 1998). For example, in Fig. 5.23a, the model residuals are not signiﬁcant, and the two inﬂuential observations shown as ﬁlled dots reinforce the general pattern in the model and lower the standard error of the parameters and of the model prediction. Thus, the two points would be considered to be leverage points that are beneﬁcial to our model building. Inﬂuential points that adversely impact model building are illustrated in Fig. 5.23b and c. In the former, the two inﬂuential points almost totally account for the observed relationship but would not have been identiﬁed as outlier points. In Fig. 5.23c, the two inﬂuential points have totally altered the model identiﬁed, and the actual data points would have shown up as points with large residuals which the analyst would probably have identiﬁed as spurious. The next frame (d) illustrates the instance when an inﬂuential point changes the intercept of the model but leaves the
196
5
Linear Regression Analysis Using Least Squares
Fig. 5.23 (a–f) Common patterns of inﬂuential observations (From Hair et al. 1998 by # permission of Pearson Education)
slope unaltered. The two ﬁnal frames, Fig. 5.23e, f, illustrate two, hard to identify and rectify, cases when two inﬂuential points reinforce each other in altering both the slope and the intercept of the model though their relative positions are very much different. Note that data points that satisfy both these statistical criteria, that is, are both inﬂuential and have high leverage, are the ones worthy of closer scrutiny. Most statistical programs have the ability to ﬂag such points, and hence performing this analysis is fairly straightforward. Thus, in summary, individual data points can be outliers, leverage, or inﬂuential points. Leverage of a point is a measure of how unusual the point lies in the xspace. As mentioned above, just because a point has high leverage does not make it inﬂuential. An inﬂuence point is one that has an important effect on the regression model if that particular point were to be removed from the data set. Inﬂuential points are the ones that need particular attention since they provide insights about the robustness of the ﬁt. In any case, all three measures (leverage pii, DFITS, and Rstudent) provide indications as to the role played by different observations toward the overall model ﬁt. Ultimately, the decision to either retain or reject such points is somewhat based on judgment.
Table 5.6 Data table for Example 5.6.1a x 1 2 3 4 5 6 7 8 9 10 a
y[0,1] 11.69977 12.72232 16.24426 19.27647 21.19835 23.73313 21.81641 25.76582 29.09502 28.9133
y1 11.69977 12.72232 16.24426 19.27647 21.19835 23.73313 21.81641 25.76582 29.09502 50
Data available electronically on book website
Example 5.6.1 Example highlighting different characteristics of residuals versus inﬂuence points. Consider the following madeup data (Table 5.6) where x ranges from 1 to 10, and the model is y = 10 + 1.5 * x to which random normal noise ε = [0, σ = 1] has been added to give y1 (second column). The response of the last observation has been intentionally corrupted to a value of 50 as shown (say, due to a momentary spike in power supply to the instrument).
5.6 Model Residual Analysis and Regularization
197
Fig. 5.24 (a) Observed vs predicted plot. (b) Residual plot versus regressor plot
How well a linear model ﬁts the data is depicted in Fig. 5.24. Not surprisingly, the table of unusual residuals shown below does include the last observation since its Studentized absolute residual value is greater than 3.0 (99% CL). It has been ﬂagged as an inﬂuential point since it has a major impact on the model coefﬁcients. However, the same point has not been ﬂagged as a leverage one since the point is not “isolated” in the xspace. This example serves to highlight the different impacts of leverage versus inﬂuence points. ■ Inﬂuential points ﬂagged by the statistical package Row 10
5.6.3
x 10.0
y 50.0
Predicted Y Residual 37.2572 12.743
Studentized residual 11.43
Remedies for Nonuniform Residuals
Nonuniform model residuals or heteroscedasticity can be due to: (i) the nature of the process investigated, (ii) noise in the data, or (iii) the method of data collection from samples that are known to have different variances. Three possible generic remedies for nonconstant variance are to (Chatterjee and Price 1991):
(a) Introduce additional variables into the model and collect new data The physics of the problem along with model residual behavior can shed light into whether certain key variables, left out in the original ﬁt, need to be introduced or not. This aspect is further discussed in Sect. 5.6.5. (b) Transform the dependent variable This is appropriate when the errors in measuring the dependent variable may follow a probability distribution whose variance is a function of the mean of the distribution. In such cases, the model residuals are likely to exhibit heteroscedasticity that can be removed by using exponential, Poisson, or binomial transformations. For example, a variable that is distributed binomially with parameters “n and p” has mean (n.p) and variance [n.p.(1  p)] (Sect. 2.4.2). For a Poisson variable, the mean and variance are equal. The transformations shown in Table 5.7 will stabilize variance, and the distribution of the transformed variable will be closer to the normal distribution. The logarithmic transformation is also widely used in certain cases to transform a nonlinear model into a linear
198
5
Table 5.7 Transformations in dependent variable y likely to stabilize nonuniform model variance Poisson Binomial
Variance of y in terms of its mean μ μ μ(1μ)/n
Transformation y1/2 sin1(y)1/2
Table 5.8 Data table for Example 5.6.2 Obs # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 a
x 294 247 267 358 423 311 450 534 438 697 688 630 709 627
Y 30 32 37 44 47 49 56 62 68 78 80 84 88 97
Obs # 15 16 17 18 19 20 21 22 23 24 25 26 27
x 615 999 1022 1015 700 850 980 1025 1021 1200 1250 1500 1650
y 100 109 114 117 106 128 130 160 97 180 112 210 135
Data available electronically on book website
one (see Sect. 9.4.3). When the variables have a large standard deviation compared to the mean, working with the data on a log scale often has the effect of dampening variability and reducing asymmetry. This is often an effective means of removing heteroscedasticity as well. However, this approach is valid only when the magnitude of the residuals increases (or decreases) with that of one of the variables. Example 5.6.2 Example of variable transformation to remedy improper residual behavior The following example serves to illustrate the use of variable transformation. Table 5.8 shows data/observations from 27 departments in a university with y as the number of faculty and staff and x the number of students. A simple linear regression yields a model with R2 = 77.6% and a RMSE = 21.73. However, the residuals reveal an unacceptable behavior with a strong funnel behavior (see Fig. 5.25a). Instead of a linear model in y, a linear model in ln(y) is investigated. In this case, the model R2 = 76.1% and RMSE =0.25. However, these statistics should NOT be compared directly with the previous indices since the y variable is no longer the same (in one case, it is “y”; in the other “ln(y)”). Leaving this issue aside for now, notice that a ﬁrstorder model does reduce some of the improper residual variances but an inverted “U” shape behavior can still be detected—indicating model misspeciﬁcation (see Fig. 5.25b).
Linear Regression Analysis Using Least Squares
Finally, using a quadratic model along with the ln transformation results in a model: lnðyÞ = 2:8516 þ 0:00311267 x‐0:00000110226 x2 ð5:54Þ The residuals shown in Fig. 5.25c are now quite well behaved as a result of such a transformation. ■ (c) Perform weighted least squares This approach is more ﬂexible, and several variants exist (Chatterjee and Price 1991). As described earlier, OLS model residual behavior can exhibit nonuniform variance (called heteroscedasticity) even if the model is structurally complete, that is, the model is not misspeciﬁed. This violates one of the standard OLS assumptions. In a multiple regression model, the detection of heteroscedasticity may not be very straightforward since only one or two variables may be the culprits. Examination of the residuals versus each variable in turn along with intuition and understanding of the physical phenomenon being modeled can be of great help. Otherwise, the OLS estimates will lack precision, and the estimated standard errors of the model parameters will be wider. If this phenomenon occurs, the model identiﬁcation should be redone with explicit recognition of this fact. During OLS, the sum of the model residuals of all points is minimized with no regard to the values of the individual points or to points from different domains of the range of variability of the regressors. The basic concept of weighted least squares (WLS) is to simply assign different weights to different points according to a certain (rational) statistical scheme. The magnitude of the weight of an observation indicates the importance to be given to that observation. A note of caution is that the weights are not known exactly and that assigning values to them is sort of circumstance speciﬁc. The general formulation of WLS is that the following function should be minimized: WLS function =
wi yi  β0  β1 x1i ⋯  βp xpi
2
ð5:55Þ where wi are the weights of individual points. These are formulated differently depending on the weighting scheme selected. (ci) Errors Are Proportional to x Resulting in FunnelShaped Residuals Consider the simple model y = α + βx + ε whose residuals ε have a standard deviation that increases as the regressor variable (resulting in the funnellike shape in Fig. 5.26). Dividing the terms of the model by x results in:
5.6 Model Residual Analysis and Regularization
199
Fig. 5.25 (a) Residual plot of linear model. (b) Residual plot of logtransformed linear model. (c) Residual plot of logtransformed quadratic model (Eq. 5.54)
y α ε = þ β þ or y0 = αx0 þ β þ ε0 x x x
Fig. 5.26 Type of heteroscedastic model residual behavior which arises when errors are proportional to the magnitude of the x variable
ð5:56Þ
with the variance of ε′ becoming constant and equal to a constant k2. This is akin to weighting different vertical slices of the regressor variable by varðεi Þ = k 2 x2i . If the assumption about the weighting scheme is correct, the transformed model will be homoscedastic, and the model parameters α and β will be efﬁciently estimated by OLS (i.e., the standard errors of the estimates will be optimal). The above transformation is only valid when the model residuals behave as shown in Fig. 5.26. If residuals behave differently, then different transformations or weighting schemes should be explored. Whether a particular transformation is adequate or not can only be gauged by the behavior of the variance of the residuals. Note that the analyst must
200
5
Linear Regression Analysis Using Least Squares
Table 5.9 Measured x and y variables, OLS residuals deduced from Eq. 5.58a and the weights calculated from Eq. 5.58b (Example 5.6.3)a x 1.15 1.90 3.00 3.00 3.00 3.00 3.00 5.34 5.38 5.40 5.40 5.45 7.70 7.80 7.81 7.85 7.87 7.91 7.94 a
y 0.99 0.98 2.60 2.67 2.66 2.78 2.80 5.92 5.35 4.33 4.89 5.21 7.68 9.81 6.52 9.71 9.82 9.81 8.50
OLS residual εi 0.26329  0.59826  0.2272  0.1572  0.1672  0.0472  0.0272 0.435964  0.17945  1.22216 0.66216  0.39893  0.48358 1.53288  1.76847 1.37611 1.463402 1.407986 0.063924
wi 0.9882 1.7083 6.1489 6.1489 6.1489 6.1489 6.1489 15.2439 13.6185 12.9092 12.9092 11.3767 0.9318 0.8768 0.8716 0.8512 0.8413 0.8219 0.8078
x 9.03 9.07 9.11 9.14 9.16 9.37 10.17 10.18 10.22 10.22 10.22 10.18 10.50 10.23 10.03 10.23
OLS residual εi  0.20366 1.730922 2.375506 1.701444 0.828736 0.580302  1.18802 1.410628 0.005212  3.02479 0.875212  2.29937  4.0927 2.423858  0.61906  1.10614
y 9.47 11.45 12.14 11.50 10.65 10.64 9.78 12.39 11.03 8.00 11.90 8.68 7.25 13.46 10.19 9.93
wi 0.4694 0.4614 0.4535 0.4477 0.4440 0.4070 0.3015 0.3004 0.2963 0.2963 0.2963 0.3004 0.2696 0.2953 0.3167 0.2953
Data available electronically on book website
perform two separate regressions: First, an OLS regression to determine the residual amounts of the individual data points, and then a WLS regression for ﬁnal parameter identiﬁcation. This is often referred to as twostage estimation. (cii) Replicated Measurements with Different Variance It could happen, especially with designed experiments involving one regressor variable only that one obtains replicated measurements on the response variable corresponding to a set of ﬁxed values of the explanatory variables. For example, consider the case when the regressor variable x takes several discrete values. If the physics of the phenomenon cannot provide any theoretical basis on how to select a particular weighting scheme, then this must be determined heuristically from studying the data. If there is an increasing pattern in the heteroscedasticity present in the data, this could be modeled either by a logarithmic transform (as illustrated in Example 5.6.2) or a suitable variable transformation. Another more versatile approach that can be applied to any pattern of the residuals is illustrated. Each observed residual εij (where the index for discrete x values is i, and the number of observations at each discrete x value is j = 1, 2, . . . ni) is made up of two parts, that is, εij = yij  yi þ yi  yij . The ﬁrst part is referred to as pure error while the second part measures lack of ﬁt. An assessment of heteroscedasticity is based on pure error. Thus,
the WLS weight may be estimated as wi = 1=s2i where the mean square error is: s2i
=
yij  yi ð n i  1Þ
2
ð5:57Þ
Alternatively, a model can be ﬁt to the mean values of x and the s2i values in order to smoothen out the weighting function, and this function is used instead. Thus, this approach would also qualify as a twostage estimation process. The following example illustrates this approach. Example 5.6.313 Example of twostage weighted regression for replicate measurements Consider the xy data given in Table 5.9 noting that replicate measurements of y are taken at different values of x (which vary slightly). Step 1: An OLS model is identiﬁed from the data. y =  0:578954 þ 1:1354 x with R2 = 0:841 and RMSE = 1:4566
ð5:58aÞ
From the summary tables of the regression analysis, one notes that the intercept term in the model is not statistically 13
From Draper and Smith (1981), with permission from John Wiley and Sons.
5.6 Model Residual Analysis and Regularization
201
Fig. 5.27 (a) Data set and OLS regression line of observations with nonconstant variance and replicated observations in x. (b) Residuals of a simple linear OLS model ﬁt (Eq. 5.58a) during step 1. (c) Residuals and regression line of a second order polynomial OLS ﬁt to the mean x and mean square error (MSE) of the replicate values during step 2 (Eq. 5.58b). d Residuals of the weighted regression model identiﬁed during step 3 (Eq. 5.58c)
signiﬁcant ( pvalue = 0.4 for the tstatistic), while the overall model ﬁt given by the Fratio is signiﬁcant. A scatter plot of these data and the simple OLS linear model are shown in Fig. 5.27a. The residuals of a simple linear OLS model shown in Fig. 5.27b reveal, as expected, marked
heteroscedasticity. Hence, the OLS model is bound to lead to misleading uncertainty bands even if the model predictions themselves are not biased. The residuals from the above OLS model are also shown in the 3rd column of Table 5.9.
202
5
Step 1: OLS model coefﬁcients Parameter Intercept
Least squares estimate  0.578954
Standard error 0.679186
Slope
1.1354
0.086218
tstatistic 0.852423 13.169
pvalue 0.4001
Sum of squares 367.948 70.0157 437.964
D. f. 1 33 34
Mean square 367.948 2.12169
Fratio 173.42
pvalue 0.0000
Step 2: The residuals of the OLS model are heteroscedastic. One needs to identify a regression model for the OLS model residuals. The data range of the regressor variable can be partitioned into ﬁve ranges (these are the discrete values of x for which observations were taken if the ﬁrst two rows are omitted from the analysis). These ﬁve values of x and the corresponding average of the mean square error s2i following Eq. 5.58a are shown in the Step 2 table. s2i 0.0072 0.373 1.6482 0.8802 4.1152
x 3 5.39 7.84 9.15 10.22
The data pattern exhibits a quadratic pattern (see Fig. 5.27c) and so a secondorder polynomial model is regressed to this data to yield: s2i = 1:887  0:8727x þ 0:9967x2 with R2 = 0:743%
ð5:58bÞ
Step 3: The regression weights wi can thus be deduced by using individual values of xi instead of x in the above equation. The values of the weights are also shown in Table 5.9 under the 4th column. Step 4: Finally, a weighted regression is performed following the functional form given by Eq. 5.55 (most statistical packages have this capability) using the data under the 1st, 2nd, and 4th columns. y =  0:942228 þ 1:16252x R = 0:896 and 2
with
RMSE = 1:2725:
the real advantage is that this model will have better prediction accuracy and more realistic (unbiased) prediction errors ■ than Eq. 5.58a. (ciii) Nonpatterned Variance in the Residuals
0.0000
Step 1: OLS analysis of variance Source Model Residual Total (Corr.)
Linear Regression Analysis Using Least Squares
ð5:58cÞ
The residual plot is shown as Fig. 5.27d. Though the goodness of ﬁt is only slightly better than the OLS model,
A third type of nonconstant residual variance is one when no pattern is discerned with respect to the regressors which can be discrete or vary continuously. In this case, a practical approach is to look at a plot of the model residuals against the response variable, divide the range in the response variable into as many regions as seem to have different variances, and calculate the standard deviation of the residuals for each of these regions. In that sense, the general approach parallels the one adopted in case (cii) when dealing with replicated values with a nonconstant variance; however, now, no model such as Eq. 5.58b is needed. The general approach would involve the following steps: • First, ﬁt an OLS model to the data. • Next, discretize the domain of the regressor variables into a ﬁnite number of groups and determine εi2 from which the weights wi for each of these groups can be deduced. • Finally, perform a WLS regression to estimate the efﬁcient model parameters. Though this twostage estimation approach is conceptually easy and appealing for simple models, it may become rather complex for multivariate models, and moreover, there is no guarantee that heteroscedasticity will be removed entirely.
5.6.4
Serially Correlated Residuals
Another manifestation of improper residual behavior is serial correlation. As stated earlier (Sect. 5.6.1). one should distinguish between the two different types of autocorrelation, namely pure autocorrelation and modelmisspeciﬁcation, although it is often difﬁcult to distinguish between them. The latter is usually addressed using the weight matrix approach (Pindyck and Rubinfeld 1981) which is fairly formal and general, but somewhat demanding. Pure autocorrelation relates to the case of “pseudo” patterned residual behavior, which arises because the regressor variables have strong serial correlation. This serial correlation behavior is subsequently transferred over to the model, and hence to its residuals, even when the regression model functional form is close to “perfect.” The remedial approach to be adopted is to transform the original data set prior to regression itself. There are several techniques for doing so, and the widely used CochraneOrcutt (CO) procedure is described. It involves the use of generalized differencing to alter the linear model into one in which the errors are independent. The twostage ﬁrstorder CO procedure involves:
5.6 Model Residual Analysis and Regularization
203
(i) Fitting an OLS model to the original variables (ii) Computing the ﬁrstorder serial correlation coefﬁcient r of the model residuals (Eq. 3.12) (iii) Transforming the original variables y and x into a new set of pseudovariables: yt = yt  r:yt  1 and xt = xt  r:xt  1
ð5:59Þ
(iv) OLS regression on the pseudo variables y* and x* to reestimate the parameters (b0* and b1*) of the model (v) Finally, obtaining the ﬁtted regression model in the original variables by a back transformation of the pseudo regression coefﬁcients: b 0 = b0
1 and b1 = b1 1r
ð5:60Þ
Though two estimation steps are involved, the entire process is simple to implement. This approach, when originally proposed, advocated that this process be continued till the residuals become random (say, based on the DurbinWatson test). However, the current recommendation is that alternative estimation methods should be attempted if one iteration proves inadequate. This approach can be used during model parameter estimation of MLR models provided only one of the regressor variables is the cause of the pseudocorrelation. Also, a more sophisticated version of the CO procedure has been suggested by Hildreth and Lu (Chatterjee and Price 1991) involving only one estimation process where the optimal value of r is determined along with the parameters. This, however, requires nonlinear estimation methods. Example 5.6.4 Using the CochraneOrcutt (CO) procedure to remove ﬁrstorder autocorrelation Consider the case when observed preretroﬁt data of either cooling or heating energy consumption in a commercial building support a linear regression model as follows: E i = b0 þ b1 T i
ð5:61Þ
where Ti = daily average outdoor drybulb temperature Ei = daily total energy use predicted by the model i = subscript representing a particular day bo and b1 are the leastsquare regression coefﬁcients How the above transformation yields a regression model different from OLS estimation is illustrated in
Fig. 5.28 How serial correlation in the residuals affects model identiﬁcation (Example 5.6.4)
Fig. 5.28 with yearlong daily cooling energy use from a large institutional building in central Texas. The ﬁrstorder autocorrelation coefﬁcients of cooling energy and average daily temperature were both equal to 0.92, while that of the OLS residuals was 0.60. The DurbinWatson statistic for the OLS residuals (i.e. untransformed data) was DW = 3 indicating strong residual autocorrelation, while that of the CO transform was 1.89 indicating little or no autocorrelation. Note that the CO transform is inadequate in cases of model misspeciﬁcation and/or seasonal operational changes. ■
5.6.5
Dealing with Misspecified Models
An important source of error during model identiﬁcation is model misspeciﬁcation error. This is unrelated to measurement error and arises when the functional form of the model is not appropriate. This can occur due to: (i) Inclusion of irrelevant variables: Does not bias the estimation of the intercept and slope parameters, but generally reduces the efﬁciency of the slope parameters, that is, their variance will be larger. This source of error can be eliminated by, say, stepwise regression or simple tests such as ttests. (ii) Exclusion of an important variable: Will result in the slope parameters being both biased and inconsistent. (iii) Assumption of a linear model: When a linear model is erroneously assumed. (iv) Incorrect model order: When one assumes a lower or higher model than what the data warrants.
204
5
Linear Regression Analysis Using Least Squares
Fig. 5.29 Improvement in residual behavior for a model of hourly energy use of a variable air volume HVAC system in a commercial building as inﬂuential regressors are incrementally added to the model. (From Katipamula et al. 1998)
The latter three sources of errors are very likely to manifest themselves in improper residual behavior (the residuals will show serial correlation or nonconstant variance behavior). The residual analysis may not identify the exact cause, and several attempts at model reformulations may be required to overcome this problem. Even if the physics of the phenomenon or of the system is well understood and can be cast in mathematical terms, experimental or identiﬁability constraints may require that a simpliﬁed or macroscopic model be used for parameter identiﬁcation rather than the detailed model. This could cause model misspeciﬁcation. Example 5.6.5 Example to illustrate how the inclusion of additional regressors can remedy improper model residual behavior
Energy use in commercial buildings accounts for about 19% of the total primary energy use in the United States and consequently, it is a prime area of energy conservation efforts. For this purpose, the development of baseline models, that is, models of energy use for a speciﬁc enduse before energy conservation measures are implemented, is an important modeling activity for monitoring and veriﬁcation studies. Let us illustrate the effect of improper selection of regressor variables or model misspeciﬁcation for modeling measured thermal cooling energy use of a large commercial building operating 24 hours a day under a variable air volume HVAC system (Katipamula et al. 1998). Figure 5.29 illustrates the residual pattern when hourly energy use is modeled with only the outdoor drybulb temperature (To). The residual pattern is blatantly poor exhibiting both nonconstant variances as well as systematic bias in the low
5.7 Other Useful OLS Regression Models
205
range of the xvariable. Once the outdoor dew point tempera14 the global horizontal solar radiation (qsol) and ture T þ dp ,
the internal building heat loads qi (such as lights and equipment) are introduced in the model, the residual behavior improves signiﬁcantly but the lower tail is still improper. Finally, when additional terms involving indicator variables I to both intercept and To are introduced (described in Sect. 5.7.2), is an acceptable residual behavior achieved. ■
5.7
Other Useful OLS Regression Models
5.7.1
ZeroIntercept Models
Sometimes the physics of the system dictates that the regression line passes through the origin. For the linear case, the model assumes the form: y = β1 x þ ε
ð5:62Þ
The interpretation of R2 under such a case is not the same as for the model with an intercept, and this statistic cannot be used to compare the two types of models directly. Recall that for linear models the R2 value indicates the percentage variation of the response variable about its mean explained by that of the regressor variable. For the nointercept case, the R2 value relates to the percentage variation of the response variable about the origin explained by the regressor variable. Thus, when comparing both models, one should decide on which is the better model based on their RMSE values.
5.7.2
Indicator Variables for Local Piecewise Models— Linear Splines
Spline functions are an important class of functions, described in numerical analysis textbooks in the framework of interpolation, which allows distinct functions to be used over different ranges while maintaining continuity in the function. They are extremely ﬂexible functions in that they allow a wide range of locally different behavior to be captured within one elegant functional framework. In addition to interpolation, splines have been used in a regression context as well as for data smoothing (discussed below and in Sect. 9.6)
Fig. 5.30 Piecewise linear model or ﬁrstorder spline ﬁt with hinge point at xc. Such models are referred to as change point models in building energy modeling terminology
Thus, a globally nonlinear function can be decomposed into simpler local patterns. Two common cases are discussed below. (a) The simpler case is one where it is known which points lie on which trend, that is, when the physics of the system is such that the location of the structural break or “hinge point” xc of the regressor is known. One could represent the two regions by piecewise linear spline (as shown in Fig. 5.30); otherwise, the thirddegree polynomial spline is often used to capture highly nonlinear trends (see Sect. 9.6.4). The objective here is to formulate a linear model and identify its parameters that best describe data points in Fig. 5.30. One cannot simply divide the data into two regions and ﬁt each region with a separate linear model since the two segments are unlikely to intersect at exactly the hinge point (a constraint that the model be continuous at the hinge point would be violated). A model of the following form would be acceptable: y = b 0 þ b1 x þ b2 ð x  x c Þ I
ð5:63aÞ
where the indicator (also called dummy or binary) variable I=
1
if x > xc
0
otherwise
ð5:63bÞ
Hence, for the region x ≤ x c , y = b0 þ b1 x
ð5:64Þ
and for the region 14
Actually, the outdoor humidity impacts energy use only when the dew point temperature Tdp exceeds a certain threshold which many studies have identiﬁed to be about 55°F (this is related to how the HVAC cooling coil is controlled to meet indoor occupant comfort). This conditional variable indicated by a + superscript is equal to (Tdp 55) when the term is positive, and zero otherwise.
x > xc , y = ðb0  b2 xc Þ þ ðb1 þ b2 Þx Thus, the slope of the model is b1 before the break and (b1 + b2) afterward. The intercept term changes as well from
206
5
b0 before the break to (b0  b2xc) after the break. The logical extensions to linear spline models with two structural breaks or to higher order splines involving quadratic and cubic terms are fairly straightforward and treated further in Sect. 9.6.4. (b) The second case arises when the change point is not known. A simple approach is to look at the data, identify a “ballpark” range for the change point, perform numerous regression ﬁts with the data set divided according to each possible value of the change point in this ballpark range, and pick that value which yields the best overall R2 or RMSE. Alternatively, the more accurate but more complex approach is to cast the problem as a nonlinear estimation method with the change point variable as one of the parameters. Example 5.7.1 Change point models for building utility bill analysis The theoretical basis of modeling monthly energy use in buildings is discussed in several papers (e.g., Reddy et al., 1997, 2016). The interest in this particular time scale is obvious—such information is easily obtained from utility bills, which are usually available on a monthly time scale. The models suitable for this application are similar to linear spline models and are referred to as change point models by building energy analysts. A simple example is shown below to illustrate the above equations. Electricity utility bills of a residence in Houston, TX have been normalized by the number of days in the month and assembled in Table 5.10 along with the corresponding month and monthly mean outdoor temperature values for Houston (the ﬁrst three columns of the table). The intent is to use Eq. 5.63a to model this behavior. The scatter plot and the trend lines drawn in Fig. 5.31 suggest that the change point is in the range 17–19 °C. Let us
Linear Regression Analysis Using Least Squares
perform the calculation assuming a value of 17 °C. Deﬁning an indicator variable: I=
1
if x > 17 ° C
0
otherwise
Based on this assumption, the last two columns of the table have been generated to correspond to the two regressor variables in Eq. 5.63a. A linear multiple regression yields: y = 0:1046 þ 0:005904x þ 0:00905ðx  17ÞI with R2 = 0:996 and RMSE = 0:0055
ð5:65Þ
with all three parameters being statistically signiﬁcant. The reader can repeat this analysis assuming a different value for the change point (say xc = 18 °C) in order to study the
Fig. 5.31 Piecewise linear regression lines for building electric use with outdoor temperature. The change point is the point of intersection of the two lines. The combined model is called a change point model, which in this case, is a fourparameter model given by Eq. 5.65
Table 5.10 Measured monthly energy use data and calculation step for deducing the change point independent variable assuming a base value of 17°C. Data for Example 5.7.1 Month Jan Feb Mar Apr May Jun Jul Aug Sept Oct Nov Dec a
Mean outdoor temperature (°C) 11 13 16 21 24 27 29 29 26 22 16 13
Data available electronically on book website
Monthly mean daily electric use (kWh/m2/day) 0.1669 0.1866 0.1988 0.2575 0.3152 0.3518 0.3898 0.3872 0.3315 0.2789 0.2051 0.1790
x (°C) 11 13 16 21 24 27 29 29 26 22 16 13
(x  17 ° C)I (°C) 0 0 0 4 7 10 12 12 9 5 0 0
5.7 Other Useful OLS Regression Models
207
sensitivity of the model to the choice of the change point value. Though only three parameters are determined by regression, this is an example of a fourparameter (or 4P) model in building science terminology. The fourth parameter is the change point xc which also needs to be selected/ determined. Specialized software programs have been developed to determine the optimal value of xc (i.e., that which results in minimum RMSE of different possible choices of xc) following a numerical search process akin to the one described in this example. ■
and model 2 be for energy  efficient buildings : y = a 2 þ b2 x1 þ c2 x2
The complete model (or model 3) would be formulated as: y = a 1 þ b 1 x 1 þ c 1 x 2 þ I ð a2 þ b 2 x 1 þ c 2 x 2 Þ
Indicator Variables for Categorical Regressor Models
The use of indicator (also called dummy) variables has been illustrated in the previous section when dealing with spline models. Indicator variables are also used in cases when shifts in either the intercept or the slope are to be modeled with the condition of continuity now being relaxed. Most variables encountered in mechanistic models are quantitative and continuous, that is, the variables are measured on a numerical scale. Variables which cannot be controlled are often called “covariates”. Some examples are temperature, pressure, distance, energy use, and age. Occasionally, the analyst comes across models involving qualitative or categorical variables, that is, regressor data that belong in one of two (or more) possible categories. One would like to evaluate whether differences in intercept and slope between categories are signiﬁcant enough to warrant two separate models or not. This concept is illustrated by the following example. Whether the annual energy use of a regular commercial buildings is markedly higher than that of another certiﬁed as being energy efﬁcient is to be determined. Data from several buildings which fall in each group are gathered to ascertain whether the presumption is supported by the actual data. Covariates that affect the normalized energy use (variable y) of both experimental groups are conditioned ﬂoor area (variable x1) and outdoor temperature (variable x2). Suppose that a linear relationship can be assumed with the same intercept for both groups. One approach would be to separate the data into two groups: one for regular buildings and one for efﬁcient buildings and develop regression models for each group separately. Subsequently, one could perform a ttest to determine whether the slope terms of the two models are signiﬁcantly different or not. However, the assumption of constant intercept term for both models may be erroneous, and this may confound the analysis. A better approach is to use the entire data and adopt a modeling approach involving indicator variables. Let model 1 be for regular buildings : y = a 1 þ b1 x 1 þ c 1 x 2
ð5:66aÞ
ð5:67aÞ
where I is an indicator variable such that I=
5.7.3
ð5:66bÞ
1 0
for energy efficient buildings for regular buildings
ð5:67bÞ
Note that a basic assumption in formulating this model is that all three model parameters are affected by the building group. Formally, one would like to test the null hypothesis H0: a2 = b2 = c2 = 0. The hypothesis is tested by constructing an Fstatistic for the comparison of the two models. Note that model 3 is referred to as the full model (FM) or pooled model. Model 1, when the null hypothesis holds, is the reduced model (RM). The idea is to compare the goodnessofﬁt of the FM and that of the RM using both data sets combined. If the RM provides as good a ﬁt as the FM, then the null hypothesis is valid. Let SSE(FM) and SSE(RM) be the corresponding model sum of squared errors or squared model residuals. Then, the following Ftest statistic is deﬁned: F=
½SSEðRMÞ  SSEðFMÞ pm
SSEðFMÞ np
ð5:68Þ
where n is the number of data sets, p is the number of parameters of the FM, and m is the number of parameters of the RM. If the observed Fvalue is larger than the tabulated value of F with (n p) and (p  m) degrees of freedom at the prespeciﬁed signiﬁcance level (provided by Table A.6), the RM is unsatisfactory, and the full model has to be retained. As a cautionary note, this test is strictly valid only if the OLS assumptions for the model residuals hold. Example 5.7.2 Combined modeling of energy use in regular and energyefﬁcient buildings Consider the data assembled in Table 5.11. Let us designate the regular buildings by group (A) and the energyefﬁcient buildings by group (B), with the problem simpliﬁed by assuming both types of buildings to be located in the same geographic location. Hence, the model has only one regressor variable involving ﬂoor area. The complete model with the indicator variable term given by Eq. 5.67a is used to verify whether group B buildings consume less energy than group A buildings.
208
5
Linear Regression Analysis Using Least Squares
Table 5.11 Data table for Example 5.7.2a Energy use ( y) 45.44 42.03 50.1 48.75 47.92 47.79 52.26 50.52 45.58 44.78 a
Floor area (x1) 225 200 250 245 235 237 265 259 221 218
Bldg type A A A A A A A A A A
Energy use ( y) 32.13 35.47 33.49 32.29 33.5 31.23 37.52 37.13 34.7 33.92
Floor area (x1) 224 251 232 216 224 212 248 260 243 238
Bldg type B B B B B B B B B B
Data available electronically on book website
The full model (FM) given by Eq. 5.67a reduces to the following form since only one regressor is involved: y = a + b1x1 + b2Ix1 where the variable I is an indicator variable such that it is 0 for group A and 1 for group B. The null hypothesis is that H0 : b2 = 0. The reduced model (RM) is y = a + bx1. It is identiﬁed using the entire data set without distinguishing between the building types. The estimated model : y = 14:2762 þ 0:14115 x1  13:2802 ðI x1 Þ while the RM model : y = 5:7768 þ 0:1491 x1 :
ð5:69Þ
The analysis of variance results in SSR(FM) = 7.7943 and SSR(RM) = 889.245. The Fstatistic in this case is: F=
ð889:245  7:7943Þ=1 = 1922:5 7:7943=ð20  3Þ
One can thus safely reject the null hypothesis, and state with conﬁdence that buildings built as energyefﬁcient ones consume energy which is statistically lower than those of regular buildings. ■
5.8
Resampling Methods Applied to Regression
5.8.1
Basic Approach
The fundamental tasks in regression involve model selection, that is, identifying a suitable model and estimating the values and the uncertainty intervals of the model parameters, and model assessment, that is, deducing the predictive accuracy of the model for subsequent use. The classical OLS equations presented in Sects. 5.3 and 5.4 can be used for this purpose in conjunction with model residual analysis along with the
DurbinWatson statistic (Sect. 5.6.1) to guard against the dangers or model underﬁtting and overﬁtting. However, it should be recognized that the data set used is simply a sample of a much larger set of population data characterizing the behavior of the stochastic system under study. In that sense, the model selection and evaluation results are somewhat limited because they do not make full use of the variability inherent in samples drawn from a population. These limitations can be overcome by resampling methods (see Sect. 4.8), which offer distinct advantages in terms of better accuracy, robustness, versatility, and intuitive appeal in the context of regression modeling as well. They are widely regarded as being able to better perform the fundamental tasks involved in regression, and as a result are becoming increasingly popular. Recall from Sect. 4.8.3 that the basic rationale behind resampling methods is to draw one single sample or experimental data set, treat this original sample as a surrogate for the population, and generate numerous subsamples by simply resampling the sample itself. Thus, resampling refers to the use of given data, or a data generating mechanism, to produce new samples from which the required estimates can be deduced numerically. Note, however, that the resampling methods cannot overcome some of the limitations inherent in the original sample. The resampling samples are to the sample what the sample is to the population. Hence, if the sample does not adequately cover the spatial range or if the sample is not truly random, then the resampling results will be inaccurate as well.
5.8.2
Jackknife and kFold CrossValidation
In most practical situations, it is misleading to use the entire data set to identify the regression model and report the resulting RMSE as the predictive accuracy of the model. This is due to the possibility of model overﬁtting and associated underestimation of the predictive RMSE error.
5.8 Resampling Methods Applied to Regression
Overﬁtting refers to the situation in which the regression model is able to capture the training set with high accuracy but is poor at predicting new data. In other words, the model is overtrained on features speciﬁc to the training set, which may differ for new data. A better but somewhat empirical approach is to randomly partition the data set into two samples (say, in the proportion of 80/20), use the 80% portion of the data to train or develop the model, calculate the internal predictive error (say the RMSE following Eq. 5.8), use the 20% portion of the data as the validation data set, and predict or test the y values using the already identiﬁed model, and ﬁnally calculate the test or external or simulation error magnitude. The competing models can then be compared, and a selection made based on both the internal and external predictive errors pertinent to the training and testing data sets respectively. The test errors indices will generally be greater than the training errors; larger discrepancies are suggestive of greater overﬁtting, and vice versa. This general approach is the basis of the two most common resampling methods discussed below. The jackknife method and its more recent version, the crossvalidation method, were described in Sect. 4.8.3. The latter method of model evaluation, also referred to as holdout sample validation, can avoid model overﬁtting. With the advent of powerful computers, a variant, namely the “kfold crossvalidation” method has become popular. It involves: (i) Dividing the random sample of n observations into k groups of equal size. (ii) Omitting one group at a time and performing the regression with the other (k1) groups. (iii) Determining and saving the internal or modeling errors as well as the external or simulation or predictive errors (say, in terms of the RMSE values) of both sets of subsamples. (iv) And using the saved parameters values to deduce the mean and uncertainty intervals for the model parameters and computing the mean and conﬁdence levels of the modeling and simulation prediction errors. The model parameters and the test values determined in the last step are likely to be less biased, much more robust, and more representative of the actual model behavior than using classical methods.
209
Note, however, that though the same equations are used to compute the RMSE indices, the degrees of freedom (d.f.) are different. Let n be the total number of data points. Then d. f. = {[(k1)(np)]/k]} while computing the internal errors for model building or selection, and d.f. = [(n/k)] while computing the RMSE of the external predictive errors. There is a tradeoff between high bias error and high variance in the choice of the number k. It is recommended that k be selected as either k = 5 or k = 10 (James et al. 2013), colloquially referred to the “magic” numbers of folds. Example 5.8.1 kfold crossvalidation Consider Example 5.3.1, which involved ﬁtting the simple OLS regression with 33 data observations of solids reduction (xvariable) and oxygen demand (yvariable). The regression analysis will be redone to illustrate the insights provided by kfold crossvalidation; k = 3 has been assumed in this simpliﬁed illustration. The data set has been ﬁrst randomized to remove the monotonic increase in the regressor variable and broken up into 3 subsets of 11 observations each. Three data samples with different combinations of the subsets involving two subsets (i.e., 22 data points) can now be created. The analysis is performed with 22 data points sets for training, that is, to identify an OLS regression model, and the remaining subset of 11 data points used for testing, that is, to compute the simulation or external prediction error. The results are summarized in Table 5.12. One notes that the model parameters vary from one run to another indicating their random variability as different samples of data are selected. If the kfold analysis was done with a greater number of folds (say k = 5), one could have deduced the variance of these estimates which would most likely be less biased and more accurate than those determined from classical methods. The ﬁnal model determined from kfold regression analysis is the same as the one using all data points while the estimate of the predictive error (as indicated by the test RMSE = 3.447) is the average of the RMSE values from the three samples. Thus, the extra effort involving creating kfold samples, identifying regression models, and calculating the prediction errors was simply to get a better estimate of the prediction RMSE error. It is a more
Table 5.12 Summary of the OLS regression analysis results using threefold crossvalidation using data from Example 5.3.1 Data Alldata (Example 5.3.1) Threefold sample 1 Threefold sample 2 Threefold sample 3 Final model
OLS model y = 3.830 + 0.9036 x y = 1.797 + 0.9734 x y = 3.962 + 0.8941 x y = 6.079 +0.8339 x y = 3.830 + 0.9036 x
R2 0.913 0.933 0.949 0.911 0.913
Training RMSE 3.229 3.176 3.442 2.980 3.229
Test RMSE – 3.628 2.826 3.886 3.447
210
5
representative value which, as stated earlier, is usually greater than the training or internal RMSE.
5.8.3
error variance in the regression if normal errors can be assumed (this is analogous to the concept behind the Monte Carlo approach), or (ii) nonparametrically, by resampling residuals from the original regression. One would then regress the bootstrapped values of the response variable on the ﬁxed X matrix to obtain bootstrap replications of the regression coefﬁcients. This approach is often adopted with data from designed experiments.
Bootstrap Method
Recall that in Sect. 4.8.3, the use of the bootstrap method (one of the most powerful and popular methods currently in use) was illustrated to infer variance and conﬁdence intervals of parametric statistical measures in a univariate context, and also in a situation involving a nonparametric approach, where the correlation coefﬁcient between two variables was to be deduced. Bootstrap is a statistical method where random resampling with replacement is done repeatedly from an original or initial sample, and then each bootstrapped sample is used to compute a statistic (such as the mean, median, or the interquartile range). The resulting empirical distribution of the statistic is then examined and interpreted as an approximation to the true sampling distribution. Thus, bootstrapping is an ensemble training method, which “bags” the results from numerous data sets into an ensemble average. It is often used as a robust nonparametric alternative to inferencetype problems when parametric assumptions are in doubt (e.g., knowledge of the probability distribution of the errors), or where parametric inference is impossible or requires very complicated formulas for the calculation of variance. Say, one has a data set of multivariate observations: zi = {yi, x1i, x2i,. . .} with i = 1, . . .n (this can be viewed as a sample with n observations in the bootstrap context taken from a population of possible observations). One distinguishes between two approaches: (i) Case resampling, where the predictors and response observations i are random and change from sample to sample. One selects a certain number of bootstrap subsamples (say 1000) from zi, ﬁts the model and saves the model coefﬁcients from each bootstrap sample. The generation of the conﬁdence intervals for the regression coefﬁcients is now similar to the univariate situation and is quite straightforward (Sects. 5.3.3 and 5.3.4). One of the beneﬁts is that the correlation structure between the regressors is maintained; (ii) Modelbased resampling or ﬁxed X resampling, where the regressor data structure is already imposed or known with conﬁdence. Here, the basic idea is to generate or resample the model residuals and not the observations themselves. This preserves the stochastic nature of the model structure and so the variance is better representative of the model’s own assumption. The implementation involves attaching a random error to each yi, and thereby producing a ﬁxed X bootstrap sample. The errors could be generated: (i) parametrically from a normal distribution with zero mean and variance equal to the estimated
Linear Regression Analysis Using Least Squares
The reader can refer to Efron and Tibshirani (1985), Davison and Hinkley (1997) and other more advanced papers such as Freedman and Peters (1984) for a more complete treatment.
5.9
Case Study Example: Effect of Refrigerant Additive on Chiller Performance15
The objective of this analysis is to verify the claim of a company, which had developed a refrigerant additive to improve chiller COP. The performance of a chiller before (called preretroﬁt period) and after (called postretroﬁt period) addition of this additive was monitored for several months to determine whether the additive results in an improvement in chiller performance, and if so, by how much. The same four variables described in Example 5.4.3, namely two temperatures (Tcho and Tcdi), the chiller thermal cooling load (Qch) and the electrical power consumed (Pcomp) were measured in intervals of 15 min. Note that the chiller COP can be deduced from the last two variables. Altogether, there were 4607 and 5078 data points for the preand postperiods respectively. Step 1: Perform Exploratory Data Analysis At the onset, an exploratory data analysis should be performed to determine the spread of the variables, and their occurrence frequencies during the pre and postperiods, that is, before and after the addition of the refrigerant additive. Further, it is important to ascertain whether the operating conditions during both periods are similar or not. The eight frames in Fig. 5.32 summarize the spread and frequency of the important variables. The chiller outlet water temperature ranges are similar during both periods. However, the condenser water temperature and the chiller cooling load show much larger variability during the postperiod. Finally, the cooling load and power consumed are noticeably lower during the post period. The histogram of Figure 5.33 suggests 15
The monitored data were provided by Ken Gillespie, for which we are grateful.
5.9 Case Study Example: Effect of Refrigerant Additive on Chiller Performance
Pre
(X 1000.0) 5
211
Post
(X 1000.0) 5
Tcho 4
percentage
percentage
4 3 2 1
2 1
0
0 40
42
44
46
48
50
40
42
44
46
48
50
66
70
74
78
82
86
0
300
600
900
1200
1500
1000
2400
Tcdi
2000
800
percentage
percentage
3
1600 1200 800
600 400 200
400
0
0 66
70
74
78
82
86 1500
1800
Qch
1200
percentage
percentage
1500 1200 900 600
900 600 300
300
0
0 0
300
600
900
1200
1500
1500
1800
Pcomp
1200
percentage
percentage
1500 1200 900 600
900 600 300
300 0
0 0
200
400
600
800
Fig. 5.32 Histograms depicting the range of variation and frequency of the four important variables before and after the retroﬁt (pre = 4607 data points, post = 5078 data points). The condenser water temperature
0
200
400
600
800
and the chiller cooling load show much larger variability during the postperiod. The cooling load and power consumed are noticeably lower during the post period
212
5
that COPpost > COPpre. An ANOVA test with results shown in Table 5.13 and Fig. 5.34 also indicates that the mean of postretroﬁt power use is statistically different at 95% CL as compared to the preretroﬁt power. t test to Compare Means Null hypothesis: mean (COPpost) = mean (COPpre) Alternative hypothesis: mean (COPpost) ≠ mean (COPpre) assuming equal variances:
Linear Regression Analysis Using Least Squares
Step 2: Use the Entire Preretroﬁt Data to Identify a Model The GN chiller models (Gordon and Ng 2000) are described in Sect. 10.2.3. The monitored data are ﬁrst used to compute transformed y, x1 and x2 (temperatures are in Kelvin and cooling load and power in kW) of the model given by Eq. 10.14b. Then, a linear regression is performed using Eq. 10.14a, which is given below along with standard errors of the coefﬁcients shown within parenthesis with AdjR2 = 0.998: y =  0:00187 x1 þ261:2885 x2 þ0:022461 x3
t = 38:8828, p‐value = 0:0
ð0:00163Þ
ð15:925Þ
ð0:000111Þ
with adjusted R = 0:998 2
The null hypothesis is rejected at α = 0.05. Of particular interest is the CI for the difference between the means, which extends from 0.678 to 0.750. Since the interval does not contain the value 0.0, there is a statistically signiﬁcant difference between the means of the two samples at 95.0% CL. However, it would be incorrect to infer that COPpost > COPpre since the operating conditions are different, and thus one should not use the ttest to draw any conclusions. Hence, a regression modelbased approach is warranted.
ð5:70Þ
This model is then retransformed into a model for power using Eq. 10.15, and the error statistics using the preretroﬁt data are found to be: RMSE = 9.36 kW and CV = 2.24%. Figure 5.35 shows the x–y plot from which one can visually evaluate the goodness of ﬁt of the model. Note that the mean power use is 418.7 kW while the mean model residuals are 0.017 kW (very close to zero, as it should be. This step validates the fact that the spreadsheet cells have been coded correctly with the right formulas). Step 3: Calculate Savings in Electrical Power The above chiller model representative of the thermal performance of the chiller without refrigerant additive is used to estimate savings by ﬁrst predicting power consumption for each 15 min interval using the two operating temperatures and the load corresponding to the 5078 postretroﬁt data points. Subsequently, savings in chiller power are deduced for each of the 5078 data points: Power savings = Model predictedpre‐retrofit  Measuredpost‐retrofit
Fig. 5.33 Histogram plots of the coefﬁcient of performance (COP) of the chiller before and after the retroﬁt. Clearly, there are several instances when COPpost > COPpre, but that could be due to operating conditions. Hence, a regression modeling approach is clearly warranted
ð5:71Þ
It is found that the mean power savings are  21.0 kW (i.e., an increase in power use) in the measured mean power use of 287.5 kW. Overlooking the few outliers, one can detect two distinct patterns from the xy plot of Fig. 5.36: (i) the lower range of data (for chiller power < 300 kW or so) when the differences between model predicted and postmeasurements are minor (or nil), and (ii) the higher range of data for which postretroﬁt electricity power usage was
Table 5.13 Results of the ANOVA Test of comparison of means at a significance level of 0.05 95.0% CI for mean of COPpost: 95.0% CI for mean of COPpre: 95.0% CI for the difference between the means assuming equal variances:
8.573 ± 0.03142 = [8.542, 8.605] 7.859 ± 0.01512 = [7.844, 7.874] 0.714 ± 0.03599 = [0.678, 0.750]
5.10
Parting Comments on Regression Analysis and OLS
213
higher than that of the model identiﬁed from preretroﬁt data. This is the cause of the negative power savings determined above. The reason for the onset of two distinct patterns in operation is worthy of a subsequent investigation.
Step 4: Calculate Uncertainty in Savings and Draw Conclusions The uncertainty arises from two sources: prediction model and power measurement errors. The latter are usually small, about 0.1% of the reading, which in this particular case is less than 1 kW. Hence, this contribution can be neglected during an initial investigation such as this one. The model uncertainty is given by: Fig. 5.34 ANOVA test results in the form of boxandwhisker plots for chiller COP before and after addition of refrigerant additive
absolute uncertainty in power use savings or reduction = ðt value × RMSEÞ ð5:72Þ The tvalue at 90% CL = 1.65 and RMSE of model (for preretroﬁt period) = 9.36 kW. Hence, the calculated increase in power due to refrigerant additive =  21.0 kW ± 15.44 kW at 90% CL. Thus, one would conclude that the refrigerant additive is actually penalizing chiller performance by 7.88% since electric power use has increased.
5.10
Fig. 5.35 Measured vs modeled plot of chiller power during preretroﬁt period. The overall ﬁt is excellent (RMSE = 9.36 kW and CV = 2.24%), and except for a few data points, the data seem well behaved. Total number of data points = 4607
Fig. 5.36 Difference in postperiod measured vs preretroﬁt model predicted data of chiller power indicating that postretroﬁt values are higher than those during preretroﬁt period (mean increase = 21 kW or 7.88%). One can clearly distinguish two operating patterns in the data suggesting some intrinsic behavioral change in chiller operation. Entire data set for the postperiod consisting of 5078 observations has been used in this analysis
Parting Comments on Regression Analysis and OLS
Recall that OLS regression is an important subclass of regression analysis methods. The approach adopted in OLS was to minimize an objective function (also referred to as the loss function) expressed as the sum of the squared residuals (given by Eq. 5.3). One was able to derive closed form solutions for the model parameters and their variance under certain simplifying assumptions as to how noise corrupts measured system performance. Such closed form solutions cannot be obtained for many situations where the function to be minimized has to be framed differently, and these require the adoption of search methods. Thus, parameter estimation problems are, in essence, optimization problems where the objective function is framed in accordance with what one knows about the errors in the measurements and in the model structure. OLS yields the best linear unbiased estimates provided the conditions (called the GaussMarkov conditions) stated in Sect. 5.5.1 are met. These conditions, along with the stipulation that an additive linear relationship exists between the response and regressor variables, can be summarized as:
214
5
• Data are random and normally distributed, that is, the expected value of model residuals/errors is zero: E{εi} = 0, i = 1, . . ., N. • Model residuals {ε1. . .εN} and regressors {x1, . . ., xN} are independent. • Model residuals are not collinear: cov{εi, εj} = 0, i, j = 1, . . ., N, i ≠ j. • Residuals have constant variance : εi = σ 2 , i = 1, . . . N
ð5:73Þ
Further, OLS applies when the measurement errors in the regressors are small compared to that of the response and when the response variable is normally distributed. These conditions are often not met in practice. This has led to the development of a uniﬁed approach called generalized linear models (GLM) which is treated in Sect. 9.4.
Problems Pr. 5.1 Table 5.14 lists various properties of saturated water in the temperature range 0–100°C. (a) Investigate ﬁrstorder and secondorder polynomials that ﬁt saturated vapor enthalpy to temperature in °C. Identify the better model by looking at R2, RMSE, and CV values for both models. Predict the value of saturated vapor enthalpy at 30°C along with 95% CI and 95% prediction intervals. (b) Repeat the above analysis for speciﬁc volume but investigate thirdorder polynomial ﬁts as well. Predict the value of speciﬁc volume at 30°C along with 95% CI and 95% prediction intervals. (c) Calculate the skill factors for the second and thirdorder models with the ﬁrst order as the baseline model. Pr. 5.2 Regression of home size versus monthly energy use It is natural to expect that monthly energy use in a home increases with the size of the home. Table 5.15 assembles 10 data points of home size (in square feet) versus energy use (kWh/month).
Linear Regression Analysis Using Least Squares
You will analyze this data as follows: (a) Plot the data as is and visually determine linear or polynomial trends in this data. Perform the regression and report the model goodnessofﬁt and model parameters values using (i) all 10 points, and (ii) a bootstrap analysis with 10 samples. Compare the results by both methods and draw pertinent conclusions. (b) Repeat the above analysis taking the logarithm of (home area) versus energy use. (c) Which of these two models would you recommend for future use? Provide justiﬁcation. Pr. 5.3 Tensile tests on a steel specimen yielded the results shown in Table 5.16. (a) Assuming the regression of y on x to be linear, estimate the parameters of the regression line and determine the 95% CI for x = 4.5. (b) Now regress x on y and estimate the parameters of the regression line. For the same value of y predicted in (a) above, determine the value of x. Compare this value with the value of 4.5 assumed in (a). If different, discuss why. (c) Compare the R2 and CV values of both models. Discuss difference with the results of part (b). (d) Plot the residuals of both models and identify the preferable one for OLS. (e) Repeat the analysis using bootstrapping and compare the model parameter estimates with those of (a) and (b) above. Pr. 5.4 The yield of a chemical process was measured at three temperatures (in °C), each with two concentrations of a particular reactant, as recorded in Table 5.17. (a) Use OLS to ﬁnd the best values of the coefﬁcients a, b, and c assuming the equation: y = a + b.t + c.x. (b) Calculate the R2, RMSE, and CV of the overall model as well as the SE of the parameters. (c) Using the β coefﬁcient concept described in Sect. 5.4.5, determine the relative importance of the two independent variables on the yield.
Table 5.14 Data table for Problem 5.1 Temperature t (°C) Speciﬁc volume v (m^3/kg) Sat. vapor enthalpy kJ/kg
0 206.3
10 106.4
20 57.84
30 32.93
40 19.55
50 12.05
60 7.679
70 5.046
80 3.409
90 2.361
100 1.673
2501.6
2519.9
2538.2
2556.4
2574.4
2592.2
2609.7
2626.9
2643.8
2660.1
2676
Problems
215
Table 5.15 Data table for Problem 5.2
Home area (sq. ft) 1290 1350 1470 1600 1710 1840 1980 2230 2400 2930
Table 5.16 Data table for Problem 5.3
Tensile force x Elongation y
1 15
2 35
3 41
4 63
5 77
6 84
Table 5.17 Data table for Problem 5.4
Temperature, t Concentration, x Yield y
40 0.2 38
40 0.4 42
50 0.2 41
50 0.4 46
60 0.2 46
60 0.4 49
Energy use (kWh/mo) 1182 1172 1264 1493 1571 1711 1804 1840 1956 1954
Table 5.18 Data table for Problem 5.5a LF CCoal CEle a
85 15 4.1
80 17 4.5
70 27 5.6
74 23 5.1
67 20 5.0
87 29 5.2
78 25 5.3
73 14 4.3
72 26 5.8
69 29 5.7
82 24 4.9
89 23 4.8
Data available electronically on book website
Table 5.19 Data table of outlet water temperature Tco (°C) for Problem 5.6a
Range R (°C) 10 13 16 19 22 a
Ambient wetbulb temperature Twb (°C) 20 21.5 23 25.54 26.47 27.31 26.51 27.30 27.69 27.34 27.86 28.14 27.68 28.40 29.24 27.89 29.15 29.19
23.5 27.29 28.18 29.16 29.29 29.34
26 29.08 29.84 29.88 30.98 30.83
Data available electronically on book website
Pr. 5.5 Cost of electric power generation versus load factor and cost of coal The cost to an electric utility of producing power (CEle) in mills per kilowatthr ($ 103/kWh) is a function of the load factor (LF) in % and the cost of coal (Ccoal) in cents per million Btu. Relevant data are assembled in Table 5.18. (a) Investigate different models (ﬁrst order and second order with and without interaction terms) and identify the best model for predicting CEle vs LF and CCoal. Use stepwise regression if appropriate. (Hint: plot the data and look for trends ﬁrst). (b) Perform residual analysis. (c) Calculate the R2, RMSE, CV, and DW of the overall model as well as the SE of the parameters. Is DW relevant?
(d) Repeat the analysis using bootstrapping and compare the model parameter estimates with those of the best model identiﬁed earlier. Pr. 5.6 Modeling of cooling tower performance Manufacturers of cooling towers often present catalog data showing outletwater temperature Tco as a function of ambient air wetbulb temperature (Twb) and range R (which is the difference between inlet and outlet water temperatures). Table 5.19 assembles data for a speciﬁc cooling tower. (a) Identify an appropriate model (investigate ﬁrstorder linear and secondorder polynomial models without and with interaction terms for Tco) by looking at R2, RMSE, and CV values, the individual tvalues of the parameters as well as the behavior of the overall model residuals.
216
5
(b) Calculate the skill factor of the ﬁnal model compared to the baseline model (assumed to be the ﬁrstorder baseline model without interaction effects). (c) Repeat the analysis for the best model using kfold crossvalidation with k=5. (d) Summarize the additional insights which the kfold analysis has provided. Pr. 5.7 Steadystate performance testing of solar thermal ﬂat plate collector Solar thermal collectors are devices that convert the radiant energy from the sun into useful thermal energy that goes to heating, say, water for domestic or for industrial applications. Because of low collector time constants, heat capacity effects are usually small compared to the hourly time step used to drive the model. The steadystate useful energy qC delivered by a solar ﬂatplate collector of surface area AC is given by the HottelWhillierBliss equation (see any textbook on solar energy thermal collectors, e.g., Reddy 1987): qc = Ac F R ½I T ηn  U L ðT Ci  T a Þþ
ð5:74Þ
where FR is called the heat removal factor and is a measure of the solar collector performance as a heat exchanger (since it can be interpreted as the ratio of actual heat transfer to the maximum possible heat transfer); ηn is the optical efﬁciency or the product of the transmittance and absorptance of the cover and absorber of the collector at normal solar incidence; UL is the overall heat loss coefﬁcient of the collector, which is dependent on collector design only, IT is the radiation intensity on the plane of the collector, Tci is the temperature of the ﬂuid entering the collector, and Ta is the ambient temperature. The + sign denotes that only positive values are to be used, which physically implies that the collector should not be operated if qC is negative, that is, when the collector loses more heat than it can collect (which can happen under low radiation and high Tci conditions).
Linear Regression Analysis Using Least Squares
to the collector. Thus, measurements (of course done as per the standard protocol, ASHRAE 1978) of IT, Tci and Tco are done under a prespeciﬁed and controlled value of ﬂuid ﬂow rate from which ηc can be calculated using Eq. 5.75a. The test data are then plotted as ηc against reduced temperature [(TCi  Ta)/IT] as shown in Fig. 5.37. A linear ﬁt is made to these data points by regression using Eq. 5.75b from which the values of FR.ηn and FR UL are deduced. If the same collector is testing during different days, slightly different numerical values are obtained for the two parameters FR.ηn and FRUL which are often, but not always, within the uncertainty bands of the estimates. Model misspeciﬁcation (i.e., the model is not perfect, which can occur, for example, the collector heat losses are not strictly linear) is partly the cause of such variability. This is somewhat disconcerting to a manufacturer since this introduces ambiguity as to which values of the parameters to present in his product speciﬁcation sheet. The data points of Fig. 5.37 are assembled in Table 5.20. Assume that water is the working ﬂuid. (a) Perform OLS regression using Eq. 5.75b and identify the two parameters FRηn and FRUL along with their variance. Plot the model residuals and study their behavior. (b) Repeat the analysis using bootstrapping and compare the model parameter estimates with those of the best model identiﬁed earlier. (c) Draw a straight line visually through the data points and determine the xaxis and yaxis intercepts. Estimate the FRηn and FRUL parameters and compare them with those determined from (a). (d) Calculate the R2, RMSE and CV values of the model.
Steadystate collector testing is the best manner for a manufacturer to rate his product. From an overall heat balance on the collector ﬂuid and from Eq. 5.74, the expressions for the instantaneous collector efﬁciency ηc under normal solar incidence are: mcp C ðT Co  T Ci Þ qC = AC I T AC I T T  Ta = F R ηn  F R U L Ci IT
ηC
ð5.75a, bÞ
where mc is the total ﬂuid ﬂow rate through the collectors, cpc is the speciﬁc heat of the ﬂuid ﬂowing through the collector, and Tci and Tco are the inlet and exit temperatures of the ﬂuid
Fig. 5.37 Test data points of thermal efﬁciency of a double glazed ﬂatplate liquid collector with reduced temperature. The regression line of the model given by Eq. 5.75 is also shown (From ASHRAE (1978) # American Society of Heating, Refrigerating and Airconditioning Engineers, Inc., www.ashvae.org)
Problems
217
(e) Calculate the Fstatistic to test for overall model signiﬁcance of the model. (f) Perform ttests on the individual model parameters. (g) Use the model to predict collector efﬁciency when IT = 800 W/m2, Tci = 35 °C and Ta = 10 °C. (h) Determine the 95% CL intervals for the mean and individual responses for ( f ) above. (i) The steadystate model of the solar thermal collector assumes the heat loss term given by [UA(Tci  Ta] is linear with the temperature difference between collector inlet temperature and the ambient temperature. One wishes to investigate whether the model improves if the loss term is to include an additional second order term: (i) Derive the resulting expression for collector efﬁciency analogous to Eq. 5.75b? (Hint: start with the fundamental heat balance equation—Eq. 5.74). (ii) Does the data justify the use of such a model?
Table 5.20 Data table for Problem 5.7a x 0.009 0.011 0.025 0.025 0.025 0.025 0.050 a
y (%) 64 65 56 56 52.5 49 35
x 0.051 0.052 0.053 0.056 0.056 0.061 0.062
y (%) 30 30 31 29 29 29 25
X 0.064 0.065 0.065 0.069 0.071 0.071 0.075
y (%) 27 26 24 24 23 21 20
x 0.077 0.080 0.083 0.086 0.091 0.094
y (%) 20 16 14 14 12 10
Data available electronically on book website
Pr. 5.816 Dimensionless model for fans or pumps The performance of a fan or pump is characterized in terms of the head or the pressure rise across the device and the ﬂow rate for a given shaft power. The use of dimensionless variables simpliﬁes and generalizes the model. Dimensional analysis (consistent with fan afﬁnity laws for changes in speed, diameter, and air density) suggests that the performance of a centrifugal fan can be expressed as a function of two dimensionless groups representing ﬂow coefﬁcient and pressure head, respectively: Ψ=
SP D2 ω2 ρ
and Φ =
Q D3 ω
ð5:76Þ
where SP is the static pressure, Pa; D is the diameter of wheel, m; ω is the rotative speed, rad/s; ρ is the density, kg/m3 and Q is the volume ﬂow rate of air, m3/s. For a fan operating at constant density, it should be possible to plot one curve of Ψ vs Φ that represents the performance at all speeds and diameters for this generic class of pumps. The performance of a certain 0.3 m diameter fan is shown in Table 5.21. (a) Convert the given data into the two dimensionless groups deﬁned by Eq. 5.76. (b) Next, plot the data and formulate two or three promising functions. (c) Identify the best function by looking at the R2, RMSE, CV and DW values and at residual behavior. (d) Repeat the analysis for the best model using kfold crossvalidation with k = 5. (e) Summarize the additional insights which the kfold analysis has provided.
Table 5.21 Data table for Problem 5.8a Rotation ω (Rad/s) 157 157 157 157 157 157 126 126 126 126 126 126 a
Flow rate Q (m3/s) 1.42 1.89 2.36 2.83 3.02 3.30 1.42 1.79 2.17 2.36 2.60 3.30
Static pressure SP (Pa) 861 861 796 694 635 525 548 530 473 428 351 114
Rotation ω (Rad/s) 94 94 94 94 94 63 63 63 63
Flow rate Q (m3/s) 0.94 1.27 1.89 2.22 2.36 0.80 1.04 1.42 1.51
Static pressure SP (Pa) 304 299 219 134 100 134 122 70 55
Data available electronically on book website
16
From Stoecker (1989), with permission from McGrawHill.
218
5
Table 5.22 Data table for Problem 5.10a
kT 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 a
(Id/I ) 0.991 0.987 0.982 0.978 0.947 0.903 0.839 0.756
Linear Regression Analysis Using Least Squares kT 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9
(Id/I ) 0.658 0.55 0.439 0.333 0.244 0.183 0.164 0.166 0.165
Data available electronically on book website
Table 5.23 Data table for Problem 5.11 Balance point temp. (°C) VBDD (°CDays)
25 4750
20 3900
15 2000
10 1100
5 500
0 100
5 0
Assume the density of air at STP conditions to be 1.204 kg/m3.
[Hint: Make sure that the function is continuous at the hinge point].
Pr. 5.9 Consider the data used in Example 5.6.3 meant to illustrate the use of weighted regression for replicate measurements with nonconstant variance. For the same data set, identify a model using the logarithmic transform approach similar to that shown in Example 5.6.2
Pr. 5.11 Modeling variable base degreedays with balance point temperature at a speciﬁc location Degreeday methods provide a simple means of determining annual energy use in envelopedominated buildings operated constantly and with simple HVAC systems, which can be characterized by a constant efﬁciency. Such simple singlemeasure methods capture the severity of the climate in a particular location. The variable base degree day (VBDD) is conceptually similar to the simple degreeday method but is an improvement since it is based on the actual balance point of the house instead of the outdated default value of 65°F or 18.3°C (Reddy et al. 2016). Table 5.23 assembles the VBDD values for New York City, NY from actual climatic data over several years at this location.
Pr. 5.10 Spline models for solar radiation This problem involves using splines for functions with abrupt hinge points. Several studies have proposed correlations to predict different components of solar radiation from more routinely measured components. One such correlation relates the fraction of hourly diffuse solar radiation on a horizontal radiation (Id) and the global radiation on a horizontal surface (I) to a quantity known as the hourly atmospheric clearness index (kT = I/I0) where I0 is the extraterrestrial hourly radiation on a horizontal surface at the same latitude and time and day of the year (Reddy 1987). The latter is an astronomical quantity and can be predicted almost exactly. Data have been gathered (Table 5.22) from which a correlation between (Id/I ) = f(kT) needs to be identiﬁed. (a) Plot the data and visually determine the likely locations of hinge points. (Hint: there should be two points, one at either extreme). (b) Previous studies have suggested the following three functional forms: a constant model for the lower range, a second order for the middle range, and a constant model for the higher range. Evaluate with the data provided whether this functional form still holds, and report pertinent models and relevant goodnessofﬁt indices.
(a) Identify a suitable regression curve for VBDD versus balance point temperature for this location and report all pertinent statistics (goodnessofﬁt and model parameter estimates and their CL). (b) Repeat the analysis using bootstrapping and compare the model parameter estimates and their CL with those of the results from (a). Pr. 5.12 Consider Example 5.7.2 where two types of buildings were modeled following the fullmodel (FM) and the reduced model (RM) approaches using categorical variables. Whether the model slope parameters of the two types of buildings are different or not were only evaluated. Extend the analysis to test whether both model slope and intercept parameters are affected by the type of building.
Problems
219
Table 5.24 Data table for Pr. 5.13a Year 94 94 94 94 94 95 95 95 95 95 95 95 a
Month Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul
E (W/ft2) 1.006 1.123 0.987 0.962 0.751 0.921 0.947 0.876 0.918 1.123 0.539 0.869
To (°F) 78.233 73.686 66.784 61.037 52.475 49.373 53.764 59.197 65.711 73.891 77.840 81.742
foc 0.41 0.68 0.67 0.65 0.42 0.65 0.68 0.58 0.66 0.65 0 0
Year 95 95 95 95 95 96 96 96 96 96 96 96
Month Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul
E (W/ft2) 1.351 1.337 0.987 0.938 0.751 0.921 0.947 0.873 0.993 1.427 0.567 1.005
To (°F) 81.766 76.341 65.805 56.714 52.839 49.270 55.873 55.200 66.221 78.719 78.382 82.992
foc 0.39 0.71 0.68 0.66 0.41 0.65 0.66 0.57 0.65 0.64 0.1 0.2
Data available electronically on book website
Pr. 5.13 Change point models of utility bills in variable occupancy buildings Example 5.7.1 illustrated the use of linear spline models to model monthly energy use in a commercial building versus outdoor drybulb temperature. Such models are useful for several purposes, one of which is for energy conservation. For example, the energy manager may wish to track the extent to which energy use has been increasing over the years, or the effect of a recently implemented energy conservation measure (such as a new chiller). For such purposes, one would like to correct, or normalize, for any changes in weather since an abnormally hot summer could obscure the beneﬁcial effects of a more efﬁcient chiller. Hence, factors that change over the months or the years need to be considered explicitly in the model. Two common normalization factors include changes to the conditioned ﬂoor area (e.g., an extension to an existing building wing), or changes in the number of students in a school. A model regressing monthly utility energy use against outdoor temperature is appropriate for buildings with constant occupancy (such as residences) or even ofﬁces. However, buildings such as schools are practically closed during summer, and hence, the occupancy rate needs to be included as the second regressor. The functional form of the model, in such cases, is a multivariate change point model given by: y = β0,un þ β0 f oc þ β1,un x þ β1 f oc x þβ2,un ðx  xc ÞI þ β2 f oc ðx  xc ÞI
ð5:77Þ
where x is the monthly mean outdoor temperature (To) and y is the electricity use per square foot of the school (E). Also, foc =(Noc/Ntotal ) represents the fraction of days in the month when the school is in session (Noc) to the total number of days in that particular month (Ntotal). The factor foc can be determined from the school calendar. Clearly, the unoccupied fraction fun = (1  foc).
The term I represents an indicator variable whose numerical value is given by Eq. 5.71b. Note that the change point temperatures for occupied and unoccupied periods are assumed to be identical since the monthly data does not allow this separation to be identiﬁed. Consider the monthly data assembled for an actual school (shown in Table 5.24). (a) Plot the data and look for change points in the data. Note that the model given by Eq. 5.77 has 7 parameters of which xc (the change point temperature) is the one that makes the estimation nonlinear. By inspection of the scatter plot, you will assume a reasonable value for this variable, and proceed to perform a linear regression as illustrated in Example 5.7.1. The search for the best value of xc (one with minimum RMSE) would require several OLS regressions assuming different values of the change point temperature. (b) Identify the parsimonious model and estimate the appropriate parameters of the model. Note that of the six parameters appearing in Eq. 5.77, some of the parameters may be statistically insigniﬁcant, and appropriate care should be exercised in this regard. Report appropriate model and parameter statistics. (c) Perform a residual analysis and discuss the results. (d) Repeat the analysis using kfold crossvalidation with k=4 and compare the model parameter estimates with those of (b) above. Pr. 5.14 Determining energy savings from monitoring and veriﬁcation (M&V) projects A crucial element in any energy conservation program is the ability to verify savings from measured energy use data—this is referred to as monitoring and veriﬁcation (M&V). Energy service companies (ESCOs) are required to perform this as part of their services. Figure 5.38 depicts how energy savings are estimated. A common M&V protocol involves measuring
220
5
Linear Regression Analysis Using Least Squares
Fig. 5.38 Schematic representation of energy use prior to and after installing energy conservation measures (ECM) and of the resulting energy savings
the monthly total energy use at the facility for the whole year before the retroﬁt (this is the baseline period or the “preretroﬁt period”) and a whole year after the retroﬁt (called the “postretroﬁt period”). The time taken for implementing the energysaving measures (called the “construction period”) is neglected in this simple example. One ﬁrst identiﬁes a baseline regression model of energy use against ambient drybulb temperature To during the preretroﬁt period Epre = f(To). This model is then used to predict energy use during each month of the postretroﬁt period by using the corresponding ambient temperature values. The difference between model predicted and measured monthly energy use is the energy savings during that month.
Table 5.25 Data table for Problem 5.14a Preretroﬁt period Month To (°F) 1994Jul 84.04 Aug 81.26 Sep 77.98 Oct 71.94 Nov 66.80 Dec 58.68 199556.57 Jan Feb 60.35 Mar 62.70 Apr 69.29 May 77.14 Jun 80.54
Energy savings = Model‐predicted pre‐retrofit use  measured post‐retrofit use
a
ð5:78Þ The determination of the annual savings resulting from the energy retroﬁt and its uncertainty are ﬁnally determined. It is very important that the uncertainty associated with the savings estimates be determined as well for meaningful conclusions to be reached regarding the impact of the retroﬁt on energy use. You are given monthly data of outdoor dry bulb temperature (To) and areanormalized whole building electricity use (WBe) for two years (Table 5.25). The ﬁrst year is the preretroﬁt period before a new energy management and control system (EMCS) for the building is installed, and the second is the postretroﬁt period. Construction period, that is, the period it takes to implement the conservation measures, is taken to be negligible.
WBe (W/ft2) 3.289 2.827 2.675 1.908 1.514 1.073 1.237 1.253 1.318 1.584 2.474 2.356
Postretroﬁt period Month To (°F) 1995Jul 83.63 Aug 83.69 Sep 80.99 Oct 72.04 Nov 62.75 Dec 57.81 199654.32 Jan Feb 59.53 Mar 58.70 Apr 68.28 May 78.12 Jun 80.91
WBe (W/ft2) 2.362 2.732 2.695 1.524 1.109 0.937 1.015 1.119 1.016 1.364 2.208 2.070
Data available electronically on book website
(a) Plot time series and x–y plots and see whether you can visually distinguish the change in energy use as a result of installing the EMCS (similar to Fig. 5.38); (b) Evaluate at least two different models (with one of them being a model with indicator variables) for the preretroﬁt period, and select the better model. (c) Repeat the analysis using bootstrapping (10 samples are adequate) and compare the model parameter estimates with those of (a) and (b) above. (d) Use this baseline model to determine monthbymonth energy use during the postretroﬁt period representative of energy use had not the conservation measure been implemented. (e) Determine the monthbymonth as well as the annual energy savings (this is the “modelpredicted preretroﬁt energy use” of Eq. 5.78).
References
(f) The ESCO which suggested and implemented the ECM claims a savings of 15%. You have been retained by the building owner as an independent M&V consultant to verify this claim. Prepare a short report describing your analysis methodology, results, and conclusions. (Note: you should also calculate the 90% uncertainty in the savings estimated assuming zero measurement uncertainty. Only the cumulative annual savings and their uncertainty are required, not monthbymonth values).
References ASHRAE, 1978, Standard 9377: Methods of Testing to Determine the Thermal Performance of Solar Collectors, American Society of Heating, Refrigerating and AirConditioning Engineers, Atlanta, GA. ASHRAE, 2005. Guideline 22005: Engineering Analysis of Experimental Data, American Society of Heating, Refrigerating and AirConditioning Engineers, Atlanta, GA. Belsley, D.A., E. Kuh and R.E Welsch, 1980, Regression Diagnostics, John Wiley & Sons, New York. Chatﬁeld, C., 1995. Problem Solving: A Statistician’s Guide, 2nd Ed., Chapman and Hall, London, U.K. Chatterjee, S. and B. Price, 1991. Regression Analysis by Example, 2nd Edition, John Wiley & Sons, New York. Cook, R.D. and S. Weisberg, 1982. Residuals and Inﬂuence in Regression, Chapman and Hall, New York. Davison, A.C. and D. Hinkley, 1997, Cambridge University Press, U.K. Draper, N.R. and H. Smith, 1981. Applied Regression Analysis, 2nd Ed., John Wiley and Sons, New York. Efron, B. and R. Tibshirani, 1985. The Bootstrp Method for Assessing Statistical Accuracy, Behaviormetrika, 12, 1–35, Springer, 1985. Ezekiel, M. and K.A. Fox, 1959. Methods of Correlation and Regression Analysis, 3rd ed., John Wiley and Sons, New York.
221 Freedman, D. and Peters, S. (1984) Bootstraping an Econometric Model: Some Empirical Results. Journal of Business Economic Statistics, 2, 150–158. Gordon, J.M. and K.C. Ng, 2000. Cool Thermodynamics, Cambridge International Science Publishing, Cambridge, UK Hair, J.F., R.E. Anderson, R.L. Tatham and W.C. Black, 1998. Multivariate Data Analysis, 5th Ed., Prentice Hall, Upper Saddle River, NJ, James, G., D. Witten, T. Hastie and R. Tibshirani, 2013. An Introduction to Statistical Learning: with Applications to R, Springer, New York. Katipamula, S., T.A. Reddy and D. E. Claridge, 1998. Multivariate regression modeling, ASME Journal of Solar Energy Engineering, vol. 120, p.177, August. Neter, J. W. Wasserman and M.H. Kutner, 1983. Applied Linear Regression Models, Richard D. Irwin, Homewood IL. Pindyck, R.S. and D.L. Rubinfeld, 1981. Econometric Models and Economic Forecasts, 2nd Edition, McGrawHill, New York, NY. Reddy, T.A., 1987. The Design and Sizing of Active Solar Thermal Systems, Oxford University Press, Clarendon Press, U.K., September. Reddy, T.A., N.F. Saman, D.E. Claridge, J.S. Haberl, W.D. Turner and A.T. Chalifoux, 1997. Baselining methodology for facilitylevel monthly energy use part 1: Theoretical aspects, ASHRAE Transactions, v.103 (2), American Society of Heating, Refrigerating and AirConditioning Engineers, Atlanta, GA. Reddy, T.A., J.F. Kreider, P. Curtiss and A. Rabl, 2016. Heating and Cooling of Buildings Principles and Practice of Energy Efﬁcient Design, 3rd Edition, CRC Press, Boca Raton, FL. Schenck, H., 1969. Theories of Engineering Experimentation, Second Edition, McGrawHill, New York. Shannon, R.E., 1975. System Simulation: The Art and Science, PrenticeHall, Englewood Cliffs, NJ. Stoecker, W.F., 1989. Design of Thermal Systems, 3rd Edition, McGrawHill, New York. Walpole, R.E., R.H. Myers and S.L. Myers, 1998. Probability and Statistics for Engineers and Scientists, 6th Ed., Prentice Hall, Upper Saddle River, NJ
6
Design of Physical and Simulation Experiments
Abstract
One of the objectives of performing engineering experiments is to assess performance/quality improvements (or system response variable) of a product under different changes/variations during the manufacturing process (called “treatments”). Experimental design is the term used to denote the series of planned experiments to be undertaken to compare the effect of one or more treatments or interventions on a response variable. The analysis of such data once collected entails methods which are a logical extension of the Student ttest and oneway ANOVA hypothesis tests meant to compare two or more population means of samples. Design of Experiments (DOE) is a broader term which includes deﬁning the objective and scope of the study, selecting the response variable, identifying the treatments and their levels/ranges of variability, prescribing the exact manner in which samples for testing need to be selected, specifying the conditions and executing the test sequence where one variable is varied at a time, analyzing the data collected to verify (or refute) statistical hypotheses, and then drawing meaningful conclusions. Selected experimental design methods are discussed such as full and fractional factorial designs, and complete block and Latin squares designs. The parallel between model building in a DOE framework and linear multiple regression is illustrated. Also discussed are response surface modeling (RSM) designs, which are meant to accelerate the search toward optimizing a process or ﬁnding the proper product mix by simultaneously varying more than one continuous treatment variable. It is a sequential approach where one starts with test conditions in a plausible area of the search space, analyzes test results to determine the optimal direction to move, performs a second set of test conditions, and so on till the required optimum is reached. Central composite design (CCD) is often used for RSM situations for
continuous treatment variables since it allows ﬁtting a secondorder response surface with greater efﬁciency. Computer simulations are, to some extent, replacing the need to perform physical experiments, which are more expensive, timeconsuming, and limited in the number of factors one can consider. There are parallels between the traditional physical DOE approach and designs based on computer simulations. The last section of this chapter discuses similarities and the important considerations/ differences between experimental design in both ﬁelds. It presents the various methods of sampling when the set of input design variables is very large (in some cases the RSMCCD design can be used but the Latin Hypercube Monte Carlo and its variants such as the Morris method are much more efﬁcient), for performing sensitivity analysis to identify important input variables (similar to screening) and reducing the number of computer simulations by adopting spaceﬁlling interpolation methods (also called surrogate modeling).
6.1
Introduction
6.1.1
Types of Data Collection
All statistical data analyses are predicated on acquiring proper data, and the more “proper” the data the sounder the statistical analysis. Basically, data can be collected in one of three ways (Montgomery 2017): (i) A retrospective cohort study involving a control group and a test group. For example, in medical and psychological research, data are collected from a group of individuals exposed or vaccinated against a certain factor and compared to another control group; (ii) An observational study where data are collected during normal operation of the system and the observer cannot
# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 T. A. Reddy, G. P. Henze, Applied Data Analysis and Modeling for Energy Engineers and Scientists, https://doi.org/10.1007/9783031348693_6
223
224
6
intervene; relevant analyses methods have been discussed brieﬂy in Sect. 3.8 and in Sect. 10.3; (iii) A designed experiment where the analyst can control certain system inputs and the leeway to frame/perform the sequence of experiments as desired. This is the essence of a body of knowledge referred to as design of experiments (DOE), the focus of this chapter. One of the objectives of performing designed experiments in an engineering context is to improve some quality of products during their manufacture. Speciﬁcally, this involves evaluating process yield improvement or detecting changes in model system performance, i.e., the response of different speciﬁc changes (called treatments). An example related to metallurgy is to study the inﬂuence of adding carbon to iron in different concentrations to increase the strength and toughness of steels. The word “treatment” is generic and is used to denote an intervention in a process such as, say an additive nutrient in fertilizers, use of different machines during manufacture, design change in manufacturing components/processes, etc. (Box et al. 1978). Often data are collected without proper reﬂection of the intended purpose, and the analyst then tries to do the best he can. The two previous chapters dealt with statistical techniques for analyzing data, which were already gathered. However, no amount of “creative” statistical data analysis can reveal information not available in the data itself. The richness of the data set cannot be ascertained by the amount of data but by the extent to which all possible states of the system are represented in the data set. This is especially true for observational data sets where data are collected while the system is under routine daytoday operation without any external intervention by the observer. The process of proper planning and execution of experiments, intentionally designed to provide data rich in information especially suited for the intended objective with low/least effort and cost, is referred to as experimental design. Optimal experimental design is one which stipulates the conditions under which each observation should be taken to minimize/maximize the system response or inherent characteristics. Practical considerations and constraints often complicate the design of optimal experiments, and these factors need to be explicitly considered.
6.1.2
Design of Physical and Simulation Experiments
Purpose of DOE
Experimental design techniques were developed at the beginning of the twentieth century primarily in the context of agricultural research, subsequently migrating to industrial engineering, and then on to other ﬁelds. The historic reason for their development was to ascertain, by hypothesis testing, whether a certain treatment increased the yield or improved the strength of the product. The statistical techniques which stipulate how each of the independent variables or factors have to be varied so as to obtain the most information about system behavior quantiﬁed by a response variable and do so with a minimum of tests (and hence, low/least effort and expense) are the essence of the body of knowledge known as experimental design. Design of experiments (DOE) is broader in scope and includes additional aspects: deﬁning the objective and scope of the study, selecting the response variable, identifying the treatments and their levels/ranges of variability, prescribing the exact way samples for testing need to be selected, specifying the conditions and executing the test sequence, analyzing the data collected to verify (or refute) statistical hypotheses and then drawing meaningful conclusions. Often the terms experimental design and DOE are used interchangeably with the context dispelling any confusion. Experimental design involves one or more aspects where the intent is to (i) “screen” a large number of possible candidates or likely variables/factors and identify the dominant variables. These possible candidate factors are then subject to more extensive investigation; (ii) formulate the test conditions and sequence so that sources of unsuspecting and uncontrollable/extraneous errors can be minimized while eliciting the necessary “richness” in system behavior, and (iii) build a suitable mathematical model between the factors and the response variable using the data set acquired. This involves both hypothesis testing to identify the signiﬁcant factors as well as model building and residual diagnostic checking. Many software packages have a DOE wizard which walks one through the various steps of deﬁning the entire set of experimental combinations. Typical application goals of DOE are listed in Table 6.1. The relative importance of these goals depends on the
Table 6.1 Typical application goals of DOE Application Hypothesis testing
Factor screening Factor interaction Model development Response surface design
Description Determine whether there is a difference between the mean of the response variable for the different levels of a factor (i.e., to verify whether the new product or process is indeed an improvement over the status quo) Determine which factors greatly inﬂuence the response variable Determine whether (or which) factors interact with each other Determine functional relationship between factors and response variable Experimental design which allows determining the numerical values or settings of the factors that maximize/minimize the response variable
Chapter section 6.2, 6.4, 6.5
6.4.2 6.4 6.4, 6.5 6.6
6.2 Overview of Different Statistical Methods
speciﬁc circumstance. For example, often and especially so in engineering model building, the dominant regressor or independent variable set is known beforehand, acquired either from mechanistic insights or prior experimentation; and so, the screening phase may be redundant. In the context of calibrating a detailed simulation program with monitored data (Sect. 6.7), the problem involves dozens of input parameters. Which parameters are inﬂuential can be determined from a sensitivity analysis which is directly based on the principles of screening tests in DOE.
6.1.3
DOE Terminology
Literature on DOE has its own somewhat unique terminology which needs to be well understood. Referring to Fig. 1.8, a component or system can be represented by a simple block diagram, which consists of controllable input variables, uncontrollable inputs, system response, and the model structure denoted by the mathematical model and its parameter vector. The description of important terms is assembled in Table 6.2. Of special importance are the three terms: treatment factors, nuisance factors, and random effects. These and other terms appearing in this table are discussed in this chapter.
225
6.2
Overview of Different Statistical Methods
6.2.1
Different Types of ANOVA Tests
Recall the two statistical tests covered in Chap. 4, both of which involve comparing the mean values of one factor or variable corresponding to samples drawn from different populations. The Student ttest is a hypothesis test meant to compare the mean values on two population means using the tdistribution (Sect. 4.2). The onefactor or oneway ANOVA (Sect. 4.3) is a variancebased statistical Ftest to concurrently compare three or more population means of samples gathered after the system has been subject to one or more treatments or interventions. Note that it could be used for comparing two population means; but the ttest is simpler. These tests are not meant for cases where more than one factor inﬂuences the outcome. Consider an example where an animal must be fattened (increase in weight to fetch a higher sale price), and the effect of two different animal feeds is being evaluated. If the tests are conducted without considering the confounding random effect of other disturbances which may inﬂuence the outcome, one obtains a large variance in the mean and the effect of the feed type is difﬁcult to isolate/identify statistically
Table 6.2 Description of important terms used in DOE 1 2 2a
Terminology Component/system response Factors
2c
Treatment or primary factors Nuisance or secondary factors Covariates
3
Random effects
4
Factor levels
5
Blocking
6
Randomization
7
Replication
8
Experimental units
9
Block
2b
Description Usually a continuous variable (akin to the dependent variable in regression analysis discussed in Chap. 5) Controllable component or system variables/inputs, usually discrete or qualitative/categorical. Continuous variables need to be discretized The interventions or controlled variables whose impact on the response is the primary purpose of adopting DOE Variables which are sources of variability and can be controlled by blocking. They are not of primary interest to the experimenter but need to be included in the model Nuisance factors that cannot be controlled but can be measured prior to, or during, the experiment. They can be included in the model if they are of interest Extraneous or unspeciﬁed (or lurking) disturbances which the experimenter cannot control or measure, but which impact system response and obfuscate the statistical analysis Discrete values of the treatment and nuisance factors selected for conducting experiments (e.g., low, medium, and high can be coded as 1, 0, and +1). For continuous variables, the range of variation is discretized into a small number of numerical value sets or levels Clamping down on the variability of known nuisance factors to minimize or remove their inﬂuence/impact on the variability of the main factor. Reduces experimental error and increases power of statistical analysis. By not blocking a nuisance factor, it is much more difﬁcult to detect whether the primary factor is signiﬁcant or not Executing test runs in a random order to average out the impact of random effects. Meant to minimize the effect of “noise” in the experiments and improve estimation of other factors using statistical methods Repeating each factor/level combination independently of any previous run to reduce the effect of random error. Replication allows estimating the experimental error and provides a more precise estimate of the actual parameter values Physical entities such as lab equipment setups, pilot plants, ﬁeld parcels, etc. for conducting DOE experiments. Any practical situation will have a limitation on the number of units available A noun used to denote a set or group of experimental units where one of the treatment factors has been blocked.
226
6
Design of Physical and Simulation Experiments
disturbances requires suitable experimental procedures involving randomization and replication; (ii) Several levels or treatments, but only two to four levels will be considered here for conceptual simplicity. The levels of a factor are the different values or categories which the factor can assume. This can be dictated by the type of variable (which can be continuous or discrete or qualitative) or selected by the experimenter. Note that the factors can be either continuous or qualitative. In case of continuous variables, their range of variation is discretized into a small set of numerical values; as a result, the levels have a magnitude associated with them. This is not the case with qualitative/categorical variables where there is no magnitude involved; the grouping is done based on some classiﬁcation criterion such as different types of treatments. Fig. 6.1 The variance in two samples of the response variable (weight) against two treatments (factors 1 and 2) can be large when several nuisance factors are present. In such cases, oneway ANOVA tests are inadequate. DOE strategies reduce the variance and bias due to secondary or nuisance factors and average out the inﬂuence of random or uncontrollable effects. This is achieved by restricted randomization, replication, and blocking resulting in tighter limits as shown
(Fig. 6.1). It is in such cases that DOE can determine, by adopting suitable block designs, whether (and by how much) a change in one input variable (or intended treatment) among several known and controllable inputs affects the mean behavior/response/output of the process/system. Twoway ANOVA is used to compare mean differences on one dependent continuous variable between groups which have been subject to different interventions or treatments involving two independent discrete factors with one of them being a nuisance/secondary factor/variable. Similarly, threeway ANOVA involves techniques where two nuisance factors are blocked. The design of experiments can be extended to include: (i) Several factors, but only two to four factors will be discussed here for the sake of simplicity. The factors could be either controllable by the experimenter or cannot be controlled (random or extraneous factors). The source of variation of controllable variables can be reduced by ﬁxing them at preselected levels called blocking and collecting different sets of experimental data. The analysis of these different sets or experimental units or blocks would allow the effect of treatment variables to be better discerned. If a model is being identiﬁed, the impact of cofactors can be included as well. The effect of the uncontrollable/unspeciﬁed
6.2.2
Link Between ANOVA and Regression
ANOVA and linear regression are equivalent when used to test the same hypotheses. The response variable is continuous in both cases while the factors or independent variables are discrete for ANOVA and continuous in the regression setting (Sect. 5.7 discusses the use of discrete or dummy regressors, but this is not the predominant situation). ANOVA can be viewed as a special case of regression analysis with all independent factors being qualitative. However, there is a difference in their application. ANOVA is an analysis method, widely used in experimental statistics, which addresses the question: what is the expected difference in the mean response between different groups/categories? Its main concern is to reduce the residual variance of a response variable in such a manner that the individual impact of speciﬁc factors can be better determined statistically. On the other hand, regression analysis is a mathematical modeling tool that aims to develop a predictive model for the change in the response when the predictor(s) changes by a given amount (or between different groups/categories). Achieving high model goodnessofﬁt and predictive accuracy are the main concerns.
6.2.3
Recap of Basic Model Functional Forms
Section 5.4 discussed higherorder regression models involving more than one regressor. The basic additive linear model was given by Eq. 5.22a, while its “normal” transformation is one where the individual regressor variables are subtracted by their mean values:
6.3 Basic Concepts
227
Fig. 6.2 Points required for modeling linear and nonlinear effects
y = β 0 0 þ β 1 ð x1  x1 Þ þ β 2 ð x2  x2 Þ þ ⋯ þ β k ð xk  xk Þ þ ε
ð5:20Þ
where k is the number of regressors and ε is the error or unexplained variation in y. The intercept term β0′ can be interpreted as the mean response of y. The basic regression model can be made to capture nonlinear variation in the response variables. For example, as shown in Fig. 6.2, a third observation would allow quadratic behavior (either concave or convex) to be modeled. The linear models capture variation when there is no interaction between regressors. The ﬁrstorder linear model with two interacting regressor variables could be stated as: y = β 0 þ β 1 x1 þ β 2 x2 þ β 3 x1 x 2 þ ε
ð5:21Þ
The term (β3x1 x2) was called the interaction term. How the presence of interaction affects the shape of the family of curves was previously illustrated in Fig. 5.8. The secondorder quadratic model with interacting terms (Eq. 5.28) was also presented. The number of runs/trials or data points must be greater than the number of terms in the model to (i) estimate the model parameters, (ii) determine, by hypothesis testing, whether they are statistically signiﬁcant or not, and (iii) obtain an estimate of the random/pure error. Even with blocking and randomization, it is important to adopt a replication strategy to increase the power of the DOE design.
6.3
Basic Concepts
6.3.1
Levels, Discretization, and Experimental Combinations
A full factorial design is one where experiments are done for each and every combination of the factors and their levels. The number of combinations is then given by:
Number of Experiments for Full Factorial = ∏ki ni
ð6:1Þ
where k is the number of factors, n is the number of levels, and i is the index for the factors. For example, a factorial experiment with three factors involving one twolevel factor, a threelevel factor, and a fourlevel factor would have 2 × 3 × 4 = 24 runs or trials. For the special case when all factors have the same number of levels, the number of experiments = nk. A widely used experimental design is the 2k factorial design since it is intuitive and easy to design. It derives its terminology from the fact that only two levels for each factor k are presumed, one indicative of the lower level of its range of variation (coded as “”) and the other representing the higher level (coded as “+”). The factors can be qualitative or continuous; if the latter, they are discretized into categories or levels. Figure 6.3a illustrates how the continuous regressors x1 and x2 are discretized depending on their range of variation into four system states, while Fig. 6.3b depicts how these four observations would appear in a scatter plot should they exhibit no factor interaction (that is why the lines are parallel). For two factors, the number of experiments/trials (without any replication) would be 22 = 4; for three factors, this would be 23 = 8, and so on. The formalism of coding the low and high levels of the factors as 1 and +1 respectively is most widespread; other ways of coding variables can also be adopted. A full factorial design involving three factors (k = 3) at two levels each (n = 2) would involve 8 tests. These are shown conceptually in Fig. 6.4 as the 8 corner points of the cube. One could add center points to a twolevel design at the center of the cube, and this would provide an estimate of residual error (the variability of the variation in the response not captured by the model) and allow one to test the goodness of model ﬁt of a linear model. It would also indicate whether quadratic effects are present; however, it would be unable to identify the speciﬁc factor causing this behavior. To capture the quadratic effect of individual factors, one
228
6
High
o
o
Low
o
o
Low
High
X2
X1
oHigh X2
Y 0
o
Low X2 0
Low
High
X1 Fig. 6.3 Illustration of how models are built from factorial design data. A 22 factorial design is assumed (a) Discretization of the range of variation of the regressors x1 and x2 into “low” and “high” ranges, and (b) Regression of the system performance data as they appear on a scatter plot (there is no factor interaction since the two lines are shown parallel)
Fig. 6.4 Full factorial design for three factors at twoLevels (23 design) involves 8 experiments mapped to each of the corners of the cube. The three factors (A, B, C) are coded as 1 or 1 to denote the high and lowlevel settings of the factors. Section 6.4.2 discusses this representation in more detail
needs to add experimental points at the center of each of the 6 surfaces. This strategy is not used in the classic factorial designs but has been adopted in later designs discussed in
Design of Physical and Simulation Experiments
Sect. 6.6.4. Designs with more than 3 factors are often referred to as hypercube designs and cannot be drawn graphically. Figure 6.5 is a ﬂowchart of some of the simpler cases encountered in practice and discussed in this chapter. One distinguishes between two broad categories of traditional designs: (a) Full factorial and fractional factorial designs (Sect. 6.4): Full factorial designs are the most conservative of all design types where it is assumed that all trials can be run randomly by varying one factor/variable at a time. For example, with 3 factors with 3 levels each, the number of combinations is 33 = 27 while that for four factors is 34 = 81 and so on. Full factorial designs apply to instances when all factors at all their individual levels are deemed equally important and one wishes to compare k treatment/factor means. They allow estimating all possible interactions. Of special importance in this category of designs is the 2k design which is limited to each factor having two levels only. It is often used at the early stages of an investigation for factor screening and involves an iterative approach with each iteration providing incremental insight into factor dominance and into model building. The total number of runs required is much lower than that needed for the full factorial design. The number of experiments required for fullfactorial design increases exponentially, and it is rare to adopt such designs. Moreover, not all the factors initially selected may be dominant thereby warranting that the list of factors be screened. The fractional factorial design is appropriate in such instances. It only requires a subset of trials of those of the full factorial design, but then some of the main effects and twoway interactions are confounded and cannot be separated from the effects of other higherorder interactions. (b) Complete block and Latin squares (Sect. 6.5): Complete block design is used for instances when only the effect of one categorical treatment variable is being investigated over a range of different conditions when one or more categorical nuisance variables are present. The experimental units can be grouped in such a way that complete block designs require fewer experiments than the full factorial and allow the effect of interaction to be modeled. Latin squares are a special type of complete block design requiring even fewer number of experiments or trials. However, it has limitations: interaction effects cannot be captured, and it is restricted to the case of equal number of levels for all factors.
6.3 Basic Concepts
229
Number of Total Factors k=1
k>1 No. of Important Treatment Factors?
Only Random Effects
All
One
Two samples
Multiple samples
Yes
ttest 4.2.3
Onefactor ANOVA 4.3.1
Enough number of experimental units?
Interaction terms important? Yes
No
No. of blocked factors Linear
Complete Block 6.5.1
No Fractional Factorial
Full 6.4.1 Factorial
Two
Three
Latin Squares
GraecoLatin Squares
Twolevel 2k Design 6.4.2
6.5.2
Model Behavior?
6.4.4
Quadratic
Threelevel 3k Design 6.4.2
Fig. 6.5 Flow chart meant to provide an overview of the applicability of the various traditional DOE methods covered in different sections of this chapter. Response surface design (treated in Sect. 6.6) is akin to a threelevel design while requiring fewer experiments; this is not
shown. Monte Carlo methods are also not indicated in the ﬂowchart since they are not traditional and are adopted for different types of applications (such as computer simulation experiments)
6.3.2
6.3.3
Blocking
The general strategies or foundational principles adopted by DOE in order to reduce the effect of nuisance factors and random effects are block what factors you can and randomize and replicate (or repeat) what you cannot (Box et al. 1978). The concept of blocking is a form of stratiﬁed sampling whereby experimental tests or items in the sample of data are grouped into blocks according to some “matching” criterion so that the similarity of subjects within each block or group is maximized while that from block to block is minimized. Pharmaceutical companies wishing to test the effectiveness of a new drug adopt the above concept extensively. Since different people react differently, grouping of subjects is done according to some criteria (such as age, gender, body fat percentage, etc.). Such blocking would result in more uniformity among groups. Subsequently, a random administration of the drugs to half of the people within each group/block with a placebo to the other half would constitute randomization. Thus, any differences between each block taken separately would be more pronounced than if randomization was done without blocking.
Unrestricted and Restricted Randomization
After the experimental design is formulated, the sequence of the tests, i.e., the selection of the combinations of different levels and different factors, should be done in a random manner. This randomization would reduce the effect of random or extraneous factors beyond the control of the experimenter or arising from inherent experimental bias on the results of the statistical analysis. The simplest design is the randomized unrestricted block design, which involves selecting at random the combinations of the factors and levels under which to perform the experiments. This type of design, if done naively, is not very efﬁcient and may require an unnecessarily large number of experiments to be performed. The concept is illustrated with a simple example, from the agricultural area from which DOE emerged. Say, one wishes to evaluate the yield of four newly developed varieties of wheat (labeled x1, x2, x3, x4). Since the yield is affected in addition by regional climatic and soil differences, one would like to perform the evaluation at four different locations. If one had four plots of land at each
230
6
Table 6.3 Example of unrestricted randomized block design (one of several possibilities) for one factor of interest at levels x1, x2, x3, x4 and one nuisance variable (Regions 1, 2, 3, 4) Region 1 x1 x3 x1 x2
Region 2 x1 x2 x3 x1
Region 3 x2 x1 x3 x3
Region 4 x3 x4 x2 x4
Design of Physical and Simulation Experiments
Table 6.5 Standard method of assembling test results for a balanced (3 × 2) design with two replication levels
Factor A
Average
Level 1 Level 2 Level 3
Factor B Level 1 10, 14 23, 21 31, 27 21
Level 2 18, 14 16, 20 21, 25 19
Average 14 20 26 20
This is not an efﬁcient design since x1 appears twice under Region 1 and not at all in Region 4 Table 6.4 Example of restricted randomized block design (one of several possibilities) for the same example as that of Table 6.3 Region 1 x1 x2 x3 x4
Region 2 x2 x3 x4 x1
Region 3 x3 x4 x1 x2
Region 4 x4 x1 x2 x3
individual region (i.e., 16 experimental units in total), all the tests could be completed over one period or cycle. Consider the more realistic case, when unfortunately, only one plot of land or ﬁeld station is available at each of the four different geographic regions, i.e., only four experimental units in all. The total time duration of the evaluation process due to this limitation would now require four time periods. This simple case illustrates the importance of designing one’s DOE strategy keeping in mind physical constraints such as the number of experimental units available and the total time duration within which to complete the entire evaluation. The simplest way of assigning which station will be planted with which variety of wheat is to do so randomly; one such result (among many possible ones) is shown in Table 6.3. Such an unrestricted randomization leads to needless replication (e.g., wheat type x1 is tested twice in Region 1 and not at all in Region 4) and is not very efﬁcient. Since the intention is to reduce variability in the uncontrolled variable, in this case the “region” variable, one can insist that each variety of wheat be tested in each region. There are again several possibilities, with one being shown in Table 6.4. This example illustrates the principle of restricted randomization by blocking the effect of the uncontrollable factor. This aspect is further discussed in Sect. 6.5.2. Note that the intent of this investigation is to determine the effect of wheat variety on total yield. The location of the ﬁeld or station is a “nuisance” variable, but in this case can be controlled by suitable blocking. However, there may be other disturbances which are uncontrollable (say, excessive rainfall in one of the regions during the test) and even worse some of the disturbances may be unknown. Such effects can be partially compensated for by replication, i.e., repeating the tests more than once for each combination.
6.4
Factorial Designs
6.4.1
Full Factorial Design
Consider two factors (labeled A and B) that are to be studied at two levels: a and b. This is often referred to as (a × b) design, and the standard manner of representing the results of the test is by assembling them as shown in Table 6.5. Each combination of factorlevel can be tested more than once to minimize the effect of random errors, and this is called replication. Though more tests are done, replication reduces experimental errors introduced by extraneous factors not explicitly controlled during the experiments that can bias the results. Often for mathematical convenience, each combination is tested at the same replication level, and this is called a balanced design. Thus, Table 6.5 is an example of a (3 × 2) balanced design with replication r = 2. The above terms are perhaps better understood in the context of regression analysis (treated in Chap. 5). Let Z be the response variable which is linear in regressor variable X, and a model needs to be identiﬁed. Further, say, another variable Y is known to inﬂuence Z which may corrupt the soughtafter relation. Selecting three speciﬁc values is akin to selecting three levels for the factor X (say, x1, x2, x3). The nuisance effect of variable Y can be “blocked” by performing the tests at preselected ﬁxed levels or values of Y (say, y1 and y2). The corresponding scatter plot is shown in Fig. 6.6. Repeat testing at each of the six combinations in order to reduce experimental errors is akin to replication; in this example, replication r = 3. Finally, if the 18 tests are performed in random sequence, the experimental design would qualify as a full factorial random design. The averages shown in Table 6.5 correspond to those of either the associated row or the associated column. Thus, the average of the ﬁrst row, i.e. {10, 14, 18, 14}, is shown as 14, and so on. Plots of the average response versus the levels of a factor yield a graph which depicts the trend, called main effect of the factor. Thus, Fig. 6.7a suggests that the average response tends to increase linearly as factor A changes from A1 to A3, while that of factor B decreases a
6.4 Factorial Designs
231
little as factor B changes from B1 to B2. The effect of the factors on the response may not be purely additive, and an interaction or multiplicative term (see Eq. 5.25) may have to be included as well. In such cases, the two factors are said to interact with each other. Whether this interaction effect is statistically signiﬁcant or not can be determined from the results shown in Table 6.6. The effect of going from A1 to A3 is 17 under B1 and only 7 under B2. This
o o o
Z
y1
o o o o o o
o o o
y2
o o o
SST = SSA þ SSB þ SSðABÞ þ SSE
o o o
x1
suggests interaction effects. A simpler and more direct approach is to graph the twofactor interaction plot as shown in Fig. 6.7b. Since the lines are not parallel (in this case they cross each other), one would infer interaction between the two factors. However, in many instances, such plots are not conclusive enough, and one needs to perform statistical tests to determine whether the main or the interaction effects are signiﬁcant or not (illustrated below). Figure 6.8 shows the type of interaction plots one would obtain for the case when the interaction effects are not at all signiﬁcant. ANOVA decompositions allow breaking up the observed total sum of square variation (SST) into its various contributing causes (the one factor ANOVA was described in Sect. 4.3.1). For a twofactor ANOVA decomposition (Devore and Farnum 2005): ð6:2aÞ
where the observed sum of squares: x2
x3
X
SST =
a
b
r
i=1 j=1 m=1 2
yijm  < y >
2
ð6:2bÞ
= ðstdevÞ ðabr  1Þ
Fig. 6.6 Correspondence between block design approach and multiple regression analysis
Fig. 6.7 Plots for the (3 × 2) balanced factorial design. (a) Main effects of factors A and B with mean and 95% intervals (data from Table 6.5). (b) Twofactor interaction plot. (Data from Table 6.6)
Table 6.6 Interaction effect calculations for the data in Table 6.5 Effect of changing A (B ﬁxed at B1) (10 + 14)/2 = 12 A1 and B1 (23 + 21)/2 = 22 A2 and B1 (31 + 27)/2 = 29 A3 and B1
29  12 = 17
Effect of changing A (B ﬁxed at B2) A1 and B2 (18 + 14)/2 = 16 A2 and B2 (16 + 20)/2 = 18 A3 and B2 (21 + 25)/2 = 23
23  16 = 7
232
6
sum of squares associated with factor A: a
SSA = br
Ai  < y >
2
ð6:2cÞ
i=1
sum of squares associated with factor B: b
SSB = ar
Bj  < y >
2
ð6:2dÞ
j=1
error or residual sum of squares: a
b
r
yijm  yij
SSE =
2
ð6:2eÞ
i=1 j=1 m=1
with yijm = observation under mth replication when A is at level i and B is at level j a = number of levels of factor A b = number of levels of factor B r = number of replications per cell Ai = average of all response values at ith level of factor A Bj = average of all response values at jth level of factor B
Design of Physical and Simulation Experiments
yij = average of y for each cell (i.e., across replications) = grand average of y values i = 1, . . . a is the index for levels of factor A j = 1, . . . b is the index for levels of factor B m = 1, . . . r is the index for replicate. The sum of squares associated with the AB interaction is SST(AB) and this is deduced from Eq. 6.2a since all other quantities can be calculated. A linear statistical model, referred to as a random effects model, between the response and the two factors which include the interaction term between factors A and B can be deduced. More speciﬁcally, this is called a nonadditive twofactor model (it is nonadditive because the interaction term is present). It assumes the following form given that one starts with the grand average and then adds individual effects of the factors, the interaction terms, and the noise or error term: yij = < y > þ αi þ βj þ ðαβÞij þ εij
ð6:3Þ
where αi represents the main effect of factor A at the ith level = Ai  < y > and =
a i=1
αi = 0
βj the main effect of factor B at the jth level = Bj  < y > and = (αβ)ij
b j=1
the
βj = 0 interaction
between
B = yij  ð < y > þ αi þ βi Þ =
b
factors a
j=1 i=1
A
and
ðαβÞij = 0
and εij is the error (or residuals) assumed uncorrelated with mean zero and variance σ 2 = MSE.
Fig. 6.8 An example of a twofactor interaction plot when the factors have no interaction
The analysis of variance is done as described earlier, but care must be taken to use the correct degrees of freedom to calculate the mean squares (refer to Table 6.7). The analysis of the variance model (Eq. 6.3) can be viewed as a special case of multiple linear regression (or more speciﬁcally to one with indicator variables—see Sect. 5.7.3). This concept is illustrated in Example 6.4.1.
Table 6.7 Computational procedure for a twofactor ANOVA design Source of variation Factor A Factor B AB interaction
Sum of squares SSA SSB SS(AB)
Degrees of freedom a1 b1 (a  1)(b  1)
Error Total variation
SSE SST
ab(r  1) abr  1
Mean square MSA = SSA/(a  1) MSB = SSB/(b  1) MS(AB) = SS(AB)/(a  1) (b  1) MSE = SSE/[ab(r  1)] –
Computed F statistic MSA/MSE MSB/MSE MS(AB)/MSE
Degrees of freedom for pvalue a  1, ab(r  1) b  1, ab(r  1) (a  1)(b  1), ab(r  1)
– –
– –
6.4 Factorial Designs
233
Example 6.4.1 Twofactor ANOVA analysis and random effect model ﬁtting Using the data from Table 6.5, determine whether the main effect of factor A, the main effect of factor B, and the interaction effect of AB are statistically signiﬁcant at α = 0.05. Subsequently, identify the random effects model. It is recommended that one start by generating the treatment and effect plots as shown in Fig. 6.7a, b. The response increases with the increasing level of factor A while it decreases a little with Factor B. The effect of factor A on the response looks more pronounced. First, using all 12 observations, one computes the grand average = 20 and the standard deviation stdev = 6.015. Then, following Eq. 6.2b: SST = stdev2 :ðabr  1Þ = 6:0152 ½ð3Þð2Þð2Þ  1 = 398 SSA = ð2Þ:ð2Þ ð14  20Þ2 þ ð20  20Þ2 þ ð26  20Þ2 = 288 SSB = ð3Þ:ð2Þ ð21  20Þ2 þ ð19  20Þ2 = 12 SSE = ½ð10  12Þ2 þ ð14  12Þ2 þ ð18  16Þ2 þð14  16Þ2 þ ð23  22Þ2 þ ð21  22Þ2 þð16  18Þ2 þ ð20  18Þ2 þ ð31  29Þ2 þð27  29Þ2 þ ð21  23Þ2 þ ð23  25Þ2 = 42
The statistical signiﬁcance of the factors can now be evaluated by computing the Fvalues and comparing them with the corresponding critical values. S A 144 • Factor A: F  value = M M S E = 7 = 20:57: Since critical Fvalue for degrees of freedom (2, 6) = Fc (2,6) @ 0.05 signiﬁcance level = 5.14, and because calculated F > Fc, one concludes that this factor is indeed signiﬁcant at the 95% conﬁdence level (CL). S B 12 • Factor B: F  value = M M S E = 7 = 1:71: Since Fc (1,6) @ 0.05 signiﬁcance level = 5.99; this factor is not signiﬁcant.
• Factor AB: F  value = MMSðSA EBÞ = 28 7 = 4: Since Fc (2,6) @ 0.05 signiﬁcance level = 5.14; this factor is not signiﬁcant. These results are also assembled in Table 6.8 for easier comprehension. The use of Eq. 6.3 can also be illustrated in terms of this example. The main effects of A and B are given by the differences between the cell averages and the grand average = 20 (see Table 6.7): α1 = ð14  20Þ =  6; α2 = ð20  20Þ = 0; α3 = ð26  20Þ = 6; β1 = ð21  20Þ = 1; β2 = ð19  20Þ =  1;
Then, from Eq. 6.2a SST = SSA þ SSB þ SSðABÞ þ SSE SSðABÞ = SST  SSA  SSB  SSE = 398  288  12  42 = 56
and those of the interaction terms by (refer to Table 6.6): ðαβÞ11 = 12  ð20  6 þ 1Þ =  3; ðαβÞ21 = 22  ð20 þ 0 þ 1Þ = 1;
Next, the expressions shown in Table 6.7 result in: MSA = MSB = MSðABÞ = MSE =
ðαβÞ31 = 29  ð20 þ 6 þ 1Þ = 2; ðαβÞ12 = 16  ð20  6  1Þ = 3;
SSA 288 = = 144 a1 31 SSB 12 = = 12 b1 21 SSðABÞ 56 = = 28 ð a  1Þ ð b  1Þ ð 2 Þ ð 1 Þ SSE 42 = =7 abðr  1Þ ð3Þð2Þð1Þ
ðαβÞ22 = 18  ð20 þ 0  1Þ =  1; ðαβÞ32 = 23  ð20 þ 6  1Þ =  2; Finally, following Eq. 6.3, the random effects model can be expressed as:
Table 6.8 Results of the ANOVA analysis for Example 6.4.1 Source Main effects A:Factor A B:Factor B Interactions AB Residual Total (corrected)
Sum of squares
d.f.
Mean square
Fratio
Pvalue
288.0 12.0
2 1
144.0 12.0
20.57 1.71
0.0021 0.2383
56.0 42.0 398.0
2 6 11
28.0 7.0
4.00
0.0787
All Fratios are based on the residual mean square error
234
6
yij = 20 þ f  6, 0, 6gi þ f1,  1gj þf  3, 1, 2, 3,  1,  2gij with
ð6:4aÞ
i = 1, 2, 3 and j = 1, 2
For example, the cell corresponding to (A1, B1) has a mean value of 12 which is predicted by the above model as: yij = 20  6 þ 1  3 = 12, and so on. Finally, the prediction error of the model has a variance σ 2 = MSE = 7. Recasting the above model (Eq. 6.4a) as a regression model with indicator variables may be insightful (though cumbersome) to those more familiar with regression analysis methods: yij = 20 þ ð  6ÞI 1 þ ð0ÞI 2 þ ð6ÞI 3 þ ð1ÞJ 1 þ ð  1ÞJ 2 þð  3ÞI 1 J 1 þ ð1ÞI 1 J 2 þ ð2ÞI 2 J 1 þ ð3ÞI 2 J 2 þð  1ÞI 3 J 1 þ ð  2ÞI 3 J 2 ð6:4bÞ where Ii and Ji are indicator variables corresponding to the α and β terms.
6.4.2
Design of Physical and Simulation Experiments
(following the standard form suggested by Yates). Notice that the last but one column has four () followed by four (+), the last but two column by successive pairs of () and (+), and the second column has alternating () and (+). The Yates algorithm is easily extended to higher number of factors. However, the sequence in which the runs are to be performed should be randomized; a good way is to simply sample the set of trials {1, . . ., 8} in a random fashion without replacement. The approach can be modiﬁed to treat the case of parameter interaction. Table 6.9 is simply modiﬁed by including separate columns for the three interaction terms, as shown in Table 6.10. The product of any two columns of the factors yields a column for the effect of the interaction term (e.g., the interaction of A and B is denoted by AB). The appropriate sign for the interactions is determined by multiplying the signs of each of the two corresponding terms. For example, AB for trial 1, would be coded as ()() = (+); and so on.1 Note that every column has an equal number of (–) and (+) signs. The orthogonality property (discussed in the next section) relates to the fact that the sum of the product of the signs in any two columns is zero.
2k Factorial Designs
The above treatment of full factorial designs can lead to a prohibitive number of runs when numerous levels need to be considered. As pointed out by Box et al. (1978), it is wise to design a DOE investigation in stages, with each successive iteration providing incremental insight into inﬂuential factors, type of interaction, etc. while suggesting subsequent investigations. Factorial designs, primarily 2k and 3k, are of great value at the early stages of an investigation, where many possible factors are investigated with the intention of either narrowing down the number (screening), or to get a preliminary understanding of the mathematical relationship between factors and the response variable. These are, thus, viewed as a logical leadin to the response surface method discussed in Sect. 6.6. The associated mathematics and interpretation of 2k designs are simple and can provide insights into the framing of more sophisticated and complete experimental designs called sequential designs which allow for more precise parameter estimation (a practical example is given in Sect. 10.4.2). They are popular in R&D of products and processes and are used extensively. They can also be used during computer simulation experiments; see, for example, Hou et al. 1996 for evaluating the performance of building energy systems (discussed in Sect. 6.7.4). Figure 6.4 depicts the full factorial design for three factors at twolevels (23 design), which involves 8 experiments mapped to each of the corners of the cube. The three factors (A, B, C) are coded as 1 or 1 to denote the high and lowlevel settings of the factors. Table 6.9 depicts a quick and easy way of setting up a twolevel threefactor design
Table 6.9 The standard form (suggested by Yates) for setting up the twolevel threefactor (or 23) design Trial 1 2 3 4 5 6 7 8
Level of factors A B + + + + + + + +
C + + + +
Response y1 y2 y3 y4 y5 y6 y7 y8
Table 6.10 The standard form of the twolevel threefactor (or 23) design with interactions Trial 1 2 3 4 5 6 7 8
1
Level of factors A B C + + + + + + + + + + + +
Interactions AB AC + + + + + + + +
BC + + + +
ABC + + + +
Response y1 y2 y3 y4 y5 y6 y7 y8
The statistical basis of this simple coding process is given by Box et al. (1978).
6.4 Factorial Designs
235
Table 6.11 Response table representation for the 23 design with interactions (omitting the ABC term) generated by expanding Table 6.10 Trial 1 2 3 4 5 6 7 8 Sum Avg Effect
Resp. y1 y2 y3 y4 y5 y6 y7 y8 8
A+
Ay1
B+
y3
y3 y4
y2 y4 y5
y5 y6
y6 y7 y8 4
4
Aþ
A
(Aþ  A  )
By1 y2
y7 y8 4 Bþ
4 B
(Bþ  B  )
C+
y5 y6 y7 y8 4 Cþ
ð6:5Þ
AB
AC+ y1
y2 y3
y2
4 C
ABþ
BC+ y1 y2
y3
y4 y5
y8 4
AC
y6 y7
4
y8 4
4
AB 
ACþ
AC 
(ABþ  AB  )
(ACþ  AC  )
ð6:6aÞ
Similarly, the interaction effect of, say, BC can be determined as the average of the B effect when C is held constant at +1 minus the B effect when C is held constant at 1. Interaction effect of BC = BC þ  BC ð6:6bÞ 1 = ½ðy1 þ y2 þ y7 þ y8 Þ  ðy3 þ y4 þ y5 þ y6 Þ 4
BC
y3 y4 y5 y6
y4 y5 y7 y8 4 BCþ
4 BC 
(BCþ  BC  )
Thus, the individual and interaction effects directly provide a prediction model of the form: y = b 0 þ b 1 A þ b 2 B þ b3 C Main effects
þ b12 AB þ b13 AC þ b23 BC þ b123 ABC
where the overbar indicates the average value. Statistical textbooks on DOE provide elaborate details of how to obtain estimates of all main and interaction effects when more factors are to be considered, and then how to use statistical procedures such as ANOVA to identify the signiﬁcant ones. The standard form shown in Table 6.10 can be rewritten as in Table 6.11 for the 23 design with interactions by expanding each of the four interaction columns into their (+) and (–) columns respectively. For example, AB in Table 6.10 is (+) for trials 1, 4, 5, and 8 and these are listed under AB+ column in Table 6.11. This is referred to as the response table form and is advantageous in that it allows the analysis to be done in a clear and modular manner. The physical interpretation of the measure of interaction AB is that it is the difference between the average change in the response with factor A and that of factor B. Similarly, AB+ denotes the average effect of A on the response variable when B is held ﬁxed (or blocked) at the B+ level. On the other hand, AB denotes the effect of A when B is held ﬁxed at the lower level B. The main effect of A is simply: 1 = Aþ  A  = ½ðy2 þ y4 þ y6 þ y8 Þ 4  ðy1 þ y3 þ y5 þ y7 Þ
AB+ y1
y6 y7
(Cþ  C  )
The main effect of, say, factor C can be determined simply as: Main effect of C = C þ  C ðy þ y6 þ y7 þ y8 Þ ðy1 þ y2 þ y3 þ y4 Þ = 5 4 4
Cy1 y2 y3 y4
ð6:7aÞ
Interaction terms
The intercept term is given by the grand average of all the response values y. This model is analogous to Eq. 5.25 which is one form of the additive multiple linear models discussed in Chap. 5. Note that Eq. 6.7a has eight parameters and with eight experimental runs, the model ﬁt will be perfect with no variance. A measure of the random error can only be deduced if the degrees of freedom (d. f.) > 0, and so replication (i.e., repeats of runs) is necessary. Another option, relevant when interaction effects are known to be negligible, is to adopt a model which includes main effects only: y = b0 þ b 1 A þ b 2 B þ b 3 C
ð6:7bÞ
In this case, d.f. = 4, and so a measure of random error of the model can be determined. Example 6.4.2 Deducing a prediction model for a 23 factorial design Consider a problem where three factors {A, B, C} are presumed to inﬂuence a response variable y. The problem is to specify a DOE design, collect data, ascertain the statistical importance of the factors, and then identify a prediction model. The numerical values of the factors or regressors corresponding to the high and low levels are assembled in Table 6.12. It was decided to use two replicate tests for each of the 8 combinations to enhance accuracy. Thus, 16 runs were performed, and the results are tabulated in the standard form as suggested by Yates (Table 6.9) and shown in Table 6.13.
236
6
(a) Identify statistically signiﬁcant terms This tabular data can be used to create a table similar to Table 6.11, which is left to the reader. Then, the main effects and interaction terms can be calculated following Eq. 6.6a and Eq. 6.6b. Main effect of factor A: 1 ½ð26 þ 29Þ þ ð21 þ 22Þ þ ð23 þ 22Þ ð 2Þ ð 4Þ þ ð18 þ 18Þ  ð34 þ 40Þ  ð33 þ 35Þ  ð24 þ 23Þ  ð19 þ 18Þ 47 ==  5:875 8 while the effect sum of squares SSA = (47.0)2/16 = 138.063. Table 6.12 Assumed low and high levels for the three factors (Example 6.4.2) Factor A B C
Low level 0.9 1.20 20
High level 1.1 1.30 30
Table 6.13 Standard table (Example 6.4.2) Trial 1 2 3 4 5 6 7 8
Level of factors A B 0.9 1.2 1.1 1.2 0.9 1.3 1.1 1.3 0.9 1.2 1.1 1.2 0.9 1.3 1.1 1.3
C 20 20 20 20 30 30 30 30
Responses (two replicates) 34, 40 26, 29 33, 35 21, 22 24, 23 23, 22 19, 18 18, 18
Data available electronically on book website
Design of Physical and Simulation Experiments
Similarly, the main effects of B =  4.625 and C = 9.375, while interaction effects AB =  0.625, AC = 5.125, BC =  0.125, and ABC= 0.875. The results of the ANOVA analysis are assembled in Table 6.14. One concludes that the main effects A, B, and C and the interaction effect AC are signiﬁcant at the 0.01 level. The main effect and interaction effect plots are shown in Figs. 6.9 and 6.10. These plots do suggest that interaction effects are present only for factors A and C since the lines are clearly not parallel. (b) Identify prediction model Only four terms, namely A, B, C, and AC interaction are found to be statistically signiﬁcant at the 0.05 level (see Table 6.14). In such a case, the functional form of the prediction model reduces to: y = b0 þ b 1 x A þ b 2 x B þ b 3 x c þ b 4 x A x C
ð6:8aÞ
Substituting the values of the effect estimates determined earlier results in y = 25:313  2:938xA  2:313xB  4:688xc þ 2:563xA xC ð6:8bÞ where coefﬁcient b0 is the mean of all observations. Also, note that the values of the model coefﬁcients are half the values of the main and interaction effects determined in part (a). For example, the main effect of factor A was calculated to be ( 5.875) which is twice the ( 2.938) coefﬁcient for the xA factor shown in the equation above. The division by 2 is needed because of the way the factors were coded, i.e., the high and low levels, coded as +1 and 1, are separated by 2 units. The performance equation thus determined can be used for predictions. For example, when xA = +1, xB = 1, xC = 1, one gets y = 26:813 which agrees reasonably well with the average of the two replicates performed (26 and 29) despite dropping the interaction terms.
Table 6.14 Results of the ANOVA analysis Source Main effects Factor A Factor B Factor C Interactions AB AC BC ABC Residual or error Total (corrected)
Sum of squares
D.f.
Mean square
Fratio
pvalue
138.063 85.5625 351.563
1 1 1
138.063 85.5625 351.563
41.68 25.83 106.13
0.0002 0.0010 0.0000
1.5625 105.063 0.0625 3.063 26.5 711.438
1 1 1 1 8 15
1.5625 105.063 0.0625 3.063 3.3125
0.47 31.72 0.02 0.92
0.5116 0.0005 0.8941 0.3640
Interaction effects AB, BC, and ABC are not signiﬁcant (Example 6.4.2a)
6.4 Factorial Designs
237
Fig. 6.9 Main effect scatter plots for the three factors for Example 6.4.2 Fig. 6.10 Interaction plots for Example 6.4.2
(c) Comparison with linear multiple regression approach The parallel between this approach and regression modeling involving indicator variables was discussed previously (Sects. 5.7.3 and 6.2.2). For example, if one were to perform a multiple regression to the above data with the three regressors coded as 1 and +1 for low and high values respectively, one obtains the results shown in Table 6.15.
Note that the same four variables (A, B, C, and AC interaction) are statistically signiﬁcant while the model coefﬁcients are identical to the ones determined by the ANOVA analysis. If the regression were to be redone with only these four variables present, the model coefﬁcients would be identical. This is a great advantage with factorial designs in that one could include additional variables incrementally in the model without impacting the model
238
6
Design of Physical and Simulation Experiments
Table 6.15 Results of performing a multiple linear regression to the same data with regressors coded as +1 and 1 (Example 6.4.2c) Parameter Constant Factor A Factor B Factor C Factor A*Factor B Factor A*Factor C Factor B*Factor C Factor A*Factor B* Factor C
Parameter estimate 25.3125 2.9375 2.3125 4.6875 0.3125 2.5625 0.0625 0.4375
Standard error 0.455007 0.455007 0.455007 0.455007 0.455007 0.455007 0.455007 0.455007
tstatistic 55.631 6.45595 5.08234 10.302 0.686803 5.63178 0.137361 0.961524
pvalue 0.0000 0.0002 0.0010 0.0000 0.5116 0.0005 0.8941 0.3644
Table 6.16 Goodnessoffit statistics of different multiple linear regression models (Example 6.4.2) Regression model With all terms With only four signiﬁcant terms
Model R2 0.963 0.956
Adjusted R2 0.930 0.940
RMSE 1.820 1.684
Fig. 6.12 Model residuals versus model predicted values highlight the larger scatter present at higher values indicative of nonadditive errors (Example 6.4.2)
Fig. 6.11 Observed versus predicted values for the regression model indicate larger scatter at high values (Example 6.4.2)
coefﬁcients of variables already identiﬁed. Why this is so is explained in the Sect. 6.4.3. Table 6.16 assembles pertinent goodnessofﬁt indices for the complete model and the model with the four signiﬁcant regressors only. Note that while the R2 value of the former is higher (a misleading statistic to consider when dealing with multivariate model building), the AdjR2 and the RMSE of the reduced model are superior. Finally, Figs. 6.11 and 6.12 are model predicted versus observed plots, which allow one to ascertain how well the model has fared; in this case, there seems to be a larger scatter at higher values indicative of nonadditive errors. This suggests that a linear additive model may not be the best choice, and the analyst may undertake further reﬁnements if time permits. In summary, DOE involves the complete reasoning process of deﬁning the structural framework, i.e., prescribing the exact manner in which samples for testing need to be selected, and the conditions and sequence under which the testing needs to be performed under speciﬁc restrictions
imposed by space, time and nature of the process (Mandel 1964). The applications of DOE have expanded to the area of model building as well. It is now used to identify which subsets among several possible variables inﬂuence the response variable, and to determine a quantitative relationship between them.
6.4.3
Concept of Orthogonality
An important concept in DOE is orthogonality by which it is implied that trials should be framed such that the data matrix X2 results in (XTX) = 1 where XT is the transpose of X. 3 In such a case, the offdiagonal terms of the matrix (XTX) will be zero, i.e., the regressors are uncorrelated. This would lead to the best designs since it would minimize the variance of the regression coefﬁcients. For example, consider Table 6.10 where the standard form for the twolevel threefactor design is shown. Replacing low and high values (i.e., 2
Refer to Sect. 5.4.2 for refresher. Recall from basic geometry that two straight lines are perpendicular when the product of their slopes is equal to 1. Orthogonality is an extension of this concept to multiple dimensions. 3
6.4 Factorial Designs
239
Fig. 6.13 The coded regression matrix with four main and four interaction parameters with eight experiments
x1 1 1 1 1 1 1 1 1
X=
1 1 1 1 1 1 1 1
x2 1 1 1 1 1 1 1 1
x3
x1x2
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
Main effects  and +) by  1 and +1, and noting that an extra column of 1 needs to be introduced to take care of the constant term in the model (see Eq. 5.31) results in the regressor matrix being deﬁned by:
X=
1 1
1 þ1
1 1
1 1
1
1
þ1
1
1 1
þ1 1
þ1 1
1 þ1
1 1
þ1 1
1 þ1
þ1 þ1
1
þ1
þ1
þ1
ð6:9Þ
Example 6.4.34 Matrix approach to inferring a prediction model for a 23design This example will illustrate the analysis procedure for a complete 2k factorial design with three factors. The model, assuming a linear form, is given by Eq. 5.28 and includes main and interaction effects. Denoting the three factors by x1, From Beck and Arnold (1977) by permission of Beck.
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
Interaction effects
x2, and x3, the regressor matrix X will have four main effect parameters (the intercept term is the ﬁrst column) as well as the four interaction terms as shown in Fig. 6.13. For example, the 6th column is the product of the 2nd and 3rd columns and so on. Let us assume that a DOE has yielded the following eight values for the response variable: YT = ½49 62 44 58 42 73 35 69
The reader can verify that the offdiagonal terms of the matrix (XTX) are indeed zero. All nk factorial designs are thus orthogonal, i.e., (XTX)1 is a diagonal matrix with nonzero diagonal components. This leads to the soundest parameter estimation (as discussed in Sect. 9.2.3). Another beneﬁt of orthogonal designs is that parameters of regressors already identiﬁed remain unchanged as additional regressors are added to the model; thereby allowing the model to be developed incrementally. Thus, the effect of each term of the model can be examined independently. These are two great beneﬁts when factorial designs are adopted for model identiﬁcation.
4
x1x3 x2x3 x1x2x3
ð6:10Þ
The intention is to identify a parsimonious model, i.e., one in which only the statistically signiﬁcant terms are retained in the model given by Eq. 5.28. The inverse of (XTX) = 18 I and the (XTX) terms can be deduced by taking the sums of the yi terms multiplied by either (+1) or (1) as indicated in XT. The coefﬁcient b0 = 54 (average of all eight values of y), b1 following Eq. 6.5 is: b1 = [(62 + 58 + 73 + 69)  (49 + 44 + 42 + 35)]/ (4 × 2) = 11.5, and so on. The resulting model is: yi = 54 þ 11:5x1i  2:5x2i þ 0:75x3i þ 0:5x1i x2i þ4:75x1i x3i  0:25x2i x3i þ 0:25x1i x2i x3i
ð6:11Þ
With eight parameters and eight observations (and no replication), the model will be perfect with zero degrees of freedom; this is referred to as a saturated model. This is not a prudent situation since a model variance cannot be computed nor can the pvalues of the various terms be inferred. Had a replication design been adopted, an estimate of the variance in the model could have been conveniently estimated and some measure of the goodnessofﬁt of the model deduced (as in Example 6.4.1). In this case, the simplest recourse is to drop one of the terms from the model (say the (x1.x2.x3) interaction term) and then perform the ANOVA analysis.
240
6
Design of Physical and Simulation Experiments
Table 6.17 Results of the ANOVA analysis (Example 6.4.3) Source Main effects Factor x1 Factor x2 Factor x3 Interactions x1x2 x1x3 x2x3 Residual or error Total (Corrected)
Sum of squares
D.f.
Mean square
Fratio
pvalue
1058 50.0 4.50
1 1 1
1058 50.0 4.50
2116 100 9.00
0.0138 0.0635 0.2050
2.00 180.5 0.50 0.50 1296
1 1 1 1 7
2.00 180.5 0.50 0.50
4.00 361 1.00
0.2950 0.0335 0.5000
Because of the orthogonal behavior, the signiﬁcance of the dropped term can be evaluated at a later stage without affecting the model terms already identiﬁed. The effect of individual terms is now investigated in a manner similar to the previous example. The ANOVA analysis shown in Table 6.17 suggests that only the terms x1 and (x1x3) are statistically signiﬁcant at the 0.05 level. However, the pvalue for x2 is close, and so it would be advisable to keep this term. Then, the parsimonious model is directly stated as: yi = 54 þ 11:5x1i  2:5x2i þ 4:75x1i x3i
ð6:12Þ Fig. 6.14 Illustration of the differences between (a) full factorial and
The above example illustrates how data gathered within a DOE design and analyzed following the ANOVA method can yield an efﬁcient functional predictive model of the data. It is left to the reader to repeat the analysis illustrated in Example 6.4.1 where an identical model was obtained by straightforward use of multiple linear regression. Note that orthogonality is maintained only if the analysis is done with coded variables (1 and +1), and not with the original ones. Recall that a 22 factorial design implies two regressors or factors, each at two levels; say “low” and “high.” Since there are only two states, one can only frame a ﬁrstorder functional model to the data such as Eq. 6.12. Thus, a 22 factorial design is inherently constrained to identifying a ﬁrstorder linear model between the regressors and the response variable. If the mathematical relationship requires higher order terms, multilevel factorial designs are required (to identify polynomial models such as Eq. 5.23). For example, the 3k design will require the range of variation of the factors to be aggregated into three levels, such as “low”, “medium” and “high”. If the situation is one with three factors (i.e., k = 3), one needs to perform 27 experiments even if no replication tests are considered. This is more than three times the number of tests needed for the 23 design. Thus, the additional higherorder insight can only be gained at the expense of a larger number of runs which, for higher number of factors, may
(b) fractional factorial design for a 23 DOE experiment. Several different combinations of fractional factorial designs are possible; only one such combination is shown
become prohibitive. In such instances, central composite designs are often advisable since they allow secondorder effects to be modeled with 2k designs; this design is discussed in Sect. 6.6.4.
6.4.4
Fractional Factorial Designs
One way of greatly reducing the number of runs, provided interaction effects are known to be negligible, is to adopt fractional factorial designs. The 27 tests needed for a full 33 factorial design can be reduced to 9 tests only. Thus, instead of 3k tests, an incomplete block design would only require (3k1) tests. A graphical interpretation of how a fractional factorial design differs from a full factorial one for a 23 instance is illustrated in Fig. 6.14. Three factors are involved (A, B, and C) at two levels each (1, 1). While 8 test runs are performed corresponding to each of the 8 corners of the cube for the full factorial, only 4 runs are required for the fractional factorial as shown. The Latin squares design, discussed in Sect. 6.5.2, is a type of fractional factorial method. The interested reader can refer to Box et al. (1978) or Montgomery (2017) for a detailed treatment of fractional factorial design methods.
6.5 Block Designs
241
Table 6.18 Machining time (in minutes) for Example 6.5.1 Operator 1 42.5 39.8 40.2 41.3 40.950
Machine 1 2 3 4 Average
2 39.3 40.1 40.5 42.2 40.525
3 39.6 40.5 41.3 43.5 41.225
4 39.9 42.3 43.4 44.2 42.450
5 42.9 42.5 44.9 45.9 44.050
6 43.6 43.1 45.1 42.3 43.525
Average 41.300 41.383 42.567 43.233 42.121
Data available electronically on book website
Table 6.19 ANOVA table for Example 6.5.1 Source of variation Machines Operators Error Total
Sum of squares 15.92 42.09 23.84 81.86
6.5
Block Designs
6.5.1
Complete Block Design
Degrees of freedom 3 5 15 23
Complete block design pertains to the instance when one wishes to investigate the effect of only one primary or treatment factor/variable on the response when other secondary factors (or nuisance variables) are present. The effect of these nuisance variables is minimized/eliminated by blocking. The following example serves to illustrate the concept of randomized complete block design with one nuisance factor.5 Example 6.5.16 Evaluating performance of four machines while blocking effect of operator dexterity The performance of four different machines M1, M2, M3, and M4 is to be evaluated in terms of time needed to manufacture a widget. It is decided that the same widget will be made on these machines by six different machinists/operators in a randomized block experiment. The machines are assigned in a random order to each operator. Since dexterity is involved, there will be a difference among the operators in the time needed to machine the widget. Table 6.18 assembles the machining time in minutes after 24 tests have been completed. Here, the machine type is the primary treatment factor, while the nuisance factor is the operator (the intent of the study could have been the reverse). The effect of this nuisance factor is blocked or controlled by the randomized complete block design where all operators use all four machines. The analysis calls for testing the hypothesis at
5
Since two factors are involved, such problems are often referred to as twoway ANOVA problems. 6 From Walpole et al. (2007) by # permission of Pearson Education.
Mean square 5.31 8.42 1.59 –
Computed F statistic 3.34
pvalue 0.048
the 0.05 level of signiﬁcance that the performance of the machines is identical. Let Factor A correspond to the machine type and B to the operator. Thus a = 4 and b = 6, with replication r = 1. Then, Eq. 6.2a reduces to:
where
SST = SSA þ SSB þ SSE
SSA = ð6Þ ð41:3  42:121Þ2 þ ð41:383  42:121Þ2 þ . . . = 15:92 SSB = ð4Þ ð40:95  42:121Þ2 þ ð40:525  42:121Þ2 þ . . . = 42:09 Total variation = SST = (abr  1). stdev2 = (23).(1.88652) = 81.86 Subsequently, SSE = 81.86  15.92  42.09 = 23.84 The ANOVA table can then be generated as depicted in Table 6.19. The Fstatistic = (5.31/1.59) = 3.34 is signiﬁcant at probability p = 0.048. One would conclude that the performance of the machines cannot be taken to be similar at the 0.05 signiﬁcance level (this is a close call though and would merit further investigation!). What can one infer about differences in the dexterity of the machinists? As illustrated earlier, graphical display of data can provide useful diagnostic insights in ANOVA type of problems as well. For example, a simple plotting of the raw observations around each treatment mean can provide a feel for variability between sample means and within samples. Figure 6.15 depicts all the data as well as the mean variation. One notices that there are two unusually different values which stand out, and it may be wise to go back and study the experimental conditions which produced these results. Without these, the interaction effects seem small.
242
6
Fig. 6.15 Factor mean plots of the two factors with six levels for the operator variable and four for the machine variable
Design of Physical and Simulation Experiments
Fig. 6.18 Normal probability plot of the residuals
Inspection of the residuals can provide diagnostic insights regarding violation of normality and nonuniform variance akin to regression analysis. Since model predictions are given by: yij = < y > þ Ai  < y > þ Bj  < y > = Ai þ Bj  < y >
ð6:14aÞ
the residuals of the (i,j) observation are: εij yij  yij = yij  Ai þ Bj  < y > i = 1, . . . , 4 Fig. 6.16 Scatter plot of the residuals versus the six operators
Fig. 6.17 Scatter plot of residuals versus predicted values
A random effects model can also be identiﬁed. In this case, an additive linear model is perhaps adequate such as: yij = < y > þ αi þ βj þ εij
ð6:13Þ
and
j = 1, . . . , 6
ð6:14bÞ
Two different residual plots have been generated. Figures 6.16 and 6.17 reveal that the variance of the errors versus operators and versus model predicted values are fairly random except for two large residuals (as noted earlier). Further, a normal probability plot of the model residuals seems to show some departure from normality, and this issue may need further scrutiny (Fig. 6.18). An implicit and important assumption in the above model design is that the treatment and block effects are additive, i.e., negligible interaction effects. In the context of Example 6.5.1, it means that if, say, Operator 3 is on average 0.5 min faster than Operator 2 on machine 1, the same difference also holds for machines 2, 3, and 4. This pattern would be akin to that depicted in Fig. 6.3 where the mean responses of different blocks differ by the same amount from one treatment to the next. In many experiments, this assumption of additivity does not hold, and the treatment and block effects interact (as illustrated in Fig. 6.7b). For example, Operator 1 may be faster by 0.5 min on the average than Operator 2 when machine 1 is used, but he may be slower by, say, 0.3 min on the average than Operator 2 when machine 2 is used. In such a case, the operators and the machines are said to be interacting.
6.5 Block Designs
243
This results in a signiﬁcant reduction in the number of experimental runs especially when several levels are involved. However, replication is advisable to reduce random error, to estimate the experimental error, and to provide more precise estimate of the parameter values. A Latin square for n levels denoted by (n × n) is a square of n rows and n columns with each of the n2 cells containing one speciﬁc treatment that appears once, and only once, in each row and column. Consider a threefactor experiment at four different levels each. The number of experiments required for full factorial, i.e., to map out the entire experimental space would be 43 = 64. For incomplete factorials, the
number of experiments reduces to 42 = 16 experiments. A Latin square is said to be reduced (also, normalized or in standard form) if both its ﬁrst row and its ﬁrst column are in their natural order. The standard manner of specifying a (4 × 4) Latin square design with three factors is shown in Table 6.20a. One of the blocked factors is represented by levels (1, 2, 3, 4) and the other by (I, II, III, IV) with the primary or treatment factor by (A, B, C, D). A randomized design (one of several) is shown in Table 6.20b. Note that the Latin square design shown in Table 6.20b is not unique. The number of possible combinations N grows exponentially with the number of levels n. For n = 3, N = 12; for n = 4, N = 576 and for n = 5, N = 161,280. For 3 factors at 3 levels, each Latin square design only needs 32 = 9 as against 33 = 27 experiments required for the full factorial design. Thus, Latin square designs reduce the required number of experiments from n3 to n2 (where n is the number of levels), thereby saving cost and time. In general, the fractional factorial design requires nk1 experiments, while the full factorial requires nk. A simple way of generating Latin square designs for higher values of n is to simply write them in order of level in the ﬁrst row with the subsequent rows generated by simply shifting the sequence of levels one space to the left. Then one needs to randomize (and perhaps include replicates as well) to average out the effect of random inﬂuences. Table 6.21 assembles the analysis of variance equations for a Latin square design, which will be illustrated in Example 6.5.2. Latin square designs usually have a small number of error degrees of freedom (e.g., 2 for a 3 × 3 and 6 for a 4 × 4 design), which allows a measure of model variance to be deduced. In summary, while the randomized block design allows blocking of one source of variation, the Latin square design allows systematic blocking of two sources of variability for problems involving three factors (k = 3 with two nuisance factors). The restrictions of this design are that (i) all 3 factors must have the same number of levels, and (ii) no interaction effects are present. The concept, under the same assumptions as those for Latin square design can be extended to problems
Table 6.20 A (4 × 4) Latin square design with three factors and four levels. The treatment factor levels are (A, B, C, D) while those of the two nuisance factors are shown as (1, 2, 3, 4) and (I, II, III, IV). Note that
each treatment occurs in every row and column. (a) The standard manner of specifying the design called reduced form. (b) One of several possible randomized designs
The above treatment of full factorial designs was limited to one nuisance factor. The treatment can be extended to a greater number of factors, but the analysis gets messier though the extension is quite straightforward; see, for example, Box et al. (1978) or Montgomery (2017).
6.5.2
Latin Squares
For the special case when all factors have the same number of levels, the number of experiments necessary for a complete factorial design which includes all main effects and interactions is nk where k is the number of factors and n the number of levels. If certain assumptions are made, this number can be reduced considerably (see Fig. 6.14). Such methods are referred to as fractional factorial designs. The Latin squares approach is one such special design meant for problems: (i) involving three factors with one treatment and two noninteracting nuisance factors (i.e., k = 3), (ii) that allows blocking in two directions, i.e., eliminating two sources of nuisance variability, (iii) where the number of levels for each factor is the same, and (iv) where interaction terms among factors are negligible (i.e., the interaction terms (αβ)ij in the statistical effects model given by Eq. 6.3 are dropped).
(a)
I II III IV
1 A B C D
2 B C D A
3 C D A B
4 D A B C
(b)
I II III IV
1 A D B C
2 B C D A
3 D A C B
4 C B A D
244
6
Design of Physical and Simulation Experiments
Table 6.21 The analysis of variance equations for (n × n) Latin square design Source of variation Row Column Treatment Error Total
Sum of squares SSR SSC SSTr SSE SST
Degrees of freedom n1 n1 n1 (n  1)(n  2) n2  1
with four factors (k = 4) where three sources of variability need to be blocked; this is done using GraecoLatin square designs (see Box et al. 1978; Montgomery 2017). Example 6.5.2 Evaluating impact of air ﬁlter type on breathing complaints with school vintage and season being nuisance factors7 To reduce breathing related complaints from students, four different types of air cleaning ﬁlters (labeled A, B, C, and D which are the treatment factors) are being considered for mandatory replacement of existing air ﬁlters in all schools in a school district. Since seasonal effects are important, tests are to be performed under each of the four seasons (and correct for the days when the school is in session for each of these seasons). Further, it is decided that tests should be conducted in four schools representative of different vintage (labeled 1 through 4). Because of the potential for differences in the HVAC systems between old and new schools, it is logical to insist that each ﬁlter type be tested at each school during each season of the year. It would have been advisable to have replicates but that would have increased the duration of the testing period from one year to two years which was not acceptable to the school board. (a) Develop a DOE design This is a threefactor problem with four levels in each. The total number of treatment combinations for a completely randomized design would be 43 = 64. The selection of the same number of categories for all three criteria of classiﬁcation could be done following a Latin square design, and the analysis of variance was performed using the results of only 16 treatment combinations. One such Latin square is given in Table 6.22. The rows and columns represent the two sources of variation one wishes to control. One notes that in this design, each treatment occurs exactly once in each row and in each column. Such a balanced arrangement allows the effect of the aircleaning ﬁlter to be separated from that of the season variable. Note that if the interaction between the sources of variation is present, the Latin square model cannot be used; 7
Since three factors are involved, such problems are often referred to as threeway ANOVA problems.
Mean square SSR/(n  1) SSC/(n  1) SSTr/(n  1) SSE/(n  1)(n  2) –
Computed F statistic FR = (MSR/MSE) FC = (MSC/MSE) FTr = (MSTr/MSE) – –
Table 6.22 Experimental design (Example 6.5.2) School vintage 1 2 3 4
Season Fall A D C B
Winter B A D C
Spring C B A D
Summer D C B A
Table 6.23 Data table showing a number of breathing complaints School vintage 1 2 3 4 Average
Fall A 70 D 66 C 59 B 41 59.00
Winter B 75 A 59 D 66 C 57 64.25
Spring C 68 B 55 A 39 D 39 50.25
Summer D 81 C 63 B 42 A 55 60.25
Average 73.5 60.75 51.50 48.00 58.4375
A, B, C, and D are four different types of air ﬁlters being evaluated (Example 6.5.2) Data available electronically on book website
this assessment ought to be made based on previous studies or expert opinion. (b) Perform an ANOVA analysis Table 6.23 summarizes the data collected under such an experimental protocol, where the numerical values shown are the number of breathingrelated complaints per season corrected for the number of days when the school is in session and for changes in a number of student population. Assuming that the various sources of variation do not interact, the objective is to statistically determine whether ﬁlter type affects the number of breathing complaints. A secondary objective is to investigate whether any (and, if so, which) of the nuisance factors (school vintage and season) are inﬂuential. Generating scatter plots such as those shown in Fig. 6.19 for school vintage and ﬁlter type is a logical ﬁrst step. The standard deviation is stdev = 12.91, while the averages of the four treatments or ﬁlter types are:
245
89
89
79
79
Complaints
Complaints
6.6 Response Surface Designs
69 59
69 59 49
49
39
39
x1
x2
x3
z1
x4
z2
z3
z4
Filter
School
Fig. 6.19 Scatter plots of number of complaints vs (a) school vintage and (b) ﬁlter type
Table 6.24 ANOVA results following equations shown in Table 6.21 (Example 6.5.2) Source of variation School vintage Season Filter type Error Total
Sum of squares 1557.2
Degrees of freedom 3
Mean square 519.06
Computed F statistic 11.92
pvalue 0.006
417.69 263.69 261.37 2499.94
3 3 6 15
139.23 87.90 43.56 –
3.20 2.02 – –
0.105 0.213 – –
A = 55:75, B = 53:25, C = 61:75, D = 63:00 In this example, one would make a fair guess based on the intra and within variation that ﬁlter type is probably not an inﬂuential factor on the number of complaints while school vintage may be. The analysis of variance approach or ANOVA is likely to be more convincing because of its statistical rigor. From the probability values in the last column of Table 6.24, it can be concluded that the number of complaints is strongly dependent on the school vintage, statistically signiﬁcant at the 0.10 level on the season, and not statistically signiﬁcant on ﬁlter type.
6.6
Response Surface Designs
6.6.1
Applications
Recall that the factorial methods described in the previous section can be applied to either continuous or discrete qualitative/categorical variables with only one variable changed at
a time. The 2k factorial method allows both screening to identify dominant factors and to identify a robust linear predictive model. In a historic timeline, these techniques were then extended to optimizing a process or product by Box and Wilson in the early 1950s. A special class of mathematical and statistical techniques was developed meant to identify models and analyze data between a response and a set of continuous treatment variables with the intent of determining the conditions under which a maximum (or a minimum) of the response variable is obtained when one or more of the variables are simultaneously changed (Box et al. 1978). For example, the optimal mix of two alloys which would result in the product having maximum strength can be deduced by ﬁtting the data from factorial experiments with a model from which the optimum is determined either by calculus or search methods (described in Chap. 7 under optimization methods). These models, called response surface (RS) models, can be framed as secondorder models (sometimes, the ﬁrst order is adequate if the optimum is far from the initial search space) which are linear in the parameters. RS designs involve not just the modeling aspect, but also recommendations on how to perform the sequential search involving several DOE steps. The reader may wonder why most of the DOE models treated in this chapter assume empirical polynomial models. This was because of historic reasons where the types of applications which triggered the development of DOE were not understood well enough to adopt mechanistic functional forms. Empirical polynomial models are linear in the parameters but can be nonlinear in their functional form due to interaction terms and higher terms in the variables (such as Eq. 6.7a).
246
6.6.2
6
Methodology
A typical RS experimental design involves three general phases (screening, optimizing, and conﬁrming) performed with the speciﬁc intention of limiting the number of experiments required to achieve a rich data set. This will be illustrated using the following example. The R&D staff of a steel company wants to improve the strength of the metal sheets sold. They have identiﬁed a preliminary list of ﬁve factors that might impact the strength of their metal sheets: concentrations of chemical A and chemical B, the annealing temperature, the time to anneal, and the thickness of the sheet casting. The ﬁrst phase is to run a screening design to identify the main factors inﬂuencing the metal sheet strength. Thus, those factors that are not important contributors to the metal sheet strength are eliminated from further study. How to perform such screening tests involving the 2k factorial design have been discussed in Sect. 6.4.2. It was concluded that the chemical concentrations A and B are the main treatment factors that survive the screening design. To optimize the mechanical strength of the metal sheets, one needs to know the relationship between the strength of the metal sheet and the concentration of chemicals A and B in the mix; this is done in the second phase, which requires a sequential search process. The following steps are undertaken during the sequential search: (i) Identify the levels of the amount of chemicals A and B to study. Three distinct values for each factor are usually necessary to ﬁt a quadratic function, so standard twolevel designs are not appropriate for ﬁtting curved surfaces. (ii) Generate the experimental design using one of several factorial methods. (iii) Run the experiments. (iv) Analyze the data using ANOVA to identify the statistical signiﬁcance of factors. (v) Draw conclusions and develop a model for the response variable. The quadratic terms in these equations approximate the curvature in the underlying response function. If a maximum or minimum exists inside the design region, the point where that value occurs can be estimated. Unfortunately, this is unlikely to be the case. The approximate model identiﬁed is representative of the behavior of the metal in the local design space only while the global optimum may lie outside the search space. (vi) Using optimization methods (such as calculusbased methods or search methods such as steepest descent), move in the direction where the overall optimum is likely to lie (refer to Sect. 7.4.2
Design of Physical and Simulation Experiments
(vii) Repeat steps (i) through (vi) until the global optimum is reached. Once the optimum has been identiﬁed, the R&D staff would want to conﬁrm that the new, improved metal sheets have higher strength; this is the third phase. They would resort to hypothesis tests involving running experiments to support the alternate hypothesis that the strength of the new, improved metal sheet is greater than the strength of the existing metal sheet. In summary, the goals of the second and third phases of the RS design are to determine and then conﬁrm, with the needed statistical conﬁdence, the optimum levels of chemicals A and B that maximize the metal sheet strength.
6.6.3
First and SecondOrder Models
In most RS problems, the form of the relationship between the response and the regressors is unknown. Consider the case where the yield (Y ) of a chemical process is to be maximized with temperature (T ) and pressure (P) being the two independent variables (Montgomery 2017). The 3D plot (called the response surface plot in DOE terminology) is shown in Fig. 6.20, along with its projection of a 2D plane, known as a contour plot. The maximum yield is achieved under T = 138 and P = 28, at which the maximum yield Y = 70. If one did not know the shape of this curve, one simple approach would be to assume a starting point (say, T = 115 and P = 20, as shown) and repeatedly perform experiments in an effort to reach the maximum point. This is akin to a univariate optimization search (see Sect. 7.4) which is not very efﬁcient. In this example involving a chemical process, varying one variable at a time may work because of the symmetry of the RS plot. However, in cases (and this is often so) when the RS plot is asymmetrical or when the search location is far away from the optimum, such a univariate search may erroneously indicate a nonoptimal maximum. A superior manner, and the one adopted in most numerical methods is the steepest gradient method, which involves adjusting all the variables together (see Sect. 7.4). As shown in Fig. 6.21, if the responses Y at each of the four corners of the square are known by experimentation, a suitable model is identiﬁed (in the ﬁgure, a linear model is assumed and so the set of lines for different values of Y are parallel). The steepest gradient method involves moving along a direction perpendicular to the sets of lines (indicated by the “steepest descent” direction in the ﬁgure) to another point where the next set of experiments ought to be performed. Repeated use of this testing, modeling, and stepping is likely to lead one close to the soughtafter maximum or minimum (provided one is not caught in a local peak or valley or a saddle point).
6.6 Response Surface Designs
247
Fig. 6.20 A threedimensional response surface between the response variable (the expected yield) and two regressors (temperature and pressure) with the associate contour plots indicating the optimal value. (From Montgomery 2017 by permission of John Wiley and Sons)
Fig. 6.21 Figure illustrating how the ﬁrstorder response surface model (RSM) ﬁt to a local region can progressively lead to the global optimum using the steepest descent search method
fractional, are good choices at the preliminary stage of the RS investigation. As stated earlier, due to the beneﬁt of orthogonality, these designs are recommended since they would minimize the variance of the regression coefﬁcients. (b) Once close to the optimal region, polynomial models higher than the ﬁrst order are advised. This could be a secondorder polynomial (involving just the main effects) or a higherorder polynomial which also includes quadratic effects and interactions between pairs of factors (twofactor interactions) to account for curvature. Quadratic models are usually sufﬁcient for most engineering applications, though increasing the order of approximation to higher orders could, sometimes, further reduce model errors. Of course, it is unlikely that a polynomial model will be a reasonable approximation of the true functional relationship over the entire space of the independent variables, but for a relatively small region, they usually work well. Note that rarely would all the terms of the quadratic model be needed; and how to identify a parsimonious model has been illustrated in Examples 6.4.2 and 6.4.3.
The following recommendations are noteworthy to minimize the number of experiments to be performed:
6.6.4 (a) During the initial stages of the investigation, a ﬁrstorder polynomial model in some region of the range of variation of the regressors is usually adequate. Such models have been extensively covered in Chap. 5 with Eq. 5.29 being the linear ﬁrstorder model form in vector notation. 2k factorial designs, both full and
Central Composite Design and the Concept of Rotation
One must assume 3 levels for the factors in order to ﬁt quadratic models. For a 3k factorial design with the number of factors k = 3, one needs 27 experiments with no replication, which, for k = 4 grows to 81 experiments. Thus, the
248
6
Design of Physical and Simulation Experiments
Fig. 6.22 A central composite design (CCD) contains two sets of experiments: a fractional factorial or “cube” portion which serves as a preliminary stage where one can ﬁt a ﬁrstorder (linear) model, and a group of axial or “star” points that allow estimation of curvature. A CCD
always contains twice as many axial (or star) points as there are factors in the design. In addition, a certain number of center points are also used to capture inherent random variability in the process or system behavior. (a) CCD for two factors. (b) CCD for three factors
number of trials at each iteration point increases exponentially. Hence, 3k designs become impractical for k > 3. A more efﬁcient design requiring fewer experiments is to use the concept of rotation, also referred to as axisymmetric. An experimental design is said to be rotatable if the trials are selected such that they are equidistant from the center. Since the location of the optimum point is unknown, such a design would result in equal precision of estimation in all directions. In other words, the variance of the response variable at any point in the regressor space is function of only the distance of the point from the design center. Central composite design (CCD) contains three components (Berger and Maurer 2002):
This equation is conﬁrmed by Fig. 6.22. For a twofactor experiment design, the CCD generates 4 factorial points and 4 axial points, i.e., 4 + 4 + 1 = 9 points (assuming only one center point). For a threefactor experiment design, the CCD generates 8 factorial points and 6 axial points, i.e., the number of experiments = 8 + 6 + 1 = 15 points. The factorial or “cube” portion and center points (shown as circles in Fig. 6.22) may aid in ﬁtting a ﬁrstorder (linear) model during the preliminary stage while still providing evidence regarding the importance of a secondorder contribution or curvature. A CCD always contains twice as many axial (or star) points as there are factors in the design. The star points represent new extreme values (low and high) for each factor in the design. The number of center points for some useful CCDs has also been suggested. Sometimes, more center points than the numbers suggested are introduced; nothing will be lost by this except the cost of performing the additional runs. For a twofactor CCD, it is recommended that at least two center points be used, while many researchers routinely are said to use as many as 6–8 points. CCDs are most widely used during RSM experimental design since they inherently satisfy the desirable design properties of orthogonal blocking and rotatability and allow for efﬁcient estimation of the quadratic terms in the secondorder model. Central composite designs with two and three factors, along with the manner of coding the levels of the factors at which experimental tests must be conducted, are shown more clearly in Fig. 6.23a, b, respectively. If the distance from the center of the design space to a factorial point is ±1 unit for each factor, the distance from the center of the design space to a star point is±α with α > 1 . The precise value of α depends on certain properties desired for the design, like orthogonal blocking and on the number of
(a) A twolevel (fractional) factorial design which estimates the main and two factor interaction terms; (b) A “star” or “axial” design, which in conjunction with the other two components, allows estimation of curvature by allowing quadratic terms to be introduced in the model function; (c) A set of center points (which are essentially random repeats of the center point) provides a measure of process stability by reducing model prediction error and allowing one to estimate the error. They provide a check for curvature, i.e., if the response surface is curved, the center points will be lower or higher than predicted by the design points (see Fig. 6.22). The total number of experimental runs for CCD with k factors = 2k þ 2k þ c here c is the number of center points.
ð6:15Þ
6.6 Response Surface Designs
249
Fig. 6.23 Rotatable central composite designs (CCD) for two factors and three factors during RSM. The black dots indicate locations of experimental runs
factors involved. To maintain rotatability, the value of α for CCD is chosen such that: α = nf
1=4
ð6:16Þ
where nf is the number of experimental runs in factorial portion. For example: if the experiment has 2 factors, the full factorial portion would contain 22 = 4 points; the value of α for rotatability would be α = (22)1/4 = 1.414. If the experiment has 3 factors, α = (23)1/4 = 1.682; if the experiment has 4 factors, α = (24)1/4 = 2; and so on. As shown in Fig. 6.23, CCDs usually have axial points outside the “cube” (unless one intentionally speciﬁes α ≤ 1 due to, say, safety concerns in performing the experiments). Finally, since the design points describe a circle circumscribed about the factorial square, the optimum values must fall within this experimental region. If not, suitable constraints must be imposed on the function to be optimized (as illustrated in the example below). For further reading on CCD, the texts by Box et al. (1978) and Montgomery (2017) are recommended. Many software packages have a DOE wizard, which walks one through the various steps of deﬁning the entire set of experimental/simulation combinations. Example 6.6.18 Optimizing the deposition rate for a tungsten ﬁlm on silicon wafer. A twofactor rotatable central composite design (CCD) was run to optimize the deposition rate for a tungsten ﬁlm on silicon wafer. The two factors are the process pressure (in kPa) and the ratio of hydrogen H2 to tungsten hexaﬂuoride WF6 in the reaction atmosphere. The levels for these factors are given in Table 6.25. Let x1 be the pressure factor and x2 the ratio factor for the two coded factors. The rotatable CCD design with three center points was adopted with the experimental results 8
From Buckner et al. (1993) with small modiﬁcation.
Table 6.25 Assumed low and high levels for the two factors (Example 6.6.1) Factor Pressure Ratio H2/WF6
Low level 0.4 2
High level 8.0 10
Table 6.26 Results of the CCD rotatable design for two coded factors with 3 center points (Example 6.6.1) x1 1 1 1 1 1.414 1.414 0 0 0 0 0
x2 1 1 1 1 0 0 1.414 1.414 0 0 0
y 3663 9393 5602 12488 1984 12603 5007 10310 8979 8960 8979
Data available electronically on book website
assembled in Table 6.26. For example, for the pressure term, the low level of 0.4 is coded as 1, and the high level of 8.0 as +1. The numerical value of the response is left unaltered. A secondorder linear regression with all 11 data points results in a model with AdjR2 = 0.969 and RMSE = 608.9. The model coefﬁcients assembled in Table 6.27 indicate that coefﬁcients (x1x2), and (x12x22) are not statistically signiﬁcant. Dropping these terms results in a better model with AdjR2 = 0.983, and RMSE = 578.8. The corresponding values of the reduced model coefﬁcients are shown in Table 6.28. In determining whether the model can be further simpliﬁed, one notes that the highest pvalue on the independent variables is 0.0549, belonging to (x22). Since the pvalue is greater or equal to 0.05, that term may not be statistically signiﬁcant and one could consider removing this term from the model; this, however, is a close call.
250
6
Design of Physical and Simulation Experiments
Table 6.27 Model coefficients for the secondorder complete model with coded regressors (Example 6.4.1) Parameter Constant x1 x2 x1*x2 x1^2 x2^2 x1^2*x2^2
Estimate 8972.6 3454.43 1566.79 289.0 839.837 657.282 310.952
Standard error 351.53 215..284 215.284 304.434 277.993 277.993 430.60
tstatistic 25.5246 16.046 7.27781 0.949303 3.02107 2.36438 0.722137
pvalue 0.0000 0.0001 0.0019 0.3962 0.0391 0.0773 0.5102
Table 6.28 Model coefficients for the reduced model with coded regressors (Example 6.4.1) Parameter Constant x1 x2 x1^2 x2^2
Estimate 8972.6 3454.43 1566.79 762.044 579.489
Standard error 334.19 204.664 204.664 243.63 243.63
tstatistic 26.8488 16.8785 7.65544 3.12787 2.37856
pvalue 0.0000 0.0000 0.0003 0.0204 0.0549
Fig. 6.25 Studentized residuals versus model predicted values (Example 6.6.1)
The optimal values of the two coded regressors associated with the maximum response are determined by taking partial derivatives of Eq. 6.17 and setting them to zero; resulting in: x1 = 2.267 and x2 = 1.353. However, this optimum lies outside the experimental region used to identify the RSM model (see Fig. 6.26) and is unacceptable. A constrained optimization is warranted since the spherical constraint of a rotatable CCD must be satisﬁed; i.e., from Eq. 6.16: α = (22)1/4 = 1.414. This would guarantee that the optimal condition would fall within the experimental region assumed. Resorting to a constrained optimization (see Sect. 7.3) results in the optimal values of the regressors: x1* = 1.253 and x2* = 0.656 representing a maximum deposition rate y* = 12,883. The low and high values of the two regressors are shown in Table 6.25. These optimal values of the coded variables can be transformed back in terms of the original variables to yield pressure = 8.56 kPa and ratio H2/ WF6 = 8.0. Finally, a conﬁrmatory experiment would have to be conducted in the neighborhood of this optimum.
Fig. 6.24 Observed versus model predicted values (Example 6.6.1)
Thus, the ﬁnal RSM model with coded regressors is: y = 8972:6 þ 3454:4x1 þ 1566:8x2  762:0x21  579:5x22 ð6:17Þ It would be wise to look at the residuals. Figure 6.24 suggests quite good agreement between the model and the observations. However, Fig. 6.25 indicates that one of the points (the ﬁrst row, i.e., y = 3663) is unusual since its studentized residual is high (recall that studentized residuals measure how many standard deviations each observed value of y deviates from the ﬁtted model using all of the data except that observation, see Sect. 5.6.2). Those greater than 3 in absolute value warrant a close look and if necessary, may be removed prior to model ﬁtting.
6.7
Simulation Experiments
6.7.1
Background
Computer simulations of product behavior, processes, and systems ranging from simple (say a solar photovoltaic system) to extremely complex (such as design of a widearea distributed energy system involving traditional and renewable power generation subunits or simulation of future climate change scenarios) have acquired great importance and a wellrespected role in all branches of scientiﬁc and engineering endeavor. These virtual tools are based on “behavioral models” at a certain level of abstraction which allow prediction, assessment, and veriﬁcation of the performance of products and systems under different design and operating
6.7 Simulation Experiments
251
Fig. 6.26 Contour plot of the RSM given by Eq. 6.17 of the two coded factors during RSM (Example 6.6.1)
conditions. They are, to some extent, replacing the need to perform physical experiments which are more expensive, timeconsuming, and limited in the number of issues one can consider. Note that simulations are done assuming a set of discrete values for the design parameters/variables, which is akin to the selection of speciﬁc levels for treatment and secondary factors while conducting physical experiments. Thus, the primary purpose of performing simulation studies is to learn as much as possible about system behavior under different sets of design factors/variables/parameters with the lowest possible cost (Shannon 1975). The parallel between the traditional physicalexperimental DOE methods and the selection of samples of input vectors of design variables for simulations is obvious. Unfortunately, the technical developments in the design of computer simulations have tended to be siloed with rediscovery and duplication slowing down progress. It is common to have several competing commercial simulation programs meant for the same purpose but with differing degree of accuracy, sophistication, and capabilities; however, such issues will not be considered here. It is simply assumed that a reasonable highﬁdelity validated computer simulation program is available for the intended purpose. Further, this discussion is aimed at computationally intensive longrun simulations i.e., to instances where a single computer simulation run may require minutes/hours to complete. The material in this section is primarily focused on the use of detailed hourly or subhourly timestep computer simulation programs for the design of energyefﬁcient buildings, which predict the hourly energy consumption and the indoor comfort conditions for a whole year for any building
geometry and climatic conditions (see, for example, Clarke 1993). These programs include models for the dynamic thermal response of the building envelope, that of the performance of the different types of HVAC systems, and they consider the speciﬁc manner in which the building is scheduled and operated (widely used ones are EnergyPlus, TRNSYS, Modelica, and eQuest, refer to, for example, de Witt 2003). Commonly, computer simulations can be used for one or more of the following applications: (a) During design, i.e., while conceptualizing the system before it is built. It generally involves two different tasks: (i) Sizing equipment or systems based on point conditions such as peak conditions during normal course of operation. System reliability, safety, etc. must meet some codespeciﬁed criteria. Examples involve (i) a sloping roof that is structurally able to support codespeciﬁed conditions (say, locationspeciﬁc 50year maximum snow loading), (ii) extreme conditions of outdoor temperature and solar radiation for sizing heating and cooling equipment (often based on 1% and 99% annual probability criteria) for building HVAC equipment in a speciﬁc location, (iii) mitigation measures to be activated to avoid environmental pollutants due to vehicles exceeding some prespeciﬁed threshold in a speciﬁc city neighborhood. (ii) Longterm temporal simulation for system performance prediction under a preselected range of design conditions or during normal range of
252
operating conditions assuming efﬁcient or optimal control as per prevailing industry norms. Two examples are: (i) simulating the energy use in a building controlled in a standard energy efﬁcient manner on an hourbyhour basis over a standard year, and (ii) predicting the annual electricity produced by a PV system assuming faultfree faulty behavior. Such capability allows optimal or satisfactory options to be evaluated based on circumstancespeciﬁc constraints in addition to criteria such as energy or cost or both. (b) During daytoday operation, using modelbased shortterm forecasting and optimization techniques to control and operate the system as efﬁciently as possible. Such conditions may deviate from the idealized and standard conditions assumed during design. Typical applications are modelbased supervisory control of energy systems (such as a distributed energy system in conjunction with building HVAC systems) and fault detection and diagnosis. (c) System response/performance under extraordinary/rare events which result in full or partial component/system failure leading to large functionality loss and severe economic and social hardship (i.e., hurricane knocking down power lines). Simulations are performed to study system robustness and ability to recover quickly (this aspect falls under reliability (Sect. 7.5.6) and resilience analysis (complementary to sustainability brieﬂy mentioned in Sect. 12.7.6). Note that the above applications may cover conditions wherein the model/simulation program inputs are either assumed to be deterministic or stochastic. The latter is an instance when some of the physical parameters of the model are not known with certainty and need to be expressed by a probability distribution. For example, during the construction of the wall assembly of a building, deviation from design speciﬁcations is common (referred to as speciﬁcation uncertainty). Another source of uncertainty is that several external drivers (such as weather) may not be known with the needed accuracy at the intended site. In addition, the conditions under which the system is operated may not be known properly and this introduces scenario uncertainty. For example, an architect during design may assume the building to be operated for 12 h/day while the owner may subsequently use it for 16 h/day. Finally, given the complexity of the actual system performance, the underlying models used for the simulation are simpliﬁcations of reality (modeling uncertainty). The extent to which all these uncertainties along with the uncertainty of the numerical method used to solve the set of modeling equations (called numerical uncertainty) would affect the ﬁnal design should be evaluated.
6
Design of Physical and Simulation Experiments
Surprisingly, this aspect has yet to reach the necessary level of maturity for practicing architects and building energy analysts to adopt routinely. The impact of uncertainties in building simulations, and how to address them realistically has been addressed by de Witt (2003).
6.7.2
Similarities and Differences Between Physical and Simulation Experiments
DOE in conjunction with model building can be used to simplify the search for an optimum or a satisﬁcing solution when detailed simulation programs of physical systems requiring longrun computer times are to be used. The similarity of such problems to DOE experiments on physical processes is obvious since the former requires: (a) Important design parameters (or independent variables) and their range of variability to be selected, which is identical to that for treatment and secondary factors. (b) Number of levels for the factors to be decided based on some prior insight of the dependence between response and the set of design parameters. (c) Experimental design: the speciﬁc combinations of primary factors to be selected for which one predicts system responses to improve the design. This involves making multiple runs of the computer model using speciﬁc values and pairings of these input parameters. (d) Finally, the data analysis phase allows insights into the following: (i) Screening: performing a sensitivity analysis to determine a subset of dominant model input parameter combinations, (ii) Model: gain insights into the structure of the performance model structure such as whether linear, secondorder, crossproduct terms, etc. (iii) Design optimization: ﬁnding an optimum (or nearoptimum) either from an exhaustive search of the set of discrete simulation results or ﬁtting an appropriate mathematical model to the data, referred to as surrogate modeling (discussed in Sect. 6.7.5) There are, however, some major differences. One major difference is that computer experiments are deterministic, i.e., one gets the same system response under the same set of inputs. Replication is not required and only one center point is sufﬁcient if CCD design is adopted. A second major difference is that since the analyst selects the input variable set prior to each simulation run, blocking and control of the variables are inherent and no special consideration is required (Dean et al. 2017). Another difference is that the initial number of design variables tends to be usually much larger than that
6.7 Simulation Experiments
involving physical experiments. The resulting range of input combinations or design space is very large requiring different approaches to generate the input combination samples, reduce the initial variable set by sensitivity analysis (similar to screening), and reduce the number of computer simulations during the search for a satisﬁcing/optimal solution by adopting spaceﬁlling interpolation methods (also called “surrogate modelings”). These aspects are discussed below.
6.7.3
Monte Carlo and Allied Sampling Methods
The Monte Carlo (MC) method has been introduced earlier in the context of uncertainty analysis (Sect. 3.7.2). The MC approach, of which there are several variants, comprises that branch of computational mathematics that relies on experiments using random numbers to infer the response of a system (Hammersley and Handscomb 1964). Chance events are artiﬁcially recreated numerically (on a computer), the simulation runs many times, and the results provide the necessary insights. MC methods provide approximate solutions to a variety of deterministic and stochastic problems, hence their widespread appeal. The methods vary but tend to follow a particular pattern when applied to computer simulation situations: (i) Deﬁne a domain of possible inputs, (ii) Generate inputs randomly from an assumed probability distribution over the domain, (iii) Perform a deterministic computation on the inputs, and (iv) Aggregate and analyze the results. The many advantages of MC methods are conceptual simplicity, low level of mathematics, applicability to a large number of different types of problems, ability to account for correlations between inputs, and suitability to situations where model parameters have unknown distributions. MC methods are numerical methods in that all the uncertain inputs must be assigned a deﬁnite probability distribution. For each simulation, one value is selected at random for each input based on its probability of occurrence. Numerous such input sequences are generated, and simulations are performed. Provided the number of runs is large, the simulation output values will be normally distributed irrespective of the probability distributions of the inputs (this follows from the Central Limit theorem described in Sect. 4.2.1). Even though nonlinearities between the inputs and output are accounted for, the accuracy of the results depends on the number of runs. However, given the power of modern computers, the relatively large computational effort is no longer a major limitation except in very large simulation studies. The concept of “sampling efﬁciency” has been used
253
to compare different schemes of implementing MC methods and was introduced earlier (refer to Eq. 4.49 in Sect.4.7.4 dealing with stratiﬁed sampling). In the present context of computer simulations, this term assumes a different meaning and Eq. 4.49 needs to be modiﬁed. Here, a more efﬁcient scheme of sampling a prechosen set of input variables along with their range of variation is one which results in greater spread or variance in the resulting simulation response set with fewer simulations n (thus requiring less computing time). Say two methods, methods 1 and 2, are to be compared. Method 1 created a sample of n1 simulation runs, while method 2 created a sample of n2 runs. The resulting two sets of the response variable outputs were found to have variances σ 21 and σ 22 : Then, the sampling efﬁciency of method 1 with respect to method 2 can be said to be as: ε1 = ε2
σ 21 =σ 22 =ðn1 =n2 Þ
ð6:18Þ
where (n1/n2) is called the labor ratio, and σ 21 =σ 22 is called the variance ratio. MC methods have emerged as a basic and widely used generic approach to quantify variabilities associated with model predictions, and for examining the relative importance of model parameters that affect model performance (Spears et al. 1994; de Wit 2003). There are different types of MC methods depending on the sampling algorithm for generating the trials (Helton and Davis 2003): (a) Random sampling methods, which were the historic manner of explaining MC methods. They involve using random sampling for estimating integrals (i.e., for computing areas under a curve and solving differential equations). However, there is no assurance that a sample element will be generated from any subset of the sample space. Important subsets of space with low probability but high consequences are likely to be missed. (b) Crude MC which uses traditional random sampling where each sample element is generated independently following a prespeciﬁed distribution. (c) Stratiﬁed MC (also called “importance sampling”), where the population is divided into groups or strata according to some prespeciﬁed criterion, and sampling is done so that each stratum is guaranteed representation (unlike the crude MC method). Thus, it has the advantage of forcing the inclusion of speciﬁed subsets of sampling space while maintaining the probabilistic character of random sampling This method is said to be an order of magnitude more efﬁcient than the crude MC method. A major problem is the necessity of deﬁning the strata and calculating their probabilities, especially for highdimension situations.
254
6
Design of Physical and Simulation Experiments
Fig 6.27 The LHMC sampling method is based on quantiles, i.e., dividing the range of variation of each input variable into intervals of equal probability and then combining successive variables by sampling without replacement. LHMC is more efﬁcient than basic MC sampling since it assures that each interval is sampled with the same density
(d) Latin hypercube, or LHMC, a stratiﬁed sampling without replacement technique for generating a set of input vectors from a multidimensional distribution. This is often used to construct computer experiments for performing sensitivity and uncertainty analysis on complex systems. It uses stratiﬁed sampling without replacement and can be viewed as a compromise procedure combining many of the desirable features of random and stratiﬁed sampling. A Latin hypercube is the generalization of the Latin square to an arbitrary number of dimensions. It is most appropriate for design problems involving computer simulations because of its higher efﬁciency (McKay et al. 2000). LHMC is said to yield an unbiased estimator of the mean, but the estimator of the variance is biased (unknown but generally small).
Fig. 6.27). A more detailed discussion of how to construct LHMC and why this method is advantageous in terms of computertime efﬁciency is discussed by Dean et al. (2017). How to modify this method to deal with correlated variables has also been proposed and is brieﬂy discussed in the next section.
6.7.4
Sensitivity Analysis for Screening
LHMC sampling, which can be considered to be a factorial method, is conceptually easy to grasp. Say the input variable/factor vector is of dimension k and a sample of size m is to be generated from p = [p1, p2, p3, . . . ,pk]. The range of each variable pj is divided into n disjoint or nonoverlapping intervals of equal probability9 (see Fig. 6.27 where n = 8 and m = 16) and values are selected randomly from each interval. The m values thus obtained for p1 are paired at random without replacement with similarly obtained m values for p2. These two pairs (called k2pairs) are then combined in a random manner without replacement with the m values of p3 to form k3triples. This process is continued until m samples of kptuples are formed. Note that this method of generating samples assures that each interval/subspace is sampled with the same density; this leads to greater efﬁciency compared to the basic MC method (as shown in
Sensitivity analysis is akin to factor screening in DOE and is different from an uncertainty analysis (treated in Sects. 3.6 and 3.7). The aim of sensitivity analysis is to explore/determine/identify the impact of input factors/variables/ parameters on the predicted/simulated output variable/ response, and then, to quantify their relative importance.10 During many design studies, the performance of the system often depends on a relatively few signiﬁcant factors and to a much lesser degree on several insigniﬁcant ones (this is the Pareto principle rule of thumb often stated as 20/80!). The mapping between random input and the obtained model output can be explored with various techniques to determine the effects of the individual factors. The simplest manner of identifying parameter importance, appropriate for lowdimension input parameter vectors, is to generate scatterplots for each factor versus the response, which can visually reveal the relationships between them. Another possibility is to use the popular least squares techniques to construct a regression model that relates the parameters with the response. Several studies have proposed using partial correlation coefﬁcients of linear regression (Sect. 5.4.5) or using stepwise regression (Sect. 5.4.6) for ranking
9
10
Recall that quantiles split sorted data or a probability distribution into equal parts.
Uncertainty analyses, on the other hand, use probabilistic values of model inputs to estimate probability distributions of model outputs.
6.7 Simulation Experiments
255
Fig. 6.28 Figure illustrating linear and nonlinear sensitivity behavior of annual energy use of a building under different sets of design variables (from Lam and Hui 1996). (a) Linear effect of four different
options of glazing design, (b) Nonlinear effect of three different external shading designs
parameter importance and for detecting parameter interaction (one such study is that by Wang et al. 2014). The use of linear regression is questionable; it may be suitable when system performance is linear (as exhibited by energy use in certain types of buildings and their HVAC system behavior, certain design parameters, for narrow range of parameter variation etc.), but may not be of general applicability (see Fig. 6.28). Data mining methods, such as random forest algorithm (Sect. 11.5), are generally superior than regressionbased methods in both screening and ranking parameters for building energy design (see, e.g., Dutta et al. 2016). There are several more formal sensitivity analysis methods (see Saltelli et al. 2000 and relevant technical papers). In such methods, the analyst is often faced with the difﬁcult task of selecting the one method most appropriate for his application. A report by Iman and Helton (1985) compares different sensitivity analysis methods as applied to complex engineering systems and summarizes current knowledge in this area. Note that the modeling equations on which the simulation is based are often nonlinear in the parameters and even if linear, the parameters may interact. In that case, the sensitivity of a parameter/factor may vary from point to point in the parameter space, and random samples in different regions will be required for proper analysis. Thus, one needs to distinguish between local sensitivity analysis (LSA) and global sensitivity analysis (GSA) (Heiselberg et al. 2009). There are two types of sensitivities: (i) LSA (or oneparameteratatime method) which describes the inﬂuences of individual design parameters on system response with all other parameters held constant at their standard or base values, i.e., at a speciﬁc local region in the parameter space. It is satisfactory for linear models and when the sensitivity of
each individual input is independent of the value of the other inputs (often not true), (ii) GSA which provides insight into the inﬂuence of a single design parameter on system response when all other parameters are varied together based on their individual ranges and probability distributions. These two types are discussed below. (a) LSA or local sensitivity analysis The general approach to determining individual sensitivity coefﬁcients is summarized below: (i) Formulate a base case reference and its description. (ii) Study and break down the factors into basic parameters (parameterization). (iii) Identify parameters of interest and determine their base case values. (iv) Determine which simulation outputs are to be investigated and their practical implications. (v) Introduce perturbations to the selected parameters about their base case values one at a time. Factorial and fractional factorial designs are commonly used. (vi) Study the corresponding effects of the variation/perturbation on the simulation outputs. (vii) Determine the sensitivity coefﬁcients for each selected parameter. The ones with the largest values are deemed more signiﬁcant. Sensitivity coefﬁcients (also called, elasticity in economics, as well as inﬂuence coefﬁcients) are deﬁned in various ways as shown in Table 6.29. All these formulae involve discrete changes to the variables denoted by step change Δ as against partial derivatives. The ﬁrst form is simply the
256
6
Design of Physical and Simulation Experiments
Table 6.29 Different forms of sensitivity coefficient Form 1 2a 2b 3a 3b
Formula1 ΔOP ΔIP ΔOP=OPBC ΔIP=IPBC ΔOP=OPBC ΔIP ðOP þOP Þ ΔOP= 1 2 2 ðIP þIP Þ ΔIP= 1 2 2 ΔOP ΔIP
=
 OP  IP
Dimension With dimension
Common name(s) Sensitivity coefﬁcient, inﬂuence coefﬁcient
% OP change % IP change
Inﬂuence coefﬁcient, point elasticity
With dimension
Inﬂuence coefﬁcient
% OP change % IP change
Arc midpoint elasticity, meant for two inputs
% OP change % IP change
(See note 2)
From Lam and Hui (1996) 1. ΔOP, ΔIP = changes in output and input respectively OPBC, IPBC = base case values of output and input respectively IP1, IP2 = two values of input OP1, OP2 = two values of the corresponding output  OP,  IP = mean values of output and input respectively 2. the slope of the linear regression line divided by the ratio of the mean output and mean input values
derivative of the output variable (OP) with respect to the input parameter (IP). The second group uses the base case values to express the sensitivity in percentage change, while the third group uses the mean values to express the percentage change (this is similar to forward differencing and central differencing approaches used in numerical methods). Form (1) is local sensitivity coefﬁcient and is the simplest to interpret. Forms (2a), (3a), and (3b) have the advantage that the sensitivity coefﬁcients are dimensionless. However, form (3a) can only be applied to onestep change and cannot be used for multiple sets of parameters. In general, such methods are of limited use for simulation models with complex sets of nonlinear functions and a large set of input variables. Figure 6.28 illustrates the linear and nonlinear behavior of two different sets of design strategies. The ﬁrst set consists of four glazing design variables while the second set of three external shading variables (all the variables are described in the ﬁgure). Variations in the parameters related to the four different window designs (Fig. 6.28a) are linear, and the local sensitivity method would provide the necessary insight. Further, it is clear that the effect of the shading coefﬁcient (SC) has the largest impact on annual energy use. The behavior of the sensitivity coefﬁcients of the external shading designs likely to impact the energy use of the building is clearly nonlinear from Fig. 6.28b. The projection ratio of the eggcrate external shading design (shown as EG) has the largest impact of energy use. Further, all three external shading designs exhibit an exponential asymptotic behavior. (b) GSA or global sensitivity analysis Global sensitivity methods allow parameter sensitivity and ranking for nonlinear models over the entire design space. The simplest approaches applied to the design of energyefﬁcient buildings include traditional randomized factorial twolevel sampling designs (e.g., Hou et al. 1996), variations
of the Latin squares, and CCD in conjunction with quadratic regression (e.g., Snyder et al. 2013). These approaches are suitable when the number of input variables is relatively low (say up to 6–7 variables with 2 or 3 levels). However, for larger number of variables such approaches are usually infeasible since running detailed simulation programs are computationally intensive and have long runtimes. One cannot afford to perform separate simulation runs for sensitivity analysis and for acquiring system response for different input vector combinations. While being an attractive way of generating multidimensional input vector samples for extensive computer experiments, LHMC is the most popular method for performing sensitivity studies as well, i.e., for screening variable importance (Hofer 1999). Once the LHMC simulation runs have been performed, one can identify the strong or inﬂuential parameters and/or the weak ones based on the results of the response variable. For example, the designer of an energy efﬁcient building would be more concerned with identifying the subset of input parameters which are more likely to lead to low annual energy consumption (such parameters can be referred to as “strong” or inﬂuential parameters). This is achieved by a process called regional sensitivity analysis. If the “weak” parameters can be ﬁxed at their nominal values and removed from further consideration during design, the parameter space would be reduced enormously and somewhat alleviate the “curse of dimensionality.” There are several approaches one could adopt; a rather simple and intuitive statistical method is described below. Assume that m “candidate” input/design parameter were initially selected with each parameter discretized into three levels or states (i.e., n = 3). The necessary number of LHMC runs (say, 1,000) are then conducted. Out of these 1,000 runs, it was found that only 30 runs had response variable values in the acceptable range corresponding to 30 speciﬁc vectors of input parameters. One would expect
6.7 Simulation Experiments
257
Table 6.30 Critical thresholds for the chisquare statistic with different significance levels for degrees of freedom 2 d. f 2
α = 0.001 13.815
α = 0.005 10.597
α = 0.01 9.210
α = 0.05 5.991
the inﬂuential input parameters to appear more often in one level in this “acceptable subset” of input parameter vectors than the weak or noninﬂuential ones. In fact, the latter are likely to be randomly distributed among the acceptable subset of vectors. Thus, the extent to which the number of occurrences of an individual parameter differs from 10 within each discrete state would indicate whether this parameter is strong or weak. This is a type of sensitivity test where the weak and strong parameters are identiﬁed using nonrandom pattern tests (Saltelli et al. 2000). The wellknown chisquare χ 2 test for comparing distributions (see Sect. 2.4.3g) can be used to assess statistical independence for each and every parameter. First, the χ 2 statistic is computed for each of the m input variables as: 3
χ = 2
s=1
pobs,s  pexp pexp
2
ð6:19Þ
where pobs is the observed number of occurrences, and pexp is the expected number (in this example above, this will be 10), and the subscript s refers to the index of the state (in this case, there are three states). If the observed number is close to the expected number, the χ 2 value will be small indicating that the observed distribution ﬁts the theoretical distribution closely. This would imply that the particular parameter is weak since the corresponding distribution can be viewed as being random. Note that this test requires that the degrees of freedom (d. f.) be selected as (number of states 1), i.e., in our case d.f. = 2. The critical values for the χ 2 distribution for different signiﬁcance levels α are given in Table 6.30. If the χ 2 statistic for a particular parameter is greater than 9.21, one could assume it to be very strong since the associated statistical probability is greater than 99%. On the other hand, a parameter having a value of 1.386 (α = 0.5) could be considered to be weak, and those in between the two values as uncertain in inﬂuence. (c) The Morris method A superior method to perform GSA than the one described above is the variancebased approach, which depends on the decomposition of variance of the response variable. One such variant has proven to be particularly attractive for many practical design problems. The elementary effects method, also referred to as the Morris method (Morris 1991), is an MC approach utilizing random sampling and a oneatatime (OAT) approach to generate vectors of input parameters. It
α = 0.2 3.219
α = 0.3 2.408
α = 0.5 1.386
α = 0.9 0.211
Table 6.31 Simple example of how the Morris method generates a trajectory of 5 simulation runs with four design parameters (k = 4) Run 0a Run 1 Run 3 Run 4 Run 5
Parameter k1 k1,1 k1,2 k1,2 k1,2 k1,2
Parameter k2 k2,1 k2,1 k2,2 k2,2 k2,2
Parameter k3 k3,1 k3,1 k3,1 k3,2 k3,2
Parameter k4 k4,1 k4,1 k4,1 k4,1 k4,2
Note that the parameter values above and below those indicated in bold are frozen for consecutive runs a Baseline run
is an extension of the derivativebased methods (see Table 6.29 for different types of working deﬁnitions). It includes variance as an additional sensitivity index for screening the global space and allows combining ﬁrst and secondorder sensitivity analysis. Each of the k input variables is deﬁned within a range of a continuous variable, normalized by a minmax scaling (see Eq. 3.15) and discretized into subintervals (equal to the number of levels selected for each parameter) with equal probability (see Fig. 6.27). It uses an LHMC input vector generation method where the pairing of successive parameter combinations retains their correlation behavior. A trajectory is deﬁned as a set of (k + 1) simulation runs or vector/sequence of the k parameters with each successive point differing from the preceding one in one variable value only by a multiple of a predeﬁned step size Δi. Multiple trajectories need to be generated during the parametric design. How the vector sequence is generated is illustrated using an example with four parameters in Table 6.31. The baseline run (Run 0) is generated by randomly selecting the discrete levels of each parameter (indicated as k1,1, k2,1, . . .k4,1). For Run 1, one of the parameters, in this case, k1 is resampled with the other four parametric values unchanged. Run 2 consists of randomly selecting one of the parameters (the selection should not be sequential; in this case, k2 has been selected for easier comprehension) and then randomly resampling one of the parameters (shown as k2,2). This is repeated for the remaining parameters k3 and k4. Hence, only one parameter is resampled for each run and that value is frozen for the remaining subsequent runs in the trajectory. The second simulation trajectory is created by randomly selecting a new set of combinations for Run 1 and using the OAT approach for subsequent runs similar to the ﬁrst trajectory. It has been found that only a relatively small number of such trajectory simulation runs are needed. Once the
258
6
Design of Physical and Simulation Experiments
Fig. 6.29 Variation of the mean and standard deviation (μ* and σ) for the 23 building design variables chosen with annual energy use as the response variable (from Didwania et al. 2023). Only the seven design variables falling outside the envelope indicated are inﬂuential, the others are anonymously shown as dots
simulations are performed, the elementary effect (EE) of an individual parameter or factor k can be calculated for all points in the trajectory as follows (Sanchez et al. 2014): y ð k þ ei Δ i Þ  y ð k Þ EEi = Δi
ð6:20Þ
where y is the response variable or design criterion or objective function and Δi is the predeﬁned step size. The term ei is a vector of zeros except for its ith component, which takes on integer values by which different levels of the discretized parameter levels can be selected. Each trajectory with (k+1) simulation runs provides an estimate of the elementary effects for each of the k parameters or variables. A set of r such different trajectories are deﬁned, and so the total number of simulations runs = [r (k + 1)]. The average μ and standard deviation σ of elementary effects are computed for each parameter and each trajectory t: μi = σi =
1 r
1 ð r  1Þ
r t=1
EEit
r t=1
ðEEit  μi Þ2
ð6:21Þ ð6:22Þ
The ensemble of trajectory runs is analyzed by computing and plotting the two statistical indicators, namely mean and standard deviation (μ* and σ) of the response to each design or input parameter. The relative importance of each parameter on the response variable and parameter interaction on the response can be determined as11: 1. Negligible – low average (μ*) and low standard deviation (σ) 2. Linear and additive – high average (μ*) and low standard deviation (σ) 11
Note that this process is akin to the regional sensitivity approach using the chisquare test given by Eq. 6.19.
3. Nonlinear or presence of interactions – high standard deviation (σ) How the ﬁnal selection is done based on the above criteria is illustrated in Fig. 6.29. The points falling outside the zone of inﬂuence indicated by a curved line are deemed inﬂuential. One notes that out of 23 parameters investigated, only 7 are signiﬁcant and interactive. Lower μ* would indicate that changing these variables will not have a substantial impact on the objective function; lower σ would suggest that its impact on the objective function is not affected by other parameters and that its impact is not nonlinear. If the model parameters have a signiﬁcantly nonlinear effect, then it is suggested that an additional analysis involving secondorder effects be performed as described by Sanchez et al. (2014). The Morris method is said to require far fewer simulation trajectory runs (i.e., more efﬁcient) than the traditional LHMC method when a large number of parameters with possible interaction and secondorder effects are being screened. The greater the number of trajectory simulation runs, the greater the accuracy, but studies have shown that 15–20 trajectory runs are adequate with 25 or so design variables. The Morris method has also been combined with the parallel coordinate graphical representation to enhance the ability for architects and designers to visually explore different combinations of sustainable building design which meet prestipulated ranges of variation of multicriteria objective functions (Didwania et al. 2023). (e) Closure Helton and Davis (2003) cite over 150 references in the area of sensitivity analysis, discuss the clear advantages of LHMC for the analysis of complex systems, and enumerate the reasons for the popularity of such methods. Another popular GSA method similar to the Morris method meant for complex mathematical models is the Sobol method (Sobol 2001). It generates a sample that is uniformly distributed over the unit
6.7 Simulation Experiments
hypercube and uses variancebased global sensitivity indices to determine ﬁrstorder, secondorder, and total effects of individual variables or groups of variables on the model output. It is said to be more stable than the Morris method but requires more simulation runs. The results of both methods have been reported to be similar by Didwania et al. (2023) for a case study involving the design of energyefﬁcient buildings. In addition to the LHMC method and the variancebased GSA method, variable screening and ranking of variables/ factors by importance can also be done by data mining methods (such as random forest algorithms discussed in Sect. 11.5). Dutta et al. (2016) illustrate this approach in the context of energyefﬁcient building design as a more attractive option to the CCD design approach which has been reported by Snyder et al. (2013) and described in Problem 11.14.
6.7.5
Surrogate Modeling
All the experimental design methods discussed above are referred to as static sampling approaches since all the samples are deﬁned prior to running the batch of simulations and not adjusted depending on simulation outcomes. A more efﬁcient approach that speeds up convergence is the iterative sampling approach while requiring fewer simulation runs. This is akin to RSM which, as described in Sect. 6.6, is an iterative approach, which accelerates the search toward ﬁnding the optimum condition by simultaneously varying more than one response variable. It proceeds by ﬁrst identifying a model between a response and a set of several Fig. 6.30 Simple example with two design variables to visually illustrate the surrogate modeling iterative approach as applicable to the design of an energyefﬁcient building. The dots indicate the preliminary computer simulation runs which would provide insights into how to narrow the solution space for subsequent simulation runs. (From Westermann and Evins 2019)
259
continuous treatment variables over an initial limited solution space, analyzes test results to determine the optimal direction to move, performs a second set of test conditions, and so on until the desired optimum is reached. A similar methodology for iteratively reducing the number of simulations during the search for an optimum and speedup convergence can be adopted for detailed computer simulation programs. Speciﬁcally, the following steps are undertaken:
(i) Initially select a relatively small set of speciﬁc values and pairings of these input parameters (akin to performing factorial experiments). (ii) Make multiple runs on the computer model and perform a sensitivity analysis to determine a subset of dominant model input parameter combinations (akin to screening). (iii) Fit an appropriate mathematical model between the response and the set of independent variables (usually a secondorder polynomial model with/without interactions). (iv) Use this ﬁtted response surface polynomial model as a surrogate (replacement or proxy) for the computer model to rapidly revise/shrink the original search space. (v) Repeat steps (ii) to (iv) till the desired optimal/ satisﬁcing design solution is reached. Figure 6.30 visually illustrates the general approach in a conceptually clear manner for a simple case involving two design variables only. An energyefﬁcient building is to be designed by varying two design variables: WWR—window to wall ratio and SHGC—solar heat gain coefﬁcient of the window. The traditional factorialtype experimental design would require that the solution space be uniformly blanketed to ﬁnd a satisfactory or optimal solution. The same insight
260
6
could be provided by fewer simulation runs by performing an initial set of limited runs (indicated as dots), selecting a ﬁner grid in the subspace of interest, and then iteratively zeroing in on the design solution. This approach is akin to the response surface modeling (RSM) approach described in Sect. 6.6. The signiﬁcant beneﬁt of the mathematical surrogate model approach is that it allows the solution to be reached more quickly using calculus methods than by computer simulations alone. However, a higher level of domain knowledge and analytical skills is demanded of the analyst. A good review of surrogate modeling techniques in general can be found in Dean et al. (2017) while its application to the design of sustainable buildings and an extensive literature review as applied to the design of buildings is provided by Westermann and Evins (2019).
6.7.6
Summary
Sensitivity analysis as pertinent to computerbased simulation design involves three major stages: Stage 1: Preprocessing or selection of independent design variable combinations The preprocessing stage involves selecting design variables of interest and identifying practical ranges based on the building type, project requirements, and owner speciﬁcations. If nonlinear relationships between predictors and response are suspected, then a minimum of three levels (which allow nonlinear and interactions to be explicitly captured) for each variable should be selected for the traditional factorial design methods while two levels can be used for rotatable CCD design. The number of evaluative combinations increases exponentially with the number of factors and levels. For example, 15 variables at three levels would lead to 315~14 × 106 combinations, an impractical number of simulations. An experimental design technique is essential to select fewer runs while ensuring stratiﬁed (representative) sampling of the variable space. LHMC is popular since it is numerically efﬁcient while its results are easy to interpret while performing sensitivity and uncertainty analyses of complex systems with many design parameters (Helton and Davis 2003; Heiselberg et al. 2009). Alternatively, a more traditional CCD design could also be adopted if the initial variable set is relatively small (say less than 6–7 variables)12. Since the samples are deﬁned prior to simulation and not adjusted
Design of Physical and Simulation Experiments
depending on simulation outcomes, this approach is called static sampling (more popular and easier to implement than the iterative approach). Stage 2: Simulationbased generation of system responses Selected variable combinations can now be input into an hourly building energy simulation program for batch processing. The responses could be direct outputs from the chosen simulation program, such as annual energy use/peak demand or could be derived metrics like energy costs or environmental impacts. A database of such simulation outputs is created which is then used in the next stage. Stage 3: Postprocessing of simulation results Traditionally, least squares regression analysis has been the most popular technique in developing a model for predicting energy consumption under the range of variation of the numerous design parameters. However, many studies have found that global regression techniques are questionable for this purpose and that nonparametric data mining methods such as the random forest algorithm (described in Sect. 11.5) are superior in both feature selection (identiﬁcation of design variables that are most inﬂuential) and as a global prediction model.
Problems Pr. 6.113 Fullfactorial design for evaluating three different missile systems A fullfactorial experiment is conducted to determine which of three different missile systems is preferable. The propellant burning rate for 24 static ﬁrings was measured using four different propellant types. The experiment performed duplicate observations (replicate r = 2) of burning rates (in minutes) at each combination of the treatments. The data, after coding, are given in Table 6.32. Table 6.32 Burning rates in minutes for the (3 × 4) case with two replicates (Problem 6.1) Missile System A1 A2 A3
Propellant type b1 b2 34.0, 32.7 30.1, 32.8 32.0, 33.2 30.2, 29.8 28.4, 29.3 27.3, 28.9
b3 29.8, 26.7 28.7, 28.1 29.7, 27.3
b4 29.0, 28.9 27.6, 27.8 28.8, 29.1
Data available electronically on book website
12
Refer to Problem 6.14 in the context of energy efﬁcient building design.
13
From Walpole et al. (2007) by # permission of Pearson Education.
Problems
261 Table 6.34 Data table for Problem 6.3
The following hypotheses tests are to be studied: (i) There is no difference in the mean propellant burning rates when different missile systems are used. (ii) There is no difference in the mean propellant burning rates of the four propellant types. (iii) There is no interaction between the different missile systems and the different propellant types. Pr. 6.2 Random effects model for worker productivity A fullfactorial experiment was conducted to study the effect of indoor environment condition (depending on such factors as dry bulb temperature, relative humidity. . .) on the productivity of workers manufacturing widgets. Four groups of workers were selected and distinguished by such traits as age, gender. . . called G1, G2, G3, and G4. The number of widgets produced over a day by two members of each group under three different environmental conditions (E1, E2, and E3) was recorded. These results are assembled in Table 6.33. Using 0.05 signiﬁcance level, test the hypothesis that: (a) different environmental conditions have no effect on number of widgets produced, (b) different worker groups have no effect on number of widgets produced, (c) there are no interaction effects between both factors. Subsequently, identify a suitable random effects model, study model residual behavior, and draw relevant conclusions. Pr. 6.3 Twofactor twolevel (22) factorial design (complete or balanced) Consider a brand of variable speed electric motor which is meant to operate at different ambient temperatures (factor X) and at different operating speeds (factor Y ). The time to failure in hours (the response variable) has been measured for the four different treatment groups (conducted in a randomized manner) at a replication level of three (Table 6.34).
Table 6.33 Number of widgets produced daily using a replicate r = 2 (Problem 6.2) Environmental conditions E1 E2 E3
Group number G1 G2 227, 214, 221 259 187, 181, 208 179 174, 198, 202 194
Data available electronically on book website
G3 225, 236 232, 198 178, 213
G4 260, 229 246, 273 206, 219
Factor X Low High
Factor Y!
Low 82, 78, 86 67, 74, 67
High 64, 61, 57 46, 54, 52
Table 6.35 Thermal efficiencies (%) of the two solar thermal collectors (Problem 6.4)
Without selective surface With selective surface
Mean operating temperature (°C) 80 70 60 28, 34, 38, 29, 31, 32 33, 35, 34 39, 41, 38 33, 38, 41, 36, 33, 34 38, 36, 35 40, 43, 42
50 40, 42, 41, 41 43, 45, 44, 45
Data available electronically on book website
(a) Code the data in the table using the standard form suggested by Yates (see Table 6.10). (b) Generate the main effect and interaction plots and summarize observations. (c) Repeat the analysis procedure described in Example 6.4.2 and identify a suitable prediction model for the response variable at the 0.05 signiﬁcance level. (d) Compare this model with one identiﬁed using linear OLS multiple regression. Pr. 6.4 The thermal efﬁciency of solar thermal collectors decreases as their average operating temperatures increase. One of the means of improving the thermal performance is to use selective surfaces for the absorber plates which have the special property that the absorption coefﬁcient is high for the solar radiation and low for the infrared radiative heat losses. Two collectors, one without a selective surface and another with, were tested at four different operating temperatures under replication r = 4. The experimental results of thermal efﬁciency in % are tabulated in Table 6.35. (i) Perform an analysis of variance to test for signiﬁcant main and interaction effects. (ii) Identify a suitable random effects model. (iii) Identify a linear regression model and compare your results with those from part (ii). (iv) Study model residual behavior and draw relevant conclusions. Pr. 6.5 Complete factorial design (32) with replication The carbon monoxide (CO) emissions in g/m3 from automobiles (the response variable) depend on the amount of ethanol added to a standard fuel (Factor A) and the air/fuel ratio (Factor B). A standard 3^k factorial design (i.e. three levels) with k = 2 with two replicates results in the values shown in
262
6
Table 6.36 Data table for Problem 6.5 Trial # 1 2 3 4 5 6 7 8 9
Levels Factor A 1 0 +1 1 0 +1 1 0 +1
CO emissions Response Y 66, 62 78, 81 90, 94 72, 67 80, 81 75, 78 68, 66 66, 69 60, 58
Factor B 1 1 1 0 0 0 +1 +1 +1
Data available electronically on book website
Table 6.37 Data table for Problem 6.6 X1 1 +1 1 +1 2 +2 0 0 0
X2 1 1 +1 +1 0 0 2 +2 0
Y 2 4 3 5 1 4 1 5 3
Design of Physical and Simulation Experiments
(c) Identify a prediction model. (d) Using least squares, evaluate ﬁrstorder and secondorder models. Compare them with the model identiﬁed from step (c). (e) Determine the optimal value. (f) Criticize the analysis and suggest improvements to the design procedure. (For example, more central points, replicates, more decimal points in the response variable, .... Is this a rotatable design?. . . .) Pr. 6.7 The close similarity between a factorial design model and a multiple linear regression model was illustrated in Example 6.4.2. You will repeat this exercise with data from Example 6.4.3. (a) Identify a multiple linear regression model and verify that the parameters of all regressors are identical to the factorial design model. (b) Verify that model coefﬁcients do not change when multiple linear regression is redone with the reduced model using variables coded as 1 and +1. (c) Perform a forward stepwise linear regression and verify that you get back the same reduced model with the same coefﬁcients.
Data available electronically on book website
Table 6.36. Low, medium, and levels are codes as 1, 0, and +1 (Table 6.36.) (a) Generate the main effect and interaction plots and summarize observations. (b) Identify the signiﬁcant main and interaction terms using multifactor ANOVA. (c) Conﬁrm that this is an orthogonal array (by taking the inverse of (XTX)). (d) Using the matrix inverse approach (see Eq. 6.11), identify a factorial design model. (e) Perform a multiple linear regression analysis with only the signiﬁcant terms and identify a suitable prediction model for the response variable at the 0.05 signiﬁcance level. Pr. 6.6 Composite design The following data were experimentally collected using a composite design (Table 6.37): (a) Generate the main effect and interaction plots and summarize observations. (b) Identify the signiﬁcant main and interaction terms using multifactor ANOVA.
Pr. 6.8 23 factorial analysis for strength of concrete mix A civil construction company wishes to maximize the strength of its concrete mix with three factors or variables: A water content, B coarse aggregate, and C silica. A 23 full factorial set of experimental runs, consistent with the nomenclature of Table 6.10, was performed. These results are assembled below: (a) You are asked to analyze these data, generate the ANOVA table and identify statistically meaningful terms (Hint: You will ﬁnd that none are signiﬁcant which is probably due to d.f. = 1 for the residual error. It would have been more robust to do replicate testing). (b) Analyze the data using stepwise multiple linear regression and identify the statistically signiﬁcant factors and interactions (at 0.005 signiﬁcance). (c) Identify the complete linear regression model using all main and interaction terms and verify that the model coefﬁcients of the statistically signiﬁcant terms are identical to those using stepwise regression (step b). This is one of the major strengths of the 2^k factorial design. (d) Develop a factorial design model for this problem using only the statistically signiﬁcant terms (Table 6.38).
Problems
263
Pr. 6.9 Predictive model inferred from 23 factorial design on a large laboratory chiller14 Table 6.39 assembles steadystate data of a 23 factorial series of laboratory tests conducted on a 90 ton centrifugal chiller. There are three response variables (Tcho = chilled water leaving the evaporator, Tcdi = cooling water entering the condenser, and Qch = chiller cooling load) with two levels each, thereby resulting in 8 data points without any replication. Note that there are small differences in the high and low levels of each of the factors because of operational control variability during testing. The chiller coefﬁcient of performance (COP) is the response variable. (a) Perform an ANOVA analysis, and check the importance of the main and interaction terms using the 8 data points indicated in the table. (b) Identify the parsimonious predictive model from the above ANOVA analysis. (c) Identify a least square regression model with coded variables and compare the model coefﬁcients with those from the model identiﬁed in part (b). Table 6.38 Data table for Problem 6.8
Trial 1 2 3 4 5 6 7 8
Level of factors A 1 1 1 1 1 1 1 1
B 1 1 1 1 1 1 1 1
C 1 1 1 1 1 1 1 1
Response Replication 1 58.27 55.06 58.73 52.55 54.88 58.07 56.60 59.57
(d) Generate model residuals and study their behavior (inﬂuential outliers, constant variance and nearnormal distribution). (e) Reframe both models in terms of the original variables and compare the internal prediction errors. (f) Using the four data sets indicated in the table as holdout points meant for crossvalidation, compute the NMSE, RMSE and CV values of both models. Draw relevant conclusions. Pr. 6.10 Blocking design for machining time Consider Example 6.5.1 where the performance of four machines was analyzed in terms of machining time with operator dexterity being a factor to be blocked. How to identify an additive linear model was also illustrated. It was pointed out that interaction effects may be important. (a) You will reanalyze the data to determine whether interaction terms are statistically signiﬁcant or not. (b) It was noted from the residual plots that two of the extreme data points were suspect. Can you redo the analysis while accounting for this fact? Pr. 6.11 Latin squares design with k = 3 Reduction in nitrogen oxides due to gasoline additives (the main treatment factor) is to be analyzed for different types of automobiles (factor X) under different drivers (Factor Y ). One does not expect interaction effects and so a Latin squares series of experiments are conducted assuming ﬁve different levels for all three factors. The following table assembles the results for this design with nitrogen oxide emissions being shown as continuous numerical values. The different gasoline additives are the treatment coded as A, B, C, D, and E (Table 6.40).
Replication 2 57.32 55.53 57.95 53.09 55.2 58.76 56.16 58.87
Data available electronically on book website
Table 6.39 Laboratory tests from a centrifugal chiller (Problem 6.9)
Test # 1 2 3 4 5 6 7 8
Data for model development Tcdi Tcho (°C) (°C) 10.940 29.816 10.403 29.559 10.038 21.537 9.967 18.086 4.930 27.056 4.541 26.783 4.793 21.523 4.426 18.666
Qch (kW) 315.011 103.140 289.625 122.884 292.052 109.822 354.936 114.394
COP 3.765 2.425 4.748 3.503 3.763 2.526 4.411 3.151
Data available electronically on book website
14
Adapted from a more extensive table from data collected by Comstock and Braun (1999). We are thankful to James Braun for providing this data.
Data for crossvalidation Tcho Tcdi (°C) (°C) 7.940 29.628 7.528 24.403 6.699 24.288 7.306 24.202
Qch (kW) 286.284 348.387 188.940 93.798
COP 3.593 4.274 3.678 2.517
264
6
Table 6.40 Data table for Problem 6.11
Factor X X1 X2 X3 X4 X5
Factor Y Y1 A B C D E
Y2 B C D E A
24 17 18 26 22
Design of Physical and Simulation Experiments
20 24 38 31 30
Y3 C D E A B
Course 2 B C D A
79 82 70 91
Y4 D E A B C
19 30 26 26 20
24 27 27 23 29
Y5 E A B C D
24 36 21 22 31
Data available electronically on book website Table 6.41 Data for Problem 6.12 where the grades are out of 100
Time period 1 2 3 4
Course 1 A B C D
84 91 59 75
Course 3 C D A B
63 80 77 75
Course 4 D A B C
97 93 80 68
Data available electronically on book website
(a) Brieﬂy state the beneﬁts and limitations of Latin square design. (b) State the relevant hypothesis tests one could perform. (c) Generate the analysis of variance table and identify the factors at the 0.05 signiﬁcance levels. (d) Perform a linear regression analysis and identify a parsimonious model. Pr. 6.12 Latin squares for teaching evaluations The College of Engineering of a large university wishes to evaluate the teaching capabilities of four professors. To eliminate any effects due to different courses offered during different times of the day, a Latin squares experiment was performed in which the letters A, B, C, and D represent the four professors. Each professor taught one section of each of the four different courses scheduled at each of four different times of the day. Table 6.41 shows the grades assigned by these professors to 16 students of approximately equal ability. At the 0.05 level of signiﬁcance, test the hypothesis that different professors have no effect on the grades. Pr. 6.13 As part of the ﬁrst step of a response surface (RS) approach, the following linear model was identiﬁed from preliminary experimentation using two coded variables y = 55  2:5x1 þ 1:2x2
with
 1 ≤ xi ≤ þ 1
ð6:23Þ
Determine the path of the steepest ascent and draw this path on a contour plot. Pr. 6.14 Building design involving simulations15 This is a simpliﬁed example to illustrate how DOE can be used in conjunction with a detailed building energy 15
We thank Steve Snyder for this design problem which is fully discussed in Snyder et al. (2013).
Table 6.42 Design parameters along with their range of variability and the response variables (Pr. 6.14) Design parameter Lighting power density Window shading coefﬁcient Exterior Wall Rvalue (total resistance) Window Uvalue (overall heat loss coefﬁcient) Window to wall ratio Response variables Electricity use (annual) Natural gas use (annual)
Variable name LPD SC EWR
WWR
Range 0.8–1.5 W/ft2 0.2–0.7 7.8–27 hft2° F/Btu 0.26–1 Btu/hft2 °F 0.1–0.5
E_elec E_gas
103 kWh 106 Btu
WU
simulation program to ﬁnd optimal values of the design parameter, which minimize annual energy use. Deru et al. (2011) undertook a project of characterizing the commercial building stock in US and developing reference models for them. Fifteen commercial building types and one multifamily residential building were determined to represent approximately twothirds of the commercial building stock. The input parameters for the building models came from several sources, some determined from ASHRAE standards, and rest were determined from other studies of data and standard practices. National data from 2003 CBECS (EIA 2003) were used to determine the appropriate, average mix of representative buildings, with the intension to represent 70% of US commercial building ﬂoor area. The building selected for this example is a medium ofﬁce building of about 52,000 ft2, three ﬂoors, 1.5 aspect ratio. Building energy use depends on several variables but only 5 variables are assumed, as shown in Table 6.42: lighting power density (LPD), window shading coefﬁcient (SC), exterior wall Rvalue (EWR), window Uvalue (WU), and windowtowall ratio (WWH). All these design variables are continuous, and the range of variation set by the architect is
References
265
also shown. The study used the CCD design at two levels to generate an “optimal” set of 43 input combinations (edge + axial + center = 25 + 2 × 5 + 1 = 43). Each of these combinations was then simulated by the detailed building energy simulation program to yield the annual energy use of both electricity use and natural gas use as assembled in the last two columns of Table B.3 in Appendix B and also available electronically. The location is Madison, WI, and the TMY2 climate data ﬁle was used. (a) Analyze this study in terms of design approach. Start by identifying the edge points, the axial and central points. Compare this design with the full factorial and the partial factorial design methods when the behavior is known to be nonlinear (i.e., 3 levels for each factor). Generate the combinations and discuss beneﬁts and limitations compared to CCD. (b) Electricity use in kWh must be converted into the same units as that of natural gas. Assume a conversion factor of 33% (this is the efﬁciency of electricity generation and supply). Combine the two annual energy use quantities into a total thermal energy use, which is the aggregated variable to be considered below. (c) Perform ANOVA analysis and identify signiﬁcant terms and interactions of the aggregated energy use variable. Fit a response surface model and deduce the optimal design point (note: the design variables are bounded as indicated in Table 6.42). Discuss advantages and limitations. (d) If the study were to be expanded to include more variables (say 15), how would you proceed. Lay out your experimental design procedure in logical successive steps. (Hint: consider a twostep process: sensitivity analysis as well as identiﬁcation of the optimal combination of design variables.) (Table B.3)
Table B.3 Five factors at two level CCD design combinations and the two response variable values (building energy use per year of electricity and natural gas) found by simulation. Only a few rows are shown for comprehension while the entire data set is given in Appendix B3. The units of the variables are specified in Table 6.42 Run # 1 2 3 43
LPD 0.800 1.003 1.003 ... 1.500
SC 0.450 0.555 0.555
EWR 17.400 13.364 21.436
WU 0.630 0.786 0.786
WWR 0.300 0.216 0.216
E_elec 444.718 463.986 463.374
E_gas 117.445 116.974 117.064
0.450
17.400
0.630
0.300
523.427
116.534
Data for this problem are given in Appendix B.3 and also available electronically on book website
References Beck, J.V. and K.J. Arnold, 1977. Parameter Estimation in Engineering and Science, John Wiley and Sons, New York Berger, P.D. and R.E. Maurer, 2002. Experimental Design with Applications in Management, Engineering, and the Sciences. Duxbury Press. Box, G.E.P., W.G. Hunter and J.S. Hunter, 1978. Statistics for Experimenters, John Wiley & Sons, New York. Buckner, J., D.J. Cammenga and A. Weber, 1993. Elimination of TiN peeling during exposure to CVD tungsten deposition process using designed experiments, Statistics in the Semiconductor Industry, Austin, Texas: SEMATECH, Technology Transfer No. 92051125AGEN, vol. I, 4445371. Clarke, J.A., 1993. Assessing building performance by simulation, Building and Environment, 28(4), pp. 419–427 Comstock, M.C. and J.E. Braun, 1999. Development of Analysis Tools for the Evaluation of Fault Detection and Diagnostics in Chillers, ASHRAE Research Project 1043RP; also, Ray W. Herrick Laboratories. Purdue University. HL 9920: Report #40363, December. Dean, A., D. Voss and D. Draguljk, 2017. Design and Analysis of Experiments, 2nd ed., SpringerVerlag, New York. Deru, M., K. Field, D. Studer, K. Benne, B. Grifﬁth, P. Torcellini, B. Liu, M. Halverson, D. Winiarski, M. Yazdanian, J. Huang and D. Crawley, 2011. U.D. Department of Energy Commercial Reference Building Models of the National Building Stock, National Renewable Energy Laboratory, NREL/TP550046861, U.S. Department of Energy, February. Devore J., and N. Farnum, 2005. Applied Statistics for Engineers and Scientists, 2nd Ed., Thomson Brooks/Cole, Australia. De Witt, S., 2003. Chap. 2, Uncertainty in Building Simulation, in Advanced Building Simulation, Eds. A.M. Malkawi and G. Augenbroe, Spon Press, Taylor and Francis, New York. Didwania, K., T.A. Reddy and M. Addison, M., 2023. Synergizing Design of Building Energy Performance using Parametric Analysis, Dynamic Visualization and Neural Network Modeling, J. of Arch Eng., American Society of Civil Engineers, Vol. 29, issue 4, Sept Dutta, R., T.A. Reddy and G. Runger, 2016. A Visual Analytics Based methodology for MultiCriteria Evaluation of Building Design Alternatives, ASHRAE Winter Conference paper, OR16C051, Orlando, FL, January. EIA. 2003. Commercial Building Energy Consumption Survey (CBECS), www.eia.gov/consumption/commercial/data/2005/ Hammersley, J.M. and D.C. Handscomb, 1964. Monte Carlo Methods, Methuen and Co., London. Heiselberg, P., H. Brohus, A. Hesselholt, H. Rasmussen, E. Seinre and S. Thomas, 2009. Application of sensitivity analysis in design of sustainable buildings, Renewable Energy, 34(2009) pp. 2030–2036. Helton, J.C. and F.J. Davis, 2003. Latin hypercube sampling and the propagation of uncertainty of complex systems, Reliability Engineering and System Safety, vol. 81, pp. 23–69. Hofer, E., 1999. Sensitivity analysis in the context of uncertainty analysis for computationally intensive models, Computer Physics Communication, vol. 117, pp. 21–34. Hou, D., J.W. Jones, B.D. Hunn and J.A. Banks, 1996. Development of HVAC system performance criteria using factorial design and DOE2 simulation, Tenth Symposium on Improving Building Systems in Hot and Humid Climates, pp. 184–192, May13–14, Forth Worth, TX. Iman, R.L. and J.C. Helton, 1985. A Comparison of Uncertainty and Sensitivity Analysis Techniques for Computer Models, Sandia National Laboratories report NUREG/CR3904, SAND 84–1461.
266 Lam, J.C. and S.C.M. Hui, 1996. Sensitivity analysis of energy performance of ofﬁce buildings, Building and Environment, vol. 31, no.1, pp 27–39. Mandel, J., 1964. The Statistical Analysis of Experimental Data, Dover Publications, New York. Mckay, M. D., R.J. Beckman and W.J. Conover, 2000. A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics, 42(1, Special 40th Anniversary Issue), 55–61. Montgomery, D.C., 2017, Design and Analysis of Experiments, 9th Edition, John Wiley & Sons, New York. Morris, M.D., 1991. Factorial sampling plans for preliminary computational experiments, Technometrics, 33(2)pp. 161–174. Saltelli, A., K. Chan and E.M. Scott (eds.) 2000. Sensitivity Analysis, John Wiley and Sons, Chichester. Sanchez, D.G., B. Lacarriere, M. Musy and B. Bourges, 2014. Application of sensitivity analysis in building energy simulations: Combining ﬁrstandsecondorder elementary effects methods, Energy and Buildings, 68 (2014), pp. 741–750. Shannon, R.E., 1975. Systems Simulation: The Art and Science, PrenticeHall, Englewood Cliffs, New Jersey.
6
Design of Physical and Simulation Experiments
Snyder, S., T.A. Reddy and M. Addison, 2013. Automated Design of Buildings: Need, Conceptual Approach, and Illustrative Example, ASHRAE Conference paper, paper #DA13C010, January Sobol, I.M., 2001. Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates, Mathematics and Computers in Simulations, 55 (2001), pp. 271–280, Elsevier Spears, R., T. Grieb and N. Shiang, 1994. Parameter uncertainty and interaction in complex environmental models, Water Resources Research, vol. 30 (11), pp 3159–3169. Walpole, R.E., R.H. Myers, S.L. Myers and K. Ye, 2007. Probability and Statistics for Engineers and Scientists, 8th Ed., PrenticeHall, Upper Saddle River, NJ. Wang, M., J. Wright and A. Brownlee, 2014. A comparison of approaches to stepwise regression for the indication of variables sensitivities used with a multiobjective optimization problem, ASHRAE Annual Conference, paper SE14C060, Seattle, WA, June. Westermann, P. and R. Evins, 2019. Surrogate modelling for sustainable building design A review, Energy and Buildings, 198 (2019), pp. 170–186.
7
Optimization Methods
Abstract
This chapter provides a rather introductory overview and foundation of traditional optimization techniques along with pertinent engineering applications and illustrative examples. These techniques apply to situations where the impact of uncertainties is relatively minor and can be viewed as a subset of the broader domain of decisionmaking (treated in Chap. 12). This chapter starts by deﬁning the various terms used in the optimization literature, such as the objective function and the different types of constraints, followed by a description of the various steps involved in an optimization problem such as sensitivity and postoptimality analysis. Simple graphical methods are used to illustrate the fact that one may have problems with unique, none, or multiple solutions; and that one may encounter instances when not all the constraints are active, and some may even be redundant. Analytical methods involving calculusbased techniques (such as the Lagrange multiplier method) as well as numerical search methods, both for unconstrained and constrained problems as relevant to univariate and multivariate problems, are reviewed, and the usefulness of slack variable approach is explained. Subsequently, different solutions to problems which can be grouped as linear, quadratic, nonlinear or mixed integer programming are described, while highlighting the differences between them. Several simple examples illustrate the theoretical approaches, while more indepth practical examples involving network models and supervisory control of an integrated energy system are also presented. How such optimization analysis approaches can be used for system reliability studies involving breakage of one or more components or links in a power grid is also illustrated. Methods that allow global solutions as against local ones such as simulated annealing and genetic algorithms are brieﬂy described. Finally, the important topic of dynamic optimization is covered, which applies to optimizing a trajectory over time, i.e., to situations
when a series of decisions have to be made to deﬁne or operate a system over a set of discrete timesequenced stages. There is a vast amount of published material on the subject of optimization, and this chapter is simply meant to provide a good foundation and adequate working understanding for the reader to tackle the more complex and ever evolving extensions and variants of optimization problems.
7.1
Introduction
7.1.1
What Is Optimization?
One of the most important tools for both design and operation of engineering systems is optimization which corresponds to the case of ﬁnding optimal solutions under low uncertainty. This branch of applied mathematics, also studied under “operations research” (OR),1 is the use of speciﬁc methods where one tries to minimize or maximize a global characteristic (say, the cost or the beneﬁt) whose variation is modeled by an “objective function.” The setup of the optimization problem involves both the mathematical formulation of the objective function but as importantly, as well as the explicit and complete framing of a set of constraints. Optimization problems arise in almost all branches of industry or society, for example, in product and engineering process design, production scheduling, logistics, trafﬁc control, and even strategic planning. Optimization in an engineering context involves certain basic aspects consisting of some or all of the following: (i) the framing of a situation or problem (for which a solution or a course of action is sought) in terms of a mathematical model often called the objective function; this could be a simple “Operations Research” is the scientiﬁc/mathematical/quantitative discipline adopted by industrial and business organizations to better manage their complex business operations/systems with hundreds of variables for optimal operation/scheduling and for planning for future expansion growth.
1
# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 T. A. Reddy, G. P. Henze, Applied Data Analysis and Modeling for Energy Engineers and Scientists, https://doi.org/10.1007/9783031348693_7
267
268
7
expression, or framed as a decision tree model in case of multiple outcomes (deterministic or probabilistic) or sequential decision making stages; (ii) deﬁning the range of constraints to the problem in terms of input parameters which may be dictated by physical considerations; (iii) placing bounds on the solution space of the output variables in terms of some practical or physical constraints; (iv) deﬁning or introducing uncertainties in the input parameters and in the types of parameters appearing in the model; (v) mathematical techniques which allow solving such models efﬁciently (short execution times) and accurately (unbiased solutions); and (vi) sensitivity analysis to gauge the robustness of the optimal solution to various uncertainties. Framing of the mathematical model involves two types of uncertainties: (i) epistemic or lack of complete knowledge of the process or system which can be reduced as more data is acquired, and (ii) aleotory uncertainty which has to do with the stochasticity of the process and which cannot be reduced by collecting more data. These notions were introduced in Sect. 1.3.2 and are discussed at some length in Chap. 12 while dealing with decision analysis. This chapter deals with traditional optimization techniques as applied to engineering applications which are characterized by low aleatory and low epistemic uncertainty. Further, recall the concept of abstraction presented in Sect. 1.2.2 in the context of formulating models. It pertains to the process of deciding on the level of detail appropriate for the problem at hand without, on one hand, oversimpliﬁcation which may result in loss of important system behavior predictability, while on the other hand, avoiding the formulation of an overlydetailed model which may result in undue data and computational resources as well as time spent in understanding the model assumptions and the results generated. The same concept of abstraction also applies to the science of optimization. One must set a level of abstraction commensurate with the complexity of the problem at hand and the accuracy of the solution sought. Consider a problem framed as ﬁnding the optimum of a continuous function. There could, of course, be the added complexity of considering several discrete options; but each option has one or more continuous variables requiring proper control to achieve a global optimum. The simple example given in Pr. 1.9 from Chap. 1 will be used to illustrate this case.
7.1.2
Simple Example
Example 7.1.1 Function minimization Two pumps with parallel networks (Fig. 7.1) deliver a volumetric ﬂow rate F = 0.01 m3/s of water from a reservoir to the destination. The pressure drops in Pascals (Pa) of each network are given by: Δp1 = ð2:1Þ:1010 :F 21 and Δp2 = ð3:6Þ:
Optimization Methods
Fig. 7.1 Pumping system whose operational power consumption is to be minimized (Example 7.1.1)
1010 :F 22 where F1 and F2 are the ﬂow rates through each branch in m3/s. Assume that both the pumps and their motor assemblies have equal efﬁciencies η1 = η2 = 0.9. Let P1 and P2 be the electric power in Watts (W) consumed by the two pumpmotor assemblies. The total power draw must be minimized. Since, power consumed is equal to volume ﬂow rate times the pressure drops, the objective function to be minimized is the sum of the power consumed by both pumps: J = or J =
Δp1 :F 1 Δp2 F 2 þ η1 η2 ð2:1Þ:1010 :F 31 ð3:6Þ:1010 :F 32 þ 0:9 0:9
ð7:1Þ
The sum of both ﬂows is equal to 0.01 m3/s, and so F2 can be eliminated in Eq. 7.1. Thus, the soughtafter solution is the value of F1, which minimizes the objective function J: J = Min fJ g = Min
ð2:1Þ:1010 :F 31 ð3:6Þ:1010 :ð0:01  F 1 Þ3 þ 0:9 0:9
ð7:2Þ subject to the constraint that F1 > 0. dJ From basic calculus, dF = 0 would provide the 1 optimum solution from where F1 = 0.00567 m3/s and F2 = 0.00433 m3/s, and the total power of both pumps P = 7501 W. The extent to which nonoptimal performance is likely to lead to excess power can be gauged (referred to as postoptimality analysis) by simply plotting the function J vs. F1 (Fig. 7.2) In this case, the optima is rather broad; the system can be operated such that F1 is in the range of 0.005–0.006 m3/s without much power penalty. On the other hand, sensitivity analysis would involve a study of how the optimum value is affected by certain parameters. For example, Fig. 7.3 shows that varying the efﬁciency of pump 1 in the range of 0.85–0.95 has negligible impact on the optimal result. However, this may not be the case for
7.2 Terminology and Classification
some other variable. A systematic study of how various parameters impact the optimal value falls under “sensitivity analysis,” and there exist formal methods of investigating this aspect of the problem. Finally, note that this is a very simple optimization problem with a simple imposed constraint which was not even considered during the optimization. ■
7.2
Terminology and Classification
7.2.1
Definition of Terms
A mathematical formulation of an optimization problem involving control of an engineering system consists of the following terms:
Fig. 7.2 One type of postoptimality analysis involves plotting the objective function for Total Power against ﬂow rate through pump 1 (F1) to evaluate the shape of the curve near the optimum. In this case, there is a broad optimum indicating that the system can be operated nearoptimally over this range without much corresponding power penalty
Fig. 7.3 Sensitivity analysis with respect to efﬁciency of pump 1 on the overall optimum
269
(i) Decision variables or process variables, say (x1, x2 . . . xn), whose respective values are to be determined. These can be either discrete or continuous variables. An example is the air ﬂow rate in an evaporative cooling tower; (ii) Control variables, which are the physical quantities that can be varied by hardware according to the numerical values of the decision variables sought. Determining the “best” numerical values of these variables is the basic intent of optimization. An example is the control of the fan speed to achieve the desired air ﬂow rate through the evaporative cooling tower; (iii) Objective function is an analytical formulation of an appropriate measure of performance of the system (or characteristic of the design problem) in terms of decision variables. An example is the total electric power in Example 7.1.1; (iv) Constraints or restrictions imposed on the values of the decision variables. These can be of two types: nonnegative constraints, for example, ﬂow rates cannot be negative; and functional constraints (also called structural constraints), which can be equality, nonequality or range constraints that specify a range of variation over which the decision variables can be varied. These can be based on direct considerations (such as not exceeding capacity of energy equipment, limitations of temperature, and pressure control values) or on indirect ones (when mass and energy balances have to be satisﬁed). (v) Model parameters are constants appearing in constraints and objective equations.
7.2.2
Categorization of Methods
Optimization methods can be categorized in a number of ways.
270
7
Optimization Methods
Fig. 7.4 Examples of unimodal, bimodal and multimodal peaks for a univariate unconstrained function showing the single, dual and multicritical points, respectively
Fig 7.5 Two types of critical points for the unconstrained bivariate problem (a) saddle point, (b) global minimum of a unimodal function
(i) Univariate/multivariate problems with uni/multimodal critical points Univariate problems are those where only one variable is involved in the objective function, and this can be unimodal, bimodal, or multimodal functions (Fig. 7.4). Critical points are those where the ﬁrst derivative of the function is zero, and they can be either local or global maxima or minima depending on whether the second derivative is negative or positive respectively. If the second derivative is zero, this indicates a saddle point. By extension, critical points of a function of two variables are those points at which both partial derivatives of the function are zero. The saddle point (Fig. 7.5) is one where the slopes (or function derivatives) in orthogonal directions are both zero. As the number of variables in the function increase, the solution of the problem becomes exponentially more difﬁcult. Searching for optimum points of nonlinear problems poses the great danger of getting stuck in a local minima region; software programs often avoid this situation by adopting a technique called “multistart,” where several searches are undertaken from different starting points of the feasible search space. (ii) Analytical vs. numerical methods. Analytical methods apply to cases when the optimization of the objective function can be expressed as a mathematical
relationship using subsidiary equations which can be solved directly or by calculus methods. This contrasts with numerical search methods which require that the function or its gradient at successive locations be determined by an algorithm which allows homing in on the solution. The categorization of these methods is based on the algorithm used. Even though more demanding computationally, search methods are especially useful and often adopted for discontinuous or complex problems. (iii) Linear vs. nonlinear methods. Linear optimization problems (or linear programing LP) involve linear models and a linear objective function and linear constraints. The theory is well developed, and solutions can be found quickly and robustly. There is an enormous amount of published literature on LP problems, and it has found numerous practical applications involving up to several thousands of independent variables. There are several wellknown techniques to solve them (the Simplex method in Operations Research used to solve a large set of linear equations being the best known). However, many problems from engineering to economics require the use of nonlinear models or constraints, in which case, nonlinear programming2 (NLP) techniques must be used. In some cases, nonlinear models (e.g., equipment models such as chillers, fans, pumps, and cooling towers) can be expressed as quadratic models, and algorithms more efﬁcient than nonlinear programming ones have been developed; this falls under quadratic programming (QP) methods. Calculusbased methods are best suited for ﬁnding the solution for simpler problems; for more involved problems resulting in complex nonlinear 2
Programming does not refer to computer programming, but arises from the use of program by the United States military to refer to proposed training and logistics schedules, which were the problems studied by Dantzig (the primary developer of the Simplex method).
7.2 Terminology and Classification
(iv)
(v)
(vi)
(vii)
simultaneous equations, a search process would be required. Continuous vs. discontinuous. When the objective functions are discontinuous, calculusbased methods can break down. In such cases, one could use nongradient based methods or even heuristic based computational methods such as simulated annealing, particle swarm optimization, or genetic algorithms. The latter are very powerful in that they can overcome problems associated with local minima and discontinuous functions, but they need long computing times, have no guarantee of ﬁnding the global optimum, and require a certain amount of knowledge on the part of the analyst. Another form of discontinuity arises when one or more of the variables are discrete as against continuous. Such cases fall under the classiﬁcation known as integer or discrete programming. Static vs. dynamic. If the optimization is done with time not being a factor, then the procedure is called static. However, if optimization is to be done over a time period where decisions can be made at several subintervals of that period, then a dynamic optimization method is warranted. Two such examples are when one needs to optimize the route taken by a salesman visiting different cities as part of his road trip, or when the operation of a thermal ice storage supplying cooling to a building must be optimized during several hours of the day when high electric demand charges prevail. Whenever possible, analysts make simplifying assumptions to make the optimization problem static. Deterministic vs. probabilistic. This depends on whether one neglects or considers the uncertainty associated with various parameters of the objective function and the constraints. The need to treat these uncertainties together, and in a probabilistic manner, rather than one at a time (as is done in a sensitivity analysis) has led to the development of several numerical techniques, the Monte Carlo technique being the most widely used (Sect. 12.2.8). Traditional vs. stochastic adaptive search. Most of the methods referred to above can be designated as
Fig. 7.6 An example of a constrained optimization problem with no feasible solution
271
traditional methods in contrast to stochastic methods developed in the last three to four decades (also referred to as global optimization methods or metaheuristic methods). The latter are especially meant for very complex multivariate nonlinear problems. Three of the bestknown methods are simulated annealing, particle swarm optimization, and genetic algorithms. Each of these use metaheuristic search techniques, that is, a mix of selective combination of intermediate results and randomness. These algorithms are patterned after evolutionary and ﬂocking behavior which nature adopts for physical and biological processes to gradually and adaptively improve the search towards a nearoptimal solution.
7.2.3
Types of Objective Functions and Constraints
Single criterion optimization is one where a single objective function can be formulated. For example, an industrialist is considering starting a factory to assemble photovoltaic (PV) cells into PV modules. Whether to invest or not, and if yes, at what capacity level are issues which can both be framed as a single criterion optimization problem. However, if maximizing the number of jobs created is another (altruistic) objective, then the problem must be treated as a multicriteria decision problem. Such cases are discussed in Sects. 12.5–12.7. Establishing the objective function is often simple. The real challenge is usually in specifying the complete set of constraints. A feasible solution is one that satisﬁes all the stated constraints, while an infeasible solution is one where at least one constraint is violated. The optimal solution is a feasible solution that has the most favorable value (either maximum or minimum) of the objective function, and it is this solution that is being sought after. The optimal solutions can be a single point or even several points. Also, some problems may have no optimal solutions at all. Figure 7.6 shows a function to be maximized subject to several constraints (six in this case). Note that there is no feasible solution and one of the constraints must be relaxed or the
272
7
Optimization Methods
Fig. 7.7 An example of a constrained optimization problem with more than one optimal solution
problem reframed. In some optimization problems, one can obtain several equivalent optimal solutions. This is illustrated in Fig. 7.7, where several combinations of the two variables, which deﬁne the line segment shown, are possible optima. Sometimes, an optimal solution may not necessarily be the one selected for implementation. A “satisﬁcing” solution (combination of words “satisfactory” and “optimizing”) may be the solution that is selected for actual implementation and reﬂects the difference between theory (which yields an optimal solution) and reality faced situations (due to actual implementation issues, heuristic constraints that cannot be expressed mathematically, the need to treat unpredictable occurrences, risk attitudes of the owner/operator, . . .). Some practitioners also refer to such solutions as “nearoptimal” though this has a sort of negative connotation. A ﬁnal issue with optimization problems is that the constraints deﬁned have different importance on the optimal solution; sometimes, they have no role at all, and such cases are referred to as nonbinding constraints. The detection of such superﬂuous constraints is not simple; relaxing them would often simplify the solution search.
7.2.4
Sensitivity Analysis and PostOptimality Analysis
Model parameters are often not known with certainty and could be based on models identiﬁed from partial or incomplete observations, or they could even be guessestimates. The optimum is only correct insofar as the model is accurate, and the model parameters and constraints reﬂective of the actual situation. Hence, the optimal solution determined needs to be reevaluated in terms of how the various types of uncertainties affect it. This is done by sensitivity analysis, which determines a range of values: (i) Of the model parameters over which the optimal solutions will remain unchanged or vary within an allowable range to stay nearoptimal. This would ﬂag critical parameters which may require closer investigation, reﬁnement and monitoring.
(ii) Over which the optimal solution will remain feasible with adjusted values for the decision variables (allowable range to stay feasible, i.e., the constraints are satisﬁed). This would help identify inﬂuential constraints. Further, the above evaluations can be performed by adopting (see Sect 6.7.4): (i) Individual parameter sensitivity, where one parameter at a time in the original model is varied (or perturbed) to check its effect on the optimal solution; (ii) Total sensitivity (also called parametric programming) involves the study of how the optimal solution changes as many parameters change simultaneously over some range. Thus, it provides insight into “correlated” parameters and trade off in parameter values. Such evaluations are conveniently done using Monte Carlo methods (Sect. 12.2.8).
7.3
Analytical Methods
Calculusbased solution methods can be applied to both linear and nonlinear problems and are the ones to which undergraduate students are most likely to be exposed to. They can be used for problems where the objective function and the constraints are differentiable. These methods are also referred to as classical or traditional optimization methods as distinct from stochastic methods such as simulated annealing or evolutionary algorithms. A brief review of calculusbased analytical and search methods is presented below.
7.3.1
Unconstrained Problems
The basic calculus of the univariate unconstrained optimization problem can be extended to the multivariate case of dimension n by introducing the gradient vector ∇ and by recalling that the gradient of a scalar y is deﬁned as:
7.3 Analytical Methods
273
Fig. 7.8 Illustrations of convex, concave and combination functions. A convex function is one where every point on the line joining any two points on the graph does not lie below the graph at any point. A
∇y =
∂y ∂y ∂y i1 þ i2 þ . . . þ in ∂x1 ∂x2 ∂xn
With this terminology, the condition for optimality of a continuous function y is simply:
Example 7.3.1 Determine the minimum value of the following function: 1 1 þ 8x21 x2 þ 2 4x1 x2
1 ∂y =  2 þ 16x1 x2 ∂x1 4x1
and
∂y 2 = 8x21  3 ∂x2 x2
Setting the above two expressions to zero and solving result in x1 = 0:2051 and x2 = 1:8114 at which condition the optimal value of the objective function y* = 2.133. It is left to the reader to verify these results, and check whether this is indeed the minimum. ■
ð7:4Þ
However, the optimality may be associated with stationary points which could be minimum, maximum, saddle, or ridge points. Since objective functions are conventionally expressed as a minimization problem, one seeks the minimum of the objective function. Recall that for the univariate case, assuring that the optimal value found is a minimum (and not a maximum or a saddle point) involves computing the numerical value of the second derivative at this optimal point, and checking that its value is positive. Graphically, a minimum for a continuous function is found (or exists) when the function is convex, while a saddle point is found for a combination function (see Fig. 7.8). In the multivariate optimization case, one checks whether the Hessian matrix (i.e., the second derivative matrix which is symmetrical) is positive deﬁnite or not. It is tedious to check this condition by hand for any matrix whose dimensionality is greater 2, and so computer programs are invariably used for such problems. A simple hand calculation method (which works well for low dimension problems) for ascertaining whether the optimal point is a minimum (or maximum) is to simply perturb the optimal solution vector obtained by a small amount, compute the objective function, and determine whether this value is higher (or lower) than the optimal value found.
y=
First, the two ﬁrst order derivatives are found:
ð7:3Þ
where i1, i2,. . ., in are unit vectors and y is the objective function and is a function of n variables: y = y(x1, x2,. . ., xn)
∇y = 0
combination function is one that exhibits both convex and concave behavior during different portions with the switchover being the saddle point
ð7:5Þ
7.3.2
Direct Substitution Method for Equality Constrained Problems
The simplest approach is the direct substitution method where for a problem involving “n” variables and “m” equality constraints, one tries to eliminate the m constraints by direct substitution and solve the objective function using the unconstrained solution method described above. This approach was used earlier in Example 7.1.1. Example 7.3.23 Direct substitution method Consider the simple optimization problem stated as: Minimize f ðxÞ = 4x21 þ 5x22
ð7:6aÞ
2x1 þ 3x2 = 6
ð7:6bÞ
subject to :
Either x1 or x2 can be eliminated without difﬁculty. Say, the constraint equation is used to solve for x1, and then substituted into the objective function. This yields the unconstrained objective function: f ðx2 Þ = 14x22  36x2 þ 36: The optimal value of x2* = 1.286 from which by substitution, x1* = 1.071. The resulting value of the objective function is f(x)* = 12.857. 3
From Edgar et al. (2001) by permission of McGrawHill.
274
7
Optimization Methods
Fig. 7.9 Graphical representation of how direct substitution can reduce a function with two variables x1 and x2 into one with one variable. The unconstrained optimum is at (0, 0) at the center of the contours (From Edgar et al. 2001 by permission of McGrawHill)
This simple problem allows a geometric visualization to better illustrate the approach. As shown in Fig. 7.9, the objective function is a paraboloid shown on the z axis with x1 and x2 being the other two axes. The constraint is represented by a plane surface which intersects the paraboloid as shown. The resulting intersection is a parabola whose optimum is the solution of the objective function being sought after. Notice how this constrained optimum is different from the unconstrained optimum which occurs at (0, 0) (Fig. 7.9). The above approach requires that one variable be ﬁrst explicitly expressed as a function of the remaining variables, and then eliminated from all equations; this procedure is continued till there are no more constraints. Unfortunately, this is not an approach that is likely to be of general applicability in most problems.
7.3.3
Lagrange Multiplier Method for Equality Constrained Problems
A more versatile and widely used approach that allows the constrained problem to be reformulated into an unconstrained one is the Lagrange multiplier approach. Consider an optimization problem involving an objective function y, a set of n decision variable x and a set of m equality constraints h(x): Minimize
y = y ð xÞ
Subject to hðxÞ = 0
objective function equality constraints
ð7:7aÞ ð7:7bÞ
The Lagrange multiplier method simply absorbs the equality constraints into the objective function, and states that the optimum occurs when the following modiﬁed objective function is minimized:
J minfyðxÞg = yðxÞ  λ1 h1 ðxÞ  λ2 h2 ðxÞ  . . . = 0 ð7:8Þ where the quantities λ1, λ2, . . . λm, are called the Lagrange multipliers and applied to each of the m equality constraints. The optimization problem, thus, involves minimizing y with respect to both x and the Lagrange multipliers. The cost of eliminating the constraints comes at the price of increasing the dimensionality of the problem from n to (n + m), or stated differently, one is now seeking the optimal values of (n + m) variables as against those of n variables which optimize the function y. A simple example with one equality constraint serves to illustrate this approach. The objective function y = 2x1 + 3x2 is to be optimized subject to the constraint x1 x22 = 48. Figure 7.10 depicts this problem visually with the two variables being the two axes and the objective function being represented by a series of parallel lines for different assumed values of y. Since the constraint is a curved line, the optimal solution is obviously the point where the tangent vector of the curve (shown as a dotted line) is parallel to these lines (shown as point A). Example 7.3.34 Optimizing a solar water heater system using the Lagrange method A solar water heater consisting of a solar collector and fully mixed storage tank is to be optimized for lowest ﬁrst cost consistent with the following speciﬁed system performance. During the day, the storage temperature is to be raised gradually from an initial 30 °C (equal to the ambient temperature Ta) to a ﬁnal desired temperature Tmax, while during the night heat is to be withdrawn from storage such that the storage temperature drops back to 30 °C for the next day’s 4
From Reddy (1987).
7.3 Analytical Methods
275
Plugging numerical values results in: ð20Þ 106 = AC ð12Þ 106 ð0:8Þ ð4Þð3600Þð10Þ
T max þ 30 30 2
or 20:106 = ð11:76 0:072T max ÞAC
ð7:10bÞ A heat balance on the storage over the day yields: QC = Mcp ðT max  T initial Þ or 20:106 = V S ð1000Þð4190ÞðT max  30Þ Fig. 7.10 Optimization of the linear function y = 2x1 + 3x2 subject to the constraint shown. The problem is easily solved using the Lagrange multiplier method to yield optimal values of (x1* = 3, x2* = 4). Graphically, the optimum point A occurs where the constraint function and the lines of constant y (which, in this case, are linear) have a common normal line indicated by the arrow at A
operation. The system should be able to store 20 MJ of thermal heat over a typical day of the year, during which HT, the incident radiation over the collector operating time (assumed to be 10 h) is 12 MJ/m2. The collector performance characteristics5 are FRη0 = 0.8 and FRUL = 4.0 W/m2.°C. The costs of the solar subsystem components are ﬁxed cost Cb = $600, collector area proportional cost Ca = $200/m2 of collector area, and storage volume proportional cost Cs = $200/m3 of storage volume. Assume that the average inlet temperature to the collector Tci during a charging cycle over a day is equal to the arithmetic mean of Tmax and Ta. Let AC (m2) and VS (m3) be the collector area and storage volume respectively. The objective function is: J = 600 þ 200 AC þ 200V S
See Pr. 5.7 for a description of the solar collector model.
þ 30.
AC 9:6 
0:344 = 20 VS
This allows the combined Lagrangian objective function to be deduced as: J = 600 þ 200 AC þ 200 V S  λ AC 9:6 
0:344  20 VS ð7:12Þ
The resulting set of Lagrangian equations are: δJ 0:344 = 0 = 200  9:6 λ δAC VS 0:344 δJ = 0 = 200  λAC δV S V 22 δJ 0:344  20 = 0 = AC 9:6 δλ VS Solving this set of nonlinear equations is not straightforward, and numerical search methods (discussed in Sect. 7.4) need to be adopted. In this case, the soughtafter optimal values are AC* = 2.36 m2 and VS* = 0.308 m3. The value of the Lagrangian multiplier is λ = 23.58, and the corresponding initial cost J* = $1134. The Lagrangian multiplier can be interpreted as the sensitivity coefﬁcient, which in this example corresponds to the marginal cost of solar thermal energy. In other words, increasing the thermal requirements by 1 MJ would lead to an increase of λ = $23.58 in the initial cost of the optimal solar system. ■
ð7:10aÞ
where Δt is the number of seconds during which the collector operates. 5
20 ð4:19ÞV S
Substituting this back into the constraint Eq. 7.10b results in:
ð7:9Þ
In essence, the optimization involves determining the most costeffective sizes of the collector area and of the storage tank that can deliver the required amount of thermal energy at the end of the day. A larger collector area would require a smaller storage volume (and vice versa); but then the water in the storage tank temperature would be higher thereby penalizing the thermal efﬁciency of the solar collector array. The constraint of the daily amount of solar energy collected is found from the collector performance model expressed as: QC = AC ½H T F R η0  U L ðT Ci  T a ÞΔt
from which T max =
ð7:11Þ
7.3.4
Problems with Inequality Constraints
Most practical problems have constraints in terms of the independent variables, and often these assume the form of
276
7
Optimization Methods
Fig 7.11 Graphical solution of Example 7.3.4
x1 ≤ 5.5
X2
x1 + x2 ≤ 7 x2 ≤ 3.5
x1 + 2x2 =10.5
X1
inequality constraints. There are several semianalytical techniques that allow the constrained optimization problem to be reformulated into an unconstrained one, and the way this is done is what differentiates these methods. In such a case, one can avoid the use of generalized optimization solver approaches if such software programs are unavailable. Notice that no inequality constraints appear in Examples 7.3.2 or 7.3.3. When optimization problems involve nonequality constraints, they can be reexpressed as equality constraints by introducing additional variables, called slack (or artiﬁcial) variables. Each inequality constraint requires a new slack variable. The order of the optimization problem will increase, but the efﬁciency in the subsequent numerical solution approach outweighs this drawback. The following simple example serves to illustrate this approach. Example 7.3.4 Consider the following problem with x1 and x2 being continuous variables: Objective function : maximize J = ðx1 þ 2x2 Þ Subject to constraints 2x2 ≤ 7
ð7:13aÞ ð7:13bÞ
x 1 þ x2 ≤ 7
Max J = ðx1 þ 2x2 Þ 2x2 þ x3 = 7 x1 þ x2 þ x4 = 7 2x1 þ x5 = 11
and x3, x4, x5 ≥ 0. Because of the ≥ sign, the slack variables x3, x4, and x5 can only assume either zero or positive values. Using standard matrix inversion results in a minimum value of the objective function J* = 10.5 for x1* = 3.5, x2* = 3.5. The graphical solution is shown in Fig. 7.11. The dashed lines indicate the constraints, and the region within which the maximum should lie after meeting the constraints is shown partially hatched. The objective function line drawn as a solid line assumes a maximum value as indicated by the circled point. The above example was a linear optimization problem since the objective function and all the constraints were linear. For such problems, the slack variables are ﬁrst order unknown quantities. For nonlinear problems involving nonlinear objective function or one or more nonlinear constraints, the standard way of introducing slack variables is as quadratic terms. In the example above, if the second constraints were x12 + x2 ≤ 7, it would be expressed as x12 + x2 + x42 = 7. Even the other slack variables x3 and x5 should also be represented by their squares.
2x1 ≤ 11 x1 , x 2 ≥ 0 The calculusbased solution involves introducing three additional variables for the three constraints (to simplify the analysis, the last two constraints that the two variables x1 and x2 be positive can be discarded and the resulting solutions veriﬁed that these constraints are met).
7.3.5
Penalty Function Method
Another widely used method for constrained optimization is the use of the penalty factor method, where the problem is converted to an unconstrained one. It is especially useful when the constraint is not very rigid or not very important. Consider the problem stated in Example 7.3.2, with the
7.4 Numerical Unconstrained Search Methods
277
possibility that the constraints can be inequality constraints as well. Then, a new unconstrained function is framed as: J minfyðxÞg = min yðxÞ þ
k
Pi ðh1 Þ2
ð7:14Þ
i=1
where Pi is called the penalty factor for condition i with k being the number of constraints. The choice of this penalty factor provides the relative weighting of the constraint compared to the function. For high Pi values, the search will satisfy the constraints but move more slowly in optimizing the function. If Pi is too small, the search may terminate without satisfying the constraints adequately. The penalty factor can in general assume any function6, but the nature of the problem may often inﬂuence the selection. For example, when a forward model is being calibrated with experimental data, one has some prior knowledge of the numerical values of the model parameters. Instead of simply performing a calibration based on minimizing the least square errors, one could frame the problem as an unconstrained penalty factor problem where the function to be minimized consists of a term representing the root sum of square errors, and of the penalty factor term which may be the square deviations of the model parameters from their respective estimates. The following example illustrates this approach while it is further described in Sect. 9.5.2 when dealing with nonlinear parameter estimation. Example 7.3.5 Minimize the following problem using the penalty function approach: y = 5x21 þ 4x22
s:t: 3x1 þ 2x2 = 6
ð7:15Þ
Let us assume a simple form of the penalty factor and frame the problem as: J = min ðJ Þ = min = min
y þ PðhÞ2
5x21 þ 4x22 þ Pð3x1 þ 2x2  6Þ2
ð7:16Þ
Then :
∂J = 10x1 þ 6Pð3x1 þ 2x2  6Þ = 0 ∂x1
ð7:17aÞ
and
∂J = 8x2 þ 4Pð3x1 þ 2x2  6Þ = 0 ∂x2
ð7:17bÞ
Solving these equations results in x1 = 6x52 which, when substituted back into the constraint of Eq. 7.15, yields:
6
The penalty function should always remain positive; for example, one could specify an absolute value for the function rather than the square.
x2 =
12 P
36 þ 108 5 þ 12
The optimal values of the variables are found as the limiting values when P becomes very large. In this case, x2* = 1.071 and, subsequently, from Eq. 7.17b ■ x1* = 1.286; these are the optimal solutions sought.
7.4
Numerical Unconstrained Search Methods
Most practical optimization problems will have to be solved using numerical search methods. The search toward an optimum is done either exhaustively (or blindly) or systematically and progressively using an iterative approach to gradually zoom onto the optimum. Because the search is performed at discrete points, the precise optimum will not be known. The best that can be achieved is to specify an interval of uncertainty In which is the range of x values in which the optimum is known to exist after n trials or function calls. The explanation of the various solution methods of optimization problems is mainly meant for conceptual understanding.
7.4.1
Univariate Methods
The search methods differ depending on whether the problem is univariate/multivariate or unconstrained/constrained or have continuous/discontinuous functions. Some basic methods for univariate problems without constraints are described below. (a) Exhaustive or direct search: This is the least imaginative but very straightforward and simple to use, and is appropriate for simple situations. As illustrated, for a maximumseeking situation, in Fig. 7.12, the initial range or interval I0 over which the solution is being sought is divided into a number of discrete equal intervals (that dictates the “interval of uncertainty”), the function values are calculated for all the seven points simultaneously. Then, the maximum point is easily identiﬁed. The interval of uncertainty after n function calculations In = I0/[(n + 1)/2]. (b) Basic sequential search: This involves progressively eliminating ranges or regions of the search space based on pairs of observations done sequentially. As illustrated in Fig. 7.13 for the unimodal single variate problem, one starts by dividing the interval I0 (a,b) into, say three intervals, and calculating the function values y1 and y2 as shown. If y2 < y1, one can then say that the maximum would lie in the range (a, x2), and if y2 > y1 it would lie in the range (x1, b). For the case when the function values y1 = y2, one would assume that the optimal value would lie close to the center of this interval. Note that the search
278
7
Optimization Methods
Fig. 7.12 Conceptual illustration of the direct search method
Fig. 7.13 Conceptual illustration of the basic sequential search process
y(x)
y(x)
y1
y2
y1
a
y(x)
x1
y1
x2 b
(a)
x
a
y2
x1 (b)
x2 b
y2 x
a
x1
x2 b
x
(c)
Fig. 7.14 Comparison of reduction ratio of different univariate search methods. Note the logarithmic scale of the ordinate scale. (Adapted from Stoecker 1989)
process reuses one of the two function evaluations for the next step. The search is continued until the optimum point is determined at the preset range of uncertainty. This search algorithm is not very efﬁcient computationally and better numerical methods are available. (c) More efﬁcient sequential search methods These methods differ from the basic sequential search method in that irregularly spaced intervals are used over the interval. The three most commonly used are the dichotomous search, the Golden Search and the Fibonacci method (see, e.g., Beveridge and Schechter 1970 or Venkataraman 2002). Numerical efﬁciency (or power) of a method of solution involves both
robustness of the solution and fast execution times. A metric used to compare the execution time of different search methods is the reduction ratio RR = (I0/In) where I0 is the original interval of uncertainty, while In is the range of uncertainty after n trials. Figure 7.14 allows a comparison of the three sequential search methods with the exhaustive search method. The dichotomous search algorithm places the two starting points closer to the midpoint than spacing them equally over the starting interval as is adopted in the basic search procedure. This narrows the search interval for the next iteration, which increases the search efﬁciency and the RR. The way the spacing of the two points is selected is what differentiates the three algorithms.
7.4 Numerical Unconstrained Search Methods
279
discussion is pertinent for unimodal functions; multimodel functions require much greater care. (d) NewtonRaphson method
Fig. 7.15 (a) The Golden section rule divides a segment into two intervals following ratio r = 0.618 as shown. (b) Two search points x1 and x2 are determined within the search interval {a,b} following r = r2 = r1
The Fibonacci method is said to be the most efﬁcient especially when greater accuracy is demanded. It requires that the number of trials n be selected/decided in advance which is a limitation in cases where one has no prior knowledge of the behavior of the function near the maximum. If the function is very steep close to the maximum, selecting too few points would not yield an accurate estimate of the maximum. The other minor disadvantage is that it requires calculations of the function y(x) at rather odd values of x. A modiﬁed Fibonacci search method has also been developed for situations when the function is such that one is unable to select the number of trials in advance (Beveridge and Schechter 1970). The Golden Section method is a compromise, being slightly less efﬁcient than the Fibonacci but requiring no preselection of the number of trials n. While the Fibonacci requires different ratios of the ranges to be selected as the search progresses, the Golden section only assumes a single value, namely 0.618. Figure 7.15 illustrates how the two initial intervals r = r1 = r2 allow the points x1 and x2 to be are determined. Two function calls at x1 and x2 allow one to narrow the interval (a,b) to (x1,b). The next calculation is done by considering the interval (x1,b) and dividing it again into two ranges with end points (x1, x2′). The function value at x1 can be reused, while that at the new point x′2, the function value has to be determined again. The search is thus continued until an acceptable range of uncertainty is reached. The Golden Section method is robust and is a widely used numerical search method. It is obvious that the above
Gradientbased methods allow faster convergence than the previous methods. They are numerical methods, but the step size is varied based on the slope of the function. The NewtonRaphson method is quite popular; it has been developed for ﬁnding roots of single nonlinear equations but can be applied to optimization problems as well if the equation is taken to be the derivative of the objective function. It is iterative with each step determined based on a linear Taylor series expansion of the derivative of the objective function, namely (dJ/dx). The algorithm is essentially as follows: (i) assume a starting value x0 and calculate the ﬁrst derivative (dJ/dx) at x0, (ii) determine the slope of the ﬁrst derivative function (dJ2/dx2) at x0, (iii) update the search value based on the slope to determine next value x1 = (x0 + Δx), (iv) continue till the desired convergence is reached. Step (iii) of the algorithm is based on the following approximation: d 2 J=dx2 = dJ ðx0 þ ΔxÞ = dJ ðx0 Þ þ d2 J=dx0 2 :Δx
ð7:18aÞ
which can be rewritten in terms of the step size: Δx =  ðdJ=dx0 Þ= d2 J=dx0 2
ð7:18bÞ
Example 7.4.17 Minimize J ðxÞ = ðx  1Þ2 ðx  2Þ ðx  3Þ
s:t: 2 ≤ x ≤ 4
ð7:19Þ
The numerical search process is illustrated graphically in Fig. 7.16. As shown, the function obviously has roots (i.e., the function cuts the xaxis) at points 1, 2, and 3. The function J(x) and the ﬁrst derivative are also plotted in the ﬁgure. The ﬁrst derivative of the objective function is the equation for which the roots must be determined: ðdJ=dxÞ = 2ðx  1Þðx  2Þðx  3Þ þ ðx  1Þ2 ðx  3Þ þ ð x  1Þ 2 ð x  2Þ = 0 and the second derivative d2 J=dx2 = 2ðx  2Þðx  3Þ þ 4ðx  1Þðx  3Þ þ4ðx  1Þðx  2Þ þ 2ðx  2Þ2
7
From Venkatraman (2002).
280
7
Optimization Methods
dJ dx
J(x) slope J =0
x0 = 3
Δx x Fig. 7.16 Graphical illustration of Example 7.4.1 using the NewtonRaphson method
Assume, say a starting value of x0 = 3 (note: this value is selected to fall within the range constraint stipulated in the problem). The function value J = 0, the ﬁrst derivative (dJ/ dx) = 4 and (d2J/dx2) = 16. From Eq. (7.18b), Δx = 0.25, and the corresponding value of (dJ/dx) = 0.875. The function is not zero as yet and so another iteration is required. Assuming x1 = 2.75, the second iteration yields J = 0.5742, (dJ/dx) = 0.0875 and (d2J/dx2) = 9.25, from which Δx = 0.0946 and (dJ/dx) = 0.104. The value of the derivative has decreased indicating that we are closer to zero, but more iterations are required. In fact, ﬁve iterations are needed to reach a value of (dJ/dx) = 0 at which point x = 2.6404 and J = 0.6197. As a precaution it is urged, especially when dealing with nonlinear functions, that the search be repeated with different starting values to assure oneself that a global minimum has indeed been reached. It is important to realize that if one had stipulated the constraint as 0 ≤ x ≤ 4 and chosen a starting value of x0 = 0.5, the solution would have converged to x = 1 which is a local minimum (see Fig. 7.16). This highlights the dangers of local convergence when dealing with nonlinear functions. Further, it is clear that the number of iterations will reduce if the starting value is taken to be close to the global solution. Under certain circumstances, all gradientbased methods may fail to converge, such as when the function derivative during the search equals zero. Algorithms based on a combination of gradientbased and search methods have also been developed and exhibit desirable qualities of robustness while being efﬁcient.
Fig. 7.17 Conceptual illustration of ﬁnding a minimum point of a bivariate function using a pattern or lattice search method. From an initial point 1, the best subsequent move involves determining the function values around that point at discrete grid points (points 2 through 9) and moving to the point with the lowest function value
7.4.2
Multivariate Methods
Realistically, the efﬁciency of single variate search algorithms is not a concern given the computing power available nowadays. For multivariate problems, on the other hand, efﬁciency is a critical aspect since the number of combinations increases exponentially. Numerous multivariate search methods are available in the literature, but only the basic ones are described below. These methods can be grouped into zeroorder (when only the function values are to be determined), ﬁrstorder (based on the ﬁrst derivative or linear gradient), and secondorder (requiring the second order derivatives or quadratic polynomials). One can also categorize the methods as “numerical” or “analytical” or a combination of both. Search methods are most robust and appropriate for nondifferentiable or discontinuous functions while calculusbased methods are generally efﬁcient for problems with continuous functions. One distinguishes between valleydescending for minimization problems and hillclimbing methods for maximization problems. Three general solution approaches for nonconstrained optimization problems are described below in terms of bivariate functions for easier comprehension. (a) Pattern or lattice search is a directed search method where one starts at one point in the search space (shown as point 1 in Fig. 7.17), calculates values of the function in several points around the initial point (points 2–9), and moves to the point which has the lowest value (shown as point 5). This process is repeated till the overall minimum is found. This combination of exploratory moves and heuristic search done iteratively is the basis of the HookeJeeves pattern search method, which is quite popular. Sometimes, one may use a coarse grid search initially; ﬁnd the optimum within an interval of
7.4 Numerical Unconstrained Search Methods
Fig. 7.18 Conceptual illustration of ﬁnding a minimum point of a bivariate function using the univariate search method. From an initial point 1, the gradient of the function is used to ﬁnd the optimal point value of x2 keeping x1 ﬁxed, and so on till the optimal point 5 is found
uncertainty, then repeat the search using a ﬁner grid. Note that this is not a calculusbased method since the function calls are made at discrete surrounding points, and is not very efﬁcient computationally, especially for higher dimension problems. However, it is more robust than calculusbased algorithms and simple to implement. (b) Univariate search method (Fig. 7.18) involves ﬁnding the minimum of one variable at a time keeping the others constant and repeating this process iteratively. This method accelerates the process of reaching the minimum (or maximum) point of a function compared to the lattice search. One starts by using some preliminary values for all variables other than the one being optimized and ﬁnds the optimum value for the selected variable using a onedimension search process or a calculusbased onestep process (shown as x1). One then selects a second variable to optimize while retaining this optimal value of the ﬁrst variable, and ﬁnds the optimal value of the second variable, and so on for all remaining variables until no signiﬁcant improvement is found between successive searches. The entire process often requires more than one iteration, as shown in Fig. 7.18, a real danger is that it can get trapped in a local minimum region. The onedimensional searches do not necessarily require numerical derivatives giving this algorithm an advantage when used with functions that are not easily differentiable or discontinuous. However, the search process is inherently not very efﬁcient.
Example 7.4.2 Illustration of the univariate search method Consider the following function with two variables which is to be minimized using the univariate search process starting with an initial value of x2 = 3.
281
Fig. 7.19 Conceptual illustration of ﬁnding a minimum point of a bivariate function using the steepest descent search method. From an initial point 1, the gradient of the function is determined, and the next search point determined by moving in that direction, and so on till the optimal point 4 is found
y = x1 þ
x 16 þ 2 x1 :x2 2
ð7:20aÞ
First, the partial derivatives are found: ∂y 16 =1∂x1 x2 :x21
and
∂y 16 1 =þ ∂x2 x1 :x22 2
ð7:20bÞ
Next, the initial value of x2 = 3 is used to ﬁnd the next iterative value of x1 from the (∂y/∂x1) function as follows: p ∂y 16 4 3 =1= 0 from where x = = 2:309 1 3 ∂x1 ð3Þ:x21 The other partial derivative is ﬁnally used with this value of x1 to yield: 
16
p 4 3 2 3 x2
:þ
1 = 0 fromwhere x2 = 3:722 2
The new value of x2 is now used for the next cycle, and the iterative process repeated until consecutive improvements turn out to be sufﬁciently small to suggest convergence. It is left to the reader to verify that the optimal values are x1 = 2, x2 = 4 : ■ (c) Steepestdescent search (Fig. 7.19) is a widely used calculusbased approach because of its efﬁciency. The computational algorithm involves three steps: one starts with a guess value (represented by point 1), which is selected somewhat arbitrarily but, if possible, close to the optimal value. One then evaluates the gradient of the function at the current point by computing the partial derivatives either analytically or numerically. Finally,
282
7
one moves along this gradient (hence, the terminology “steepest”) by deciding, somewhat arbitrarily, on the step size. The relationship between the step sizes Δxi and the partial derivatives (∂y/∂xi) is: Δx1 Δx2 Δxi = = ... = ∂y=∂x1 ∂y=∂x2 ∂y=∂xi
discrete function calls at surrounding points to direct the search direction as in pattern search (which is referred to as a zeroorder model since it does not require derivatives to be determine), this method uses quadratic polynomial approximation for local interpolation of the objective function. The solution of the quadratic approximation serves as the starting point of the next iteration, and so on, while providing an indication of the step size as well. Note that if the objective function is quadratic to start with, the approximation is exact, and the minimum point is found in one step. Otherwise, several iterations are needed, but the convergence is very rapid. The Powell algorithm is not a calculusbased method; however, it is unsuitable for complicated nonlinear objective functions. Also, it may be computationally inefﬁcient for higher dimension problems, and worse, may not converge to the global optimum if the search space is not symmetrical. The FletcherReeves Method greatly improves the search efﬁciency of the steepest gradient method by adopting the concept of quadratic convergence. The transformation allows determining good search directions and distances based on the shape of the target function near the initial guess minimum, and then progresses towards the local minimum. This is a calculusbased method which involves using the Hessian matrix to determine the search direction. The advantage is that this method uses information about the local curvature of the ﬁt statistics as well as its local gradients, which often tends to stabilize the search results. Textbooks such as Venkataraman (2002) provide more details about the mathematical theory and ways to code these methods into a software program.
ð7:21Þ
Steps 2–3 are performed iteratively until the minimum (or maximum) point is reached. A note of caution is that too large a step size can result in numerical instability, while too small a step size increases computation time. The above valleydescending methods are suited for function minimization. The algorithm is easily modiﬁed to the hillclimbing situation for problems requiring maximization. Example 7.4.3 Illustration of the steepest descent method Consider the following function with three variables to be minimized: y=
72x1 360 þ þ x1 x2 þ 2x3 x1 x3 x2
ð7:22Þ
Assume a starting point of (x1 = 5, x2 = 6, x3 = 8). At this point, the value of the function is y = 115. These numerical values are inserted in the expressions for the partial derivatives: 72 360 72 360 ∂y = þ x2 = þ 6 = 16:2 2 x 6 ∂x1 x3 :x1 2 ð8Þ:ð5Þ2 72ð5Þ 72x ∂y =  2 1 þ x1 = þ 5= 5 ∂x2 x2 ð 6Þ 2 360 360 ∂y =þ 2= þ 2 = 0:875 ∂x3 x1 x23 ð 5Þ ð 8Þ 2 ð7:23Þ In order to compute the next point, a step size must be assumed. Arbitrarily assume Δx1 = 1 (verify that taking a negative value results in a decrease in the function value y). Δx3 1 2 = Δx Applying Eq. 7.23 results in 16:2  5 = 0:875 from where Δx2 = 0.309, Δx3 = 0.054. Thus, the new point is (x1 = 4, x2 = 6.309, x3 = 7.946). The reader can verify that the new point has resulted in a decrease in the functional value from 115 to 98.1. Repeated use of the search method will gradually result in the optimal value being found. ■ (d) More Efﬁcient Methods based on Quadratic Convergence. An improvement over the pattern or lattice search algorithm is the Powell’s Conjugate Direction Method, which searches along a set of directions that is conjugate or orthogonal to the objective function. Instead of
Optimization Methods
7.5
Linear Programming (LP)8
7.5.1
Standard Form
Recall the concept of numerical efﬁciency (or power) of a method of solution involving both robustness of the solution and fast execution times. Optimization problems which can be framed as a linear problem (even at the expense of a little loss in accuracy) have great numerical efﬁciency. Only if the objective function and the constraints (either equalities or inequalities) are both linear functions is the problem designated as a linear optimization problem; otherwise, it is deemed a nonlinear optimization problem. The objective function can involve one or more functions to be either minimized or maximized (either objective can be treated identically since it is easy to convert one into the other). “Programming” is synonymous or optimization in operations research.
8
with
planning
activities
7.5 Linear Programming (LP)
283
The standard form of linear programming problems is: minimize f ðxÞ = cT x
ð7:24aÞ
subject to : gðxÞ : Ax = b
ð7:24bÞ
where x is the column vector of variables of dimension n, b that of the constraint limits of dimension m, c that of the cost coefﬁcients of dimension n, and A is the (m x n) matrix of constraint coefﬁcients. Example 7.5.1 Express the following linear twodimensional problem into standard matrix notation: Maximize
subject to
f ðxÞ : 3186 þ 620x1 þ 420x2
ð7:25aÞ
g1 ðxÞ : 0:5x1 þ 0:7x2 ≤ 6:5 g2 ðxÞ : 4:5x1  x2 ≤ 35
ð7:25bÞ
g3 ðxÞ : 2:1x1 þ 5:2x2 ≤ 60 with range constraints on the variables x1 and x2 being that these should not be negative. This is a problem with two variables (x1 and x2). However, three slack variables need to be introduced to reframe the three inequality constraints as equality constraints. This makes the problem into one with ﬁve unknown variables. The three inequality constraints are rewritten as: g1 ðxÞ : 0:5x1 þ 0:7x2 þ x3 = 6:5 g2 ðxÞ : 4:5x1  x2 þ x4 = 35
ð7:26Þ
g3 ðxÞ : 2:1x1 þ 5:2x2 þ x5 = 60 Hence, the terms appearing in the standard form (Eqs. 7.24a and b) are: c = ½  620  420 0 0 0T , x = ½ x1 x2 x3 x4 x5 T , 0:5 0:7 1 0 0 A = 4:5 2:1
1 0
1
0 ,
5:2
0
1
b = ½ 6:5 35
0
ð7:27Þ
T
60
Note that the objective function is recast as a minimization problem simply by reversing the signs of the coefﬁcients. Also, the constant does not appear in the optimization since it can be simply added to the optimal value of the function at the end. Stepbystep solutions of such optimization problems are given in several textbooks such as Edgar et al. (2001), Hillier and Lieberman (2001) and Stoecker (1989).
A commercial optimization software program was used to determine the optimal value of the above objective function: f ðxÞ = 9803:8 Note that in this case, since the inequalities are “less than or equal to zero,” the numerical values of the slack variables (x3, x4, x5) will be positive. The optimal values for the primary variables are: x1 = 8:493, x2 = 3:219, while those for the slack variables are x3 = 0, x4 = 0, x5 = 25:424 (implying that constraints 1 and 2 in Eq. 7.25b have turned out to be equality constraints). ■ There is a great deal of literature on efﬁcient algorithms to solve linear problems, which are referred to as linear programming methods. Because of its efﬁciency, the Simplex algorithm is the most popular numerical technique for solving large sets of linear equations. It proceeds by moving from one feasible solution to another with each step improving the value of the objective function; it also provides the necessary information for performing a sensitivity analysis at the same time. Hence, formulating problems as linear problems (even when they are not strictly so) has a great advantage in the solution phase. Such problems arise in numerous realworld applications where limited resources such as machines (an airline with a ﬁxed number of airplanes that must serve a preset number of cities each day), material, etc. are to be allocated or scheduled in an optimal manner to one of several competing solutions/pathways.
7.5.2
Example of a LP Problem9
This example illustrates how the objective function and the constraints are to be framed given the necessary data, and then expressed in standard form. Speciﬁcally, the problem involves optimizing the mix of different mitigation technology pathways available to reduce pollution from a steelmanufacturing company. A steel company wishes to reduce its pollution emissions (speciﬁcally particulates, sulfur oxides and hydrocarbons) which are generated in two of its processes: blast furnaces for making pig iron and openhearth furnaces for changing iron into steel. For both equipment, there are three viable technological solutions: taller smokestacks, ﬁlters, and better fuels. The amounts of required reduction for each of the three pollutants, the reduction of emissions per pound from the abatement technological option are given in Table 7.1 while the cost of the abatement method is shown in Table 7.2. This problem is formulated in terms of the six fractions xi shown in Table 7.3. The unit costs (shown in Table 7.2) are 9
Adapted from Hillier and Lieberman, 2001
284
7
Optimization Methods
Table 7.1 Emission rate reduction for different abatement technologies and required total annual emission reductions for different pollutants Pollutant
Particulates Sulfur oxides Hydrocarbons
Maximum feasible reduction in emission rates (106 lb/year) Taller smokestacks Filters Blast Openhearth Blast Openhearth furnace furnace furnace furnace 12 9 25 20 35 42 18 31 37 53 28 24
Table 7.2 Annual costs) for different abatement technologies if maximum feasible capacity is implemented Abatement method Taller smokestacks Filters Better fuels
CostBlast furnace ($ millions) 8 7 11
CostOpenhearth furnaces ($ millions) 10 6 9
assumed to be constant, and so are the emission rates (shown in Table 7.1). The objective function is framed in terms of minimizing total emissions subject to the cost constraint and the fact that the fractions xi must be positive and less than 1. Minimize J = 8x1 þ 10x2 þ 7x3 þ 6x4 þ 11x5 þ 9x6 s:t: 12x1 þ 9x2 þ 25x3 þ 20x4 þ 17x5 þ 13x6 = 60 35x1 þ 42x2 þ 18x3 þ 31x4 þ 56x5 þ 49x6 = 150 37x1 þ 53x2 þ 28x3 þ 24x4 þ 29x5 þ 20x6 = 125 x1 , x 2 , x 3 , x 4 , x 5 , x 6 ≥ 0 x1 , x2 , x3 , x4 , x5 , x6 ≤ 1
Better fuels Blast furnace 17 56 29
Reqd emission reduction (106 lb/year) Openhearth furnace 13 49 20
60 150 125
Table 7.3 Decision variables fractions (xi) of the three combinations of maximum feasible capacity of an abatement technology and the two manufacturing options. Abatement method Taller smokestacks Filters Better fuels
Blast furnaces x1 x3 x5
Openhearth furnaces x2 x4 x6
Thus, since x*1 = 1.0, all the blast furnaces that can be converted to have taller smokestacks should be modiﬁed accordingly, and so on. It is left to the reader to perform a postoptimality analysis to determine the sensitivity of the solution. For example, the individual manufacturing units come in discrete sizes and so the six fractions have to be rounded up to the closest number of units. How do the total cost and the total emissions change in such a case needs to be assessed. This example is strictly a mixed integer programming problem which is usually harder to solve. The above approach is a simpliﬁcation which works well in most cases.
ð7:28abÞ Following the standard notation (Eqs. 7.24a and b), the problem is stated as: Minimize J = c x T
Constraints : gðxÞ : Ax = b c = ½ 8 10 7 6 11 9 T x = ½ x1 x2 x3 x4 x5 x6 T 12 9 25 20 17 13 A = 35
42
18
37
53
28
b = ½ 60
150 125
31 56
49
24 29
20
ð7:29Þ
T
A commercial solver yields the following values at the optimal point: x*1 = 1.0, x*2 = 0.623, x*3 = 0.343, x*4 = 1, x*5 = 0.048, x*6 = 1.0 while the optimal objective function (total cost) J* = $32.154 million.
7.5.3
Linear Network Models
Network models are suitable for modeling systems with discrete nodes/vertices/points (such as junctions) interconnected by links/lines/edges/branches (such as water pipes for district energy distribution or power lines) through which matter or power can ﬂow. They have been increasingly applied to engineered infrastructures when recovering from partial or complete failure due to extreme weather events10 (such as electric power transmission and widearea water distribution systems) and have even found applications in social sciences and for modeling internet and social media communication interactions/dynamics (a comprehensive text is that by Newman 2010). A recent building sciences application of network modeling was developed by Sonta and Jain (2020) in which a social and organizational human network structure was learned using ambient sensing data from distributed plug load energy sensors in commercial buildings. In essence, network models are “simpliﬁed representations 10
Such issues are now being increasingly studied under the general area referred to as “resilience.”.
7.5 Linear Programming (LP)
285
that reduces a system to an abstract structure capturing only the basics of connection patterns with vertices or nodes for components and edges capturing some basic relationship of the node and of the system” (Alderson and Doyle 2010). The focus is on the topology of the essential structural interconnections among components (and not just on individual components), and the behavior of the system under eventbased disruptions of such interconnections. A network with m edges connecting n nodes (called a graph) can be formulated as a set of n linear equations using Kirchhoff’s conservation law which result in a (m × n) matrix called an incidence matrix. For simple cases without complex constraints, this leads to direct solutions (see Strang 1998 or Newman 2010). However, any realistic network would generally require a numerical method to solve the set of linear equations. It has been pointed out (for example, by Alderson et al. 2015) that the representation of an actual engineered system by a simpliﬁed surrogate network can be misleading if done simplistically. Hence, some sort of validation of the network topology, modeling equations and simulation is needed before one can place conﬁdence in the analysis results.
7.5.4
Example of Maximizing Flow in a Transportation Network
This example is taken from Vugrin et al. (2014) to illustrate how to analyze transportation networks. Figure 7.20 represents a simple network with 7 nodes and 12 links where the objective is to maximize ﬂow from one starting node 1 to another speciﬁed end node 7. The limiting capacities of various links are speciﬁed (and indicated in the ﬁgure). A ﬁctitious return link (shown dotted) with inﬁnite capacity needs to be introduced to complete the circuit. The optimization model for this ﬂow problem can be framed as: max x71 ðt Þ xi ð t Þ xi ð t Þ = 0
s:t: i2I n
n = 1, . . . , 7
i2On
0 ≤ xi ðt Þ ≤ K i ðt Þ
ð7:30Þ
8i
Note that the time element t has been shown even though this analysis only involves the steady state situation. The second expression is the conservation constraint whereby the sum of inﬂows In is equal to the sum of outﬂows On at each node n. Ki denotes the limiting capacity of link i. The symbol 8 is used to state that the constraint range applies to all members i of the set. For the uninterrupted case, i.e., when no links are broken, the maximum ﬂow is 14 units, while the ﬂows through the
Fig. 7.20 Flow network topology with 7 nodes and 12 links. The intent is to maximize ﬂow from node 1 to node 7 under different breakage scenarios. The limiting ﬂow capacities of various links are shown above the corresponding lines. The dotted line is a ﬁctitious link to complete the ﬂow circuit
Table 7.4 Flows through different links for two scenarios in order to maximize total flow from link1 to 7 (see Fig. 7.20) Uninterrupted case Link Flow 12 3.0 13 7.0 14 4.0 23 0.0 25 3.0 34 0.0 35 4.0 36 3.0 46 4.0 65 1.0 57 8.0 67 6.0 Flow from 17 14.0
Compromised case (links 14, 23, 34 are broken) Link Flow 12 3.0 13 7.0 14 – 23 – 25 3.0 34 – 35 4.0 36 3.0 46 0.0 65 0.0 57 7.0 67 3.0 Flow from 17 10.0
individual links are assembled in Table 7.4. The same equations can be modiﬁed to analyze the situation when one or more of the links breaks. The corresponding ﬂows for a breakage scenario when links 14, 23 and 34 are compromised are also assembled in Table 7.4. In this case, the maximum ﬂow reduces to 10 units. Such analyses can be performed assuming different scenarios of one or more link breakages. Such types of evaluations are usually done in the framework of reliability analyses during the design of the networks.
7.5.5
Mixed Integer Linear Programing (MILP)
Mixed integer problems (MILP) are a special category of linear optimization problems where some of the variables are integers or even binary variables (such as a piece of
286
equipment being on or off or for a “yes/no” decision coded as 1 and 0). Integer/binary variables arise in scheduling problems involving multiple equipment. For example, if a large facility has numerous power generation units to meet a variable load, determining which units to operate so as to minimize operating costs would be an integer problem, while determining the fraction of their rated capacity at which to operate would be a continuous variable problem. Both issues taken together would be treated as a MILP problem (as illustrated in the solved example in Sect. 7.7). For example, Henze et al. (2008) developed and validated an optimization environment for a pharmaceutical facility with a chilled water plant with ten different chillers (electrical and absorption) that adopts mixed integer programming to optimize chiller selection (scheduling) and dispatch for any cooling load condition, while an overarching dynamic programming approach selects the optimal charge/discharge strategy of the chilled water thermal energy storage system. Another example is when a manufacturer who has the capability of producing different types of widgets must decide on how many items of each widget type to manufacture in order to maximize proﬁt. Typically, such problems can be set up as standard linear optimization problems, with the added requirement that the some of the variables must be integers. MILP problems are generally solved using a LP based branchandbound algorithm (see Hillier and Lieberman 2001). The basic LPbased branchandbound can be described as follows. Start by removing all the integrality restrictions on decision variables which can only take on integer values. The resulting LP is called the LP relaxation of the original MILP. On solving this problem, if it so happens that the result satisﬁes all the integrality restrictions, even though these were not explicitly imposed, then that is the optimal solution sought. If not, as is usually the case, then the normal procedure is to pick one variable that is restricted to be integer, but whose value in the LP relaxation is fractional. For the sake of argument, suppose that this variable is x and its value in the LP relaxation is 3.3. One can then exclude this value by, in turn, imposing the constraints x ≤ 3.0 and x ≥ 4.0. This process is done sequentially for all the integer variables and is somewhat tedious. Usually, commercial optimization programs have inbuilt capabilities, and the user does not need to specify/perform such additional steps. Generally, it can be stated that (mixed) integer programming problems are much harder to solve than linear programming problems. MILP problems also arise in circumstances where the constraints are eitheror, or the more general case when “K out of N” constraints need to be satisﬁed. A simple example illustrates the former case (Hillier and Lieberman 2001). Consider a case when at least one of the inequalities must hold:
7
Optimization Methods
Either 3x1 þ 2x2 ≤ 18 Or x1 þ 4x2 ≤ 16
ð7:31Þ
This can be reformulated as: 3x1 þ 2x2 ≤ 18 þ My x1 þ 4x2 ≤ 16 þ M ð1  yÞ
ð7:32Þ
where M is a very large number and y is a binary variable (either 1 or 0). Solving these two constraints along with the objective function will provide the solution. A practical example of how MILP can be used for supervisory control is given in Sect. 7.7 wherein the component models of various energy equipment are framed as linear functions (a useful simpliﬁcation in many cases).
7.5.6
Example of Reliability Analysis of a Power Network
Consider a simple electric power transmission system with seven loads and two generators (at nodes 1 and 3) as shown in Fig. 7.21.11 The stepdown transformers are the nodes of this network while the highvoltage power lines are the links or lines indicated by arrows. The loss in power in the lines will be neglected, and the analysis will be done assuming DC current ﬂow (the AC current analysis is more demanding computationally requiring solving nonlinear equations and one needs to consider sophisticated stabilizing feedback control loops). Moreover, network models only capture energy/ power quantities, not current and voltage, as essential variables in a power system. This limits the ability to model cascading failures. Network models essentially capture snapshots of power systems at discrete time intervals, while actual power networks are continuoustime dynamical systems. (a) Mathematical model The ﬂow model for the network shown i